All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -mm 00/12] kmemcg reparenting
@ 2014-02-26 15:05 ` Vladimir Davydov
  0 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

Hi,

During my recent attempt to push kmemcg shrinkers, I was pointed out
that current kmemcg implementation has a serious design flaw - it lacks
reparenting. Currently each memcg cache holds a css ref to its memcg and
does not let it go until the cache is emptied. Although this approach is
simple, it leads to memcgs hanging around for quite a long time after
the death, which is ugly. Building something on top of that is
unacceptable. So this patch set targets on implementing reparenting for
kmemcg charges.

[ for more details see the discussion thread:
  https://lkml.org/lkml/2014/2/11/623 ]

It is based on top of 3.14.0-rc4-mmotm and organized as follows:
 - Patches 1-3 fix some nasty races in kmemcg implementation. I could
   not let them live any longer, because they touch the code I'm going
   to modify.
 - Patches 4-6 prepare memcg_cache_params for reparenting.
 - Patch 7 rework slab charging making it easier to track and therefore
   reparent kmem charges, and patches 8-10 kill the old charging code.
 - Patch 11 introduces kmemcg reparenting.
 - Patch 12 is for slub. It fixes sysfs naming clashes that can arise
   due to reparented caches.

Please note that this patch set does not resolve all kmemcg-related
issues - there are still plenty of them (e.g. "dangling" caches), but it
is already big enough so I guess I'll address them later when this one
is committed (if it will be committed at all, of course).

Many thanks to Johannes Weiner, who proposed the idea and kindly
outlined basic design principles.

Thanks,

Vladimir Davydov (12):
  memcg: flush cache creation works before memcg cache destruction
  memcg: fix race in memcg cache destruction path
  memcg: fix root vs memcg cache destruction race
  memcg: move slab caches list/mutex init to memcg creation
  memcg: add pointer from memcg_cache_params to cache
  memcg: keep all children of each root cache on a list
  memcg: rework slab charging
  memcg: do not charge kmalloc_large allocations
  fork: do not charge thread_info to kmemcg
  memcg: kill GFP_KMEMCG and stuff
  memcg: reparent slab on css offline
  slub: make sure all memcg caches have unique names on sysfs

 include/linux/gfp.h             |    5 -
 include/linux/memcontrol.h      |  133 ++-------
 include/linux/slab.h            |   15 +-
 include/linux/thread_info.h     |    2 -
 include/trace/events/gfpflags.h |    1 -
 kernel/fork.c                   |    4 +-
 mm/memcontrol.c                 |  587 +++++++++++++++++----------------------
 mm/page_alloc.c                 |   35 ---
 mm/slab.c                       |   47 ++--
 mm/slab.h                       |   17 +-
 mm/slab_common.c                |  100 +++++--
 mm/slub.c                       |   88 ++++--
 12 files changed, 470 insertions(+), 564 deletions(-)

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH -mm 00/12] kmemcg reparenting
@ 2014-02-26 15:05 ` Vladimir Davydov
  0 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

Hi,

During my recent attempt to push kmemcg shrinkers, I was pointed out
that current kmemcg implementation has a serious design flaw - it lacks
reparenting. Currently each memcg cache holds a css ref to its memcg and
does not let it go until the cache is emptied. Although this approach is
simple, it leads to memcgs hanging around for quite a long time after
the death, which is ugly. Building something on top of that is
unacceptable. So this patch set targets on implementing reparenting for
kmemcg charges.

[ for more details see the discussion thread:
  https://lkml.org/lkml/2014/2/11/623 ]

It is based on top of 3.14.0-rc4-mmotm and organized as follows:
 - Patches 1-3 fix some nasty races in kmemcg implementation. I could
   not let them live any longer, because they touch the code I'm going
   to modify.
 - Patches 4-6 prepare memcg_cache_params for reparenting.
 - Patch 7 rework slab charging making it easier to track and therefore
   reparent kmem charges, and patches 8-10 kill the old charging code.
 - Patch 11 introduces kmemcg reparenting.
 - Patch 12 is for slub. It fixes sysfs naming clashes that can arise
   due to reparented caches.

Please note that this patch set does not resolve all kmemcg-related
issues - there are still plenty of them (e.g. "dangling" caches), but it
is already big enough so I guess I'll address them later when this one
is committed (if it will be committed at all, of course).

Many thanks to Johannes Weiner, who proposed the idea and kindly
outlined basic design principles.

Thanks,

Vladimir Davydov (12):
  memcg: flush cache creation works before memcg cache destruction
  memcg: fix race in memcg cache destruction path
  memcg: fix root vs memcg cache destruction race
  memcg: move slab caches list/mutex init to memcg creation
  memcg: add pointer from memcg_cache_params to cache
  memcg: keep all children of each root cache on a list
  memcg: rework slab charging
  memcg: do not charge kmalloc_large allocations
  fork: do not charge thread_info to kmemcg
  memcg: kill GFP_KMEMCG and stuff
  memcg: reparent slab on css offline
  slub: make sure all memcg caches have unique names on sysfs

 include/linux/gfp.h             |    5 -
 include/linux/memcontrol.h      |  133 ++-------
 include/linux/slab.h            |   15 +-
 include/linux/thread_info.h     |    2 -
 include/trace/events/gfpflags.h |    1 -
 kernel/fork.c                   |    4 +-
 mm/memcontrol.c                 |  587 +++++++++++++++++----------------------
 mm/page_alloc.c                 |   35 ---
 mm/slab.c                       |   47 ++--
 mm/slab.h                       |   17 +-
 mm/slab_common.c                |  100 +++++--
 mm/slub.c                       |   88 ++++--
 12 files changed, 470 insertions(+), 564 deletions(-)

-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH -mm 01/12] memcg: flush cache creation works before memcg cache destruction
  2014-02-26 15:05 ` Vladimir Davydov
@ 2014-02-26 15:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

When we get to memcg cache destruction, either from the root cache
destruction path or when turning memcg offline, there still might be
memcg cache creation works pending that was scheduled before we
initiated destruction. We need to flush them before starting to destroy
memcg caches, otherwise we can get a leaked kmem cache or, even worse,
an attempt to use after free.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 mm/memcontrol.c |   32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8a87614b6238..b61b6e9381e8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2966,6 +2966,7 @@ static DEFINE_MUTEX(set_limit_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
 static DEFINE_MUTEX(activate_kmem_mutex);
+static struct workqueue_struct *memcg_cache_create_wq;
 
 static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 {
@@ -3392,6 +3393,15 @@ int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 	int i, failed = 0;
 
 	/*
+	 * Since the cache is being destroyed, it shouldn't be allocated from
+	 * any more, and therefore no new memcg cache creation works could be
+	 * scheduled. However, there still might be pending works scheduled
+	 * before the cache destruction was initiated. Flush them before
+	 * destroying child caches to avoid nasty races.
+	 */
+	flush_workqueue(memcg_cache_create_wq);
+
+	/*
 	 * If the cache is being destroyed, we trust that there is no one else
 	 * requesting objects from it. Even if there are, the sanity checks in
 	 * kmem_cache_destroy should caught this ill-case.
@@ -3439,6 +3449,15 @@ static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
 	if (!memcg_kmem_is_active(memcg))
 		return;
 
+	/*
+	 * By the time we get here, the cgroup must be empty. That said no new
+	 * allocations can happen from its caches, and therefore no new memcg
+	 * cache creation works can be scheduled. However, there still might be
+	 * pending works scheduled before the cgroup was turned offline. Flush
+	 * them before destroying memcg caches to avoid nasty races.
+	 */
+	flush_workqueue(memcg_cache_create_wq);
+
 	mutex_lock(&memcg->slab_caches_mutex);
 	list_for_each_entry(params, &memcg->memcg_slab_caches, list) {
 		cachep = memcg_params_to_cache(params);
@@ -3483,7 +3502,7 @@ static void __memcg_create_cache_enqueue(struct mem_cgroup *memcg,
 	cw->cachep = cachep;
 
 	INIT_WORK(&cw->work, memcg_create_cache_work_func);
-	schedule_work(&cw->work);
+	queue_work(memcg_cache_create_wq, &cw->work);
 }
 
 static void memcg_create_cache_enqueue(struct mem_cgroup *memcg,
@@ -3694,10 +3713,20 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
 	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
 	memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
 }
+
+static void __init memcg_kmem_init(void)
+{
+	memcg_cache_create_wq = alloc_workqueue("memcg_cache_create", 0, 1);
+	BUG_ON(!memcg_cache_create_wq);
+}
 #else
 static inline void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
 {
 }
+
+static void __init memcg_kmem_init(void)
+{
+}
 #endif /* CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -7261,6 +7290,7 @@ static int __init mem_cgroup_init(void)
 	enable_swap_cgroup();
 	mem_cgroup_soft_limit_tree_init();
 	memcg_stock_init();
+	memcg_kmem_init();
 	return 0;
 }
 subsys_initcall(mem_cgroup_init);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH -mm 01/12] memcg: flush cache creation works before memcg cache destruction
@ 2014-02-26 15:05   ` Vladimir Davydov
  0 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

When we get to memcg cache destruction, either from the root cache
destruction path or when turning memcg offline, there still might be
memcg cache creation works pending that was scheduled before we
initiated destruction. We need to flush them before starting to destroy
memcg caches, otherwise we can get a leaked kmem cache or, even worse,
an attempt to use after free.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 mm/memcontrol.c |   32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8a87614b6238..b61b6e9381e8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2966,6 +2966,7 @@ static DEFINE_MUTEX(set_limit_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
 static DEFINE_MUTEX(activate_kmem_mutex);
+static struct workqueue_struct *memcg_cache_create_wq;
 
 static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 {
@@ -3392,6 +3393,15 @@ int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 	int i, failed = 0;
 
 	/*
+	 * Since the cache is being destroyed, it shouldn't be allocated from
+	 * any more, and therefore no new memcg cache creation works could be
+	 * scheduled. However, there still might be pending works scheduled
+	 * before the cache destruction was initiated. Flush them before
+	 * destroying child caches to avoid nasty races.
+	 */
+	flush_workqueue(memcg_cache_create_wq);
+
+	/*
 	 * If the cache is being destroyed, we trust that there is no one else
 	 * requesting objects from it. Even if there are, the sanity checks in
 	 * kmem_cache_destroy should caught this ill-case.
@@ -3439,6 +3449,15 @@ static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
 	if (!memcg_kmem_is_active(memcg))
 		return;
 
+	/*
+	 * By the time we get here, the cgroup must be empty. That said no new
+	 * allocations can happen from its caches, and therefore no new memcg
+	 * cache creation works can be scheduled. However, there still might be
+	 * pending works scheduled before the cgroup was turned offline. Flush
+	 * them before destroying memcg caches to avoid nasty races.
+	 */
+	flush_workqueue(memcg_cache_create_wq);
+
 	mutex_lock(&memcg->slab_caches_mutex);
 	list_for_each_entry(params, &memcg->memcg_slab_caches, list) {
 		cachep = memcg_params_to_cache(params);
@@ -3483,7 +3502,7 @@ static void __memcg_create_cache_enqueue(struct mem_cgroup *memcg,
 	cw->cachep = cachep;
 
 	INIT_WORK(&cw->work, memcg_create_cache_work_func);
-	schedule_work(&cw->work);
+	queue_work(memcg_cache_create_wq, &cw->work);
 }
 
 static void memcg_create_cache_enqueue(struct mem_cgroup *memcg,
@@ -3694,10 +3713,20 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
 	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
 	memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
 }
+
+static void __init memcg_kmem_init(void)
+{
+	memcg_cache_create_wq = alloc_workqueue("memcg_cache_create", 0, 1);
+	BUG_ON(!memcg_cache_create_wq);
+}
 #else
 static inline void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
 {
 }
+
+static void __init memcg_kmem_init(void)
+{
+}
 #endif /* CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -7261,6 +7290,7 @@ static int __init mem_cgroup_init(void)
 	enable_swap_cgroup();
 	mem_cgroup_soft_limit_tree_init();
 	memcg_stock_init();
+	memcg_kmem_init();
 	return 0;
 }
 subsys_initcall(mem_cgroup_init);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH -mm 02/12] memcg: fix race in memcg cache destruction path
  2014-02-26 15:05 ` Vladimir Davydov
@ 2014-02-26 15:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

We schedule memcg cache shrink+destruction work (memcg_params::destroy)
from two places: when we turn memcg offline
(mem_cgroup_destroy_all_caches) and when the last page of the cache is
freed (memcg_params::nr_pages reachs zero, see memcg_release_pages,
mem_cgroup_destroy_cache). Since the latter can happen while the work
scheduled from mem_cgroup_destroy_all_caches is in progress or still
pending, we need to be cautious to avoid races there - we should
accurately bail out in one of those functions if we see that the other
is in progress. Currently we only check if memcg_params::nr_pages is 0
in the destruction work handler and do not destroy the cache if so. But
that's not enough. An example of race we can get is shown below:

  CPU0					CPU1
  ----					----
  kmem_cache_destroy_work_func:		memcg_release_pages:
					  atomic_sub_and_test(1<<order, &s->
							memcg_params->nr_pages)
					  /* reached 0 => schedule destroy */

    atomic_read(&cachep->memcg_params->nr_pages)
    /* 0 => going to destroy the cache */
    kmem_cache_destroy(cachep);

					  mem_cgroup_destroy_cache(s):
					    /* the cache was destroyed on CPU0
					       - use after free */

An obvious way to fix this would be substituting the nr_pages counter
with a reference counter and make memcg take a reference. The cache
destruction would be then scheduled from that thread which decremented
the refcount to 0. Generally, this is what this patch does, but there is
one subtle thing here - the work handler serves not only for cache
destruction, it also shrinks the cache if it's still in use (we can't
call shrink directly from mem_cgroup_destroy_all_caches due to locking
dependencies). We handle this by noting that we should only issue shrink
if called from mem_cgroup_destroy_all_caches, because the cache is
already empty when we release its last page. And if we drop the
reference taken by memcg in the work handler, we can detect who exactly
scheduled the worker - mem_cgroup_destroy_all_caches or
memcg_release_pages.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 include/linux/memcontrol.h |    1 -
 include/linux/slab.h       |    7 ++--
 mm/memcontrol.c            |   86 +++++++++++++-------------------------------
 mm/slab.h                  |   17 +++++++--
 4 files changed, 42 insertions(+), 69 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index af63e6004c62..e54fb469a908 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -512,7 +512,6 @@ void memcg_update_array_size(int num_groups);
 struct kmem_cache *
 __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 
-void mem_cgroup_destroy_cache(struct kmem_cache *cachep);
 int __kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
 /**
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 3dd389aa91c7..3ed53de256ea 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -522,8 +522,8 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  * @memcg: pointer to the memcg this cache belongs to
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
- * @dead: set to true after the memcg dies; the cache may still be around.
- * @nr_pages: number of pages that belongs to this cache.
+ * @refcount: the reference counter; cache destruction will be scheduled when
+ *            it reaches zero
  * @destroy: worker to be called whenever we are ready, or believe we may be
  *           ready, to destroy this cache.
  */
@@ -538,8 +538,7 @@ struct memcg_cache_params {
 			struct mem_cgroup *memcg;
 			struct list_head list;
 			struct kmem_cache *root_cache;
-			bool dead;
-			atomic_t nr_pages;
+			atomic_t refcount;
 			struct work_struct destroy;
 		};
 	};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b61b6e9381e8..416da5dc3d2a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3206,6 +3206,7 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 		s->memcg_params->root_cache = root_cache;
 		INIT_WORK(&s->memcg_params->destroy,
 				kmem_cache_destroy_work_func);
+		atomic_set(&s->memcg_params->refcount, 1);
 		css_get(&memcg->css);
 	} else
 		s->memcg_params->is_root_cache = true;
@@ -3327,64 +3328,24 @@ static inline void memcg_resume_kmem_account(void)
 static void kmem_cache_destroy_work_func(struct work_struct *w)
 {
 	struct kmem_cache *cachep;
-	struct memcg_cache_params *p;
-
-	p = container_of(w, struct memcg_cache_params, destroy);
+	struct memcg_cache_params *params;
 
-	cachep = memcg_params_to_cache(p);
+	params = container_of(w, struct memcg_cache_params, destroy);
+	cachep = memcg_params_to_cache(params);
 
-	/*
-	 * If we get down to 0 after shrink, we could delete right away.
-	 * However, memcg_release_pages() already puts us back in the workqueue
-	 * in that case. If we proceed deleting, we'll get a dangling
-	 * reference, and removing the object from the workqueue in that case
-	 * is unnecessary complication. We are not a fast path.
-	 *
-	 * Note that this case is fundamentally different from racing with
-	 * shrink_slab(): if memcg_cgroup_destroy_cache() is called in
-	 * kmem_cache_shrink, not only we would be reinserting a dead cache
-	 * into the queue, but doing so from inside the worker racing to
-	 * destroy it.
-	 *
-	 * So if we aren't down to zero, we'll just schedule a worker and try
-	 * again
-	 */
-	if (atomic_read(&cachep->memcg_params->nr_pages) != 0)
+	if (atomic_read(&params->refcount) != 0) {
+		/*
+		 * We were scheduled from mem_cgroup_destroy_all_caches().
+		 * Shrink the cache and drop the reference taken by memcg.
+		 */
 		kmem_cache_shrink(cachep);
-	else
-		kmem_cache_destroy(cachep);
-}
 
-void mem_cgroup_destroy_cache(struct kmem_cache *cachep)
-{
-	if (!cachep->memcg_params->dead)
-		return;
+		/* cache is still in use? */
+		if (!atomic_dec_and_test(&params->refcount))
+			return;
+	}
 
-	/*
-	 * There are many ways in which we can get here.
-	 *
-	 * We can get to a memory-pressure situation while the delayed work is
-	 * still pending to run. The vmscan shrinkers can then release all
-	 * cache memory and get us to destruction. If this is the case, we'll
-	 * be executed twice, which is a bug (the second time will execute over
-	 * bogus data). In this case, cancelling the work should be fine.
-	 *
-	 * But we can also get here from the worker itself, if
-	 * kmem_cache_shrink is enough to shake all the remaining objects and
-	 * get the page count to 0. In this case, we'll deadlock if we try to
-	 * cancel the work (the worker runs with an internal lock held, which
-	 * is the same lock we would hold for cancel_work_sync().)
-	 *
-	 * Since we can't possibly know who got us here, just refrain from
-	 * running if there is already work pending
-	 */
-	if (work_pending(&cachep->memcg_params->destroy))
-		return;
-	/*
-	 * We have to defer the actual destroying to a workqueue, because
-	 * we might currently be in a context that cannot sleep.
-	 */
-	schedule_work(&cachep->memcg_params->destroy);
+	kmem_cache_destroy(cachep);
 }
 
 int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
@@ -3425,12 +3386,12 @@ int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 		 * kmem_cache_destroy() will call kmem_cache_shrink internally,
 		 * and that could spawn the workers again: it is likely that
 		 * the cache still have active pages until this very moment.
-		 * This would lead us back to mem_cgroup_destroy_cache.
+		 * This would lead us back to memcg_release_pages().
 		 *
-		 * But that will not execute at all if the "dead" flag is not
-		 * set, so flip it down to guarantee we are in control.
+		 * But that will not execute at all if the refcount is > 0, so
+		 * increment it to guarantee we are in control.
 		 */
-		c->memcg_params->dead = false;
+		atomic_inc(&c->memcg_params->refcount);
 		cancel_work_sync(&c->memcg_params->destroy);
 		kmem_cache_destroy(c);
 
@@ -3443,7 +3404,6 @@ int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 
 static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
 {
-	struct kmem_cache *cachep;
 	struct memcg_cache_params *params;
 
 	if (!memcg_kmem_is_active(memcg))
@@ -3460,9 +3420,13 @@ static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
 
 	mutex_lock(&memcg->slab_caches_mutex);
 	list_for_each_entry(params, &memcg->memcg_slab_caches, list) {
-		cachep = memcg_params_to_cache(params);
-		cachep->memcg_params->dead = true;
-		schedule_work(&cachep->memcg_params->destroy);
+		/*
+		 * Since we still hold the reference to the cache params from
+		 * the memcg, the work could not have been scheduled from
+		 * memcg_release_pages(), and this cannot fail.
+		 */
+		if (!schedule_work(&params->destroy))
+			BUG();
 	}
 	mutex_unlock(&memcg->slab_caches_mutex);
 }
diff --git a/mm/slab.h b/mm/slab.h
index 3045316b7c9d..b8caee243b88 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -122,7 +122,7 @@ static inline bool is_root_cache(struct kmem_cache *s)
 static inline void memcg_bind_pages(struct kmem_cache *s, int order)
 {
 	if (!is_root_cache(s))
-		atomic_add(1 << order, &s->memcg_params->nr_pages);
+		atomic_add(1 << order, &s->memcg_params->refcount);
 }
 
 static inline void memcg_release_pages(struct kmem_cache *s, int order)
@@ -130,8 +130,19 @@ static inline void memcg_release_pages(struct kmem_cache *s, int order)
 	if (is_root_cache(s))
 		return;
 
-	if (atomic_sub_and_test((1 << order), &s->memcg_params->nr_pages))
-		mem_cgroup_destroy_cache(s);
+	if (atomic_sub_and_test((1 << order), &s->memcg_params->refcount)) {
+		/*
+		 * We have to defer the actual destroying to a workqueue,
+		 * because we might currently be in a context that cannot
+		 * sleep.
+		 *
+		 * Note we cannot fail here, because if the work scheduled from
+		 * mem_cgroup_destroy_all_caches() were still pending, the
+		 * cache refcount wouldn't reach zero.
+		 */
+		if (!schedule_work(&s->memcg_params->destroy))
+			BUG();
+	}
 }
 
 static inline bool slab_equal_or_root(struct kmem_cache *s,
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH -mm 02/12] memcg: fix race in memcg cache destruction path
@ 2014-02-26 15:05   ` Vladimir Davydov
  0 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

We schedule memcg cache shrink+destruction work (memcg_params::destroy)
from two places: when we turn memcg offline
(mem_cgroup_destroy_all_caches) and when the last page of the cache is
freed (memcg_params::nr_pages reachs zero, see memcg_release_pages,
mem_cgroup_destroy_cache). Since the latter can happen while the work
scheduled from mem_cgroup_destroy_all_caches is in progress or still
pending, we need to be cautious to avoid races there - we should
accurately bail out in one of those functions if we see that the other
is in progress. Currently we only check if memcg_params::nr_pages is 0
in the destruction work handler and do not destroy the cache if so. But
that's not enough. An example of race we can get is shown below:

  CPU0					CPU1
  ----					----
  kmem_cache_destroy_work_func:		memcg_release_pages:
					  atomic_sub_and_test(1<<order, &s->
							memcg_params->nr_pages)
					  /* reached 0 => schedule destroy */

    atomic_read(&cachep->memcg_params->nr_pages)
    /* 0 => going to destroy the cache */
    kmem_cache_destroy(cachep);

					  mem_cgroup_destroy_cache(s):
					    /* the cache was destroyed on CPU0
					       - use after free */

An obvious way to fix this would be substituting the nr_pages counter
with a reference counter and make memcg take a reference. The cache
destruction would be then scheduled from that thread which decremented
the refcount to 0. Generally, this is what this patch does, but there is
one subtle thing here - the work handler serves not only for cache
destruction, it also shrinks the cache if it's still in use (we can't
call shrink directly from mem_cgroup_destroy_all_caches due to locking
dependencies). We handle this by noting that we should only issue shrink
if called from mem_cgroup_destroy_all_caches, because the cache is
already empty when we release its last page. And if we drop the
reference taken by memcg in the work handler, we can detect who exactly
scheduled the worker - mem_cgroup_destroy_all_caches or
memcg_release_pages.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 include/linux/memcontrol.h |    1 -
 include/linux/slab.h       |    7 ++--
 mm/memcontrol.c            |   86 +++++++++++++-------------------------------
 mm/slab.h                  |   17 +++++++--
 4 files changed, 42 insertions(+), 69 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index af63e6004c62..e54fb469a908 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -512,7 +512,6 @@ void memcg_update_array_size(int num_groups);
 struct kmem_cache *
 __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 
-void mem_cgroup_destroy_cache(struct kmem_cache *cachep);
 int __kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
 /**
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 3dd389aa91c7..3ed53de256ea 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -522,8 +522,8 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  * @memcg: pointer to the memcg this cache belongs to
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
- * @dead: set to true after the memcg dies; the cache may still be around.
- * @nr_pages: number of pages that belongs to this cache.
+ * @refcount: the reference counter; cache destruction will be scheduled when
+ *            it reaches zero
  * @destroy: worker to be called whenever we are ready, or believe we may be
  *           ready, to destroy this cache.
  */
@@ -538,8 +538,7 @@ struct memcg_cache_params {
 			struct mem_cgroup *memcg;
 			struct list_head list;
 			struct kmem_cache *root_cache;
-			bool dead;
-			atomic_t nr_pages;
+			atomic_t refcount;
 			struct work_struct destroy;
 		};
 	};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b61b6e9381e8..416da5dc3d2a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3206,6 +3206,7 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 		s->memcg_params->root_cache = root_cache;
 		INIT_WORK(&s->memcg_params->destroy,
 				kmem_cache_destroy_work_func);
+		atomic_set(&s->memcg_params->refcount, 1);
 		css_get(&memcg->css);
 	} else
 		s->memcg_params->is_root_cache = true;
@@ -3327,64 +3328,24 @@ static inline void memcg_resume_kmem_account(void)
 static void kmem_cache_destroy_work_func(struct work_struct *w)
 {
 	struct kmem_cache *cachep;
-	struct memcg_cache_params *p;
-
-	p = container_of(w, struct memcg_cache_params, destroy);
+	struct memcg_cache_params *params;
 
-	cachep = memcg_params_to_cache(p);
+	params = container_of(w, struct memcg_cache_params, destroy);
+	cachep = memcg_params_to_cache(params);
 
-	/*
-	 * If we get down to 0 after shrink, we could delete right away.
-	 * However, memcg_release_pages() already puts us back in the workqueue
-	 * in that case. If we proceed deleting, we'll get a dangling
-	 * reference, and removing the object from the workqueue in that case
-	 * is unnecessary complication. We are not a fast path.
-	 *
-	 * Note that this case is fundamentally different from racing with
-	 * shrink_slab(): if memcg_cgroup_destroy_cache() is called in
-	 * kmem_cache_shrink, not only we would be reinserting a dead cache
-	 * into the queue, but doing so from inside the worker racing to
-	 * destroy it.
-	 *
-	 * So if we aren't down to zero, we'll just schedule a worker and try
-	 * again
-	 */
-	if (atomic_read(&cachep->memcg_params->nr_pages) != 0)
+	if (atomic_read(&params->refcount) != 0) {
+		/*
+		 * We were scheduled from mem_cgroup_destroy_all_caches().
+		 * Shrink the cache and drop the reference taken by memcg.
+		 */
 		kmem_cache_shrink(cachep);
-	else
-		kmem_cache_destroy(cachep);
-}
 
-void mem_cgroup_destroy_cache(struct kmem_cache *cachep)
-{
-	if (!cachep->memcg_params->dead)
-		return;
+		/* cache is still in use? */
+		if (!atomic_dec_and_test(&params->refcount))
+			return;
+	}
 
-	/*
-	 * There are many ways in which we can get here.
-	 *
-	 * We can get to a memory-pressure situation while the delayed work is
-	 * still pending to run. The vmscan shrinkers can then release all
-	 * cache memory and get us to destruction. If this is the case, we'll
-	 * be executed twice, which is a bug (the second time will execute over
-	 * bogus data). In this case, cancelling the work should be fine.
-	 *
-	 * But we can also get here from the worker itself, if
-	 * kmem_cache_shrink is enough to shake all the remaining objects and
-	 * get the page count to 0. In this case, we'll deadlock if we try to
-	 * cancel the work (the worker runs with an internal lock held, which
-	 * is the same lock we would hold for cancel_work_sync().)
-	 *
-	 * Since we can't possibly know who got us here, just refrain from
-	 * running if there is already work pending
-	 */
-	if (work_pending(&cachep->memcg_params->destroy))
-		return;
-	/*
-	 * We have to defer the actual destroying to a workqueue, because
-	 * we might currently be in a context that cannot sleep.
-	 */
-	schedule_work(&cachep->memcg_params->destroy);
+	kmem_cache_destroy(cachep);
 }
 
 int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
@@ -3425,12 +3386,12 @@ int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 		 * kmem_cache_destroy() will call kmem_cache_shrink internally,
 		 * and that could spawn the workers again: it is likely that
 		 * the cache still have active pages until this very moment.
-		 * This would lead us back to mem_cgroup_destroy_cache.
+		 * This would lead us back to memcg_release_pages().
 		 *
-		 * But that will not execute at all if the "dead" flag is not
-		 * set, so flip it down to guarantee we are in control.
+		 * But that will not execute at all if the refcount is > 0, so
+		 * increment it to guarantee we are in control.
 		 */
-		c->memcg_params->dead = false;
+		atomic_inc(&c->memcg_params->refcount);
 		cancel_work_sync(&c->memcg_params->destroy);
 		kmem_cache_destroy(c);
 
@@ -3443,7 +3404,6 @@ int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 
 static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
 {
-	struct kmem_cache *cachep;
 	struct memcg_cache_params *params;
 
 	if (!memcg_kmem_is_active(memcg))
@@ -3460,9 +3420,13 @@ static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
 
 	mutex_lock(&memcg->slab_caches_mutex);
 	list_for_each_entry(params, &memcg->memcg_slab_caches, list) {
-		cachep = memcg_params_to_cache(params);
-		cachep->memcg_params->dead = true;
-		schedule_work(&cachep->memcg_params->destroy);
+		/*
+		 * Since we still hold the reference to the cache params from
+		 * the memcg, the work could not have been scheduled from
+		 * memcg_release_pages(), and this cannot fail.
+		 */
+		if (!schedule_work(&params->destroy))
+			BUG();
 	}
 	mutex_unlock(&memcg->slab_caches_mutex);
 }
diff --git a/mm/slab.h b/mm/slab.h
index 3045316b7c9d..b8caee243b88 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -122,7 +122,7 @@ static inline bool is_root_cache(struct kmem_cache *s)
 static inline void memcg_bind_pages(struct kmem_cache *s, int order)
 {
 	if (!is_root_cache(s))
-		atomic_add(1 << order, &s->memcg_params->nr_pages);
+		atomic_add(1 << order, &s->memcg_params->refcount);
 }
 
 static inline void memcg_release_pages(struct kmem_cache *s, int order)
@@ -130,8 +130,19 @@ static inline void memcg_release_pages(struct kmem_cache *s, int order)
 	if (is_root_cache(s))
 		return;
 
-	if (atomic_sub_and_test((1 << order), &s->memcg_params->nr_pages))
-		mem_cgroup_destroy_cache(s);
+	if (atomic_sub_and_test((1 << order), &s->memcg_params->refcount)) {
+		/*
+		 * We have to defer the actual destroying to a workqueue,
+		 * because we might currently be in a context that cannot
+		 * sleep.
+		 *
+		 * Note we cannot fail here, because if the work scheduled from
+		 * mem_cgroup_destroy_all_caches() were still pending, the
+		 * cache refcount wouldn't reach zero.
+		 */
+		if (!schedule_work(&s->memcg_params->destroy))
+			BUG();
+	}
 }
 
 static inline bool slab_equal_or_root(struct kmem_cache *s,
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH -mm 03/12] memcg: fix root vs memcg cache destruction race
  2014-02-26 15:05 ` Vladimir Davydov
@ 2014-02-26 15:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

When a root cache is being destroyed and we are about to initiate
destruction of its child caches (see kmem_cache_destroy_memcg_children),
we should handle races with pending memcg cache destruction works
somehow. Currently, we simply cancel the memcg_params::destroy work
before calling kmem_cache_destroy for a memcg cache, but that's totally
wrong, because the work handler may destroy and free the cache resulting
in a use-after-free in cancel_work_sync. Furthermore, we do not handle
the case when memcg cache destruction is scheduled after we start to
destroy the root cache - that's possible, because nothing prevents
memcgs from going offline while we are destroying a root cache. In that
case we are likely to get a use-after-free in the work handler.

This patch resolves the race as follows:

1) It makes kmem_cache_destroy silently exit if it is called for a memcg
   cache while the corresponding root cache is being destroyed, leaving
   the destruction to kmem_cache_destroy_memcg_children. That makes call
   to cancel_work_sync safe if it is called from the root cache
   destruction path.

2) It moves cancel_work_sync to be called after we unregistered a child
   cache from its memcg. That prevents the work from being rescheduled
   from memcg offline path.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 include/linux/memcontrol.h |    2 +-
 include/linux/slab.h       |    1 +
 mm/memcontrol.c            |   22 ++-----------
 mm/slab_common.c           |   77 ++++++++++++++++++++++++++++++++++++--------
 4 files changed, 69 insertions(+), 33 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e54fb469a908..6128323ea453 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -512,7 +512,7 @@ void memcg_update_array_size(int num_groups);
 struct kmem_cache *
 __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 
-int __kmem_cache_destroy_memcg_children(struct kmem_cache *s);
+int kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
 /**
  * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 3ed53de256ea..ee9f1b0382ac 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -117,6 +117,7 @@ struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
 			void (*)(void *));
 #ifdef CONFIG_MEMCG_KMEM
 void kmem_cache_create_memcg(struct mem_cgroup *, struct kmem_cache *);
+void kmem_cache_destroy_memcg(struct kmem_cache *, bool);
 #endif
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 416da5dc3d2a..1b9634090454 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3345,10 +3345,10 @@ static void kmem_cache_destroy_work_func(struct work_struct *w)
 			return;
 	}
 
-	kmem_cache_destroy(cachep);
+	kmem_cache_destroy_memcg(cachep, false);
 }
 
-int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
+int kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 {
 	struct kmem_cache *c;
 	int i, failed = 0;
@@ -3378,23 +3378,7 @@ int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 		if (!c)
 			continue;
 
-		/*
-		 * We will now manually delete the caches, so to avoid races
-		 * we need to cancel all pending destruction workers and
-		 * proceed with destruction ourselves.
-		 *
-		 * kmem_cache_destroy() will call kmem_cache_shrink internally,
-		 * and that could spawn the workers again: it is likely that
-		 * the cache still have active pages until this very moment.
-		 * This would lead us back to memcg_release_pages().
-		 *
-		 * But that will not execute at all if the refcount is > 0, so
-		 * increment it to guarantee we are in control.
-		 */
-		atomic_inc(&c->memcg_params->refcount);
-		cancel_work_sync(&c->memcg_params->destroy);
-		kmem_cache_destroy(c);
-
+		kmem_cache_destroy_memcg(c, true);
 		if (cache_from_memcg_idx(s, i))
 			failed++;
 	}
diff --git a/mm/slab_common.c b/mm/slab_common.c
index f3cfccf76dda..05ba3cd1b507 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -302,28 +302,70 @@ out_unlock:
 	put_online_cpus();
 }
 
-static int kmem_cache_destroy_memcg_children(struct kmem_cache *s)
+static void __kmem_cache_destroy(struct kmem_cache *, bool);
+
+void kmem_cache_destroy_memcg(struct kmem_cache *s, bool destroying_root)
+{
+	BUG_ON(is_root_cache(s));
+	__kmem_cache_destroy(s, destroying_root);
+}
+
+static int __kmem_cache_shutdown_memcg(struct kmem_cache *s,
+				       bool destroying_root)
 {
-	int rc;
+	int rc = 0;
 
-	if (!s->memcg_params ||
-	    !s->memcg_params->is_root_cache)
+	if (!destroying_root) {
+		struct kmem_cache *root;
+
+		root = memcg_root_cache(s);
+		BUG_ON(root == s);
+		/*
+		 * If we are racing with the root cache destruction, let
+		 * kmem_cache_destroy_memcg_children() finish with this cache.
+		 */
+		if (!root->refcount) {
+			s->refcount++;
+			return 1;
+		}
+	}
+
+	if (!s->memcg_params)
 		return 0;
 
 	mutex_unlock(&slab_mutex);
-	rc = __kmem_cache_destroy_memcg_children(s);
+	if (s->memcg_params->is_root_cache) {
+		rc = kmem_cache_destroy_memcg_children(s);
+	} else {
+		/*
+		 * There might be a destruction work pending, which needs to be
+		 * cancelled before we start to destroy the cache.
+		 *
+		 * This should be done after we deleted all the references to
+		 * this cache, otherwise the work could be rescheduled.
+		 *
+		 * __kmem_cache_shutdown() will call kmem_cache_shrink()
+		 * internally, and that could spawn the worker again. We
+		 * increment the refcount to avoid that.
+		 */
+		if (destroying_root) {
+			cancel_work_sync(&s->memcg_params->destroy);
+			atomic_inc(&s->memcg_params->refcount);
+		}
+	}
 	mutex_lock(&slab_mutex);
 
 	return rc;
 }
 #else
-static int kmem_cache_destroy_memcg_children(struct kmem_cache *s)
+static int __kmem_cache_shutdown_memcg(struct kmem_cache *s,
+				       bool destroying_root)
 {
 	return 0;
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
-void kmem_cache_destroy(struct kmem_cache *s)
+static void __kmem_cache_destroy(struct kmem_cache *s, bool destroying_root)
 {
 	get_online_cpus();
 	mutex_lock(&slab_mutex);
@@ -332,19 +374,17 @@ void kmem_cache_destroy(struct kmem_cache *s)
 	if (s->refcount)
 		goto out_unlock;
 
-	if (kmem_cache_destroy_memcg_children(s) != 0)
-		goto out_unlock;
-
 	list_del(&s->list);
 	memcg_unregister_cache(s);
 
+	if (__kmem_cache_shutdown_memcg(s, destroying_root) != 0)
+		goto out_undelete;
+
 	if (__kmem_cache_shutdown(s) != 0) {
-		list_add(&s->list, &slab_caches);
-		memcg_register_cache(s);
 		printk(KERN_ERR "kmem_cache_destroy %s: "
 		       "Slab cache still has objects\n", s->name);
 		dump_stack();
-		goto out_unlock;
+		goto out_undelete;
 	}
 
 	mutex_unlock(&slab_mutex);
@@ -360,6 +400,17 @@ out_unlock:
 	mutex_unlock(&slab_mutex);
 out_put_cpus:
 	put_online_cpus();
+	return;
+out_undelete:
+	list_add(&s->list, &slab_caches);
+	memcg_register_cache(s);
+	goto out_unlock;
+}
+
+void kmem_cache_destroy(struct kmem_cache *s)
+{
+	BUG_ON(!is_root_cache(s));
+	__kmem_cache_destroy(s, true);
 }
 EXPORT_SYMBOL(kmem_cache_destroy);
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH -mm 03/12] memcg: fix root vs memcg cache destruction race
@ 2014-02-26 15:05   ` Vladimir Davydov
  0 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

When a root cache is being destroyed and we are about to initiate
destruction of its child caches (see kmem_cache_destroy_memcg_children),
we should handle races with pending memcg cache destruction works
somehow. Currently, we simply cancel the memcg_params::destroy work
before calling kmem_cache_destroy for a memcg cache, but that's totally
wrong, because the work handler may destroy and free the cache resulting
in a use-after-free in cancel_work_sync. Furthermore, we do not handle
the case when memcg cache destruction is scheduled after we start to
destroy the root cache - that's possible, because nothing prevents
memcgs from going offline while we are destroying a root cache. In that
case we are likely to get a use-after-free in the work handler.

This patch resolves the race as follows:

1) It makes kmem_cache_destroy silently exit if it is called for a memcg
   cache while the corresponding root cache is being destroyed, leaving
   the destruction to kmem_cache_destroy_memcg_children. That makes call
   to cancel_work_sync safe if it is called from the root cache
   destruction path.

2) It moves cancel_work_sync to be called after we unregistered a child
   cache from its memcg. That prevents the work from being rescheduled
   from memcg offline path.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 include/linux/memcontrol.h |    2 +-
 include/linux/slab.h       |    1 +
 mm/memcontrol.c            |   22 ++-----------
 mm/slab_common.c           |   77 ++++++++++++++++++++++++++++++++++++--------
 4 files changed, 69 insertions(+), 33 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e54fb469a908..6128323ea453 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -512,7 +512,7 @@ void memcg_update_array_size(int num_groups);
 struct kmem_cache *
 __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 
-int __kmem_cache_destroy_memcg_children(struct kmem_cache *s);
+int kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
 /**
  * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 3ed53de256ea..ee9f1b0382ac 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -117,6 +117,7 @@ struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
 			void (*)(void *));
 #ifdef CONFIG_MEMCG_KMEM
 void kmem_cache_create_memcg(struct mem_cgroup *, struct kmem_cache *);
+void kmem_cache_destroy_memcg(struct kmem_cache *, bool);
 #endif
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 416da5dc3d2a..1b9634090454 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3345,10 +3345,10 @@ static void kmem_cache_destroy_work_func(struct work_struct *w)
 			return;
 	}
 
-	kmem_cache_destroy(cachep);
+	kmem_cache_destroy_memcg(cachep, false);
 }
 
-int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
+int kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 {
 	struct kmem_cache *c;
 	int i, failed = 0;
@@ -3378,23 +3378,7 @@ int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 		if (!c)
 			continue;
 
-		/*
-		 * We will now manually delete the caches, so to avoid races
-		 * we need to cancel all pending destruction workers and
-		 * proceed with destruction ourselves.
-		 *
-		 * kmem_cache_destroy() will call kmem_cache_shrink internally,
-		 * and that could spawn the workers again: it is likely that
-		 * the cache still have active pages until this very moment.
-		 * This would lead us back to memcg_release_pages().
-		 *
-		 * But that will not execute at all if the refcount is > 0, so
-		 * increment it to guarantee we are in control.
-		 */
-		atomic_inc(&c->memcg_params->refcount);
-		cancel_work_sync(&c->memcg_params->destroy);
-		kmem_cache_destroy(c);
-
+		kmem_cache_destroy_memcg(c, true);
 		if (cache_from_memcg_idx(s, i))
 			failed++;
 	}
diff --git a/mm/slab_common.c b/mm/slab_common.c
index f3cfccf76dda..05ba3cd1b507 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -302,28 +302,70 @@ out_unlock:
 	put_online_cpus();
 }
 
-static int kmem_cache_destroy_memcg_children(struct kmem_cache *s)
+static void __kmem_cache_destroy(struct kmem_cache *, bool);
+
+void kmem_cache_destroy_memcg(struct kmem_cache *s, bool destroying_root)
+{
+	BUG_ON(is_root_cache(s));
+	__kmem_cache_destroy(s, destroying_root);
+}
+
+static int __kmem_cache_shutdown_memcg(struct kmem_cache *s,
+				       bool destroying_root)
 {
-	int rc;
+	int rc = 0;
 
-	if (!s->memcg_params ||
-	    !s->memcg_params->is_root_cache)
+	if (!destroying_root) {
+		struct kmem_cache *root;
+
+		root = memcg_root_cache(s);
+		BUG_ON(root == s);
+		/*
+		 * If we are racing with the root cache destruction, let
+		 * kmem_cache_destroy_memcg_children() finish with this cache.
+		 */
+		if (!root->refcount) {
+			s->refcount++;
+			return 1;
+		}
+	}
+
+	if (!s->memcg_params)
 		return 0;
 
 	mutex_unlock(&slab_mutex);
-	rc = __kmem_cache_destroy_memcg_children(s);
+	if (s->memcg_params->is_root_cache) {
+		rc = kmem_cache_destroy_memcg_children(s);
+	} else {
+		/*
+		 * There might be a destruction work pending, which needs to be
+		 * cancelled before we start to destroy the cache.
+		 *
+		 * This should be done after we deleted all the references to
+		 * this cache, otherwise the work could be rescheduled.
+		 *
+		 * __kmem_cache_shutdown() will call kmem_cache_shrink()
+		 * internally, and that could spawn the worker again. We
+		 * increment the refcount to avoid that.
+		 */
+		if (destroying_root) {
+			cancel_work_sync(&s->memcg_params->destroy);
+			atomic_inc(&s->memcg_params->refcount);
+		}
+	}
 	mutex_lock(&slab_mutex);
 
 	return rc;
 }
 #else
-static int kmem_cache_destroy_memcg_children(struct kmem_cache *s)
+static int __kmem_cache_shutdown_memcg(struct kmem_cache *s,
+				       bool destroying_root)
 {
 	return 0;
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
-void kmem_cache_destroy(struct kmem_cache *s)
+static void __kmem_cache_destroy(struct kmem_cache *s, bool destroying_root)
 {
 	get_online_cpus();
 	mutex_lock(&slab_mutex);
@@ -332,19 +374,17 @@ void kmem_cache_destroy(struct kmem_cache *s)
 	if (s->refcount)
 		goto out_unlock;
 
-	if (kmem_cache_destroy_memcg_children(s) != 0)
-		goto out_unlock;
-
 	list_del(&s->list);
 	memcg_unregister_cache(s);
 
+	if (__kmem_cache_shutdown_memcg(s, destroying_root) != 0)
+		goto out_undelete;
+
 	if (__kmem_cache_shutdown(s) != 0) {
-		list_add(&s->list, &slab_caches);
-		memcg_register_cache(s);
 		printk(KERN_ERR "kmem_cache_destroy %s: "
 		       "Slab cache still has objects\n", s->name);
 		dump_stack();
-		goto out_unlock;
+		goto out_undelete;
 	}
 
 	mutex_unlock(&slab_mutex);
@@ -360,6 +400,17 @@ out_unlock:
 	mutex_unlock(&slab_mutex);
 out_put_cpus:
 	put_online_cpus();
+	return;
+out_undelete:
+	list_add(&s->list, &slab_caches);
+	memcg_register_cache(s);
+	goto out_unlock;
+}
+
+void kmem_cache_destroy(struct kmem_cache *s)
+{
+	BUG_ON(!is_root_cache(s));
+	__kmem_cache_destroy(s, true);
 }
 EXPORT_SYMBOL(kmem_cache_destroy);
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH -mm 04/12] memcg: move slab caches list/mutex init to memcg creation
  2014-02-26 15:05 ` Vladimir Davydov
@ 2014-02-26 15:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

I need them initialized for cgroups that haven't got kmem accounting
initialized.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 mm/memcontrol.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1b9634090454..69431f5285cc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5122,8 +5122,6 @@ static int __memcg_activate_kmem(struct mem_cgroup *memcg,
 		goto out_rmid;
 
 	memcg->kmemcg_id = memcg_id;
-	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
-	mutex_init(&memcg->slab_caches_mutex);
 
 	/*
 	 * We couldn't have accounted to this cgroup, because it hasn't got the
@@ -5870,6 +5868,9 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 	int ret;
 
 	memcg->kmemcg_id = -1;
+	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
+	mutex_init(&memcg->slab_caches_mutex);
+
 	ret = memcg_propagate_kmem(memcg);
 	if (ret)
 		return ret;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH -mm 04/12] memcg: move slab caches list/mutex init to memcg creation
@ 2014-02-26 15:05   ` Vladimir Davydov
  0 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

I need them initialized for cgroups that haven't got kmem accounting
initialized.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 mm/memcontrol.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1b9634090454..69431f5285cc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5122,8 +5122,6 @@ static int __memcg_activate_kmem(struct mem_cgroup *memcg,
 		goto out_rmid;
 
 	memcg->kmemcg_id = memcg_id;
-	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
-	mutex_init(&memcg->slab_caches_mutex);
 
 	/*
 	 * We couldn't have accounted to this cgroup, because it hasn't got the
@@ -5870,6 +5868,9 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 	int ret;
 
 	memcg->kmemcg_id = -1;
+	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
+	mutex_init(&memcg->slab_caches_mutex);
+
 	ret = memcg_propagate_kmem(memcg);
 	if (ret)
 		return ret;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH -mm 05/12] memcg: add pointer from memcg_cache_params to cache
  2014-02-26 15:05 ` Vladimir Davydov
@ 2014-02-26 15:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

While iterating over the memcg_slab_caches list we need to obtain the
cache a particular memcg_params is corresponding to, because
memcg_params are actually linked, not caches. Currently to achieve that,
we employ the fact that each root cache tracks all its child caches in
its memcg_caches array, so that given a memcg_params we can find the
memcg cache by doing something like this:

  root = memcg_params->root_cache;
  memcg = memcg_params->memcg;
  memcg_cache = root->memcg_params->memcg_caches[memcg->kmemcg_id];

Apart from being cumbersome, this is not going to work when reparenting
of memcg caches is introduced. So let's embed a pointer back to the
memcg cache into the memcg_cache_params struct.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 include/linux/slab.h |    2 ++
 mm/memcontrol.c      |   28 +++-------------------------
 2 files changed, 5 insertions(+), 25 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index ee9f1b0382ac..f2fd4212976e 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -520,6 +520,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  *
  * Child caches will hold extra metadata needed for its operation. Fields are:
  *
+ * @cachep: back pointer to the memcg cache
  * @memcg: pointer to the memcg this cache belongs to
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
@@ -536,6 +537,7 @@ struct memcg_cache_params {
 			struct kmem_cache *memcg_caches[0];
 		};
 		struct {
+			struct kmem_cache *cachep;
 			struct mem_cgroup *memcg;
 			struct list_head list;
 			struct kmem_cache *root_cache;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 69431f5285cc..5eb629ed28d6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2974,19 +2974,6 @@ static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 		memcg_kmem_is_active(memcg);
 }
 
-/*
- * This is a bit cumbersome, but it is rarely used and avoids a backpointer
- * in the memcg_cache_params struct.
- */
-static struct kmem_cache *memcg_params_to_cache(struct memcg_cache_params *p)
-{
-	struct kmem_cache *cachep;
-
-	VM_BUG_ON(p->is_root_cache);
-	cachep = p->root_cache;
-	return cache_from_memcg_idx(cachep, memcg_cache_id(p->memcg));
-}
-
 #ifdef CONFIG_SLABINFO
 static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v)
 {
@@ -3000,7 +2987,7 @@ static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v)
 
 	mutex_lock(&memcg->slab_caches_mutex);
 	list_for_each_entry(params, &memcg->memcg_slab_caches, list)
-		cache_show(memcg_params_to_cache(params), m);
+		cache_show(params->cachep, m);
 	mutex_unlock(&memcg->slab_caches_mutex);
 
 	return 0;
@@ -3202,6 +3189,7 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 		return -ENOMEM;
 
 	if (memcg) {
+		s->memcg_params->cachep = s;
 		s->memcg_params->memcg = memcg;
 		s->memcg_params->root_cache = root_cache;
 		INIT_WORK(&s->memcg_params->destroy,
@@ -3249,11 +3237,6 @@ void memcg_register_cache(struct kmem_cache *s)
 	 */
 	smp_wmb();
 
-	/*
-	 * Initialize the pointer to this cache in its parent's memcg_params
-	 * before adding it to the memcg_slab_caches list, otherwise we can
-	 * fail to convert memcg_params_to_cache() while traversing the list.
-	 */
 	VM_BUG_ON(root->memcg_params->memcg_caches[id]);
 	root->memcg_params->memcg_caches[id] = s;
 
@@ -3285,11 +3268,6 @@ void memcg_unregister_cache(struct kmem_cache *s)
 	list_del(&s->memcg_params->list);
 	mutex_unlock(&memcg->slab_caches_mutex);
 
-	/*
-	 * Clear the pointer to this cache in its parent's memcg_params only
-	 * after removing it from the memcg_slab_caches list, otherwise we can
-	 * fail to convert memcg_params_to_cache() while traversing the list.
-	 */
 	VM_BUG_ON(root->memcg_params->memcg_caches[id] != s);
 	root->memcg_params->memcg_caches[id] = NULL;
 }
@@ -3331,7 +3309,7 @@ static void kmem_cache_destroy_work_func(struct work_struct *w)
 	struct memcg_cache_params *params;
 
 	params = container_of(w, struct memcg_cache_params, destroy);
-	cachep = memcg_params_to_cache(params);
+	cachep = params->cachep;
 
 	if (atomic_read(&params->refcount) != 0) {
 		/*
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH -mm 05/12] memcg: add pointer from memcg_cache_params to cache
@ 2014-02-26 15:05   ` Vladimir Davydov
  0 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

While iterating over the memcg_slab_caches list we need to obtain the
cache a particular memcg_params is corresponding to, because
memcg_params are actually linked, not caches. Currently to achieve that,
we employ the fact that each root cache tracks all its child caches in
its memcg_caches array, so that given a memcg_params we can find the
memcg cache by doing something like this:

  root = memcg_params->root_cache;
  memcg = memcg_params->memcg;
  memcg_cache = root->memcg_params->memcg_caches[memcg->kmemcg_id];

Apart from being cumbersome, this is not going to work when reparenting
of memcg caches is introduced. So let's embed a pointer back to the
memcg cache into the memcg_cache_params struct.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 include/linux/slab.h |    2 ++
 mm/memcontrol.c      |   28 +++-------------------------
 2 files changed, 5 insertions(+), 25 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index ee9f1b0382ac..f2fd4212976e 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -520,6 +520,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  *
  * Child caches will hold extra metadata needed for its operation. Fields are:
  *
+ * @cachep: back pointer to the memcg cache
  * @memcg: pointer to the memcg this cache belongs to
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
@@ -536,6 +537,7 @@ struct memcg_cache_params {
 			struct kmem_cache *memcg_caches[0];
 		};
 		struct {
+			struct kmem_cache *cachep;
 			struct mem_cgroup *memcg;
 			struct list_head list;
 			struct kmem_cache *root_cache;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 69431f5285cc..5eb629ed28d6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2974,19 +2974,6 @@ static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 		memcg_kmem_is_active(memcg);
 }
 
-/*
- * This is a bit cumbersome, but it is rarely used and avoids a backpointer
- * in the memcg_cache_params struct.
- */
-static struct kmem_cache *memcg_params_to_cache(struct memcg_cache_params *p)
-{
-	struct kmem_cache *cachep;
-
-	VM_BUG_ON(p->is_root_cache);
-	cachep = p->root_cache;
-	return cache_from_memcg_idx(cachep, memcg_cache_id(p->memcg));
-}
-
 #ifdef CONFIG_SLABINFO
 static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v)
 {
@@ -3000,7 +2987,7 @@ static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v)
 
 	mutex_lock(&memcg->slab_caches_mutex);
 	list_for_each_entry(params, &memcg->memcg_slab_caches, list)
-		cache_show(memcg_params_to_cache(params), m);
+		cache_show(params->cachep, m);
 	mutex_unlock(&memcg->slab_caches_mutex);
 
 	return 0;
@@ -3202,6 +3189,7 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 		return -ENOMEM;
 
 	if (memcg) {
+		s->memcg_params->cachep = s;
 		s->memcg_params->memcg = memcg;
 		s->memcg_params->root_cache = root_cache;
 		INIT_WORK(&s->memcg_params->destroy,
@@ -3249,11 +3237,6 @@ void memcg_register_cache(struct kmem_cache *s)
 	 */
 	smp_wmb();
 
-	/*
-	 * Initialize the pointer to this cache in its parent's memcg_params
-	 * before adding it to the memcg_slab_caches list, otherwise we can
-	 * fail to convert memcg_params_to_cache() while traversing the list.
-	 */
 	VM_BUG_ON(root->memcg_params->memcg_caches[id]);
 	root->memcg_params->memcg_caches[id] = s;
 
@@ -3285,11 +3268,6 @@ void memcg_unregister_cache(struct kmem_cache *s)
 	list_del(&s->memcg_params->list);
 	mutex_unlock(&memcg->slab_caches_mutex);
 
-	/*
-	 * Clear the pointer to this cache in its parent's memcg_params only
-	 * after removing it from the memcg_slab_caches list, otherwise we can
-	 * fail to convert memcg_params_to_cache() while traversing the list.
-	 */
 	VM_BUG_ON(root->memcg_params->memcg_caches[id] != s);
 	root->memcg_params->memcg_caches[id] = NULL;
 }
@@ -3331,7 +3309,7 @@ static void kmem_cache_destroy_work_func(struct work_struct *w)
 	struct memcg_cache_params *params;
 
 	params = container_of(w, struct memcg_cache_params, destroy);
-	cachep = memcg_params_to_cache(params);
+	cachep = params->cachep;
 
 	if (atomic_read(&params->refcount) != 0) {
 		/*
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH -mm 06/12] memcg: keep all children of each root cache on a list
  2014-02-26 15:05 ` Vladimir Davydov
@ 2014-02-26 15:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

Sometimes we need to iterate over all child caches of a particular root
cache, e.g. when we are destroying it. Currently each root cache keeps
pointers to its children in its memcg_cache_params::memcg_caches_array
so that we can enumerate all active kmemcg ids dereferencing appropriate
array slots to get a memcg. However, this is going to change when memcg
cache reparenting is introduced - only active (not dead) caches will
reside in this array. So let's organize all child caches of the same
root cache into a list on memcg_cache_params.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 include/linux/memcontrol.h |    2 +-
 include/linux/slab.h       |    3 +++
 mm/memcontrol.c            |   36 +++++++++++++++++++-----------------
 mm/slab.c                  |   38 ++++++++++++++++++++++----------------
 mm/slab_common.c           |   19 +++++++++----------
 mm/slub.c                  |   41 +++++++++++++++++++++++++----------------
 6 files changed, 79 insertions(+), 60 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6128323ea453..b38b52ce59fb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -512,7 +512,7 @@ void memcg_update_array_size(int num_groups);
 struct kmem_cache *
 __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 
-int kmem_cache_destroy_memcg_children(struct kmem_cache *s);
+void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
 /**
  * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
diff --git a/include/linux/slab.h b/include/linux/slab.h
index f2fd4212976e..8091d009cd72 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -524,6 +524,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  * @memcg: pointer to the memcg this cache belongs to
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
+ * @siblings: list_head for the list of all child caches of the root_cache
  * @refcount: the reference counter; cache destruction will be scheduled when
  *            it reaches zero
  * @destroy: worker to be called whenever we are ready, or believe we may be
@@ -533,6 +534,7 @@ struct memcg_cache_params {
 	bool is_root_cache;
 	union {
 		struct {
+			struct list_head children;
 			struct rcu_head rcu_head;
 			struct kmem_cache *memcg_caches[0];
 		};
@@ -541,6 +543,7 @@ struct memcg_cache_params {
 			struct mem_cgroup *memcg;
 			struct list_head list;
 			struct kmem_cache *root_cache;
+			struct list_head siblings;
 			atomic_t refcount;
 			struct work_struct destroy;
 		};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5eb629ed28d6..6af3c062dfb1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3114,6 +3114,10 @@ int memcg_update_cache_size(struct kmem_cache *s, int num_groups)
 			return -ENOMEM;
 
 		new_params->is_root_cache = true;
+		INIT_LIST_HEAD(&new_params->children);
+		if (cur_params)
+			list_splice(&cur_params->children,
+				    &new_params->children);
 
 		/*
 		 * There is the chance it will be bigger than
@@ -3196,8 +3200,10 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 				kmem_cache_destroy_work_func);
 		atomic_set(&s->memcg_params->refcount, 1);
 		css_get(&memcg->css);
-	} else
+	} else {
 		s->memcg_params->is_root_cache = true;
+		INIT_LIST_HEAD(&s->memcg_params->children);
+	}
 
 	return 0;
 }
@@ -3237,6 +3243,8 @@ void memcg_register_cache(struct kmem_cache *s)
 	 */
 	smp_wmb();
 
+	list_add(&s->memcg_params->siblings, &root->memcg_params->children);
+
 	VM_BUG_ON(root->memcg_params->memcg_caches[id]);
 	root->memcg_params->memcg_caches[id] = s;
 
@@ -3264,6 +3272,8 @@ void memcg_unregister_cache(struct kmem_cache *s)
 	memcg = s->memcg_params->memcg;
 	id = memcg_cache_id(memcg);
 
+	list_del(&s->memcg_params->siblings);
+
 	mutex_lock(&memcg->slab_caches_mutex);
 	list_del(&s->memcg_params->list);
 	mutex_unlock(&memcg->slab_caches_mutex);
@@ -3326,10 +3336,9 @@ static void kmem_cache_destroy_work_func(struct work_struct *w)
 	kmem_cache_destroy_memcg(cachep, false);
 }
 
-int kmem_cache_destroy_memcg_children(struct kmem_cache *s)
+void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 {
-	struct kmem_cache *c;
-	int i, failed = 0;
+	struct memcg_cache_params *params, *tmp;
 
 	/*
 	 * Since the cache is being destroyed, it shouldn't be allocated from
@@ -3341,9 +3350,9 @@ int kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 	flush_workqueue(memcg_cache_create_wq);
 
 	/*
-	 * If the cache is being destroyed, we trust that there is no one else
-	 * requesting objects from it. Even if there are, the sanity checks in
-	 * kmem_cache_destroy should caught this ill-case.
+	 * At this point nobody except us is allowed to create or destroy child
+	 * caches so we don't need to take the slab_mutex for iterating over
+	 * the children list.
 	 *
 	 * Still, we don't want anyone else freeing memcg_caches under our
 	 * noses, which can happen if a new memcg comes to life. As usual,
@@ -3351,17 +3360,10 @@ int kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 	 * this.
 	 */
 	mutex_lock(&activate_kmem_mutex);
-	for_each_memcg_cache_index(i) {
-		c = cache_from_memcg_idx(s, i);
-		if (!c)
-			continue;
-
-		kmem_cache_destroy_memcg(c, true);
-		if (cache_from_memcg_idx(s, i))
-			failed++;
-	}
+	list_for_each_entry_safe(params, tmp,
+			&s->memcg_params->children, siblings)
+		kmem_cache_destroy_memcg(params->cachep, true);
 	mutex_unlock(&activate_kmem_mutex);
-	return failed;
 }
 
 static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
diff --git a/mm/slab.c b/mm/slab.c
index 9b8950cb64dd..eae95847cdef 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3816,29 +3816,35 @@ static int __do_tune_cpucache(struct kmem_cache *cachep, int limit,
 	return alloc_kmemlist(cachep, gfp);
 }
 
+static void __do_tune_cpucache_memcg(struct kmem_cache *cachep, int limit,
+				     int batchcount, int shared, gfp_t gfp)
+{
+#ifdef CONFIG_MEMCG_KMEM
+	struct memcg_cache_params *params;
+
+	if (!cachep->memcg_params ||
+	    !cachep->memcg_params->is_root_cache)
+		return;
+
+	lockdep_assert_held(&slab_mutex);
+	list_for_each_entry(params,
+			&cachep->memcg_params->children, siblings)
+		__do_tune_cpucache(params->cachep, limit,
+				   batchcount, shared, gfp);
+#endif
+}
+
 static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
 				int batchcount, int shared, gfp_t gfp)
 {
 	int ret;
-	struct kmem_cache *c = NULL;
-	int i = 0;
 
 	ret = __do_tune_cpucache(cachep, limit, batchcount, shared, gfp);
-
-	if (slab_state < FULL)
-		return ret;
-
-	if ((ret < 0) || !is_root_cache(cachep))
-		return ret;
-
-	VM_BUG_ON(!mutex_is_locked(&slab_mutex));
-	for_each_memcg_cache_index(i) {
-		c = cache_from_memcg_idx(cachep, i);
-		if (c)
-			/* return value determined by the parent cache only */
-			__do_tune_cpucache(c, limit, batchcount, shared, gfp);
+	if (!ret) {
+		/* return value determined by the parent cache only */
+		__do_tune_cpucache_memcg(cachep, limit,
+					 batchcount, shared, gfp);
 	}
-
 	return ret;
 }
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 05ba3cd1b507..48e472894511 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -335,7 +335,8 @@ static int __kmem_cache_shutdown_memcg(struct kmem_cache *s,
 
 	mutex_unlock(&slab_mutex);
 	if (s->memcg_params->is_root_cache) {
-		rc = kmem_cache_destroy_memcg_children(s);
+		kmem_cache_destroy_memcg_children(s);
+		rc = !list_empty(&s->memcg_params->children);
 	} else {
 		/*
 		 * There might be a destruction work pending, which needs to be
@@ -693,20 +694,17 @@ void slab_stop(struct seq_file *m, void *p)
 static void
 memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
 {
-	struct kmem_cache *c;
+#ifdef CONFIG_MEMCG_KMEM
+	struct memcg_cache_params *params;
 	struct slabinfo sinfo;
-	int i;
 
-	if (!is_root_cache(s))
+	if (!s->memcg_params ||
+	    !s->memcg_params->is_root_cache)
 		return;
 
-	for_each_memcg_cache_index(i) {
-		c = cache_from_memcg_idx(s, i);
-		if (!c)
-			continue;
-
+	list_for_each_entry(params, &s->memcg_params->children, siblings) {
 		memset(&sinfo, 0, sizeof(sinfo));
-		get_slabinfo(c, &sinfo);
+		get_slabinfo(params->cachep, &sinfo);
 
 		info->active_slabs += sinfo.active_slabs;
 		info->num_slabs += sinfo.num_slabs;
@@ -714,6 +712,7 @@ memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
 		info->active_objs += sinfo.active_objs;
 		info->num_objs += sinfo.num_objs;
 	}
+#endif
 }
 
 int cache_show(struct kmem_cache *s, struct seq_file *m)
diff --git a/mm/slub.c b/mm/slub.c
index 4a10126fec3a..52864a6cb681 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3740,6 +3740,25 @@ static struct kmem_cache *find_mergeable(size_t size, size_t align,
 	return NULL;
 }
 
+static void memcg_slab_merge(struct kmem_cache *s, size_t size)
+{
+#ifdef CONFIG_MEMCG_KMEM
+	struct kmem_cache *cachep;
+	struct memcg_cache_params *params;
+
+	if (!s->memcg_params)
+		return;
+	BUG_ON(!s->memcg_params->is_root_cache);
+
+	list_for_each_entry(params, &s->memcg_params->children, siblings) {
+		cachep = params->cachep;
+		cachep->object_size = s->object_size;
+		cachep->inuse = max_t(int, cachep->inuse,
+				      ALIGN(size, sizeof(void *)));
+	}
+#endif
+}
+
 struct kmem_cache *
 __kmem_cache_alias(const char *name, size_t size, size_t align,
 		   unsigned long flags, void (*ctor)(void *))
@@ -3748,9 +3767,6 @@ __kmem_cache_alias(const char *name, size_t size, size_t align,
 
 	s = find_mergeable(size, align, flags, name, ctor);
 	if (s) {
-		int i;
-		struct kmem_cache *c;
-
 		s->refcount++;
 
 		/*
@@ -3760,14 +3776,7 @@ __kmem_cache_alias(const char *name, size_t size, size_t align,
 		s->object_size = max(s->object_size, (int)size);
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
 
-		for_each_memcg_cache_index(i) {
-			c = cache_from_memcg_idx(s, i);
-			if (!c)
-				continue;
-			c->object_size = s->object_size;
-			c->inuse = max_t(int, c->inuse,
-					 ALIGN(size, sizeof(void *)));
-		}
+		memcg_slab_merge(s, size);
 
 		if (sysfs_slab_alias(s, name)) {
 			s->refcount--;
@@ -5027,7 +5036,7 @@ static ssize_t slab_attr_store(struct kobject *kobj,
 	err = attribute->store(s, buf, len);
 #ifdef CONFIG_MEMCG_KMEM
 	if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
-		int i;
+		struct memcg_cache_params *params;
 
 		mutex_lock(&slab_mutex);
 		if (s->max_attr_size < len)
@@ -5050,10 +5059,10 @@ static ssize_t slab_attr_store(struct kobject *kobj,
 		 * directly either failed or succeeded, in which case we loop
 		 * through the descendants with best-effort propagation.
 		 */
-		for_each_memcg_cache_index(i) {
-			struct kmem_cache *c = cache_from_memcg_idx(s, i);
-			if (c)
-				attribute->store(c, buf, len);
+		if (s->memcg_params) {
+			list_for_each_entry(params,
+					&s->memcg_params->children, siblings)
+				attribute->store(params->cachep, buf, len);
 		}
 		mutex_unlock(&slab_mutex);
 	}
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH -mm 06/12] memcg: keep all children of each root cache on a list
@ 2014-02-26 15:05   ` Vladimir Davydov
  0 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

Sometimes we need to iterate over all child caches of a particular root
cache, e.g. when we are destroying it. Currently each root cache keeps
pointers to its children in its memcg_cache_params::memcg_caches_array
so that we can enumerate all active kmemcg ids dereferencing appropriate
array slots to get a memcg. However, this is going to change when memcg
cache reparenting is introduced - only active (not dead) caches will
reside in this array. So let's organize all child caches of the same
root cache into a list on memcg_cache_params.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 include/linux/memcontrol.h |    2 +-
 include/linux/slab.h       |    3 +++
 mm/memcontrol.c            |   36 +++++++++++++++++++-----------------
 mm/slab.c                  |   38 ++++++++++++++++++++++----------------
 mm/slab_common.c           |   19 +++++++++----------
 mm/slub.c                  |   41 +++++++++++++++++++++++++----------------
 6 files changed, 79 insertions(+), 60 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6128323ea453..b38b52ce59fb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -512,7 +512,7 @@ void memcg_update_array_size(int num_groups);
 struct kmem_cache *
 __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 
-int kmem_cache_destroy_memcg_children(struct kmem_cache *s);
+void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
 /**
  * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
diff --git a/include/linux/slab.h b/include/linux/slab.h
index f2fd4212976e..8091d009cd72 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -524,6 +524,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  * @memcg: pointer to the memcg this cache belongs to
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
+ * @siblings: list_head for the list of all child caches of the root_cache
  * @refcount: the reference counter; cache destruction will be scheduled when
  *            it reaches zero
  * @destroy: worker to be called whenever we are ready, or believe we may be
@@ -533,6 +534,7 @@ struct memcg_cache_params {
 	bool is_root_cache;
 	union {
 		struct {
+			struct list_head children;
 			struct rcu_head rcu_head;
 			struct kmem_cache *memcg_caches[0];
 		};
@@ -541,6 +543,7 @@ struct memcg_cache_params {
 			struct mem_cgroup *memcg;
 			struct list_head list;
 			struct kmem_cache *root_cache;
+			struct list_head siblings;
 			atomic_t refcount;
 			struct work_struct destroy;
 		};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5eb629ed28d6..6af3c062dfb1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3114,6 +3114,10 @@ int memcg_update_cache_size(struct kmem_cache *s, int num_groups)
 			return -ENOMEM;
 
 		new_params->is_root_cache = true;
+		INIT_LIST_HEAD(&new_params->children);
+		if (cur_params)
+			list_splice(&cur_params->children,
+				    &new_params->children);
 
 		/*
 		 * There is the chance it will be bigger than
@@ -3196,8 +3200,10 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 				kmem_cache_destroy_work_func);
 		atomic_set(&s->memcg_params->refcount, 1);
 		css_get(&memcg->css);
-	} else
+	} else {
 		s->memcg_params->is_root_cache = true;
+		INIT_LIST_HEAD(&s->memcg_params->children);
+	}
 
 	return 0;
 }
@@ -3237,6 +3243,8 @@ void memcg_register_cache(struct kmem_cache *s)
 	 */
 	smp_wmb();
 
+	list_add(&s->memcg_params->siblings, &root->memcg_params->children);
+
 	VM_BUG_ON(root->memcg_params->memcg_caches[id]);
 	root->memcg_params->memcg_caches[id] = s;
 
@@ -3264,6 +3272,8 @@ void memcg_unregister_cache(struct kmem_cache *s)
 	memcg = s->memcg_params->memcg;
 	id = memcg_cache_id(memcg);
 
+	list_del(&s->memcg_params->siblings);
+
 	mutex_lock(&memcg->slab_caches_mutex);
 	list_del(&s->memcg_params->list);
 	mutex_unlock(&memcg->slab_caches_mutex);
@@ -3326,10 +3336,9 @@ static void kmem_cache_destroy_work_func(struct work_struct *w)
 	kmem_cache_destroy_memcg(cachep, false);
 }
 
-int kmem_cache_destroy_memcg_children(struct kmem_cache *s)
+void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 {
-	struct kmem_cache *c;
-	int i, failed = 0;
+	struct memcg_cache_params *params, *tmp;
 
 	/*
 	 * Since the cache is being destroyed, it shouldn't be allocated from
@@ -3341,9 +3350,9 @@ int kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 	flush_workqueue(memcg_cache_create_wq);
 
 	/*
-	 * If the cache is being destroyed, we trust that there is no one else
-	 * requesting objects from it. Even if there are, the sanity checks in
-	 * kmem_cache_destroy should caught this ill-case.
+	 * At this point nobody except us is allowed to create or destroy child
+	 * caches so we don't need to take the slab_mutex for iterating over
+	 * the children list.
 	 *
 	 * Still, we don't want anyone else freeing memcg_caches under our
 	 * noses, which can happen if a new memcg comes to life. As usual,
@@ -3351,17 +3360,10 @@ int kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 	 * this.
 	 */
 	mutex_lock(&activate_kmem_mutex);
-	for_each_memcg_cache_index(i) {
-		c = cache_from_memcg_idx(s, i);
-		if (!c)
-			continue;
-
-		kmem_cache_destroy_memcg(c, true);
-		if (cache_from_memcg_idx(s, i))
-			failed++;
-	}
+	list_for_each_entry_safe(params, tmp,
+			&s->memcg_params->children, siblings)
+		kmem_cache_destroy_memcg(params->cachep, true);
 	mutex_unlock(&activate_kmem_mutex);
-	return failed;
 }
 
 static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
diff --git a/mm/slab.c b/mm/slab.c
index 9b8950cb64dd..eae95847cdef 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3816,29 +3816,35 @@ static int __do_tune_cpucache(struct kmem_cache *cachep, int limit,
 	return alloc_kmemlist(cachep, gfp);
 }
 
+static void __do_tune_cpucache_memcg(struct kmem_cache *cachep, int limit,
+				     int batchcount, int shared, gfp_t gfp)
+{
+#ifdef CONFIG_MEMCG_KMEM
+	struct memcg_cache_params *params;
+
+	if (!cachep->memcg_params ||
+	    !cachep->memcg_params->is_root_cache)
+		return;
+
+	lockdep_assert_held(&slab_mutex);
+	list_for_each_entry(params,
+			&cachep->memcg_params->children, siblings)
+		__do_tune_cpucache(params->cachep, limit,
+				   batchcount, shared, gfp);
+#endif
+}
+
 static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
 				int batchcount, int shared, gfp_t gfp)
 {
 	int ret;
-	struct kmem_cache *c = NULL;
-	int i = 0;
 
 	ret = __do_tune_cpucache(cachep, limit, batchcount, shared, gfp);
-
-	if (slab_state < FULL)
-		return ret;
-
-	if ((ret < 0) || !is_root_cache(cachep))
-		return ret;
-
-	VM_BUG_ON(!mutex_is_locked(&slab_mutex));
-	for_each_memcg_cache_index(i) {
-		c = cache_from_memcg_idx(cachep, i);
-		if (c)
-			/* return value determined by the parent cache only */
-			__do_tune_cpucache(c, limit, batchcount, shared, gfp);
+	if (!ret) {
+		/* return value determined by the parent cache only */
+		__do_tune_cpucache_memcg(cachep, limit,
+					 batchcount, shared, gfp);
 	}
-
 	return ret;
 }
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 05ba3cd1b507..48e472894511 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -335,7 +335,8 @@ static int __kmem_cache_shutdown_memcg(struct kmem_cache *s,
 
 	mutex_unlock(&slab_mutex);
 	if (s->memcg_params->is_root_cache) {
-		rc = kmem_cache_destroy_memcg_children(s);
+		kmem_cache_destroy_memcg_children(s);
+		rc = !list_empty(&s->memcg_params->children);
 	} else {
 		/*
 		 * There might be a destruction work pending, which needs to be
@@ -693,20 +694,17 @@ void slab_stop(struct seq_file *m, void *p)
 static void
 memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
 {
-	struct kmem_cache *c;
+#ifdef CONFIG_MEMCG_KMEM
+	struct memcg_cache_params *params;
 	struct slabinfo sinfo;
-	int i;
 
-	if (!is_root_cache(s))
+	if (!s->memcg_params ||
+	    !s->memcg_params->is_root_cache)
 		return;
 
-	for_each_memcg_cache_index(i) {
-		c = cache_from_memcg_idx(s, i);
-		if (!c)
-			continue;
-
+	list_for_each_entry(params, &s->memcg_params->children, siblings) {
 		memset(&sinfo, 0, sizeof(sinfo));
-		get_slabinfo(c, &sinfo);
+		get_slabinfo(params->cachep, &sinfo);
 
 		info->active_slabs += sinfo.active_slabs;
 		info->num_slabs += sinfo.num_slabs;
@@ -714,6 +712,7 @@ memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
 		info->active_objs += sinfo.active_objs;
 		info->num_objs += sinfo.num_objs;
 	}
+#endif
 }
 
 int cache_show(struct kmem_cache *s, struct seq_file *m)
diff --git a/mm/slub.c b/mm/slub.c
index 4a10126fec3a..52864a6cb681 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3740,6 +3740,25 @@ static struct kmem_cache *find_mergeable(size_t size, size_t align,
 	return NULL;
 }
 
+static void memcg_slab_merge(struct kmem_cache *s, size_t size)
+{
+#ifdef CONFIG_MEMCG_KMEM
+	struct kmem_cache *cachep;
+	struct memcg_cache_params *params;
+
+	if (!s->memcg_params)
+		return;
+	BUG_ON(!s->memcg_params->is_root_cache);
+
+	list_for_each_entry(params, &s->memcg_params->children, siblings) {
+		cachep = params->cachep;
+		cachep->object_size = s->object_size;
+		cachep->inuse = max_t(int, cachep->inuse,
+				      ALIGN(size, sizeof(void *)));
+	}
+#endif
+}
+
 struct kmem_cache *
 __kmem_cache_alias(const char *name, size_t size, size_t align,
 		   unsigned long flags, void (*ctor)(void *))
@@ -3748,9 +3767,6 @@ __kmem_cache_alias(const char *name, size_t size, size_t align,
 
 	s = find_mergeable(size, align, flags, name, ctor);
 	if (s) {
-		int i;
-		struct kmem_cache *c;
-
 		s->refcount++;
 
 		/*
@@ -3760,14 +3776,7 @@ __kmem_cache_alias(const char *name, size_t size, size_t align,
 		s->object_size = max(s->object_size, (int)size);
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
 
-		for_each_memcg_cache_index(i) {
-			c = cache_from_memcg_idx(s, i);
-			if (!c)
-				continue;
-			c->object_size = s->object_size;
-			c->inuse = max_t(int, c->inuse,
-					 ALIGN(size, sizeof(void *)));
-		}
+		memcg_slab_merge(s, size);
 
 		if (sysfs_slab_alias(s, name)) {
 			s->refcount--;
@@ -5027,7 +5036,7 @@ static ssize_t slab_attr_store(struct kobject *kobj,
 	err = attribute->store(s, buf, len);
 #ifdef CONFIG_MEMCG_KMEM
 	if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
-		int i;
+		struct memcg_cache_params *params;
 
 		mutex_lock(&slab_mutex);
 		if (s->max_attr_size < len)
@@ -5050,10 +5059,10 @@ static ssize_t slab_attr_store(struct kobject *kobj,
 		 * directly either failed or succeeded, in which case we loop
 		 * through the descendants with best-effort propagation.
 		 */
-		for_each_memcg_cache_index(i) {
-			struct kmem_cache *c = cache_from_memcg_idx(s, i);
-			if (c)
-				attribute->store(c, buf, len);
+		if (s->memcg_params) {
+			list_for_each_entry(params,
+					&s->memcg_params->children, siblings)
+				attribute->store(params->cachep, buf, len);
 		}
 		mutex_unlock(&slab_mutex);
 	}
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH -mm 07/12] memcg: rework slab charging
  2014-02-26 15:05 ` Vladimir Davydov
@ 2014-02-26 15:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm
  Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel,
	Christoph Lameter, Pekka Enberg

Currently kmemcg charging is embedded to the alloc_pages path - if we
get there with the GFP_KMEMCG bit set, we charge the new page to the
cgroup of the caller. All per-memcg caches have this bit set in
allocflags so kmalloc and friends are properly charged.

So, what's wrong with it, why should it be reworked?

First, we do some extra work due to this design. We get memcg from mm,
but we already know the cache we are allocating a page for, why not
simply get it from there? We remember the memcg a page is charged to in
a page_cgroup in order to properly uncharge it, but again each kmem slab
holds a reference to its kmem cache in page->slab_cache so we could use
that instead.

Second, it's racy. If a task changes its cgroup between selecting a
cache to allocate from (memcg_kmem_get_cache) and charging, an object
allocated from one cgroup's cache will be accounted to another cgroup.

And the last, but not least, we don't have a reliable way to track all
kmem pages accounted to a particular memcg, which makes reparenting
impossible. As a result, each memcg cache holds a reference to its memcg
until death, which is bad.

Since we have only a couple of places where we should actually charge
kmem pages, why not just insert kmemcg charge/uncharge there passing on
the slab we are allocating from instead of introdudingh into the generic
allocation path. That's what this patch does.

Note, it does not remove the old code - it will be handled further.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
---
 include/linux/memcontrol.h |   44 +++++++++++++++++++++++++++++---------------
 mm/memcontrol.c            |   15 +++++++++++++++
 mm/slab.c                  |    9 +++++++--
 mm/slab_common.c           |    6 +-----
 mm/slub.c                  |   30 +++++++++++++++++++++---------
 5 files changed, 73 insertions(+), 31 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b38b52ce59fb..3f0ff043ba94 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -496,6 +496,10 @@ void __memcg_kmem_commit_charge(struct page *page,
 				       struct mem_cgroup *memcg, int order);
 void __memcg_kmem_uncharge_pages(struct page *page, int order);
 
+struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
+int __memcg_kmem_charge_slab(struct kmem_cache *s, gfp_t gfp, int nr_pages);
+void __memcg_kmem_uncharge_slab(struct kmem_cache *s, int nr_pages);
+
 int memcg_cache_id(struct mem_cgroup *memcg);
 
 char *memcg_create_cache_name(struct mem_cgroup *memcg,
@@ -509,9 +513,6 @@ void memcg_unregister_cache(struct kmem_cache *s);
 int memcg_update_cache_size(struct kmem_cache *s, int num_groups);
 void memcg_update_array_size(int num_groups);
 
-struct kmem_cache *
-__memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
-
 void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
 /**
@@ -587,18 +588,6 @@ memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
  * memcg_kmem_get_cache: selects the correct per-memcg cache for allocation
  * @cachep: the original global kmem cache
  * @gfp: allocation flags.
- *
- * This function assumes that the task allocating, which determines the memcg
- * in the page allocator, belongs to the same cgroup throughout the whole
- * process.  Misacounting can happen if the task calls memcg_kmem_get_cache()
- * while belonging to a cgroup, and later on changes. This is considered
- * acceptable, and should only happen upon task migration.
- *
- * Before the cache is created by the memcg core, there is also a possible
- * imbalance: the task belongs to a memcg, but the cache being allocated from
- * is the global cache, since the child cache is not yet guaranteed to be
- * ready. This case is also fine, since in this case the GFP_KMEMCG will not be
- * passed and the page allocator will not attempt any cgroup accounting.
  */
 static __always_inline struct kmem_cache *
 memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
@@ -614,6 +603,21 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 
 	return __memcg_kmem_get_cache(cachep, gfp);
 }
+
+static __always_inline int memcg_kmem_charge_slab(struct kmem_cache *s,
+						  gfp_t gfp, int nr_pages)
+{
+	if (memcg_kmem_enabled())
+		return __memcg_kmem_charge_slab(s, gfp, nr_pages);
+	return 0;
+}
+
+static __always_inline void memcg_kmem_uncharge_slab(struct kmem_cache *s,
+						     int nr_pages)
+{
+	if (memcg_kmem_enabled())
+		__memcg_kmem_uncharge_slab(s, nr_pages);
+}
 #else
 #define for_each_memcg_cache_index(_idx)	\
 	for (; NULL; )
@@ -666,6 +670,16 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 {
 	return cachep;
 }
+
+static inline int memcg_kmem_charge_slab(struct kmem_cache *s,
+					 gfp_t gfp, int nr_pages)
+{
+	return 0;
+}
+
+static inline void memcg_kmem_uncharge_slab(struct kmem_cache *s, int nr_pages)
+{
+}
 #endif /* CONFIG_MEMCG_KMEM */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6af3c062dfb1..d60080812060 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3057,6 +3057,21 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
 		css_put(&memcg->css);
 }
 
+int __memcg_kmem_charge_slab(struct kmem_cache *s, gfp_t gfp, int nr_pages)
+{
+	if (is_root_cache(s))
+		return 0;
+	return memcg_charge_kmem(s->memcg_params->memcg,
+				 gfp, nr_pages << PAGE_SHIFT);
+}
+
+void __memcg_kmem_uncharge_slab(struct kmem_cache *s, int nr_pages)
+{
+	if (is_root_cache(s))
+		return;
+	memcg_uncharge_kmem(s->memcg_params->memcg, nr_pages << PAGE_SHIFT);
+}
+
 /*
  * helper for acessing a memcg's index. It will be used as an index in the
  * child cache array in kmem_cache, and also to derive its name. This function
diff --git a/mm/slab.c b/mm/slab.c
index eae95847cdef..ef372599eb8b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1664,10 +1664,15 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		flags |= __GFP_RECLAIMABLE;
 
+	nr_pages = (1 << cachep->gfporder);
+	if (memcg_kmem_charge_slab(cachep, flags, nr_pages))
+		return NULL;
+
 	page = alloc_pages_exact_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder);
 	if (!page) {
 		if (!(flags & __GFP_NOWARN) && printk_ratelimit())
 			slab_out_of_memory(cachep, flags, nodeid);
+		memcg_kmem_uncharge_slab(cachep, nr_pages);
 		return NULL;
 	}
 
@@ -1675,7 +1680,6 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
 	if (unlikely(page->pfmemalloc))
 		pfmemalloc_active = true;
 
-	nr_pages = (1 << cachep->gfporder);
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		add_zone_page_state(page_zone(page),
 			NR_SLAB_RECLAIMABLE, nr_pages);
@@ -1724,7 +1728,8 @@ static void kmem_freepages(struct kmem_cache *cachep, struct page *page)
 	memcg_release_pages(cachep, cachep->gfporder);
 	if (current->reclaim_state)
 		current->reclaim_state->reclaimed_slab += nr_freed;
-	__free_memcg_kmem_pages(page, cachep->gfporder);
+	memcg_kmem_uncharge_slab(cachep, nr_freed);
+	__free_pages(page, cachep->gfporder);
 }
 
 static void kmem_rcu_free(struct rcu_head *head)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 48e472894511..22e48d000b1d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -290,12 +290,8 @@ void kmem_cache_create_memcg(struct mem_cgroup *memcg, struct kmem_cache *root_c
 				 root_cache->size, root_cache->align,
 				 root_cache->flags, root_cache->ctor,
 				 memcg, root_cache);
-	if (IS_ERR(s)) {
+	if (IS_ERR(s))
 		kfree(cache_name);
-		goto out_unlock;
-	}
-
-	s->allocflags |= __GFP_KMEMCG;
 
 out_unlock:
 	mutex_unlock(&slab_mutex);
diff --git a/mm/slub.c b/mm/slub.c
index 52864a6cb681..fa995823de60 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1328,8 +1328,9 @@ static inline struct page *alloc_slab_page(gfp_t flags, int node,
 
 static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 {
-	struct page *page;
+	struct page *page = NULL;
 	struct kmem_cache_order_objects oo = s->oo;
+	int pages = 1 << oo_order(oo);
 	gfp_t alloc_gfp;
 
 	flags &= gfp_allowed_mask;
@@ -1345,23 +1346,33 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	 */
 	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL;
 
+	if (memcg_kmem_charge_slab(s, alloc_gfp, pages))
+		goto out;
+
 	page = alloc_slab_page(alloc_gfp, node, oo);
 	if (unlikely(!page)) {
+		int charged = pages;
+
 		oo = s->min;
+		pages = 1 << oo_order(oo);
 		/*
 		 * Allocation may have failed due to fragmentation.
 		 * Try a lower order alloc if possible
 		 */
 		page = alloc_slab_page(flags, node, oo);
+		if (!page) {
+			memcg_kmem_uncharge_slab(s, charged);
+			goto out;
+		}
 
-		if (page)
-			stat(s, ORDER_FALLBACK);
-	}
+		VM_BUG_ON(charged <= pages);
+		memcg_kmem_uncharge_slab(s, charged - pages);
 
-	if (kmemcheck_enabled && page
-		&& !(s->flags & (SLAB_NOTRACK | DEBUG_DEFAULT_FLAGS))) {
-		int pages = 1 << oo_order(oo);
+		stat(s, ORDER_FALLBACK);
+	}
 
+	if (kmemcheck_enabled &&
+	    !(s->flags & (SLAB_NOTRACK | DEBUG_DEFAULT_FLAGS))) {
 		kmemcheck_alloc_shadow(page, oo_order(oo), flags, node);
 
 		/*
@@ -1373,7 +1384,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 		else
 			kmemcheck_mark_unallocated_pages(page, pages);
 	}
-
+out:
 	if (flags & __GFP_WAIT)
 		local_irq_disable();
 	if (!page)
@@ -1468,7 +1479,8 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
 	page_mapcount_reset(page);
 	if (current->reclaim_state)
 		current->reclaim_state->reclaimed_slab += pages;
-	__free_memcg_kmem_pages(page, order);
+	memcg_kmem_uncharge_slab(s, pages);
+	__free_pages(page, order);
 }
 
 #define need_reserve_slab_rcu						\
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH -mm 07/12] memcg: rework slab charging
@ 2014-02-26 15:05   ` Vladimir Davydov
  0 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm
  Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel,
	Christoph Lameter, Pekka Enberg

Currently kmemcg charging is embedded to the alloc_pages path - if we
get there with the GFP_KMEMCG bit set, we charge the new page to the
cgroup of the caller. All per-memcg caches have this bit set in
allocflags so kmalloc and friends are properly charged.

So, what's wrong with it, why should it be reworked?

First, we do some extra work due to this design. We get memcg from mm,
but we already know the cache we are allocating a page for, why not
simply get it from there? We remember the memcg a page is charged to in
a page_cgroup in order to properly uncharge it, but again each kmem slab
holds a reference to its kmem cache in page->slab_cache so we could use
that instead.

Second, it's racy. If a task changes its cgroup between selecting a
cache to allocate from (memcg_kmem_get_cache) and charging, an object
allocated from one cgroup's cache will be accounted to another cgroup.

And the last, but not least, we don't have a reliable way to track all
kmem pages accounted to a particular memcg, which makes reparenting
impossible. As a result, each memcg cache holds a reference to its memcg
until death, which is bad.

Since we have only a couple of places where we should actually charge
kmem pages, why not just insert kmemcg charge/uncharge there passing on
the slab we are allocating from instead of introdudingh into the generic
allocation path. That's what this patch does.

Note, it does not remove the old code - it will be handled further.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
---
 include/linux/memcontrol.h |   44 +++++++++++++++++++++++++++++---------------
 mm/memcontrol.c            |   15 +++++++++++++++
 mm/slab.c                  |    9 +++++++--
 mm/slab_common.c           |    6 +-----
 mm/slub.c                  |   30 +++++++++++++++++++++---------
 5 files changed, 73 insertions(+), 31 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b38b52ce59fb..3f0ff043ba94 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -496,6 +496,10 @@ void __memcg_kmem_commit_charge(struct page *page,
 				       struct mem_cgroup *memcg, int order);
 void __memcg_kmem_uncharge_pages(struct page *page, int order);
 
+struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
+int __memcg_kmem_charge_slab(struct kmem_cache *s, gfp_t gfp, int nr_pages);
+void __memcg_kmem_uncharge_slab(struct kmem_cache *s, int nr_pages);
+
 int memcg_cache_id(struct mem_cgroup *memcg);
 
 char *memcg_create_cache_name(struct mem_cgroup *memcg,
@@ -509,9 +513,6 @@ void memcg_unregister_cache(struct kmem_cache *s);
 int memcg_update_cache_size(struct kmem_cache *s, int num_groups);
 void memcg_update_array_size(int num_groups);
 
-struct kmem_cache *
-__memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
-
 void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
 /**
@@ -587,18 +588,6 @@ memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
  * memcg_kmem_get_cache: selects the correct per-memcg cache for allocation
  * @cachep: the original global kmem cache
  * @gfp: allocation flags.
- *
- * This function assumes that the task allocating, which determines the memcg
- * in the page allocator, belongs to the same cgroup throughout the whole
- * process.  Misacounting can happen if the task calls memcg_kmem_get_cache()
- * while belonging to a cgroup, and later on changes. This is considered
- * acceptable, and should only happen upon task migration.
- *
- * Before the cache is created by the memcg core, there is also a possible
- * imbalance: the task belongs to a memcg, but the cache being allocated from
- * is the global cache, since the child cache is not yet guaranteed to be
- * ready. This case is also fine, since in this case the GFP_KMEMCG will not be
- * passed and the page allocator will not attempt any cgroup accounting.
  */
 static __always_inline struct kmem_cache *
 memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
@@ -614,6 +603,21 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 
 	return __memcg_kmem_get_cache(cachep, gfp);
 }
+
+static __always_inline int memcg_kmem_charge_slab(struct kmem_cache *s,
+						  gfp_t gfp, int nr_pages)
+{
+	if (memcg_kmem_enabled())
+		return __memcg_kmem_charge_slab(s, gfp, nr_pages);
+	return 0;
+}
+
+static __always_inline void memcg_kmem_uncharge_slab(struct kmem_cache *s,
+						     int nr_pages)
+{
+	if (memcg_kmem_enabled())
+		__memcg_kmem_uncharge_slab(s, nr_pages);
+}
 #else
 #define for_each_memcg_cache_index(_idx)	\
 	for (; NULL; )
@@ -666,6 +670,16 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 {
 	return cachep;
 }
+
+static inline int memcg_kmem_charge_slab(struct kmem_cache *s,
+					 gfp_t gfp, int nr_pages)
+{
+	return 0;
+}
+
+static inline void memcg_kmem_uncharge_slab(struct kmem_cache *s, int nr_pages)
+{
+}
 #endif /* CONFIG_MEMCG_KMEM */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6af3c062dfb1..d60080812060 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3057,6 +3057,21 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
 		css_put(&memcg->css);
 }
 
+int __memcg_kmem_charge_slab(struct kmem_cache *s, gfp_t gfp, int nr_pages)
+{
+	if (is_root_cache(s))
+		return 0;
+	return memcg_charge_kmem(s->memcg_params->memcg,
+				 gfp, nr_pages << PAGE_SHIFT);
+}
+
+void __memcg_kmem_uncharge_slab(struct kmem_cache *s, int nr_pages)
+{
+	if (is_root_cache(s))
+		return;
+	memcg_uncharge_kmem(s->memcg_params->memcg, nr_pages << PAGE_SHIFT);
+}
+
 /*
  * helper for acessing a memcg's index. It will be used as an index in the
  * child cache array in kmem_cache, and also to derive its name. This function
diff --git a/mm/slab.c b/mm/slab.c
index eae95847cdef..ef372599eb8b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1664,10 +1664,15 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		flags |= __GFP_RECLAIMABLE;
 
+	nr_pages = (1 << cachep->gfporder);
+	if (memcg_kmem_charge_slab(cachep, flags, nr_pages))
+		return NULL;
+
 	page = alloc_pages_exact_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder);
 	if (!page) {
 		if (!(flags & __GFP_NOWARN) && printk_ratelimit())
 			slab_out_of_memory(cachep, flags, nodeid);
+		memcg_kmem_uncharge_slab(cachep, nr_pages);
 		return NULL;
 	}
 
@@ -1675,7 +1680,6 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
 	if (unlikely(page->pfmemalloc))
 		pfmemalloc_active = true;
 
-	nr_pages = (1 << cachep->gfporder);
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		add_zone_page_state(page_zone(page),
 			NR_SLAB_RECLAIMABLE, nr_pages);
@@ -1724,7 +1728,8 @@ static void kmem_freepages(struct kmem_cache *cachep, struct page *page)
 	memcg_release_pages(cachep, cachep->gfporder);
 	if (current->reclaim_state)
 		current->reclaim_state->reclaimed_slab += nr_freed;
-	__free_memcg_kmem_pages(page, cachep->gfporder);
+	memcg_kmem_uncharge_slab(cachep, nr_freed);
+	__free_pages(page, cachep->gfporder);
 }
 
 static void kmem_rcu_free(struct rcu_head *head)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 48e472894511..22e48d000b1d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -290,12 +290,8 @@ void kmem_cache_create_memcg(struct mem_cgroup *memcg, struct kmem_cache *root_c
 				 root_cache->size, root_cache->align,
 				 root_cache->flags, root_cache->ctor,
 				 memcg, root_cache);
-	if (IS_ERR(s)) {
+	if (IS_ERR(s))
 		kfree(cache_name);
-		goto out_unlock;
-	}
-
-	s->allocflags |= __GFP_KMEMCG;
 
 out_unlock:
 	mutex_unlock(&slab_mutex);
diff --git a/mm/slub.c b/mm/slub.c
index 52864a6cb681..fa995823de60 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1328,8 +1328,9 @@ static inline struct page *alloc_slab_page(gfp_t flags, int node,
 
 static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 {
-	struct page *page;
+	struct page *page = NULL;
 	struct kmem_cache_order_objects oo = s->oo;
+	int pages = 1 << oo_order(oo);
 	gfp_t alloc_gfp;
 
 	flags &= gfp_allowed_mask;
@@ -1345,23 +1346,33 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	 */
 	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL;
 
+	if (memcg_kmem_charge_slab(s, alloc_gfp, pages))
+		goto out;
+
 	page = alloc_slab_page(alloc_gfp, node, oo);
 	if (unlikely(!page)) {
+		int charged = pages;
+
 		oo = s->min;
+		pages = 1 << oo_order(oo);
 		/*
 		 * Allocation may have failed due to fragmentation.
 		 * Try a lower order alloc if possible
 		 */
 		page = alloc_slab_page(flags, node, oo);
+		if (!page) {
+			memcg_kmem_uncharge_slab(s, charged);
+			goto out;
+		}
 
-		if (page)
-			stat(s, ORDER_FALLBACK);
-	}
+		VM_BUG_ON(charged <= pages);
+		memcg_kmem_uncharge_slab(s, charged - pages);
 
-	if (kmemcheck_enabled && page
-		&& !(s->flags & (SLAB_NOTRACK | DEBUG_DEFAULT_FLAGS))) {
-		int pages = 1 << oo_order(oo);
+		stat(s, ORDER_FALLBACK);
+	}
 
+	if (kmemcheck_enabled &&
+	    !(s->flags & (SLAB_NOTRACK | DEBUG_DEFAULT_FLAGS))) {
 		kmemcheck_alloc_shadow(page, oo_order(oo), flags, node);
 
 		/*
@@ -1373,7 +1384,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 		else
 			kmemcheck_mark_unallocated_pages(page, pages);
 	}
-
+out:
 	if (flags & __GFP_WAIT)
 		local_irq_disable();
 	if (!page)
@@ -1468,7 +1479,8 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
 	page_mapcount_reset(page);
 	if (current->reclaim_state)
 		current->reclaim_state->reclaimed_slab += pages;
-	__free_memcg_kmem_pages(page, order);
+	memcg_kmem_uncharge_slab(s, pages);
+	__free_pages(page, order);
 }
 
 #define need_reserve_slab_rcu						\
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH -mm 08/12] memcg: do not charge kmalloc_large allocations
  2014-02-26 15:05 ` Vladimir Davydov
@ 2014-02-26 15:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm
  Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel,
	Christoph Lameter, Pekka Enberg

We don't have a way to track kmalloc_large allocations so that charging
them makes kmemcg reparenting impossible. Since such allocations are
rare and can't be massively triggered from userspace, let's just ignore
them.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
---
 include/linux/slab.h |    2 +-
 mm/slub.c            |    4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 8091d009cd72..29dbf6f2fd3a 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -364,7 +364,7 @@ kmalloc_order(size_t size, gfp_t flags, unsigned int order)
 {
 	void *ret;
 
-	flags |= (__GFP_COMP | __GFP_KMEMCG);
+	flags |= __GFP_COMP;
 	ret = (void *) __get_free_pages(flags, order);
 	kmemleak_alloc(ret, size, 1, flags);
 	return ret;
diff --git a/mm/slub.c b/mm/slub.c
index fa995823de60..d8b8659bfa64 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3332,7 +3332,7 @@ static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
 	struct page *page;
 	void *ptr = NULL;
 
-	flags |= __GFP_COMP | __GFP_NOTRACK | __GFP_KMEMCG;
+	flags |= __GFP_COMP | __GFP_NOTRACK;
 	page = alloc_pages_node(node, flags, get_order(size));
 	if (page)
 		ptr = page_address(page);
@@ -3402,7 +3402,7 @@ void kfree(const void *x)
 	if (unlikely(!PageSlab(page))) {
 		BUG_ON(!PageCompound(page));
 		kfree_hook(x);
-		__free_memcg_kmem_pages(page, compound_order(page));
+		__free_pages(page, compound_order(page));
 		return;
 	}
 	slab_free(page->slab_cache, page, object, _RET_IP_);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH -mm 08/12] memcg: do not charge kmalloc_large allocations
@ 2014-02-26 15:05   ` Vladimir Davydov
  0 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm
  Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel,
	Christoph Lameter, Pekka Enberg

We don't have a way to track kmalloc_large allocations so that charging
them makes kmemcg reparenting impossible. Since such allocations are
rare and can't be massively triggered from userspace, let's just ignore
them.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
---
 include/linux/slab.h |    2 +-
 mm/slub.c            |    4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 8091d009cd72..29dbf6f2fd3a 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -364,7 +364,7 @@ kmalloc_order(size_t size, gfp_t flags, unsigned int order)
 {
 	void *ret;
 
-	flags |= (__GFP_COMP | __GFP_KMEMCG);
+	flags |= __GFP_COMP;
 	ret = (void *) __get_free_pages(flags, order);
 	kmemleak_alloc(ret, size, 1, flags);
 	return ret;
diff --git a/mm/slub.c b/mm/slub.c
index fa995823de60..d8b8659bfa64 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3332,7 +3332,7 @@ static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
 	struct page *page;
 	void *ptr = NULL;
 
-	flags |= __GFP_COMP | __GFP_NOTRACK | __GFP_KMEMCG;
+	flags |= __GFP_COMP | __GFP_NOTRACK;
 	page = alloc_pages_node(node, flags, get_order(size));
 	if (page)
 		ptr = page_address(page);
@@ -3402,7 +3402,7 @@ void kfree(const void *x)
 	if (unlikely(!PageSlab(page))) {
 		BUG_ON(!PageCompound(page));
 		kfree_hook(x);
-		__free_memcg_kmem_pages(page, compound_order(page));
+		__free_pages(page, compound_order(page));
 		return;
 	}
 	slab_free(page->slab_cache, page, object, _RET_IP_);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH -mm 09/12] fork: do not charge thread_info to kmemcg
  2014-02-26 15:05 ` Vladimir Davydov
@ 2014-02-26 15:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm
  Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel,
	Frederic Weisbecker

This patch reverts 2ad306b17c0a ("fork: protect architectures where
THREAD_SIZE >= PAGE_SIZE against fork bombs").

The reasoning behind this is that charging thread_info is the last piece
that prevents us from reparenting kmemcg on css offline. The point is
that we can't reliably track all thread_info pages accounted to a
particular cgroup, because (a) it is freed in __put_task_struct and (b)
on exit tasks are moved to the root cgroup. That said, given a cgroup
there is no sane way to find all tasks (including zombies) that charged
thread_info to this cgroup. Of course, we could uncharge thread_info on
task exit, but that wouldn't help us against fork bombs. So revert and
forget about this.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
---
 include/linux/thread_info.h |    2 --
 kernel/fork.c               |    4 ++--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index fddbe2023a5d..1807bb194816 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -61,8 +61,6 @@ extern long do_no_restart_syscall(struct restart_block *parm);
 # define THREADINFO_GFP		(GFP_KERNEL | __GFP_NOTRACK)
 #endif
 
-#define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP | __GFP_KMEMCG)
-
 /*
  * flag set/clear/test wrappers
  * - pass TIF_xxxx constants to these functions
diff --git a/kernel/fork.c b/kernel/fork.c
index b41022f5d307..2431645d4c91 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -148,7 +148,7 @@ void __weak arch_release_thread_info(struct thread_info *ti)
 static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 						  int node)
 {
-	struct page *page = alloc_pages_node(node, THREADINFO_GFP_ACCOUNTED,
+	struct page *page = alloc_pages_node(node, THREADINFO_GFP,
 					     THREAD_SIZE_ORDER);
 
 	return page ? page_address(page) : NULL;
@@ -156,7 +156,7 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 
 static inline void free_thread_info(struct thread_info *ti)
 {
-	free_memcg_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
+	free_pages((unsigned long)ti, THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_info_cache;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH -mm 09/12] fork: do not charge thread_info to kmemcg
@ 2014-02-26 15:05   ` Vladimir Davydov
  0 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm
  Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel,
	Frederic Weisbecker

This patch reverts 2ad306b17c0a ("fork: protect architectures where
THREAD_SIZE >= PAGE_SIZE against fork bombs").

The reasoning behind this is that charging thread_info is the last piece
that prevents us from reparenting kmemcg on css offline. The point is
that we can't reliably track all thread_info pages accounted to a
particular cgroup, because (a) it is freed in __put_task_struct and (b)
on exit tasks are moved to the root cgroup. That said, given a cgroup
there is no sane way to find all tasks (including zombies) that charged
thread_info to this cgroup. Of course, we could uncharge thread_info on
task exit, but that wouldn't help us against fork bombs. So revert and
forget about this.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
---
 include/linux/thread_info.h |    2 --
 kernel/fork.c               |    4 ++--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index fddbe2023a5d..1807bb194816 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -61,8 +61,6 @@ extern long do_no_restart_syscall(struct restart_block *parm);
 # define THREADINFO_GFP		(GFP_KERNEL | __GFP_NOTRACK)
 #endif
 
-#define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP | __GFP_KMEMCG)
-
 /*
  * flag set/clear/test wrappers
  * - pass TIF_xxxx constants to these functions
diff --git a/kernel/fork.c b/kernel/fork.c
index b41022f5d307..2431645d4c91 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -148,7 +148,7 @@ void __weak arch_release_thread_info(struct thread_info *ti)
 static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 						  int node)
 {
-	struct page *page = alloc_pages_node(node, THREADINFO_GFP_ACCOUNTED,
+	struct page *page = alloc_pages_node(node, THREADINFO_GFP,
 					     THREAD_SIZE_ORDER);
 
 	return page ? page_address(page) : NULL;
@@ -156,7 +156,7 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 
 static inline void free_thread_info(struct thread_info *ti)
 {
-	free_memcg_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
+	free_pages((unsigned long)ti, THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_info_cache;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH -mm 10/12] memcg: kill GFP_KMEMCG and stuff
  2014-02-26 15:05 ` Vladimir Davydov
@ 2014-02-26 15:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

No one uses it any more. Just get rid of it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 include/linux/gfp.h             |    5 --
 include/linux/memcontrol.h      |   90 ----------------------------
 include/trace/events/gfpflags.h |    1 -
 mm/memcontrol.c                 |  125 ---------------------------------------
 mm/page_alloc.c                 |   35 -----------
 5 files changed, 256 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 39b81dc7d01a..e37b662cd869 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -31,7 +31,6 @@ struct vm_area_struct;
 #define ___GFP_HARDWALL		0x20000u
 #define ___GFP_THISNODE		0x40000u
 #define ___GFP_RECLAIMABLE	0x80000u
-#define ___GFP_KMEMCG		0x100000u
 #define ___GFP_NOTRACK		0x200000u
 #define ___GFP_NO_KSWAPD	0x400000u
 #define ___GFP_OTHER_NODE	0x800000u
@@ -91,7 +90,6 @@ struct vm_area_struct;
 
 #define __GFP_NO_KSWAPD	((__force gfp_t)___GFP_NO_KSWAPD)
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
-#define __GFP_KMEMCG	((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */
 #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Allocator intends to dirty page */
 
 /*
@@ -372,9 +370,6 @@ extern void free_pages(unsigned long addr, unsigned int order);
 extern void free_hot_cold_page(struct page *page, int cold);
 extern void free_hot_cold_page_list(struct list_head *list, int cold);
 
-extern void __free_memcg_kmem_pages(struct page *page, unsigned int order);
-extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order);
-
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr), 0)
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3f0ff043ba94..f6fa5726b1a7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -490,12 +490,6 @@ static inline bool memcg_kmem_enabled(void)
  * conditions, but because they are pretty simple, they are expected to be
  * fast.
  */
-bool __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg,
-					int order);
-void __memcg_kmem_commit_charge(struct page *page,
-				       struct mem_cgroup *memcg, int order);
-void __memcg_kmem_uncharge_pages(struct page *page, int order);
-
 struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 int __memcg_kmem_charge_slab(struct kmem_cache *s, gfp_t gfp, int nr_pages);
 void __memcg_kmem_uncharge_slab(struct kmem_cache *s, int nr_pages);
@@ -516,75 +510,6 @@ void memcg_update_array_size(int num_groups);
 void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
 /**
- * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
- * @gfp: the gfp allocation flags.
- * @memcg: a pointer to the memcg this was charged against.
- * @order: allocation order.
- *
- * returns true if the memcg where the current task belongs can hold this
- * allocation.
- *
- * We return true automatically if this allocation is not to be accounted to
- * any memcg.
- */
-static inline bool
-memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
-{
-	if (!memcg_kmem_enabled())
-		return true;
-
-	/*
-	 * __GFP_NOFAIL allocations will move on even if charging is not
-	 * possible. Therefore we don't even try, and have this allocation
-	 * unaccounted. We could in theory charge it with
-	 * res_counter_charge_nofail, but we hope those allocations are rare,
-	 * and won't be worth the trouble.
-	 */
-	if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
-		return true;
-	if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
-		return true;
-
-	/* If the test is dying, just let it go. */
-	if (unlikely(fatal_signal_pending(current)))
-		return true;
-
-	return __memcg_kmem_newpage_charge(gfp, memcg, order);
-}
-
-/**
- * memcg_kmem_uncharge_pages: uncharge pages from memcg
- * @page: pointer to struct page being freed
- * @order: allocation order.
- *
- * there is no need to specify memcg here, since it is embedded in page_cgroup
- */
-static inline void
-memcg_kmem_uncharge_pages(struct page *page, int order)
-{
-	if (memcg_kmem_enabled())
-		__memcg_kmem_uncharge_pages(page, order);
-}
-
-/**
- * memcg_kmem_commit_charge: embeds correct memcg in a page
- * @page: pointer to struct page recently allocated
- * @memcg: the memcg structure we charged against
- * @order: allocation order.
- *
- * Needs to be called after memcg_kmem_newpage_charge, regardless of success or
- * failure of the allocation. if @page is NULL, this function will revert the
- * charges. Otherwise, it will commit the memcg given by @memcg to the
- * corresponding page_cgroup.
- */
-static inline void
-memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
-{
-	if (memcg_kmem_enabled() && memcg)
-		__memcg_kmem_commit_charge(page, memcg, order);
-}
-
-/**
  * memcg_kmem_get_cache: selects the correct per-memcg cache for allocation
  * @cachep: the original global kmem cache
  * @gfp: allocation flags.
@@ -627,21 +552,6 @@ static inline bool memcg_kmem_enabled(void)
 	return false;
 }
 
-static inline bool
-memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
-{
-	return true;
-}
-
-static inline void memcg_kmem_uncharge_pages(struct page *page, int order)
-{
-}
-
-static inline void
-memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
-{
-}
-
 static inline int memcg_cache_id(struct mem_cgroup *memcg)
 {
 	return -1;
diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
index 1eddbf1557f2..d6fd8e5b14b7 100644
--- a/include/trace/events/gfpflags.h
+++ b/include/trace/events/gfpflags.h
@@ -34,7 +34,6 @@
 	{(unsigned long)__GFP_HARDWALL,		"GFP_HARDWALL"},	\
 	{(unsigned long)__GFP_THISNODE,		"GFP_THISNODE"},	\
 	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
-	{(unsigned long)__GFP_KMEMCG,		"GFP_KMEMCG"},		\
 	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"},		\
 	{(unsigned long)__GFP_NOTRACK,		"GFP_NOTRACK"},		\
 	{(unsigned long)__GFP_NO_KSWAPD,	"GFP_NO_KSWAPD"},	\
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d60080812060..dad455a911d5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3531,131 +3531,6 @@ out:
 	rcu_read_unlock();
 	return cachep;
 }
-EXPORT_SYMBOL(__memcg_kmem_get_cache);
-
-/*
- * We need to verify if the allocation against current->mm->owner's memcg is
- * possible for the given order. But the page is not allocated yet, so we'll
- * need a further commit step to do the final arrangements.
- *
- * It is possible for the task to switch cgroups in this mean time, so at
- * commit time, we can't rely on task conversion any longer.  We'll then use
- * the handle argument to return to the caller which cgroup we should commit
- * against. We could also return the memcg directly and avoid the pointer
- * passing, but a boolean return value gives better semantics considering
- * the compiled-out case as well.
- *
- * Returning true means the allocation is possible.
- */
-bool
-__memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
-{
-	struct mem_cgroup *memcg;
-	int ret;
-
-	*_memcg = NULL;
-
-	/*
-	 * Disabling accounting is only relevant for some specific memcg
-	 * internal allocations. Therefore we would initially not have such
-	 * check here, since direct calls to the page allocator that are marked
-	 * with GFP_KMEMCG only happen outside memcg core. We are mostly
-	 * concerned with cache allocations, and by having this test at
-	 * memcg_kmem_get_cache, we are already able to relay the allocation to
-	 * the root cache and bypass the memcg cache altogether.
-	 *
-	 * There is one exception, though: the SLUB allocator does not create
-	 * large order caches, but rather service large kmallocs directly from
-	 * the page allocator. Therefore, the following sequence when backed by
-	 * the SLUB allocator:
-	 *
-	 *	memcg_stop_kmem_account();
-	 *	kmalloc(<large_number>)
-	 *	memcg_resume_kmem_account();
-	 *
-	 * would effectively ignore the fact that we should skip accounting,
-	 * since it will drive us directly to this function without passing
-	 * through the cache selector memcg_kmem_get_cache. Such large
-	 * allocations are extremely rare but can happen, for instance, for the
-	 * cache arrays. We bring this test here.
-	 */
-	if (!current->mm || current->memcg_kmem_skip_account)
-		return true;
-
-	memcg = try_get_mem_cgroup_from_mm(current->mm);
-
-	/*
-	 * very rare case described in mem_cgroup_from_task. Unfortunately there
-	 * isn't much we can do without complicating this too much, and it would
-	 * be gfp-dependent anyway. Just let it go
-	 */
-	if (unlikely(!memcg))
-		return true;
-
-	if (!memcg_can_account_kmem(memcg)) {
-		css_put(&memcg->css);
-		return true;
-	}
-
-	ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order);
-	if (!ret)
-		*_memcg = memcg;
-
-	css_put(&memcg->css);
-	return (ret == 0);
-}
-
-void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg,
-			      int order)
-{
-	struct page_cgroup *pc;
-
-	VM_BUG_ON(mem_cgroup_is_root(memcg));
-
-	/* The page allocation failed. Revert */
-	if (!page) {
-		memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
-		return;
-	}
-
-	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
-	pc->mem_cgroup = memcg;
-	SetPageCgroupUsed(pc);
-	unlock_page_cgroup(pc);
-}
-
-void __memcg_kmem_uncharge_pages(struct page *page, int order)
-{
-	struct mem_cgroup *memcg = NULL;
-	struct page_cgroup *pc;
-
-
-	pc = lookup_page_cgroup(page);
-	/*
-	 * Fast unlocked return. Theoretically might have changed, have to
-	 * check again after locking.
-	 */
-	if (!PageCgroupUsed(pc))
-		return;
-
-	lock_page_cgroup(pc);
-	if (PageCgroupUsed(pc)) {
-		memcg = pc->mem_cgroup;
-		ClearPageCgroupUsed(pc);
-	}
-	unlock_page_cgroup(pc);
-
-	/*
-	 * We trust that only if there is a memcg associated with the page, it
-	 * is a valid allocation
-	 */
-	if (!memcg)
-		return;
-
-	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
-	memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
-}
 
 static void __init memcg_kmem_init(void)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9d6892c3527d..a7acceb00770 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2717,7 +2717,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	int migratetype = allocflags_to_migratetype(gfp_mask);
 	unsigned int cpuset_mems_cookie;
 	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET;
-	struct mem_cgroup *memcg = NULL;
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2736,13 +2735,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (unlikely(!zonelist->_zonerefs->zone))
 		return NULL;
 
-	/*
-	 * Will only have any effect when __GFP_KMEMCG is set.  This is
-	 * verified in the (always inline) callee
-	 */
-	if (!memcg_kmem_newpage_charge(gfp_mask, &memcg, order))
-		return NULL;
-
 retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
 
@@ -2785,8 +2777,6 @@ out:
 	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
-	memcg_kmem_commit_charge(page, memcg, order);
-
 	if (page)
 		set_page_owner(page, order, gfp_mask);
 
@@ -2842,31 +2832,6 @@ void free_pages(unsigned long addr, unsigned int order)
 
 EXPORT_SYMBOL(free_pages);
 
-/*
- * __free_memcg_kmem_pages and free_memcg_kmem_pages will free
- * pages allocated with __GFP_KMEMCG.
- *
- * Those pages are accounted to a particular memcg, embedded in the
- * corresponding page_cgroup. To avoid adding a hit in the allocator to search
- * for that information only to find out that it is NULL for users who have no
- * interest in that whatsoever, we provide these functions.
- *
- * The caller knows better which flags it relies on.
- */
-void __free_memcg_kmem_pages(struct page *page, unsigned int order)
-{
-	memcg_kmem_uncharge_pages(page, order);
-	__free_pages(page, order);
-}
-
-void free_memcg_kmem_pages(unsigned long addr, unsigned int order)
-{
-	if (addr != 0) {
-		VM_BUG_ON(!virt_addr_valid((void *)addr));
-		__free_memcg_kmem_pages(virt_to_page((void *)addr), order);
-	}
-}
-
 static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size)
 {
 	if (addr) {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH -mm 10/12] memcg: kill GFP_KMEMCG and stuff
@ 2014-02-26 15:05   ` Vladimir Davydov
  0 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

No one uses it any more. Just get rid of it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 include/linux/gfp.h             |    5 --
 include/linux/memcontrol.h      |   90 ----------------------------
 include/trace/events/gfpflags.h |    1 -
 mm/memcontrol.c                 |  125 ---------------------------------------
 mm/page_alloc.c                 |   35 -----------
 5 files changed, 256 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 39b81dc7d01a..e37b662cd869 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -31,7 +31,6 @@ struct vm_area_struct;
 #define ___GFP_HARDWALL		0x20000u
 #define ___GFP_THISNODE		0x40000u
 #define ___GFP_RECLAIMABLE	0x80000u
-#define ___GFP_KMEMCG		0x100000u
 #define ___GFP_NOTRACK		0x200000u
 #define ___GFP_NO_KSWAPD	0x400000u
 #define ___GFP_OTHER_NODE	0x800000u
@@ -91,7 +90,6 @@ struct vm_area_struct;
 
 #define __GFP_NO_KSWAPD	((__force gfp_t)___GFP_NO_KSWAPD)
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
-#define __GFP_KMEMCG	((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */
 #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Allocator intends to dirty page */
 
 /*
@@ -372,9 +370,6 @@ extern void free_pages(unsigned long addr, unsigned int order);
 extern void free_hot_cold_page(struct page *page, int cold);
 extern void free_hot_cold_page_list(struct list_head *list, int cold);
 
-extern void __free_memcg_kmem_pages(struct page *page, unsigned int order);
-extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order);
-
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr), 0)
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3f0ff043ba94..f6fa5726b1a7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -490,12 +490,6 @@ static inline bool memcg_kmem_enabled(void)
  * conditions, but because they are pretty simple, they are expected to be
  * fast.
  */
-bool __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg,
-					int order);
-void __memcg_kmem_commit_charge(struct page *page,
-				       struct mem_cgroup *memcg, int order);
-void __memcg_kmem_uncharge_pages(struct page *page, int order);
-
 struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 int __memcg_kmem_charge_slab(struct kmem_cache *s, gfp_t gfp, int nr_pages);
 void __memcg_kmem_uncharge_slab(struct kmem_cache *s, int nr_pages);
@@ -516,75 +510,6 @@ void memcg_update_array_size(int num_groups);
 void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
 /**
- * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
- * @gfp: the gfp allocation flags.
- * @memcg: a pointer to the memcg this was charged against.
- * @order: allocation order.
- *
- * returns true if the memcg where the current task belongs can hold this
- * allocation.
- *
- * We return true automatically if this allocation is not to be accounted to
- * any memcg.
- */
-static inline bool
-memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
-{
-	if (!memcg_kmem_enabled())
-		return true;
-
-	/*
-	 * __GFP_NOFAIL allocations will move on even if charging is not
-	 * possible. Therefore we don't even try, and have this allocation
-	 * unaccounted. We could in theory charge it with
-	 * res_counter_charge_nofail, but we hope those allocations are rare,
-	 * and won't be worth the trouble.
-	 */
-	if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
-		return true;
-	if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
-		return true;
-
-	/* If the test is dying, just let it go. */
-	if (unlikely(fatal_signal_pending(current)))
-		return true;
-
-	return __memcg_kmem_newpage_charge(gfp, memcg, order);
-}
-
-/**
- * memcg_kmem_uncharge_pages: uncharge pages from memcg
- * @page: pointer to struct page being freed
- * @order: allocation order.
- *
- * there is no need to specify memcg here, since it is embedded in page_cgroup
- */
-static inline void
-memcg_kmem_uncharge_pages(struct page *page, int order)
-{
-	if (memcg_kmem_enabled())
-		__memcg_kmem_uncharge_pages(page, order);
-}
-
-/**
- * memcg_kmem_commit_charge: embeds correct memcg in a page
- * @page: pointer to struct page recently allocated
- * @memcg: the memcg structure we charged against
- * @order: allocation order.
- *
- * Needs to be called after memcg_kmem_newpage_charge, regardless of success or
- * failure of the allocation. if @page is NULL, this function will revert the
- * charges. Otherwise, it will commit the memcg given by @memcg to the
- * corresponding page_cgroup.
- */
-static inline void
-memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
-{
-	if (memcg_kmem_enabled() && memcg)
-		__memcg_kmem_commit_charge(page, memcg, order);
-}
-
-/**
  * memcg_kmem_get_cache: selects the correct per-memcg cache for allocation
  * @cachep: the original global kmem cache
  * @gfp: allocation flags.
@@ -627,21 +552,6 @@ static inline bool memcg_kmem_enabled(void)
 	return false;
 }
 
-static inline bool
-memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
-{
-	return true;
-}
-
-static inline void memcg_kmem_uncharge_pages(struct page *page, int order)
-{
-}
-
-static inline void
-memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
-{
-}
-
 static inline int memcg_cache_id(struct mem_cgroup *memcg)
 {
 	return -1;
diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
index 1eddbf1557f2..d6fd8e5b14b7 100644
--- a/include/trace/events/gfpflags.h
+++ b/include/trace/events/gfpflags.h
@@ -34,7 +34,6 @@
 	{(unsigned long)__GFP_HARDWALL,		"GFP_HARDWALL"},	\
 	{(unsigned long)__GFP_THISNODE,		"GFP_THISNODE"},	\
 	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
-	{(unsigned long)__GFP_KMEMCG,		"GFP_KMEMCG"},		\
 	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"},		\
 	{(unsigned long)__GFP_NOTRACK,		"GFP_NOTRACK"},		\
 	{(unsigned long)__GFP_NO_KSWAPD,	"GFP_NO_KSWAPD"},	\
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d60080812060..dad455a911d5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3531,131 +3531,6 @@ out:
 	rcu_read_unlock();
 	return cachep;
 }
-EXPORT_SYMBOL(__memcg_kmem_get_cache);
-
-/*
- * We need to verify if the allocation against current->mm->owner's memcg is
- * possible for the given order. But the page is not allocated yet, so we'll
- * need a further commit step to do the final arrangements.
- *
- * It is possible for the task to switch cgroups in this mean time, so at
- * commit time, we can't rely on task conversion any longer.  We'll then use
- * the handle argument to return to the caller which cgroup we should commit
- * against. We could also return the memcg directly and avoid the pointer
- * passing, but a boolean return value gives better semantics considering
- * the compiled-out case as well.
- *
- * Returning true means the allocation is possible.
- */
-bool
-__memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
-{
-	struct mem_cgroup *memcg;
-	int ret;
-
-	*_memcg = NULL;
-
-	/*
-	 * Disabling accounting is only relevant for some specific memcg
-	 * internal allocations. Therefore we would initially not have such
-	 * check here, since direct calls to the page allocator that are marked
-	 * with GFP_KMEMCG only happen outside memcg core. We are mostly
-	 * concerned with cache allocations, and by having this test at
-	 * memcg_kmem_get_cache, we are already able to relay the allocation to
-	 * the root cache and bypass the memcg cache altogether.
-	 *
-	 * There is one exception, though: the SLUB allocator does not create
-	 * large order caches, but rather service large kmallocs directly from
-	 * the page allocator. Therefore, the following sequence when backed by
-	 * the SLUB allocator:
-	 *
-	 *	memcg_stop_kmem_account();
-	 *	kmalloc(<large_number>)
-	 *	memcg_resume_kmem_account();
-	 *
-	 * would effectively ignore the fact that we should skip accounting,
-	 * since it will drive us directly to this function without passing
-	 * through the cache selector memcg_kmem_get_cache. Such large
-	 * allocations are extremely rare but can happen, for instance, for the
-	 * cache arrays. We bring this test here.
-	 */
-	if (!current->mm || current->memcg_kmem_skip_account)
-		return true;
-
-	memcg = try_get_mem_cgroup_from_mm(current->mm);
-
-	/*
-	 * very rare case described in mem_cgroup_from_task. Unfortunately there
-	 * isn't much we can do without complicating this too much, and it would
-	 * be gfp-dependent anyway. Just let it go
-	 */
-	if (unlikely(!memcg))
-		return true;
-
-	if (!memcg_can_account_kmem(memcg)) {
-		css_put(&memcg->css);
-		return true;
-	}
-
-	ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order);
-	if (!ret)
-		*_memcg = memcg;
-
-	css_put(&memcg->css);
-	return (ret == 0);
-}
-
-void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg,
-			      int order)
-{
-	struct page_cgroup *pc;
-
-	VM_BUG_ON(mem_cgroup_is_root(memcg));
-
-	/* The page allocation failed. Revert */
-	if (!page) {
-		memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
-		return;
-	}
-
-	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
-	pc->mem_cgroup = memcg;
-	SetPageCgroupUsed(pc);
-	unlock_page_cgroup(pc);
-}
-
-void __memcg_kmem_uncharge_pages(struct page *page, int order)
-{
-	struct mem_cgroup *memcg = NULL;
-	struct page_cgroup *pc;
-
-
-	pc = lookup_page_cgroup(page);
-	/*
-	 * Fast unlocked return. Theoretically might have changed, have to
-	 * check again after locking.
-	 */
-	if (!PageCgroupUsed(pc))
-		return;
-
-	lock_page_cgroup(pc);
-	if (PageCgroupUsed(pc)) {
-		memcg = pc->mem_cgroup;
-		ClearPageCgroupUsed(pc);
-	}
-	unlock_page_cgroup(pc);
-
-	/*
-	 * We trust that only if there is a memcg associated with the page, it
-	 * is a valid allocation
-	 */
-	if (!memcg)
-		return;
-
-	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
-	memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
-}
 
 static void __init memcg_kmem_init(void)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9d6892c3527d..a7acceb00770 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2717,7 +2717,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	int migratetype = allocflags_to_migratetype(gfp_mask);
 	unsigned int cpuset_mems_cookie;
 	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET;
-	struct mem_cgroup *memcg = NULL;
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2736,13 +2735,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (unlikely(!zonelist->_zonerefs->zone))
 		return NULL;
 
-	/*
-	 * Will only have any effect when __GFP_KMEMCG is set.  This is
-	 * verified in the (always inline) callee
-	 */
-	if (!memcg_kmem_newpage_charge(gfp_mask, &memcg, order))
-		return NULL;
-
 retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
 
@@ -2785,8 +2777,6 @@ out:
 	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
-	memcg_kmem_commit_charge(page, memcg, order);
-
 	if (page)
 		set_page_owner(page, order, gfp_mask);
 
@@ -2842,31 +2832,6 @@ void free_pages(unsigned long addr, unsigned int order)
 
 EXPORT_SYMBOL(free_pages);
 
-/*
- * __free_memcg_kmem_pages and free_memcg_kmem_pages will free
- * pages allocated with __GFP_KMEMCG.
- *
- * Those pages are accounted to a particular memcg, embedded in the
- * corresponding page_cgroup. To avoid adding a hit in the allocator to search
- * for that information only to find out that it is NULL for users who have no
- * interest in that whatsoever, we provide these functions.
- *
- * The caller knows better which flags it relies on.
- */
-void __free_memcg_kmem_pages(struct page *page, unsigned int order)
-{
-	memcg_kmem_uncharge_pages(page, order);
-	__free_pages(page, order);
-}
-
-void free_memcg_kmem_pages(unsigned long addr, unsigned int order)
-{
-	if (addr != 0) {
-		VM_BUG_ON(!virt_addr_valid((void *)addr));
-		__free_memcg_kmem_pages(virt_to_page((void *)addr), order);
-	}
-}
-
 static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size)
 {
 	if (addr) {
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH -mm 11/12] memcg: reparent slab on css offline
  2014-02-26 15:05 ` Vladimir Davydov
@ 2014-02-26 15:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

Currently we take a css reference per each memcg cache. This is simple,
but extremely ugly - a memcg will hang around after its death until it
has any charges. Moreover, there are some clashes with the design
principles the cgroup subsystems implies.

However, there is nothing that prevents us from reparenting kmem charges
just like we do with user pages. Moreover, it is much easier to
implement: we already keep all memcg caches on a list, so all we have to
do is walk over the list and move the caches to the parent cgroup's list
changing the memcg ptr in the meanwhile. If somebody frees an object to
a cache being reparented, he might see a pointer to the old memcg, but
that's OK, we only need to use RCU to protect against use-after-free.
Let's just do it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 mm/memcontrol.c |  306 ++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 189 insertions(+), 117 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index dad455a911d5..05bde78e14f0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -383,11 +383,20 @@ struct mem_cgroup {
 /* internal only representation about the status of kmem accounting. */
 enum {
 	KMEM_ACCOUNTED_ACTIVE, /* accounted by this cgroup itself */
-	KMEM_ACCOUNTED_DEAD, /* dead memcg with pending kmem charges */
+	KMEM_ACCOUNTED_REPARENTED, /* has reparented caches */
 };
 
 #ifdef CONFIG_MEMCG_KMEM
-static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
+/*
+ * mem_cgroup::slab_caches_mutex nesting subclasses:
+ */
+enum memcg_slab_mutex_class
+{
+	MEMCG_SLAB_MUTEX_PARENT,
+	MEMCG_SLAB_MUTEX_CHILD,
+};
+
+static void memcg_kmem_set_active(struct mem_cgroup *memcg)
 {
 	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
@@ -397,21 +406,14 @@ static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
 
-static void memcg_kmem_mark_dead(struct mem_cgroup *memcg)
+static void memcg_kmem_set_reparented(struct mem_cgroup *memcg)
 {
-	/*
-	 * Our caller must use css_get() first, because memcg_uncharge_kmem()
-	 * will call css_put() if it sees the memcg is dead.
-	 */
-	smp_wmb();
-	if (test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags))
-		set_bit(KMEM_ACCOUNTED_DEAD, &memcg->kmem_account_flags);
+	set_bit(KMEM_ACCOUNTED_REPARENTED, &memcg->kmem_account_flags);
 }
 
-static bool memcg_kmem_test_and_clear_dead(struct mem_cgroup *memcg)
+static bool memcg_kmem_is_reparented(struct mem_cgroup *memcg)
 {
-	return test_and_clear_bit(KMEM_ACCOUNTED_DEAD,
-				  &memcg->kmem_account_flags);
+	return test_bit(KMEM_ACCOUNTED_REPARENTED, &memcg->kmem_account_flags);
 }
 #endif
 
@@ -656,11 +658,6 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
 		static_key_slow_dec(&memcg_kmem_enabled_key);
 		ida_simple_remove(&kmem_limited_groups, memcg->kmemcg_id);
 	}
-	/*
-	 * This check can't live in kmem destruction function,
-	 * since the charges will outlive the cgroup
-	 */
-	WARN_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
 }
 #else
 static void disarm_kmem_keys(struct mem_cgroup *memcg)
@@ -3040,36 +3037,48 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
 	res_counter_uncharge(&memcg->res, size);
 	if (do_swap_account)
 		res_counter_uncharge(&memcg->memsw, size);
+	res_counter_uncharge(&memcg->kmem, size);
+}
 
-	/* Not down to 0 */
-	if (res_counter_uncharge(&memcg->kmem, size))
-		return;
+static struct mem_cgroup *try_get_mem_cgroup_from_slab(struct kmem_cache *s)
+{
+	struct mem_cgroup *memcg;
 
-	/*
-	 * Releases a reference taken in kmem_cgroup_css_offline in case
-	 * this last uncharge is racing with the offlining code or it is
-	 * outliving the memcg existence.
-	 *
-	 * The memory barrier imposed by test&clear is paired with the
-	 * explicit one in memcg_kmem_mark_dead().
-	 */
-	if (memcg_kmem_test_and_clear_dead(memcg))
-		css_put(&memcg->css);
+	if (is_root_cache(s))
+		return NULL;
+
+	rcu_read_lock();
+	do {
+		memcg = s->memcg_params->memcg;
+		if (!memcg)
+			break;
+	} while (!css_tryget(&memcg->css));
+	rcu_read_unlock();
+	return memcg;
 }
 
 int __memcg_kmem_charge_slab(struct kmem_cache *s, gfp_t gfp, int nr_pages)
 {
-	if (is_root_cache(s))
-		return 0;
-	return memcg_charge_kmem(s->memcg_params->memcg,
-				 gfp, nr_pages << PAGE_SHIFT);
+	struct mem_cgroup *memcg;
+	int ret = 0;
+
+	memcg = try_get_mem_cgroup_from_slab(s);
+	if (memcg) {
+		ret = memcg_charge_kmem(memcg, gfp, nr_pages << PAGE_SHIFT);
+		css_put(&memcg->css);
+	}
+	return ret;
 }
 
 void __memcg_kmem_uncharge_slab(struct kmem_cache *s, int nr_pages)
 {
-	if (is_root_cache(s))
-		return;
-	memcg_uncharge_kmem(s->memcg_params->memcg, nr_pages << PAGE_SHIFT);
+	struct mem_cgroup *memcg;
+
+	memcg = try_get_mem_cgroup_from_slab(s);
+	if (memcg) {
+		memcg_uncharge_kmem(memcg, nr_pages << PAGE_SHIFT);
+		css_put(&memcg->css);
+	}
 }
 
 /*
@@ -3214,7 +3223,6 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 		INIT_WORK(&s->memcg_params->destroy,
 				kmem_cache_destroy_work_func);
 		atomic_set(&s->memcg_params->refcount, 1);
-		css_get(&memcg->css);
 	} else {
 		s->memcg_params->is_root_cache = true;
 		INIT_LIST_HEAD(&s->memcg_params->children);
@@ -3225,19 +3233,36 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 
 void memcg_free_cache_params(struct kmem_cache *s)
 {
-	if (!s->memcg_params)
-		return;
-	if (!s->memcg_params->is_root_cache)
-		css_put(&s->memcg_params->memcg->css);
 	kfree(s->memcg_params);
 }
 
-void memcg_register_cache(struct kmem_cache *s)
+static void __memcg_register_cache(struct kmem_cache *s, struct kmem_cache *root)
 {
-	struct kmem_cache *root;
 	struct mem_cgroup *memcg;
 	int id;
 
+	memcg = s->memcg_params->memcg;
+	/*
+	 * Special case: re-registering the cache on __kmem_cache_shutdown()
+	 * failure (see __kmem_cache_destroy()).
+	 */
+	if (!memcg)
+		return;
+
+	id = memcg_cache_id(memcg);
+	BUG_ON(id < 0);
+
+	mutex_lock(&memcg->slab_caches_mutex);
+	BUG_ON(root->memcg_params->memcg_caches[id]);
+	root->memcg_params->memcg_caches[id] = s;
+	list_add(&s->memcg_params->list, &memcg->memcg_slab_caches);
+	mutex_unlock(&memcg->slab_caches_mutex);
+}
+
+void memcg_register_cache(struct kmem_cache *s)
+{
+	struct kmem_cache *root;
+
 	if (is_root_cache(s))
 		return;
 
@@ -3247,10 +3272,6 @@ void memcg_register_cache(struct kmem_cache *s)
 	 */
 	lockdep_assert_held(&slab_mutex);
 
-	root = s->memcg_params->root_cache;
-	memcg = s->memcg_params->memcg;
-	id = memcg_cache_id(memcg);
-
 	/*
 	 * Since readers won't lock (see cache_from_memcg_idx()), we need a
 	 * barrier here to ensure nobody will see the kmem_cache partially
@@ -3258,21 +3279,61 @@ void memcg_register_cache(struct kmem_cache *s)
 	 */
 	smp_wmb();
 
+	root = s->memcg_params->root_cache;
 	list_add(&s->memcg_params->siblings, &root->memcg_params->children);
+	__memcg_register_cache(s, root);
 
-	VM_BUG_ON(root->memcg_params->memcg_caches[id]);
-	root->memcg_params->memcg_caches[id] = s;
+	static_key_slow_inc(&memcg_kmem_enabled_key);
+}
+
+static void __memcg_unregister_cache(struct kmem_cache *s, struct kmem_cache *root)
+{
+	struct mem_cgroup *memcg;
+	int id;
+
+retry:
+	memcg = try_get_mem_cgroup_from_slab(s);
+
+	/*
+	 * This can happen if the cache's memcg was turned offline and it was
+	 * reparented to the root cgroup. In this case the cache must have
+	 * already been properly unregistered so we have nothing to do.
+	 */
+	if (!memcg)
+		return;
 
+	id = memcg_cache_id(memcg);
+
+	/*
+	 * To delete a cache from memcg_slab_caches list, we need to take the
+	 * correpsonding slab_caches_mutex. Since nothing prevents the cache
+	 * from being reparented while we are here, we recheck the cache's
+	 * memcg after taking the mutex and retry if it changed.
+	 */
 	mutex_lock(&memcg->slab_caches_mutex);
-	list_add(&s->memcg_params->list, &memcg->memcg_slab_caches);
+	if (memcg != s->memcg_params->memcg) {
+		mutex_unlock(&memcg->slab_caches_mutex);
+		css_put(&memcg->css);
+		goto retry;
+	}
+
+	list_del(&s->memcg_params->list);
+	s->memcg_params->memcg = NULL;
+
+	/*
+	 * Clear the slot in the memcg_caches array only if the cache hasn't
+	 * been reparented before.
+	 */
+	if (id >= 0 && root->memcg_params->memcg_caches[id] == s)
+		root->memcg_params->memcg_caches[id] = NULL;
+
 	mutex_unlock(&memcg->slab_caches_mutex);
+	css_put(&memcg->css);
 }
 
 void memcg_unregister_cache(struct kmem_cache *s)
 {
 	struct kmem_cache *root;
-	struct mem_cgroup *memcg;
-	int id;
 
 	if (is_root_cache(s))
 		return;
@@ -3284,17 +3345,10 @@ void memcg_unregister_cache(struct kmem_cache *s)
 	lockdep_assert_held(&slab_mutex);
 
 	root = s->memcg_params->root_cache;
-	memcg = s->memcg_params->memcg;
-	id = memcg_cache_id(memcg);
-
 	list_del(&s->memcg_params->siblings);
+	__memcg_unregister_cache(s, root);
 
-	mutex_lock(&memcg->slab_caches_mutex);
-	list_del(&s->memcg_params->list);
-	mutex_unlock(&memcg->slab_caches_mutex);
-
-	VM_BUG_ON(root->memcg_params->memcg_caches[id] != s);
-	root->memcg_params->memcg_caches[id] = NULL;
+	static_key_slow_dec(&memcg_kmem_enabled_key);
 }
 
 /*
@@ -3338,7 +3392,7 @@ static void kmem_cache_destroy_work_func(struct work_struct *w)
 
 	if (atomic_read(&params->refcount) != 0) {
 		/*
-		 * We were scheduled from mem_cgroup_destroy_all_caches().
+		 * We were scheduled from mem_cgroup_reparent_slab().
 		 * Shrink the cache and drop the reference taken by memcg.
 		 */
 		kmem_cache_shrink(cachep);
@@ -3381,11 +3435,56 @@ void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 	mutex_unlock(&activate_kmem_mutex);
 }
 
-static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
+static void mem_cgroup_reparent_one_slab(struct kmem_cache *cachep,
+		struct mem_cgroup *from, struct mem_cgroup *to)
 {
+	struct kmem_cache *root;
+	int id;
+
+	root = cachep->memcg_params->root_cache;
+
+	if (to)
+		list_move(&cachep->memcg_params->list, &to->memcg_slab_caches);
+	else
+		list_del(&cachep->memcg_params->list);
+
+	BUG_ON(cachep->memcg_params->memcg != from);
+	cachep->memcg_params->memcg = to;
+
+	/*
+	 * We may access the cachep->memcg_params->memcg ptr lock-free so we
+	 * have to make sure readers will see the new value before the final
+	 * css put.
+	 */
+	smp_wmb();
+
+	/*
+	 * If the cache has already been reparented we are done here. Otherwise
+	 * we clear the reference to it in the memcg_caches array and schedule
+	 * shrink work.
+	 */
+	id = memcg_cache_id(from);
+	if (id < 0 || root->memcg_params->memcg_caches[id] != cachep)
+		return;
+
+	root->memcg_params->memcg_caches[id] = NULL;
+
+	/*
+	 * The work could not be scheduled from memcg_release_pages(), because
+	 * we haven't dropped cachep->memcg_params->refcount yet. That's why we
+	 * cannot fail here.
+	 */
+	if (!schedule_work(&cachep->memcg_params->destroy))
+		BUG();
+}
+
+static void mem_cgroup_reparent_slab(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *parent;
 	struct memcg_cache_params *params;
 
-	if (!memcg_kmem_is_active(memcg))
+	if (!memcg_kmem_is_active(memcg) &&
+	    !memcg_kmem_is_reparented(memcg))
 		return;
 
 	/*
@@ -3397,17 +3496,30 @@ static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
 	 */
 	flush_workqueue(memcg_cache_create_wq);
 
-	mutex_lock(&memcg->slab_caches_mutex);
-	list_for_each_entry(params, &memcg->memcg_slab_caches, list) {
-		/*
-		 * Since we still hold the reference to the cache params from
-		 * the memcg, the work could not have been scheduled from
-		 * memcg_release_pages(), and this cannot fail.
-		 */
-		if (!schedule_work(&params->destroy))
-			BUG();
+	/*
+	 * We are going to modify memcg_caches arrays, so we have to protect
+	 * them against relocating.
+	 */
+	mutex_lock(&activate_kmem_mutex);
+
+	parent = parent_mem_cgroup(memcg);
+	if (parent)
+		mutex_lock_nested(&parent->slab_caches_mutex,
+				  MEMCG_SLAB_MUTEX_PARENT);
+	mutex_lock_nested(&memcg->slab_caches_mutex, MEMCG_SLAB_MUTEX_CHILD);
+	while (!list_empty(&memcg->memcg_slab_caches)) {
+		params = list_first_entry(&memcg->memcg_slab_caches,
+					  struct memcg_cache_params, list);
+		mem_cgroup_reparent_one_slab(params->cachep, memcg, parent);
 	}
 	mutex_unlock(&memcg->slab_caches_mutex);
+	if (parent)
+		mutex_unlock(&parent->slab_caches_mutex);
+
+	mutex_unlock(&activate_kmem_mutex);
+
+	if (parent)
+		memcg_kmem_set_reparented(parent);
 }
 
 struct create_work {
@@ -3538,7 +3650,7 @@ static void __init memcg_kmem_init(void)
 	BUG_ON(!memcg_cache_create_wq);
 }
 #else
-static inline void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
+static void mem_cgroup_reparent_slab(struct mem_cgroup *memcg)
 {
 }
 
@@ -5752,40 +5864,6 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 {
 	mem_cgroup_sockets_destroy(memcg);
 }
-
-static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-{
-	if (!memcg_kmem_is_active(memcg))
-		return;
-
-	/*
-	 * kmem charges can outlive the cgroup. In the case of slab
-	 * pages, for instance, a page contain objects from various
-	 * processes. As we prevent from taking a reference for every
-	 * such allocation we have to be careful when doing uncharge
-	 * (see memcg_uncharge_kmem) and here during offlining.
-	 *
-	 * The idea is that that only the _last_ uncharge which sees
-	 * the dead memcg will drop the last reference. An additional
-	 * reference is taken here before the group is marked dead
-	 * which is then paired with css_put during uncharge resp. here.
-	 *
-	 * Although this might sound strange as this path is called from
-	 * css_offline() when the referencemight have dropped down to 0
-	 * and shouldn't be incremented anymore (css_tryget would fail)
-	 * we do not have other options because of the kmem allocations
-	 * lifetime.
-	 */
-	css_get(&memcg->css);
-
-	memcg_kmem_mark_dead(memcg);
-
-	if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0)
-		return;
-
-	if (memcg_kmem_test_and_clear_dead(memcg))
-		css_put(&memcg->css);
-}
 #else
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
@@ -5795,10 +5873,6 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 {
 }
-
-static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-{
-}
 #endif
 
 /*
@@ -6406,8 +6480,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	}
 	spin_unlock(&memcg->event_list_lock);
 
-	kmem_cgroup_css_offline(memcg);
-
 	mem_cgroup_invalidate_reclaim_iterators(memcg);
 
 	/*
@@ -6417,7 +6489,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	css_for_each_descendant_post(iter, css)
 		mem_cgroup_reparent_charges(mem_cgroup_from_css(iter));
 
-	mem_cgroup_destroy_all_caches(memcg);
+	mem_cgroup_reparent_slab(memcg);
 	vmpressure_cleanup(&memcg->vmpressure);
 }
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH -mm 11/12] memcg: reparent slab on css offline
@ 2014-02-26 15:05   ` Vladimir Davydov
  0 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

Currently we take a css reference per each memcg cache. This is simple,
but extremely ugly - a memcg will hang around after its death until it
has any charges. Moreover, there are some clashes with the design
principles the cgroup subsystems implies.

However, there is nothing that prevents us from reparenting kmem charges
just like we do with user pages. Moreover, it is much easier to
implement: we already keep all memcg caches on a list, so all we have to
do is walk over the list and move the caches to the parent cgroup's list
changing the memcg ptr in the meanwhile. If somebody frees an object to
a cache being reparented, he might see a pointer to the old memcg, but
that's OK, we only need to use RCU to protect against use-after-free.
Let's just do it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 mm/memcontrol.c |  306 ++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 189 insertions(+), 117 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index dad455a911d5..05bde78e14f0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -383,11 +383,20 @@ struct mem_cgroup {
 /* internal only representation about the status of kmem accounting. */
 enum {
 	KMEM_ACCOUNTED_ACTIVE, /* accounted by this cgroup itself */
-	KMEM_ACCOUNTED_DEAD, /* dead memcg with pending kmem charges */
+	KMEM_ACCOUNTED_REPARENTED, /* has reparented caches */
 };
 
 #ifdef CONFIG_MEMCG_KMEM
-static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
+/*
+ * mem_cgroup::slab_caches_mutex nesting subclasses:
+ */
+enum memcg_slab_mutex_class
+{
+	MEMCG_SLAB_MUTEX_PARENT,
+	MEMCG_SLAB_MUTEX_CHILD,
+};
+
+static void memcg_kmem_set_active(struct mem_cgroup *memcg)
 {
 	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
@@ -397,21 +406,14 @@ static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
 
-static void memcg_kmem_mark_dead(struct mem_cgroup *memcg)
+static void memcg_kmem_set_reparented(struct mem_cgroup *memcg)
 {
-	/*
-	 * Our caller must use css_get() first, because memcg_uncharge_kmem()
-	 * will call css_put() if it sees the memcg is dead.
-	 */
-	smp_wmb();
-	if (test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags))
-		set_bit(KMEM_ACCOUNTED_DEAD, &memcg->kmem_account_flags);
+	set_bit(KMEM_ACCOUNTED_REPARENTED, &memcg->kmem_account_flags);
 }
 
-static bool memcg_kmem_test_and_clear_dead(struct mem_cgroup *memcg)
+static bool memcg_kmem_is_reparented(struct mem_cgroup *memcg)
 {
-	return test_and_clear_bit(KMEM_ACCOUNTED_DEAD,
-				  &memcg->kmem_account_flags);
+	return test_bit(KMEM_ACCOUNTED_REPARENTED, &memcg->kmem_account_flags);
 }
 #endif
 
@@ -656,11 +658,6 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
 		static_key_slow_dec(&memcg_kmem_enabled_key);
 		ida_simple_remove(&kmem_limited_groups, memcg->kmemcg_id);
 	}
-	/*
-	 * This check can't live in kmem destruction function,
-	 * since the charges will outlive the cgroup
-	 */
-	WARN_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
 }
 #else
 static void disarm_kmem_keys(struct mem_cgroup *memcg)
@@ -3040,36 +3037,48 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
 	res_counter_uncharge(&memcg->res, size);
 	if (do_swap_account)
 		res_counter_uncharge(&memcg->memsw, size);
+	res_counter_uncharge(&memcg->kmem, size);
+}
 
-	/* Not down to 0 */
-	if (res_counter_uncharge(&memcg->kmem, size))
-		return;
+static struct mem_cgroup *try_get_mem_cgroup_from_slab(struct kmem_cache *s)
+{
+	struct mem_cgroup *memcg;
 
-	/*
-	 * Releases a reference taken in kmem_cgroup_css_offline in case
-	 * this last uncharge is racing with the offlining code or it is
-	 * outliving the memcg existence.
-	 *
-	 * The memory barrier imposed by test&clear is paired with the
-	 * explicit one in memcg_kmem_mark_dead().
-	 */
-	if (memcg_kmem_test_and_clear_dead(memcg))
-		css_put(&memcg->css);
+	if (is_root_cache(s))
+		return NULL;
+
+	rcu_read_lock();
+	do {
+		memcg = s->memcg_params->memcg;
+		if (!memcg)
+			break;
+	} while (!css_tryget(&memcg->css));
+	rcu_read_unlock();
+	return memcg;
 }
 
 int __memcg_kmem_charge_slab(struct kmem_cache *s, gfp_t gfp, int nr_pages)
 {
-	if (is_root_cache(s))
-		return 0;
-	return memcg_charge_kmem(s->memcg_params->memcg,
-				 gfp, nr_pages << PAGE_SHIFT);
+	struct mem_cgroup *memcg;
+	int ret = 0;
+
+	memcg = try_get_mem_cgroup_from_slab(s);
+	if (memcg) {
+		ret = memcg_charge_kmem(memcg, gfp, nr_pages << PAGE_SHIFT);
+		css_put(&memcg->css);
+	}
+	return ret;
 }
 
 void __memcg_kmem_uncharge_slab(struct kmem_cache *s, int nr_pages)
 {
-	if (is_root_cache(s))
-		return;
-	memcg_uncharge_kmem(s->memcg_params->memcg, nr_pages << PAGE_SHIFT);
+	struct mem_cgroup *memcg;
+
+	memcg = try_get_mem_cgroup_from_slab(s);
+	if (memcg) {
+		memcg_uncharge_kmem(memcg, nr_pages << PAGE_SHIFT);
+		css_put(&memcg->css);
+	}
 }
 
 /*
@@ -3214,7 +3223,6 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 		INIT_WORK(&s->memcg_params->destroy,
 				kmem_cache_destroy_work_func);
 		atomic_set(&s->memcg_params->refcount, 1);
-		css_get(&memcg->css);
 	} else {
 		s->memcg_params->is_root_cache = true;
 		INIT_LIST_HEAD(&s->memcg_params->children);
@@ -3225,19 +3233,36 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 
 void memcg_free_cache_params(struct kmem_cache *s)
 {
-	if (!s->memcg_params)
-		return;
-	if (!s->memcg_params->is_root_cache)
-		css_put(&s->memcg_params->memcg->css);
 	kfree(s->memcg_params);
 }
 
-void memcg_register_cache(struct kmem_cache *s)
+static void __memcg_register_cache(struct kmem_cache *s, struct kmem_cache *root)
 {
-	struct kmem_cache *root;
 	struct mem_cgroup *memcg;
 	int id;
 
+	memcg = s->memcg_params->memcg;
+	/*
+	 * Special case: re-registering the cache on __kmem_cache_shutdown()
+	 * failure (see __kmem_cache_destroy()).
+	 */
+	if (!memcg)
+		return;
+
+	id = memcg_cache_id(memcg);
+	BUG_ON(id < 0);
+
+	mutex_lock(&memcg->slab_caches_mutex);
+	BUG_ON(root->memcg_params->memcg_caches[id]);
+	root->memcg_params->memcg_caches[id] = s;
+	list_add(&s->memcg_params->list, &memcg->memcg_slab_caches);
+	mutex_unlock(&memcg->slab_caches_mutex);
+}
+
+void memcg_register_cache(struct kmem_cache *s)
+{
+	struct kmem_cache *root;
+
 	if (is_root_cache(s))
 		return;
 
@@ -3247,10 +3272,6 @@ void memcg_register_cache(struct kmem_cache *s)
 	 */
 	lockdep_assert_held(&slab_mutex);
 
-	root = s->memcg_params->root_cache;
-	memcg = s->memcg_params->memcg;
-	id = memcg_cache_id(memcg);
-
 	/*
 	 * Since readers won't lock (see cache_from_memcg_idx()), we need a
 	 * barrier here to ensure nobody will see the kmem_cache partially
@@ -3258,21 +3279,61 @@ void memcg_register_cache(struct kmem_cache *s)
 	 */
 	smp_wmb();
 
+	root = s->memcg_params->root_cache;
 	list_add(&s->memcg_params->siblings, &root->memcg_params->children);
+	__memcg_register_cache(s, root);
 
-	VM_BUG_ON(root->memcg_params->memcg_caches[id]);
-	root->memcg_params->memcg_caches[id] = s;
+	static_key_slow_inc(&memcg_kmem_enabled_key);
+}
+
+static void __memcg_unregister_cache(struct kmem_cache *s, struct kmem_cache *root)
+{
+	struct mem_cgroup *memcg;
+	int id;
+
+retry:
+	memcg = try_get_mem_cgroup_from_slab(s);
+
+	/*
+	 * This can happen if the cache's memcg was turned offline and it was
+	 * reparented to the root cgroup. In this case the cache must have
+	 * already been properly unregistered so we have nothing to do.
+	 */
+	if (!memcg)
+		return;
 
+	id = memcg_cache_id(memcg);
+
+	/*
+	 * To delete a cache from memcg_slab_caches list, we need to take the
+	 * correpsonding slab_caches_mutex. Since nothing prevents the cache
+	 * from being reparented while we are here, we recheck the cache's
+	 * memcg after taking the mutex and retry if it changed.
+	 */
 	mutex_lock(&memcg->slab_caches_mutex);
-	list_add(&s->memcg_params->list, &memcg->memcg_slab_caches);
+	if (memcg != s->memcg_params->memcg) {
+		mutex_unlock(&memcg->slab_caches_mutex);
+		css_put(&memcg->css);
+		goto retry;
+	}
+
+	list_del(&s->memcg_params->list);
+	s->memcg_params->memcg = NULL;
+
+	/*
+	 * Clear the slot in the memcg_caches array only if the cache hasn't
+	 * been reparented before.
+	 */
+	if (id >= 0 && root->memcg_params->memcg_caches[id] == s)
+		root->memcg_params->memcg_caches[id] = NULL;
+
 	mutex_unlock(&memcg->slab_caches_mutex);
+	css_put(&memcg->css);
 }
 
 void memcg_unregister_cache(struct kmem_cache *s)
 {
 	struct kmem_cache *root;
-	struct mem_cgroup *memcg;
-	int id;
 
 	if (is_root_cache(s))
 		return;
@@ -3284,17 +3345,10 @@ void memcg_unregister_cache(struct kmem_cache *s)
 	lockdep_assert_held(&slab_mutex);
 
 	root = s->memcg_params->root_cache;
-	memcg = s->memcg_params->memcg;
-	id = memcg_cache_id(memcg);
-
 	list_del(&s->memcg_params->siblings);
+	__memcg_unregister_cache(s, root);
 
-	mutex_lock(&memcg->slab_caches_mutex);
-	list_del(&s->memcg_params->list);
-	mutex_unlock(&memcg->slab_caches_mutex);
-
-	VM_BUG_ON(root->memcg_params->memcg_caches[id] != s);
-	root->memcg_params->memcg_caches[id] = NULL;
+	static_key_slow_dec(&memcg_kmem_enabled_key);
 }
 
 /*
@@ -3338,7 +3392,7 @@ static void kmem_cache_destroy_work_func(struct work_struct *w)
 
 	if (atomic_read(&params->refcount) != 0) {
 		/*
-		 * We were scheduled from mem_cgroup_destroy_all_caches().
+		 * We were scheduled from mem_cgroup_reparent_slab().
 		 * Shrink the cache and drop the reference taken by memcg.
 		 */
 		kmem_cache_shrink(cachep);
@@ -3381,11 +3435,56 @@ void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 	mutex_unlock(&activate_kmem_mutex);
 }
 
-static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
+static void mem_cgroup_reparent_one_slab(struct kmem_cache *cachep,
+		struct mem_cgroup *from, struct mem_cgroup *to)
 {
+	struct kmem_cache *root;
+	int id;
+
+	root = cachep->memcg_params->root_cache;
+
+	if (to)
+		list_move(&cachep->memcg_params->list, &to->memcg_slab_caches);
+	else
+		list_del(&cachep->memcg_params->list);
+
+	BUG_ON(cachep->memcg_params->memcg != from);
+	cachep->memcg_params->memcg = to;
+
+	/*
+	 * We may access the cachep->memcg_params->memcg ptr lock-free so we
+	 * have to make sure readers will see the new value before the final
+	 * css put.
+	 */
+	smp_wmb();
+
+	/*
+	 * If the cache has already been reparented we are done here. Otherwise
+	 * we clear the reference to it in the memcg_caches array and schedule
+	 * shrink work.
+	 */
+	id = memcg_cache_id(from);
+	if (id < 0 || root->memcg_params->memcg_caches[id] != cachep)
+		return;
+
+	root->memcg_params->memcg_caches[id] = NULL;
+
+	/*
+	 * The work could not be scheduled from memcg_release_pages(), because
+	 * we haven't dropped cachep->memcg_params->refcount yet. That's why we
+	 * cannot fail here.
+	 */
+	if (!schedule_work(&cachep->memcg_params->destroy))
+		BUG();
+}
+
+static void mem_cgroup_reparent_slab(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *parent;
 	struct memcg_cache_params *params;
 
-	if (!memcg_kmem_is_active(memcg))
+	if (!memcg_kmem_is_active(memcg) &&
+	    !memcg_kmem_is_reparented(memcg))
 		return;
 
 	/*
@@ -3397,17 +3496,30 @@ static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
 	 */
 	flush_workqueue(memcg_cache_create_wq);
 
-	mutex_lock(&memcg->slab_caches_mutex);
-	list_for_each_entry(params, &memcg->memcg_slab_caches, list) {
-		/*
-		 * Since we still hold the reference to the cache params from
-		 * the memcg, the work could not have been scheduled from
-		 * memcg_release_pages(), and this cannot fail.
-		 */
-		if (!schedule_work(&params->destroy))
-			BUG();
+	/*
+	 * We are going to modify memcg_caches arrays, so we have to protect
+	 * them against relocating.
+	 */
+	mutex_lock(&activate_kmem_mutex);
+
+	parent = parent_mem_cgroup(memcg);
+	if (parent)
+		mutex_lock_nested(&parent->slab_caches_mutex,
+				  MEMCG_SLAB_MUTEX_PARENT);
+	mutex_lock_nested(&memcg->slab_caches_mutex, MEMCG_SLAB_MUTEX_CHILD);
+	while (!list_empty(&memcg->memcg_slab_caches)) {
+		params = list_first_entry(&memcg->memcg_slab_caches,
+					  struct memcg_cache_params, list);
+		mem_cgroup_reparent_one_slab(params->cachep, memcg, parent);
 	}
 	mutex_unlock(&memcg->slab_caches_mutex);
+	if (parent)
+		mutex_unlock(&parent->slab_caches_mutex);
+
+	mutex_unlock(&activate_kmem_mutex);
+
+	if (parent)
+		memcg_kmem_set_reparented(parent);
 }
 
 struct create_work {
@@ -3538,7 +3650,7 @@ static void __init memcg_kmem_init(void)
 	BUG_ON(!memcg_cache_create_wq);
 }
 #else
-static inline void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
+static void mem_cgroup_reparent_slab(struct mem_cgroup *memcg)
 {
 }
 
@@ -5752,40 +5864,6 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 {
 	mem_cgroup_sockets_destroy(memcg);
 }
-
-static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-{
-	if (!memcg_kmem_is_active(memcg))
-		return;
-
-	/*
-	 * kmem charges can outlive the cgroup. In the case of slab
-	 * pages, for instance, a page contain objects from various
-	 * processes. As we prevent from taking a reference for every
-	 * such allocation we have to be careful when doing uncharge
-	 * (see memcg_uncharge_kmem) and here during offlining.
-	 *
-	 * The idea is that that only the _last_ uncharge which sees
-	 * the dead memcg will drop the last reference. An additional
-	 * reference is taken here before the group is marked dead
-	 * which is then paired with css_put during uncharge resp. here.
-	 *
-	 * Although this might sound strange as this path is called from
-	 * css_offline() when the referencemight have dropped down to 0
-	 * and shouldn't be incremented anymore (css_tryget would fail)
-	 * we do not have other options because of the kmem allocations
-	 * lifetime.
-	 */
-	css_get(&memcg->css);
-
-	memcg_kmem_mark_dead(memcg);
-
-	if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0)
-		return;
-
-	if (memcg_kmem_test_and_clear_dead(memcg))
-		css_put(&memcg->css);
-}
 #else
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
@@ -5795,10 +5873,6 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 {
 }
-
-static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-{
-}
 #endif
 
 /*
@@ -6406,8 +6480,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	}
 	spin_unlock(&memcg->event_list_lock);
 
-	kmem_cgroup_css_offline(memcg);
-
 	mem_cgroup_invalidate_reclaim_iterators(memcg);
 
 	/*
@@ -6417,7 +6489,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	css_for_each_descendant_post(iter, css)
 		mem_cgroup_reparent_charges(mem_cgroup_from_css(iter));
 
-	mem_cgroup_destroy_all_caches(memcg);
+	mem_cgroup_reparent_slab(memcg);
 	vmpressure_cleanup(&memcg->vmpressure);
 }
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH -mm 12/12] slub: make sure all memcg caches have unique names on sysfs
  2014-02-26 15:05 ` Vladimir Davydov
@ 2014-02-26 15:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm
  Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel,
	Christoph Lameter, Pekka Enberg

Since memcg caches are now reparented on memcg offline, a memcg cache
can outlive its cgroup. If the memcg id is then reused for a new cgroup
with the same name, we can get cache name collision, which will result
in failures while trying to add a sysfs entry for a new cgroup's cache.
Let's fix this by appending the cache address to sysfs names of all
memcg caches so that they are guaranteed to have unique names.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
---
 mm/slub.c |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index d8b8659bfa64..cdeea794bba5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5234,7 +5234,18 @@ static int sysfs_slab_add(struct kmem_cache *s)
 	}
 
 	s->kobj.kset = cache_kset(s);
-	err = kobject_init_and_add(&s->kobj, &slab_ktype, NULL, "%s", name);
+	/*
+	 * A memcg cache can outlive its cgroup. If the memcg id is then reused
+	 * for a new cgroup with the same name, we can get cache name
+	 * collision. To make sure all memcg caches have unique names on sysfs,
+	 * we append the cache address to its name.
+	 */
+	if (is_root_cache(s))
+		err = kobject_init_and_add(&s->kobj, &slab_ktype,
+					   NULL, "%s", name);
+	else
+		err = kobject_init_and_add(&s->kobj, &slab_ktype,
+					   NULL, "%s-%p", name, s);
 	if (err) {
 		kobject_put(&s->kobj);
 		return err;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH -mm 12/12] slub: make sure all memcg caches have unique names on sysfs
@ 2014-02-26 15:05   ` Vladimir Davydov
  0 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-02-26 15:05 UTC (permalink / raw)
  To: akpm
  Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel,
	Christoph Lameter, Pekka Enberg

Since memcg caches are now reparented on memcg offline, a memcg cache
can outlive its cgroup. If the memcg id is then reused for a new cgroup
with the same name, we can get cache name collision, which will result
in failures while trying to add a sysfs entry for a new cgroup's cache.
Let's fix this by appending the cache address to sysfs names of all
memcg caches so that they are guaranteed to have unique names.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
---
 mm/slub.c |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index d8b8659bfa64..cdeea794bba5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5234,7 +5234,18 @@ static int sysfs_slab_add(struct kmem_cache *s)
 	}
 
 	s->kobj.kset = cache_kset(s);
-	err = kobject_init_and_add(&s->kobj, &slab_ktype, NULL, "%s", name);
+	/*
+	 * A memcg cache can outlive its cgroup. If the memcg id is then reused
+	 * for a new cgroup with the same name, we can get cache name
+	 * collision. To make sure all memcg caches have unique names on sysfs,
+	 * we append the cache address to its name.
+	 */
+	if (is_root_cache(s))
+		err = kobject_init_and_add(&s->kobj, &slab_ktype,
+					   NULL, "%s", name);
+	else
+		err = kobject_init_and_add(&s->kobj, &slab_ktype,
+					   NULL, "%s-%p", name, s);
 	if (err) {
 		kobject_put(&s->kobj);
 		return err;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH -mm 00/12] kmemcg reparenting
  2014-02-26 15:05 ` Vladimir Davydov
@ 2014-03-04 14:56   ` Vladimir Davydov
  -1 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-03-04 14:56 UTC (permalink / raw)
  To: hannes, mhocko; +Cc: akpm, glommer, linux-kernel, linux-mm, devel

Hi Johannes, Michal

Could you please take a look at this set when you have time?

Thank you.

On 02/26/2014 07:05 PM, Vladimir Davydov wrote:
> Hi,
>
> During my recent attempt to push kmemcg shrinkers, I was pointed out
> that current kmemcg implementation has a serious design flaw - it lacks
> reparenting. Currently each memcg cache holds a css ref to its memcg and
> does not let it go until the cache is emptied. Although this approach is
> simple, it leads to memcgs hanging around for quite a long time after
> the death, which is ugly. Building something on top of that is
> unacceptable. So this patch set targets on implementing reparenting for
> kmemcg charges.
>
> [ for more details see the discussion thread:
>   https://lkml.org/lkml/2014/2/11/623 ]
>
> It is based on top of 3.14.0-rc4-mmotm and organized as follows:
>  - Patches 1-3 fix some nasty races in kmemcg implementation. I could
>    not let them live any longer, because they touch the code I'm going
>    to modify.
>  - Patches 4-6 prepare memcg_cache_params for reparenting.
>  - Patch 7 rework slab charging making it easier to track and therefore
>    reparent kmem charges, and patches 8-10 kill the old charging code.
>  - Patch 11 introduces kmemcg reparenting.
>  - Patch 12 is for slub. It fixes sysfs naming clashes that can arise
>    due to reparented caches.
>
> Please note that this patch set does not resolve all kmemcg-related
> issues - there are still plenty of them (e.g. "dangling" caches), but it
> is already big enough so I guess I'll address them later when this one
> is committed (if it will be committed at all, of course).
>
> Many thanks to Johannes Weiner, who proposed the idea and kindly
> outlined basic design principles.
>
> Thanks,
>
> Vladimir Davydov (12):
>   memcg: flush cache creation works before memcg cache destruction
>   memcg: fix race in memcg cache destruction path
>   memcg: fix root vs memcg cache destruction race
>   memcg: move slab caches list/mutex init to memcg creation
>   memcg: add pointer from memcg_cache_params to cache
>   memcg: keep all children of each root cache on a list
>   memcg: rework slab charging
>   memcg: do not charge kmalloc_large allocations
>   fork: do not charge thread_info to kmemcg
>   memcg: kill GFP_KMEMCG and stuff
>   memcg: reparent slab on css offline
>   slub: make sure all memcg caches have unique names on sysfs
>
>  include/linux/gfp.h             |    5 -
>  include/linux/memcontrol.h      |  133 ++-------
>  include/linux/slab.h            |   15 +-
>  include/linux/thread_info.h     |    2 -
>  include/trace/events/gfpflags.h |    1 -
>  kernel/fork.c                   |    4 +-
>  mm/memcontrol.c                 |  587 +++++++++++++++++----------------------
>  mm/page_alloc.c                 |   35 ---
>  mm/slab.c                       |   47 ++--
>  mm/slab.h                       |   17 +-
>  mm/slab_common.c                |  100 +++++--
>  mm/slub.c                       |   88 ++++--
>  12 files changed, 470 insertions(+), 564 deletions(-)
>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH -mm 00/12] kmemcg reparenting
@ 2014-03-04 14:56   ` Vladimir Davydov
  0 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-03-04 14:56 UTC (permalink / raw)
  To: hannes, mhocko; +Cc: akpm, glommer, linux-kernel, linux-mm, devel

Hi Johannes, Michal

Could you please take a look at this set when you have time?

Thank you.

On 02/26/2014 07:05 PM, Vladimir Davydov wrote:
> Hi,
>
> During my recent attempt to push kmemcg shrinkers, I was pointed out
> that current kmemcg implementation has a serious design flaw - it lacks
> reparenting. Currently each memcg cache holds a css ref to its memcg and
> does not let it go until the cache is emptied. Although this approach is
> simple, it leads to memcgs hanging around for quite a long time after
> the death, which is ugly. Building something on top of that is
> unacceptable. So this patch set targets on implementing reparenting for
> kmemcg charges.
>
> [ for more details see the discussion thread:
>   https://lkml.org/lkml/2014/2/11/623 ]
>
> It is based on top of 3.14.0-rc4-mmotm and organized as follows:
>  - Patches 1-3 fix some nasty races in kmemcg implementation. I could
>    not let them live any longer, because they touch the code I'm going
>    to modify.
>  - Patches 4-6 prepare memcg_cache_params for reparenting.
>  - Patch 7 rework slab charging making it easier to track and therefore
>    reparent kmem charges, and patches 8-10 kill the old charging code.
>  - Patch 11 introduces kmemcg reparenting.
>  - Patch 12 is for slub. It fixes sysfs naming clashes that can arise
>    due to reparented caches.
>
> Please note that this patch set does not resolve all kmemcg-related
> issues - there are still plenty of them (e.g. "dangling" caches), but it
> is already big enough so I guess I'll address them later when this one
> is committed (if it will be committed at all, of course).
>
> Many thanks to Johannes Weiner, who proposed the idea and kindly
> outlined basic design principles.
>
> Thanks,
>
> Vladimir Davydov (12):
>   memcg: flush cache creation works before memcg cache destruction
>   memcg: fix race in memcg cache destruction path
>   memcg: fix root vs memcg cache destruction race
>   memcg: move slab caches list/mutex init to memcg creation
>   memcg: add pointer from memcg_cache_params to cache
>   memcg: keep all children of each root cache on a list
>   memcg: rework slab charging
>   memcg: do not charge kmalloc_large allocations
>   fork: do not charge thread_info to kmemcg
>   memcg: kill GFP_KMEMCG and stuff
>   memcg: reparent slab on css offline
>   slub: make sure all memcg caches have unique names on sysfs
>
>  include/linux/gfp.h             |    5 -
>  include/linux/memcontrol.h      |  133 ++-------
>  include/linux/slab.h            |   15 +-
>  include/linux/thread_info.h     |    2 -
>  include/trace/events/gfpflags.h |    1 -
>  kernel/fork.c                   |    4 +-
>  mm/memcontrol.c                 |  587 +++++++++++++++++----------------------
>  mm/page_alloc.c                 |   35 ---
>  mm/slab.c                       |   47 ++--
>  mm/slab.h                       |   17 +-
>  mm/slab_common.c                |  100 +++++--
>  mm/slub.c                       |   88 ++++--
>  12 files changed, 470 insertions(+), 564 deletions(-)
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH -mm 00/12] kmemcg reparenting
  2014-03-04 14:56   ` Vladimir Davydov
@ 2014-03-04 15:21     ` Michal Hocko
  -1 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2014-03-04 15:21 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: hannes, akpm, glommer, linux-kernel, linux-mm, devel

On Tue 04-03-14 18:56:06, Vladimir Davydov wrote:
> Hi Johannes, Michal
> 
> Could you please take a look at this set when you have time?

I plan to catch up with others as well. I was on vacation last week and
now catching up with other stuff. I do understand that this review
"speed" might be really frustrating for you but there is a lot of things
on my agenda now (and last few weeks). Sorry about that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH -mm 00/12] kmemcg reparenting
@ 2014-03-04 15:21     ` Michal Hocko
  0 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2014-03-04 15:21 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: hannes, akpm, glommer, linux-kernel, linux-mm, devel

On Tue 04-03-14 18:56:06, Vladimir Davydov wrote:
> Hi Johannes, Michal
> 
> Could you please take a look at this set when you have time?

I plan to catch up with others as well. I was on vacation last week and
now catching up with other stuff. I do understand that this review
"speed" might be really frustrating for you but there is a lot of things
on my agenda now (and last few weeks). Sorry about that.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH -mm 00/12] kmemcg reparenting
  2014-03-04 15:21     ` Michal Hocko
@ 2014-03-05  5:54       ` Vladimir Davydov
  -1 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-03-05  5:54 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, akpm, glommer, linux-kernel, linux-mm, devel

On 03/04/2014 07:21 PM, Michal Hocko wrote:
> On Tue 04-03-14 18:56:06, Vladimir Davydov wrote:
>> Hi Johannes, Michal
>>
>> Could you please take a look at this set when you have time?
> I plan to catch up with others as well. I was on vacation last week and
> now catching up with other stuff. I do understand that this review
> "speed" might be really frustrating for you but there is a lot of things
> on my agenda now (and last few weeks). Sorry about that.

Never mind. I just want to be sure you haven't missed that.

Thank you.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH -mm 00/12] kmemcg reparenting
@ 2014-03-05  5:54       ` Vladimir Davydov
  0 siblings, 0 replies; 32+ messages in thread
From: Vladimir Davydov @ 2014-03-05  5:54 UTC (permalink / raw)
  To: Michal Hocko; +Cc: hannes, akpm, glommer, linux-kernel, linux-mm, devel

On 03/04/2014 07:21 PM, Michal Hocko wrote:
> On Tue 04-03-14 18:56:06, Vladimir Davydov wrote:
>> Hi Johannes, Michal
>>
>> Could you please take a look at this set when you have time?
> I plan to catch up with others as well. I was on vacation last week and
> now catching up with other stuff. I do understand that this review
> "speed" might be really frustrating for you but there is a lot of things
> on my agenda now (and last few weeks). Sorry about that.

Never mind. I just want to be sure you haven't missed that.

Thank you.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2014-03-05  5:54 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-26 15:05 [PATCH -mm 00/12] kmemcg reparenting Vladimir Davydov
2014-02-26 15:05 ` Vladimir Davydov
2014-02-26 15:05 ` [PATCH -mm 01/12] memcg: flush cache creation works before memcg cache destruction Vladimir Davydov
2014-02-26 15:05   ` Vladimir Davydov
2014-02-26 15:05 ` [PATCH -mm 02/12] memcg: fix race in memcg cache destruction path Vladimir Davydov
2014-02-26 15:05   ` Vladimir Davydov
2014-02-26 15:05 ` [PATCH -mm 03/12] memcg: fix root vs memcg cache destruction race Vladimir Davydov
2014-02-26 15:05   ` Vladimir Davydov
2014-02-26 15:05 ` [PATCH -mm 04/12] memcg: move slab caches list/mutex init to memcg creation Vladimir Davydov
2014-02-26 15:05   ` Vladimir Davydov
2014-02-26 15:05 ` [PATCH -mm 05/12] memcg: add pointer from memcg_cache_params to cache Vladimir Davydov
2014-02-26 15:05   ` Vladimir Davydov
2014-02-26 15:05 ` [PATCH -mm 06/12] memcg: keep all children of each root cache on a list Vladimir Davydov
2014-02-26 15:05   ` Vladimir Davydov
2014-02-26 15:05 ` [PATCH -mm 07/12] memcg: rework slab charging Vladimir Davydov
2014-02-26 15:05   ` Vladimir Davydov
2014-02-26 15:05 ` [PATCH -mm 08/12] memcg: do not charge kmalloc_large allocations Vladimir Davydov
2014-02-26 15:05   ` Vladimir Davydov
2014-02-26 15:05 ` [PATCH -mm 09/12] fork: do not charge thread_info to kmemcg Vladimir Davydov
2014-02-26 15:05   ` Vladimir Davydov
2014-02-26 15:05 ` [PATCH -mm 10/12] memcg: kill GFP_KMEMCG and stuff Vladimir Davydov
2014-02-26 15:05   ` Vladimir Davydov
2014-02-26 15:05 ` [PATCH -mm 11/12] memcg: reparent slab on css offline Vladimir Davydov
2014-02-26 15:05   ` Vladimir Davydov
2014-02-26 15:05 ` [PATCH -mm 12/12] slub: make sure all memcg caches have unique names on sysfs Vladimir Davydov
2014-02-26 15:05   ` Vladimir Davydov
2014-03-04 14:56 ` [PATCH -mm 00/12] kmemcg reparenting Vladimir Davydov
2014-03-04 14:56   ` Vladimir Davydov
2014-03-04 15:21   ` Michal Hocko
2014-03-04 15:21     ` Michal Hocko
2014-03-05  5:54     ` Vladimir Davydov
2014-03-05  5:54       ` Vladimir Davydov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.