linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/7] mm: reparent slab memory on cgroup removal
@ 2019-05-08 20:24 Roman Gushchin
  2019-05-08 20:24 ` [PATCH v3 1/7] mm: postpone kmem_cache memcg pointer initialization to memcg_link_cache() Roman Gushchin
                   ` (7 more replies)
  0 siblings, 8 replies; 20+ messages in thread
From: Roman Gushchin @ 2019-05-08 20:24 UTC (permalink / raw)
  To: Andrew Morton, Shakeel Butt
  Cc: linux-mm, linux-kernel, kernel-team, Johannes Weiner,
	Michal Hocko, Rik van Riel, Christoph Lameter, Vladimir Davydov,
	cgroups, Roman Gushchin

# Why do we need this?

We've noticed that the number of dying cgroups is steadily growing on most
of our hosts in production. The following investigation revealed an issue
in userspace memory reclaim code [1], accounting of kernel stacks [2],
and also the mainreason: slab objects.

The underlying problem is quite simple: any page charged
to a cgroup holds a reference to it, so the cgroup can't be reclaimed unless
all charged pages are gone. If a slab object is actively used by other cgroups,
it won't be reclaimed, and will prevent the origin cgroup from being reclaimed.

Slab objects, and first of all vfs cache, is shared between cgroups, which are
using the same underlying fs, and what's even more important, it's shared
between multiple generations of the same workload. So if something is running
periodically every time in a new cgroup (like how systemd works), we do
accumulate multiple dying cgroups.

Strictly speaking pagecache isn't different here, but there is a key difference:
we disable protection and apply some extra pressure on LRUs of dying cgroups,
and these LRUs contain all charged pages.
My experiments show that with the disabled kernel memory accounting the number
of dying cgroups stabilizes at a relatively small number (~100, depends on
memory pressure and cgroup creation rate), and with kernel memory accounting
it grows pretty steadily up to several thousands.

Memory cgroups are quite complex and big objects (mostly due to percpu stats),
so it leads to noticeable memory losses. Memory occupied by dying cgroups
is measured in hundreds of megabytes. I've even seen a host with more than 100Gb
of memory wasted for dying cgroups. It leads to a degradation of performance
with the uptime, and generally limits the usage of cgroups.

My previous attempt [3] to fix the problem by applying extra pressure on slab
shrinker lists caused a regressions with xfs and ext4, and has been reverted [4].
The following attempts to find the right balance [5, 6] were not successful.

So instead of trying to find a maybe non-existing balance, let's do reparent
the accounted slabs to the parent cgroup on cgroup removal.


# Implementation approach

There is however a significant problem with reparenting of slab memory:
there is no list of charged pages. Some of them are in shrinker lists,
but not all. Introducing of a new list is really not an option.

But fortunately there is a way forward: every slab page has a stable pointer
to the corresponding kmem_cache. So the idea is to reparent kmem_caches
instead of slab pages.

It's actually simpler and cheaper, but requires some underlying changes:
1) Make kmem_caches to hold a single reference to the memory cgroup,
   instead of a separate reference per every slab page.
2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
   page->kmem_cache->memcg indirection instead. It's used only on
   slab page release, so it shouldn't be a big issue.
3) Introduce a refcounter for non-root slab caches. It's required to
   be able to destroy kmem_caches when they become empty and release
   the associated memory cgroup.

There is a bonus: currently we do release empty kmem_caches on cgroup
removal, however all other are waiting for the releasing of the memory cgroup.
These refactorings allow kmem_caches to be released as soon as they
become inactive and free.

Some additional implementation details are provided in corresponding
commit messages.


# Results

Below is the average number of dying cgroups on two groups of our production
hosts. They do run some sort of web frontend workload, the memory pressure
is moderate. As we can see, with the kernel memory reparenting the number
stabilizes in 50s range; however with the original version it grows almost
linearly and doesn't show any signs of plateauing. The difference in slab
and percpu usage between patched and unpatched versions also grows linearly.
In 6 days it reached 200Mb.

day           0    1    2    3    4    5    6
original     39  338  580  827 1098 1349 1574
patched      23   44   45   47   50   46   55
mem diff(Mb) 53   73   99  137  148  182  209


# History

v3:
  1) reworked memcg kmem_cache search on allocation path
  2) fixed /proc/kpagecgroup interface

v2:
  1) switched to percpu kmem_cache refcounter
  2) a reference to kmem_cache is held during the allocation
  3) slabs stats are fixed for !MEMCG case (and the refactoring
     is separated into a standalone patch)
  4) kmem_cache reparenting is performed from deactivatation context

v1:
  https://lkml.org/lkml/2019/4/17/1095


# Links

[1]: commit 68600f623d69 ("mm: don't miss the last page because of
round-off error")
[2]: commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
[3]: commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively
small number of objects")
[4]: commit a9a238e83fbb ("Revert "mm: slowly shrink slabs
with a relatively small number of objects")
[5]: https://lkml.org/lkml/2019/1/28/1865
[6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2


Roman Gushchin (7):
  mm: postpone kmem_cache memcg pointer initialization to
    memcg_link_cache()
  mm: generalize postponed non-root kmem_cache deactivation
  mm: introduce __memcg_kmem_uncharge_memcg()
  mm: unify SLAB and SLUB page accounting
  mm: rework non-root kmem_cache lifecycle management
  mm: reparent slab memory on cgroup removal
  mm: fix /proc/kpagecgroup interface for slab pages

 include/linux/memcontrol.h |  10 +++
 include/linux/slab.h       |  13 ++--
 mm/memcontrol.c            |  97 ++++++++++++++++--------
 mm/slab.c                  |  25 ++----
 mm/slab.h                  | 120 +++++++++++++++++++++--------
 mm/slab_common.c           | 151 ++++++++++++++++++++-----------------
 mm/slub.c                  |  36 ++-------
 7 files changed, 267 insertions(+), 185 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 1/7] mm: postpone kmem_cache memcg pointer initialization to memcg_link_cache()
  2019-05-08 20:24 [PATCH v3 0/7] mm: reparent slab memory on cgroup removal Roman Gushchin
@ 2019-05-08 20:24 ` Roman Gushchin
  2019-05-11  0:32   ` Shakeel Butt
  2019-05-08 20:24 ` [PATCH v3 2/7] mm: generalize postponed non-root kmem_cache deactivation Roman Gushchin
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Roman Gushchin @ 2019-05-08 20:24 UTC (permalink / raw)
  To: Andrew Morton, Shakeel Butt
  Cc: linux-mm, linux-kernel, kernel-team, Johannes Weiner,
	Michal Hocko, Rik van Riel, Christoph Lameter, Vladimir Davydov,
	cgroups, Roman Gushchin

Initialize kmem_cache->memcg_params.memcg pointer in
memcg_link_cache() rather than in init_memcg_params().

Once kmem_cache will hold a reference to the memory cgroup,
it will simplify the refcounting.

For non-root kmem_caches memcg_link_cache() is always called
before the kmem_cache becomes visible to a user, so it's safe.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/slab.c        |  2 +-
 mm/slab.h        |  5 +++--
 mm/slab_common.c | 14 +++++++-------
 mm/slub.c        |  2 +-
 4 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 2915d912e89a..f6eff59e018e 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1268,7 +1268,7 @@ void __init kmem_cache_init(void)
 				  nr_node_ids * sizeof(struct kmem_cache_node *),
 				  SLAB_HWCACHE_ALIGN, 0, 0);
 	list_add(&kmem_cache->list, &slab_caches);
-	memcg_link_cache(kmem_cache);
+	memcg_link_cache(kmem_cache, NULL);
 	slab_state = PARTIAL;
 
 	/*
diff --git a/mm/slab.h b/mm/slab.h
index 43ac818b8592..6a562ca72bca 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -289,7 +289,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 }
 
 extern void slab_init_memcg_params(struct kmem_cache *);
-extern void memcg_link_cache(struct kmem_cache *s);
+extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
 extern void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
 				void (*deact_fn)(struct kmem_cache *));
 
@@ -344,7 +344,8 @@ static inline void slab_init_memcg_params(struct kmem_cache *s)
 {
 }
 
-static inline void memcg_link_cache(struct kmem_cache *s)
+static inline void memcg_link_cache(struct kmem_cache *s,
+				    struct mem_cgroup *memcg)
 {
 }
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 58251ba63e4a..6e00bdf8618d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -140,13 +140,12 @@ void slab_init_memcg_params(struct kmem_cache *s)
 }
 
 static int init_memcg_params(struct kmem_cache *s,
-		struct mem_cgroup *memcg, struct kmem_cache *root_cache)
+			     struct kmem_cache *root_cache)
 {
 	struct memcg_cache_array *arr;
 
 	if (root_cache) {
 		s->memcg_params.root_cache = root_cache;
-		s->memcg_params.memcg = memcg;
 		INIT_LIST_HEAD(&s->memcg_params.children_node);
 		INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node);
 		return 0;
@@ -221,11 +220,12 @@ int memcg_update_all_caches(int num_memcgs)
 	return ret;
 }
 
-void memcg_link_cache(struct kmem_cache *s)
+void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg)
 {
 	if (is_root_cache(s)) {
 		list_add(&s->root_caches_node, &slab_root_caches);
 	} else {
+		s->memcg_params.memcg = memcg;
 		list_add(&s->memcg_params.children_node,
 			 &s->memcg_params.root_cache->memcg_params.children);
 		list_add(&s->memcg_params.kmem_caches_node,
@@ -244,7 +244,7 @@ static void memcg_unlink_cache(struct kmem_cache *s)
 }
 #else
 static inline int init_memcg_params(struct kmem_cache *s,
-		struct mem_cgroup *memcg, struct kmem_cache *root_cache)
+				    struct kmem_cache *root_cache)
 {
 	return 0;
 }
@@ -384,7 +384,7 @@ static struct kmem_cache *create_cache(const char *name,
 	s->useroffset = useroffset;
 	s->usersize = usersize;
 
-	err = init_memcg_params(s, memcg, root_cache);
+	err = init_memcg_params(s, root_cache);
 	if (err)
 		goto out_free_cache;
 
@@ -394,7 +394,7 @@ static struct kmem_cache *create_cache(const char *name,
 
 	s->refcount = 1;
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s);
+	memcg_link_cache(s, memcg);
 out:
 	if (err)
 		return ERR_PTR(err);
@@ -997,7 +997,7 @@ struct kmem_cache *__init create_kmalloc_cache(const char *name,
 
 	create_boot_cache(s, name, size, flags, useroffset, usersize);
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s);
+	memcg_link_cache(s, NULL);
 	s->refcount = 1;
 	return s;
 }
diff --git a/mm/slub.c b/mm/slub.c
index 5b2e364102e1..16f7e4f5a141 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4219,7 +4219,7 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
 	}
 	slab_init_memcg_params(s);
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s);
+	memcg_link_cache(s, NULL);
 	return s;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 2/7] mm: generalize postponed non-root kmem_cache deactivation
  2019-05-08 20:24 [PATCH v3 0/7] mm: reparent slab memory on cgroup removal Roman Gushchin
  2019-05-08 20:24 ` [PATCH v3 1/7] mm: postpone kmem_cache memcg pointer initialization to memcg_link_cache() Roman Gushchin
@ 2019-05-08 20:24 ` Roman Gushchin
  2019-05-11  0:33   ` Shakeel Butt
  2019-05-08 20:24 ` [PATCH v3 3/7] mm: introduce __memcg_kmem_uncharge_memcg() Roman Gushchin
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Roman Gushchin @ 2019-05-08 20:24 UTC (permalink / raw)
  To: Andrew Morton, Shakeel Butt
  Cc: linux-mm, linux-kernel, kernel-team, Johannes Weiner,
	Michal Hocko, Rik van Riel, Christoph Lameter, Vladimir Davydov,
	cgroups, Roman Gushchin

Currently SLUB uses a work scheduled after an RCU grace period
to deactivate a non-root kmem_cache. This mechanism can be reused
for kmem_caches reparenting, but requires some generalization.

Let's decouple all infrastructure (rcu callback, work callback)
from the SLUB-specific code, so it can be used with SLAB as well.

Also, let's rename some functions to make the code look simpler.
All SLAB/SLUB-specific functions start with "__". Remove "deact_"
prefix from the corresponding struct fields.

Here is the graph of a new calling scheme:
kmemcg_cache_deactivate()
  __kmemcg_cache_deactivate()                  SLAB/SLUB-specific
  kmemcg_schedule_work_after_rcu()             rcu
    kmemcg_after_rcu_workfn()                  work
      kmemcg_cache_deactivate_after_rcu()
        __kmemcg_cache_deactivate_after_rcu()  SLAB/SLUB-specific

instead of:
__kmemcg_cache_deactivate()                    SLAB/SLUB-specific
  slab_deactivate_memcg_cache_rcu_sched()      SLUB-only
    kmemcg_deactivate_rcufn                    SLUB-only, rcu
      kmemcg_deactivate_workfn                 SLUB-only, work
        kmemcg_cache_deact_after_rcu()         SLUB-only

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/slab.h |  6 ++---
 mm/slab.c            |  4 +++
 mm/slab.h            |  3 ++-
 mm/slab_common.c     | 62 ++++++++++++++++++++------------------------
 mm/slub.c            |  8 +-----
 5 files changed, 38 insertions(+), 45 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 9449b19c5f10..47923c173f30 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -642,10 +642,10 @@ struct memcg_cache_params {
 			struct list_head children_node;
 			struct list_head kmem_caches_node;
 
-			void (*deact_fn)(struct kmem_cache *);
+			void (*work_fn)(struct kmem_cache *);
 			union {
-				struct rcu_head deact_rcu_head;
-				struct work_struct deact_work;
+				struct rcu_head rcu_head;
+				struct work_struct work;
 			};
 		};
 	};
diff --git a/mm/slab.c b/mm/slab.c
index f6eff59e018e..83000e46b870 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2281,6 +2281,10 @@ void __kmemcg_cache_deactivate(struct kmem_cache *cachep)
 {
 	__kmem_cache_shrink(cachep);
 }
+
+void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
+{
+}
 #endif
 
 int __kmem_cache_shutdown(struct kmem_cache *cachep)
diff --git a/mm/slab.h b/mm/slab.h
index 6a562ca72bca..4a261c97c138 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -172,6 +172,7 @@ int __kmem_cache_shutdown(struct kmem_cache *);
 void __kmem_cache_release(struct kmem_cache *);
 int __kmem_cache_shrink(struct kmem_cache *);
 void __kmemcg_cache_deactivate(struct kmem_cache *s);
+void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
 void slab_kmem_cache_release(struct kmem_cache *);
 
 struct seq_file;
@@ -291,7 +292,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 extern void slab_init_memcg_params(struct kmem_cache *);
 extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
 extern void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
-				void (*deact_fn)(struct kmem_cache *));
+				void (*work_fn)(struct kmem_cache *));
 
 #else /* CONFIG_MEMCG_KMEM */
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 6e00bdf8618d..4e5b4292a763 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -691,17 +691,18 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 	put_online_cpus();
 }
 
-static void kmemcg_deactivate_workfn(struct work_struct *work)
+static void kmemcg_after_rcu_workfn(struct work_struct *work)
 {
 	struct kmem_cache *s = container_of(work, struct kmem_cache,
-					    memcg_params.deact_work);
+					    memcg_params.work);
 
 	get_online_cpus();
 	get_online_mems();
 
 	mutex_lock(&slab_mutex);
 
-	s->memcg_params.deact_fn(s);
+	s->memcg_params.work_fn(s);
+	s->memcg_params.work_fn = NULL;
 
 	mutex_unlock(&slab_mutex);
 
@@ -712,37 +713,28 @@ static void kmemcg_deactivate_workfn(struct work_struct *work)
 	css_put(&s->memcg_params.memcg->css);
 }
 
-static void kmemcg_deactivate_rcufn(struct rcu_head *head)
+/*
+ * We need to grab blocking locks.  Bounce to ->work.  The
+ * work item shares the space with the RCU head and can't be
+ * initialized eariler.
+*/
+static void kmemcg_schedule_work_after_rcu(struct rcu_head *head)
 {
 	struct kmem_cache *s = container_of(head, struct kmem_cache,
-					    memcg_params.deact_rcu_head);
+					    memcg_params.rcu_head);
 
-	/*
-	 * We need to grab blocking locks.  Bounce to ->deact_work.  The
-	 * work item shares the space with the RCU head and can't be
-	 * initialized eariler.
-	 */
-	INIT_WORK(&s->memcg_params.deact_work, kmemcg_deactivate_workfn);
-	queue_work(memcg_kmem_cache_wq, &s->memcg_params.deact_work);
+	INIT_WORK(&s->memcg_params.work, kmemcg_after_rcu_workfn);
+	queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
 }
 
-/**
- * slab_deactivate_memcg_cache_rcu_sched - schedule deactivation after a
- *					   sched RCU grace period
- * @s: target kmem_cache
- * @deact_fn: deactivation function to call
- *
- * Schedule @deact_fn to be invoked with online cpus, mems and slab_mutex
- * held after a sched RCU grace period.  The slab is guaranteed to stay
- * alive until @deact_fn is finished.  This is to be used from
- * __kmemcg_cache_deactivate().
- */
-void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
-					   void (*deact_fn)(struct kmem_cache *))
+static void kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
 {
-	if (WARN_ON_ONCE(is_root_cache(s)) ||
-	    WARN_ON_ONCE(s->memcg_params.deact_fn))
-		return;
+	__kmemcg_cache_deactivate_after_rcu(s);
+}
+
+static void kmemcg_cache_deactivate(struct kmem_cache *s)
+{
+	__kmemcg_cache_deactivate(s);
 
 	if (s->memcg_params.root_cache->memcg_params.dying)
 		return;
@@ -750,8 +742,9 @@ void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
 	/* pin memcg so that @s doesn't get destroyed in the middle */
 	css_get(&s->memcg_params.memcg->css);
 
-	s->memcg_params.deact_fn = deact_fn;
-	call_rcu(&s->memcg_params.deact_rcu_head, kmemcg_deactivate_rcufn);
+	WARN_ON_ONCE(s->memcg_params.work_fn);
+	s->memcg_params.work_fn = kmemcg_cache_deactivate_after_rcu;
+	call_rcu(&s->memcg_params.rcu_head, kmemcg_schedule_work_after_rcu);
 }
 
 void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
@@ -773,7 +766,7 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 		if (!c)
 			continue;
 
-		__kmemcg_cache_deactivate(c);
+		kmemcg_cache_deactivate(c);
 		arr->entries[idx] = NULL;
 	}
 	mutex_unlock(&slab_mutex);
@@ -866,11 +859,12 @@ static void flush_memcg_workqueue(struct kmem_cache *s)
 	mutex_unlock(&slab_mutex);
 
 	/*
-	 * SLUB deactivates the kmem_caches through call_rcu. Make
+	 * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make
 	 * sure all registered rcu callbacks have been invoked.
 	 */
-	if (IS_ENABLED(CONFIG_SLUB))
-		rcu_barrier();
+#ifndef CONFIG_SLOB
+	rcu_barrier();
+#endif
 
 	/*
 	 * SLAB and SLUB create memcg kmem_caches through workqueue and SLUB
diff --git a/mm/slub.c b/mm/slub.c
index 16f7e4f5a141..43c34d54ad86 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4028,7 +4028,7 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 }
 
 #ifdef CONFIG_MEMCG
-static void kmemcg_cache_deact_after_rcu(struct kmem_cache *s)
+void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
 {
 	/*
 	 * Called with all the locks held after a sched RCU grace period.
@@ -4054,12 +4054,6 @@ void __kmemcg_cache_deactivate(struct kmem_cache *s)
 	 */
 	slub_set_cpu_partial(s, 0);
 	s->min_partial = 0;
-
-	/*
-	 * s->cpu_partial is checked locklessly (see put_cpu_partial), so
-	 * we have to make sure the change is visible before shrinking.
-	 */
-	slab_deactivate_memcg_cache_rcu_sched(s, kmemcg_cache_deact_after_rcu);
 }
 #endif	/* CONFIG_MEMCG */
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 3/7] mm: introduce __memcg_kmem_uncharge_memcg()
  2019-05-08 20:24 [PATCH v3 0/7] mm: reparent slab memory on cgroup removal Roman Gushchin
  2019-05-08 20:24 ` [PATCH v3 1/7] mm: postpone kmem_cache memcg pointer initialization to memcg_link_cache() Roman Gushchin
  2019-05-08 20:24 ` [PATCH v3 2/7] mm: generalize postponed non-root kmem_cache deactivation Roman Gushchin
@ 2019-05-08 20:24 ` Roman Gushchin
  2019-05-11  0:33   ` Shakeel Butt
  2019-05-08 20:24 ` [PATCH v3 4/7] mm: unify SLAB and SLUB page accounting Roman Gushchin
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Roman Gushchin @ 2019-05-08 20:24 UTC (permalink / raw)
  To: Andrew Morton, Shakeel Butt
  Cc: linux-mm, linux-kernel, kernel-team, Johannes Weiner,
	Michal Hocko, Rik van Riel, Christoph Lameter, Vladimir Davydov,
	cgroups, Roman Gushchin

Let's separate the page counter modification code out of
__memcg_kmem_uncharge() in a way similar to what
__memcg_kmem_charge() and __memcg_kmem_charge_memcg() work.

This will allow to reuse this code later using a new
memcg_kmem_uncharge_memcg() wrapper, which calls
__memcg_kmem_unchare_memcg() if memcg_kmem_enabled()
check is passed.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h | 10 ++++++++++
 mm/memcontrol.c            | 25 +++++++++++++++++--------
 2 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 36bdfe8e5965..deb209510902 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1298,6 +1298,8 @@ int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order);
 void __memcg_kmem_uncharge(struct page *page, int order);
 int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
 			      struct mem_cgroup *memcg);
+void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
+				 unsigned int nr_pages);
 
 extern struct static_key_false memcg_kmem_enabled_key;
 extern struct workqueue_struct *memcg_kmem_cache_wq;
@@ -1339,6 +1341,14 @@ static inline int memcg_kmem_charge_memcg(struct page *page, gfp_t gfp,
 		return __memcg_kmem_charge_memcg(page, gfp, order, memcg);
 	return 0;
 }
+
+static inline void memcg_kmem_uncharge_memcg(struct page *page, int order,
+					     struct mem_cgroup *memcg)
+{
+	if (memcg_kmem_enabled())
+		__memcg_kmem_uncharge_memcg(memcg, 1 << order);
+}
+
 /*
  * helper for accessing a memcg's index. It will be used as an index in the
  * child cache array in kmem_cache, and also to derive its name. This function
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 48a8f1c35176..b2c39f187cbb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2750,6 +2750,22 @@ int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
 	css_put(&memcg->css);
 	return ret;
 }
+
+/**
+ * __memcg_kmem_uncharge_memcg: uncharge a kmem page
+ * @memcg: memcg to uncharge
+ * @nr_pages: number of pages to uncharge
+ */
+void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
+				 unsigned int nr_pages)
+{
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		page_counter_uncharge(&memcg->kmem, nr_pages);
+
+	page_counter_uncharge(&memcg->memory, nr_pages);
+	if (do_memsw_account())
+		page_counter_uncharge(&memcg->memsw, nr_pages);
+}
 /**
  * __memcg_kmem_uncharge: uncharge a kmem page
  * @page: page to uncharge
@@ -2764,14 +2780,7 @@ void __memcg_kmem_uncharge(struct page *page, int order)
 		return;
 
 	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
-
-	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
-		page_counter_uncharge(&memcg->kmem, nr_pages);
-
-	page_counter_uncharge(&memcg->memory, nr_pages);
-	if (do_memsw_account())
-		page_counter_uncharge(&memcg->memsw, nr_pages);
-
+	__memcg_kmem_uncharge_memcg(memcg, nr_pages);
 	page->mem_cgroup = NULL;
 
 	/* slab pages do not have PageKmemcg flag set */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 4/7] mm: unify SLAB and SLUB page accounting
  2019-05-08 20:24 [PATCH v3 0/7] mm: reparent slab memory on cgroup removal Roman Gushchin
                   ` (2 preceding siblings ...)
  2019-05-08 20:24 ` [PATCH v3 3/7] mm: introduce __memcg_kmem_uncharge_memcg() Roman Gushchin
@ 2019-05-08 20:24 ` Roman Gushchin
  2019-05-11  0:33   ` Shakeel Butt
  2019-05-13 18:01   ` Christopher Lameter
  2019-05-08 20:24 ` [PATCH v3 5/7] mm: rework non-root kmem_cache lifecycle management Roman Gushchin
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 20+ messages in thread
From: Roman Gushchin @ 2019-05-08 20:24 UTC (permalink / raw)
  To: Andrew Morton, Shakeel Butt
  Cc: linux-mm, linux-kernel, kernel-team, Johannes Weiner,
	Michal Hocko, Rik van Riel, Christoph Lameter, Vladimir Davydov,
	cgroups, Roman Gushchin

Currently the page accounting code is duplicated in SLAB and SLUB
internals. Let's move it into new (un)charge_slab_page helpers
in the slab_common.c file. These helpers will be responsible
for statistics (global and memcg-aware) and memcg charging.
So they are replacing direct memcg_(un)charge_slab() calls.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/slab.c | 19 +++----------------
 mm/slab.h | 25 +++++++++++++++++++++++++
 mm/slub.c | 14 ++------------
 3 files changed, 30 insertions(+), 28 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 83000e46b870..32e6af9ed9af 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1389,7 +1389,6 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
 								int nodeid)
 {
 	struct page *page;
-	int nr_pages;
 
 	flags |= cachep->allocflags;
 
@@ -1399,17 +1398,11 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
 		return NULL;
 	}
 
-	if (memcg_charge_slab(page, flags, cachep->gfporder, cachep)) {
+	if (charge_slab_page(page, flags, cachep->gfporder, cachep)) {
 		__free_pages(page, cachep->gfporder);
 		return NULL;
 	}
 
-	nr_pages = (1 << cachep->gfporder);
-	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
-		mod_lruvec_page_state(page, NR_SLAB_RECLAIMABLE, nr_pages);
-	else
-		mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE, nr_pages);
-
 	__SetPageSlab(page);
 	/* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
 	if (sk_memalloc_socks() && page_is_pfmemalloc(page))
@@ -1424,12 +1417,6 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
 static void kmem_freepages(struct kmem_cache *cachep, struct page *page)
 {
 	int order = cachep->gfporder;
-	unsigned long nr_freed = (1 << order);
-
-	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
-		mod_lruvec_page_state(page, NR_SLAB_RECLAIMABLE, -nr_freed);
-	else
-		mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE, -nr_freed);
 
 	BUG_ON(!PageSlab(page));
 	__ClearPageSlabPfmemalloc(page);
@@ -1438,8 +1425,8 @@ static void kmem_freepages(struct kmem_cache *cachep, struct page *page)
 	page->mapping = NULL;
 
 	if (current->reclaim_state)
-		current->reclaim_state->reclaimed_slab += nr_freed;
-	memcg_uncharge_slab(page, order, cachep);
+		current->reclaim_state->reclaimed_slab += 1 << order;
+	uncharge_slab_page(page, order, cachep);
 	__free_pages(page, order);
 }
 
diff --git a/mm/slab.h b/mm/slab.h
index 4a261c97c138..c9a31120fa1d 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -205,6 +205,12 @@ ssize_t slabinfo_write(struct file *file, const char __user *buffer,
 void __kmem_cache_free_bulk(struct kmem_cache *, size_t, void **);
 int __kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, size_t, void **);
 
+static inline int cache_vmstat_idx(struct kmem_cache *s)
+{
+	return (s->flags & SLAB_RECLAIM_ACCOUNT) ?
+		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
+}
+
 #ifdef CONFIG_MEMCG_KMEM
 
 /* List of all root caches. */
@@ -352,6 +358,25 @@ static inline void memcg_link_cache(struct kmem_cache *s,
 
 #endif /* CONFIG_MEMCG_KMEM */
 
+static __always_inline int charge_slab_page(struct page *page,
+					    gfp_t gfp, int order,
+					    struct kmem_cache *s)
+{
+	int ret = memcg_charge_slab(page, gfp, order, s);
+
+	if (!ret)
+		mod_lruvec_page_state(page, cache_vmstat_idx(s), 1 << order);
+
+	return ret;
+}
+
+static __always_inline void uncharge_slab_page(struct page *page, int order,
+					       struct kmem_cache *s)
+{
+	mod_lruvec_page_state(page, cache_vmstat_idx(s), -(1 << order));
+	memcg_uncharge_slab(page, order, s);
+}
+
 static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
 {
 	struct kmem_cache *cachep;
diff --git a/mm/slub.c b/mm/slub.c
index 43c34d54ad86..9ec25a588bdd 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1494,7 +1494,7 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
 	else
 		page = __alloc_pages_node(node, flags, order);
 
-	if (page && memcg_charge_slab(page, flags, order, s)) {
+	if (page && charge_slab_page(page, flags, order, s)) {
 		__free_pages(page, order);
 		page = NULL;
 	}
@@ -1687,11 +1687,6 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	if (!page)
 		return NULL;
 
-	mod_lruvec_page_state(page,
-		(s->flags & SLAB_RECLAIM_ACCOUNT) ?
-		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
-		1 << oo_order(oo));
-
 	inc_slabs_node(s, page_to_nid(page), page->objects);
 
 	return page;
@@ -1725,18 +1720,13 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
 			check_object(s, page, p, SLUB_RED_INACTIVE);
 	}
 
-	mod_lruvec_page_state(page,
-		(s->flags & SLAB_RECLAIM_ACCOUNT) ?
-		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
-		-pages);
-
 	__ClearPageSlabPfmemalloc(page);
 	__ClearPageSlab(page);
 
 	page->mapping = NULL;
 	if (current->reclaim_state)
 		current->reclaim_state->reclaimed_slab += pages;
-	memcg_uncharge_slab(page, order, s);
+	uncharge_slab_page(page, order, s);
 	__free_pages(page, order);
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 5/7] mm: rework non-root kmem_cache lifecycle management
  2019-05-08 20:24 [PATCH v3 0/7] mm: reparent slab memory on cgroup removal Roman Gushchin
                   ` (3 preceding siblings ...)
  2019-05-08 20:24 ` [PATCH v3 4/7] mm: unify SLAB and SLUB page accounting Roman Gushchin
@ 2019-05-08 20:24 ` Roman Gushchin
  2019-05-11  0:33   ` Shakeel Butt
  2019-05-08 20:24 ` [PATCH v3 6/7] mm: reparent slab memory on cgroup removal Roman Gushchin
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Roman Gushchin @ 2019-05-08 20:24 UTC (permalink / raw)
  To: Andrew Morton, Shakeel Butt
  Cc: linux-mm, linux-kernel, kernel-team, Johannes Weiner,
	Michal Hocko, Rik van Riel, Christoph Lameter, Vladimir Davydov,
	cgroups, Roman Gushchin

This commit makes several important changes in the lifecycle
of a non-root kmem_cache, which also affect the lifecycle
of a memory cgroup.

Currently each charged slab page has a page->mem_cgroup pointer
to the memory cgroup and holds a reference to it.
Kmem_caches are held by the memcg and are released with it.
It means that none of kmem_caches are released unless at least one
reference to the memcg exists, which is not optimal.

So the current scheme can be illustrated as:
page->mem_cgroup->kmem_cache.

To implement the slab memory reparenting we need to invert the scheme
into: page->kmem_cache->mem_cgroup.

Let's make every page to hold a reference to the kmem_cache (we
already have a stable pointer), and make kmem_caches to hold a single
reference to the memory cgroup.

To make this possible we need to introduce a new percpu refcounter
for non-root kmem_caches. The counter is initialized to the percpu
mode, and is switched to atomic mode after deactivation, so we never
shutdown an active cache. The counter is bumped for every charged page
and also for every running allocation. So the kmem_cache can't
be released unless all allocations complete.

To shutdown non-active empty kmem_caches, let's reuse the
infrastructure of the RCU-delayed work queue, used previously for
the deactivation. After the generalization, it's perfectly suited
for our needs.

Since now we can release a kmem_cache at any moment after the
deactivation, let's call sysfs_slab_remove() only from the shutdown
path. It makes deactivation path simpler.

Because we don't set the page->mem_cgroup pointer, we need to change
the way how memcg-level stats is working for slab pages. We can't use
mod_lruvec_page_state() helpers anymore, so switch over to
mod_lruvec_state().

* I used the following simple approach to test the performance
(stolen from another patchset by T. Harding):

    time find / -name fname-no-exist
    echo 2 > /proc/sys/vm/drop_caches
    repeat several times

Results (I've chosen best results in several runs):

        orig       patched

real	0m0.700s   0m0.722s
user	0m0.114s   0m0.120s
sys	0m0.317s   0m0.324s

real	0m0.729s   0m0.746s
user	0m0.110s   0m0.139s
sys	0m0.320s   0m0.317s

real	0m0.745s   0m0.719s
user	0m0.108s   0m0.124s
sys	0m0.320s   0m0.323s

So it looks like the difference is not noticeable in this test.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/slab.h |  3 +-
 mm/memcontrol.c      | 53 +++++++++++++++++++++----------
 mm/slab.h            | 74 +++++++++++++++++++++++---------------------
 mm/slab_common.c     | 63 +++++++++++++++++++------------------
 mm/slub.c            | 12 +------
 5 files changed, 112 insertions(+), 93 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 47923c173f30..1b54e5f83342 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -16,6 +16,7 @@
 #include <linux/overflow.h>
 #include <linux/types.h>
 #include <linux/workqueue.h>
+#include <linux/percpu-refcount.h>
 
 
 /*
@@ -152,7 +153,6 @@ int kmem_cache_shrink(struct kmem_cache *);
 
 void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
 void memcg_deactivate_kmem_caches(struct mem_cgroup *);
-void memcg_destroy_kmem_caches(struct mem_cgroup *);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
@@ -641,6 +641,7 @@ struct memcg_cache_params {
 			struct mem_cgroup *memcg;
 			struct list_head children_node;
 			struct list_head kmem_caches_node;
+			struct percpu_ref refcnt;
 
 			void (*work_fn)(struct kmem_cache *);
 			union {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b2c39f187cbb..9b27988c8969 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2610,12 +2610,13 @@ static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
 {
 	struct memcg_kmem_cache_create_work *cw;
 
+	if (!css_tryget_online(&memcg->css))
+		return;
+
 	cw = kmalloc(sizeof(*cw), GFP_NOWAIT | __GFP_NOWARN);
 	if (!cw)
 		return;
 
-	css_get(&memcg->css);
-
 	cw->memcg = memcg;
 	cw->cachep = cachep;
 	INIT_WORK(&cw->work, memcg_kmem_cache_create_func);
@@ -2651,20 +2652,35 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
 	struct mem_cgroup *memcg;
 	struct kmem_cache *memcg_cachep;
 	int kmemcg_id;
+	struct memcg_cache_array *arr;
 
 	VM_BUG_ON(!is_root_cache(cachep));
 
 	if (memcg_kmem_bypass())
 		return cachep;
 
-	memcg = get_mem_cgroup_from_current();
+	rcu_read_lock();
+
+	if (unlikely(current->active_memcg))
+		memcg = current->active_memcg;
+	else
+		memcg = mem_cgroup_from_task(current);
+
+	if (!memcg || memcg == root_mem_cgroup)
+		goto out_unlock;
+
 	kmemcg_id = READ_ONCE(memcg->kmemcg_id);
 	if (kmemcg_id < 0)
-		goto out;
+		goto out_unlock;
 
-	memcg_cachep = cache_from_memcg_idx(cachep, kmemcg_id);
-	if (likely(memcg_cachep))
-		return memcg_cachep;
+	arr = rcu_dereference(cachep->memcg_params.memcg_caches);
+
+	/*
+	 * Make sure we will access the up-to-date value. The code updating
+	 * memcg_caches issues a write barrier to match this (see
+	 * memcg_create_kmem_cache()).
+	 */
+	memcg_cachep = READ_ONCE(arr->entries[kmemcg_id]);
 
 	/*
 	 * If we are in a safe context (can wait, and not in interrupt
@@ -2677,10 +2693,16 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
 	 * memcg_create_kmem_cache, this means no further allocation
 	 * could happen with the slab_mutex held. So it's better to
 	 * defer everything.
+	 *
+	 * If the memcg is dying or memcg_cache is about to be released,
+	 * don't bother creating new kmem_caches.
 	 */
-	memcg_schedule_kmem_cache_create(memcg, cachep);
-out:
-	css_put(&memcg->css);
+	if (unlikely(!memcg_cachep))
+		memcg_schedule_kmem_cache_create(memcg, cachep);
+	else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt))
+		cachep = memcg_cachep;
+out_unlock:
+	rcu_read_lock();
 	return cachep;
 }
 
@@ -2691,7 +2713,7 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
 void memcg_kmem_put_cache(struct kmem_cache *cachep)
 {
 	if (!is_root_cache(cachep))
-		css_put(&cachep->memcg_params.memcg->css);
+		percpu_ref_put(&cachep->memcg_params.refcnt);
 }
 
 /**
@@ -2719,9 +2741,6 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
 		cancel_charge(memcg, nr_pages);
 		return -ENOMEM;
 	}
-
-	page->mem_cgroup = memcg;
-
 	return 0;
 }
 
@@ -2744,8 +2763,10 @@ int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
 	memcg = get_mem_cgroup_from_current();
 	if (!mem_cgroup_is_root(memcg)) {
 		ret = __memcg_kmem_charge_memcg(page, gfp, order, memcg);
-		if (!ret)
+		if (!ret) {
+			page->mem_cgroup = memcg;
 			__SetPageKmemcg(page);
+		}
 	}
 	css_put(&memcg->css);
 	return ret;
@@ -3238,7 +3259,7 @@ static void memcg_free_kmem(struct mem_cgroup *memcg)
 		memcg_offline_kmem(memcg);
 
 	if (memcg->kmem_state == KMEM_ALLOCATED) {
-		memcg_destroy_kmem_caches(memcg);
+		WARN_ON(!list_empty(&memcg->kmem_caches));
 		static_branch_dec(&memcg_kmem_enabled_key);
 		WARN_ON(page_counter_read(&memcg->kmem));
 	}
diff --git a/mm/slab.h b/mm/slab.h
index c9a31120fa1d..2acc68a7e0a0 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -173,6 +173,7 @@ void __kmem_cache_release(struct kmem_cache *);
 int __kmem_cache_shrink(struct kmem_cache *);
 void __kmemcg_cache_deactivate(struct kmem_cache *s);
 void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
+void kmemcg_cache_shutdown(struct kmem_cache *s);
 void slab_kmem_cache_release(struct kmem_cache *);
 
 struct seq_file;
@@ -248,31 +249,6 @@ static inline const char *cache_name(struct kmem_cache *s)
 	return s->name;
 }
 
-/*
- * Note, we protect with RCU only the memcg_caches array, not per-memcg caches.
- * That said the caller must assure the memcg's cache won't go away by either
- * taking a css reference to the owner cgroup, or holding the slab_mutex.
- */
-static inline struct kmem_cache *
-cache_from_memcg_idx(struct kmem_cache *s, int idx)
-{
-	struct kmem_cache *cachep;
-	struct memcg_cache_array *arr;
-
-	rcu_read_lock();
-	arr = rcu_dereference(s->memcg_params.memcg_caches);
-
-	/*
-	 * Make sure we will access the up-to-date value. The code updating
-	 * memcg_caches issues a write barrier to match this (see
-	 * memcg_create_kmem_cache()).
-	 */
-	cachep = READ_ONCE(arr->entries[idx]);
-	rcu_read_unlock();
-
-	return cachep;
-}
-
 static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
 {
 	if (is_root_cache(s))
@@ -284,15 +260,37 @@ static __always_inline int memcg_charge_slab(struct page *page,
 					     gfp_t gfp, int order,
 					     struct kmem_cache *s)
 {
-	if (is_root_cache(s))
-		return 0;
-	return memcg_kmem_charge_memcg(page, gfp, order, s->memcg_params.memcg);
+	struct mem_cgroup *memcg;
+	struct lruvec *lruvec;
+	int ret;
+
+	memcg = s->memcg_params.memcg;
+	ret = memcg_kmem_charge_memcg(page, gfp, order, memcg);
+	if (ret)
+		return ret;
+
+	lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
+	mod_lruvec_state(lruvec, cache_vmstat_idx(s), 1 << order);
+
+	/* transer try_charge() page references to kmem_cache */
+	percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
+	css_put_many(&memcg->css, 1 << order);
+
+	return 0;
 }
 
 static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 						struct kmem_cache *s)
 {
-	memcg_kmem_uncharge(page, order);
+	struct mem_cgroup *memcg;
+	struct lruvec *lruvec;
+
+	memcg = s->memcg_params.memcg;
+	lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
+	mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
+	memcg_kmem_uncharge_memcg(page, order, memcg);
+
+	percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
 }
 
 extern void slab_init_memcg_params(struct kmem_cache *);
@@ -362,18 +360,24 @@ static __always_inline int charge_slab_page(struct page *page,
 					    gfp_t gfp, int order,
 					    struct kmem_cache *s)
 {
-	int ret = memcg_charge_slab(page, gfp, order, s);
-
-	if (!ret)
-		mod_lruvec_page_state(page, cache_vmstat_idx(s), 1 << order);
+	if (is_root_cache(s)) {
+		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
+				    1 << order);
+		return 0;
+	}
 
-	return ret;
+	return memcg_charge_slab(page, gfp, order, s);
 }
 
 static __always_inline void uncharge_slab_page(struct page *page, int order,
 					       struct kmem_cache *s)
 {
-	mod_lruvec_page_state(page, cache_vmstat_idx(s), -(1 << order));
+	if (is_root_cache(s)) {
+		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
+				    -(1 << order));
+		return;
+	}
+
 	memcg_uncharge_slab(page, order, s);
 }
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 4e5b4292a763..995920222127 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -45,6 +45,8 @@ static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work);
 static DECLARE_WORK(slab_caches_to_rcu_destroy_work,
 		    slab_caches_to_rcu_destroy_workfn);
 
+static void kmemcg_queue_cache_shutdown(struct percpu_ref *percpu_ref);
+
 /*
  * Set of flags that will prevent slab merging
  */
@@ -145,6 +147,12 @@ static int init_memcg_params(struct kmem_cache *s,
 	struct memcg_cache_array *arr;
 
 	if (root_cache) {
+		int ret = percpu_ref_init(&s->memcg_params.refcnt,
+					  kmemcg_queue_cache_shutdown,
+					  0, GFP_KERNEL);
+		if (ret)
+			return ret;
+
 		s->memcg_params.root_cache = root_cache;
 		INIT_LIST_HEAD(&s->memcg_params.children_node);
 		INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node);
@@ -170,6 +178,8 @@ static void destroy_memcg_params(struct kmem_cache *s)
 {
 	if (is_root_cache(s))
 		kvfree(rcu_access_pointer(s->memcg_params.memcg_caches));
+	else
+		percpu_ref_exit(&s->memcg_params.refcnt);
 }
 
 static void free_memcg_params(struct rcu_head *rcu)
@@ -225,6 +235,7 @@ void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg)
 	if (is_root_cache(s)) {
 		list_add(&s->root_caches_node, &slab_root_caches);
 	} else {
+		css_get(&memcg->css);
 		s->memcg_params.memcg = memcg;
 		list_add(&s->memcg_params.children_node,
 			 &s->memcg_params.root_cache->memcg_params.children);
@@ -240,6 +251,7 @@ static void memcg_unlink_cache(struct kmem_cache *s)
 	} else {
 		list_del(&s->memcg_params.children_node);
 		list_del(&s->memcg_params.kmem_caches_node);
+		css_put(&s->memcg_params.memcg->css);
 	}
 }
 #else
@@ -708,16 +720,13 @@ static void kmemcg_after_rcu_workfn(struct work_struct *work)
 
 	put_online_mems();
 	put_online_cpus();
-
-	/* done, put the ref from slab_deactivate_memcg_cache_rcu_sched() */
-	css_put(&s->memcg_params.memcg->css);
 }
 
 /*
  * We need to grab blocking locks.  Bounce to ->work.  The
  * work item shares the space with the RCU head and can't be
- * initialized eariler.
-*/
+ * initialized earlier.
+ */
 static void kmemcg_schedule_work_after_rcu(struct rcu_head *head)
 {
 	struct kmem_cache *s = container_of(head, struct kmem_cache,
@@ -727,9 +736,28 @@ static void kmemcg_schedule_work_after_rcu(struct rcu_head *head)
 	queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
 }
 
+static void kmemcg_cache_shutdown_after_rcu(struct kmem_cache *s)
+{
+	WARN_ON(shutdown_cache(s));
+}
+
+static void kmemcg_queue_cache_shutdown(struct percpu_ref *percpu_ref)
+{
+	struct kmem_cache *s = container_of(percpu_ref, struct kmem_cache,
+					    memcg_params.refcnt);
+
+	if (s->memcg_params.root_cache->memcg_params.dying)
+		return;
+
+	WARN_ON(s->memcg_params.work_fn);
+	s->memcg_params.work_fn = kmemcg_cache_shutdown_after_rcu;
+	call_rcu(&s->memcg_params.rcu_head, kmemcg_schedule_work_after_rcu);
+}
+
 static void kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
 {
 	__kmemcg_cache_deactivate_after_rcu(s);
+	percpu_ref_kill(&s->memcg_params.refcnt);
 }
 
 static void kmemcg_cache_deactivate(struct kmem_cache *s)
@@ -739,9 +767,6 @@ static void kmemcg_cache_deactivate(struct kmem_cache *s)
 	if (s->memcg_params.root_cache->memcg_params.dying)
 		return;
 
-	/* pin memcg so that @s doesn't get destroyed in the middle */
-	css_get(&s->memcg_params.memcg->css);
-
 	WARN_ON_ONCE(s->memcg_params.work_fn);
 	s->memcg_params.work_fn = kmemcg_cache_deactivate_after_rcu;
 	call_rcu(&s->memcg_params.rcu_head, kmemcg_schedule_work_after_rcu);
@@ -775,28 +800,6 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 	put_online_cpus();
 }
 
-void memcg_destroy_kmem_caches(struct mem_cgroup *memcg)
-{
-	struct kmem_cache *s, *s2;
-
-	get_online_cpus();
-	get_online_mems();
-
-	mutex_lock(&slab_mutex);
-	list_for_each_entry_safe(s, s2, &memcg->kmem_caches,
-				 memcg_params.kmem_caches_node) {
-		/*
-		 * The cgroup is about to be freed and therefore has no charges
-		 * left. Hence, all its caches must be empty by now.
-		 */
-		BUG_ON(shutdown_cache(s));
-	}
-	mutex_unlock(&slab_mutex);
-
-	put_online_mems();
-	put_online_cpus();
-}
-
 static int shutdown_memcg_caches(struct kmem_cache *s)
 {
 	struct memcg_cache_array *arr;
diff --git a/mm/slub.c b/mm/slub.c
index 9ec25a588bdd..e7ce810ebd02 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4022,18 +4022,8 @@ void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
 {
 	/*
 	 * Called with all the locks held after a sched RCU grace period.
-	 * Even if @s becomes empty after shrinking, we can't know that @s
-	 * doesn't have allocations already in-flight and thus can't
-	 * destroy @s until the associated memcg is released.
-	 *
-	 * However, let's remove the sysfs files for empty caches here.
-	 * Each cache has a lot of interface files which aren't
-	 * particularly useful for empty draining caches; otherwise, we can
-	 * easily end up with millions of unnecessary sysfs files on
-	 * systems which have a lot of memory and transient cgroups.
 	 */
-	if (!__kmem_cache_shrink(s))
-		sysfs_slab_remove(s);
+	__kmem_cache_shrink(s);
 }
 
 void __kmemcg_cache_deactivate(struct kmem_cache *s)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 6/7] mm: reparent slab memory on cgroup removal
  2019-05-08 20:24 [PATCH v3 0/7] mm: reparent slab memory on cgroup removal Roman Gushchin
                   ` (4 preceding siblings ...)
  2019-05-08 20:24 ` [PATCH v3 5/7] mm: rework non-root kmem_cache lifecycle management Roman Gushchin
@ 2019-05-08 20:24 ` Roman Gushchin
  2019-05-11  0:34   ` Shakeel Butt
  2019-05-08 20:24 ` [PATCH v3 7/7] mm: fix /proc/kpagecgroup interface for slab pages Roman Gushchin
  2019-05-11  0:32 ` [PATCH v3 0/7] mm: reparent slab memory on cgroup removal Shakeel Butt
  7 siblings, 1 reply; 20+ messages in thread
From: Roman Gushchin @ 2019-05-08 20:24 UTC (permalink / raw)
  To: Andrew Morton, Shakeel Butt
  Cc: linux-mm, linux-kernel, kernel-team, Johannes Weiner,
	Michal Hocko, Rik van Riel, Christoph Lameter, Vladimir Davydov,
	cgroups, Roman Gushchin

Let's reparent memcg slab memory on memcg offlining. This allows us
to release the memory cgroup without waiting for the last outstanding
kernel object (e.g. dentry used by another application).

So instead of reparenting all accounted slab pages, let's do reparent
a relatively small amount of kmem_caches. Reparenting is performed as
a part of the deactivation process.

Since the parent cgroup is already charged, everything we need to do
is to splice the list of kmem_caches to the parent's kmem_caches list,
swap the memcg pointer and drop the css refcounter for each kmem_cache
and adjust the parent's css refcounter. Quite simple.

Please, note that kmem_cache->memcg_params.memcg isn't a stable
pointer anymore. It's safe to read it under rcu_read_lock() or
with slab_mutex held.

We can race with the slab allocation and deallocation paths. It's not
a big problem: parent's charge and slab global stats are always
correct, and we don't care anymore about the child usage and global
stats. The child cgroup is already offline, so we don't use or show it
anywhere.

Local slab stats (NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE)
aren't used anywhere except count_shadow_nodes(). But even there it
won't break anything: after reparenting "nodes" will be 0 on child
level (because we're already reparenting shrinker lists), and on
parent level page stats always were 0, and this patch won't change
anything.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/slab.h |  4 ++--
 mm/memcontrol.c      | 14 ++++++++------
 mm/slab.h            | 14 +++++++++-----
 mm/slab_common.c     | 23 ++++++++++++++++++++---
 4 files changed, 39 insertions(+), 16 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 1b54e5f83342..109cab2ad9b4 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -152,7 +152,7 @@ void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 
 void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
-void memcg_deactivate_kmem_caches(struct mem_cgroup *);
+void memcg_deactivate_kmem_caches(struct mem_cgroup *, struct mem_cgroup *);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
@@ -638,7 +638,7 @@ struct memcg_cache_params {
 			bool dying;
 		};
 		struct {
-			struct mem_cgroup *memcg;
+			struct mem_cgroup __rcu *memcg;
 			struct list_head children_node;
 			struct list_head kmem_caches_node;
 			struct percpu_ref refcnt;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9b27988c8969..6e4d9ed16069 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3220,15 +3220,15 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 	 */
 	memcg->kmem_state = KMEM_ALLOCATED;
 
-	memcg_deactivate_kmem_caches(memcg);
-
-	kmemcg_id = memcg->kmemcg_id;
-	BUG_ON(kmemcg_id < 0);
-
 	parent = parent_mem_cgroup(memcg);
 	if (!parent)
 		parent = root_mem_cgroup;
 
+	memcg_deactivate_kmem_caches(memcg, parent);
+
+	kmemcg_id = memcg->kmemcg_id;
+	BUG_ON(kmemcg_id < 0);
+
 	/*
 	 * Change kmemcg_id of this cgroup and all its descendants to the
 	 * parent's id, and then move all entries from this cgroup's list_lrus
@@ -3261,7 +3261,6 @@ static void memcg_free_kmem(struct mem_cgroup *memcg)
 	if (memcg->kmem_state == KMEM_ALLOCATED) {
 		WARN_ON(!list_empty(&memcg->kmem_caches));
 		static_branch_dec(&memcg_kmem_enabled_key);
-		WARN_ON(page_counter_read(&memcg->kmem));
 	}
 }
 #else
@@ -4673,6 +4672,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 
 	/* The following stuff does not apply to the root */
 	if (!parent) {
+#ifdef CONFIG_MEMCG_KMEM
+		INIT_LIST_HEAD(&memcg->kmem_caches);
+#endif
 		root_mem_cgroup = memcg;
 		return &memcg->css;
 	}
diff --git a/mm/slab.h b/mm/slab.h
index 2acc68a7e0a0..acdc1810639d 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -264,10 +264,11 @@ static __always_inline int memcg_charge_slab(struct page *page,
 	struct lruvec *lruvec;
 	int ret;
 
-	memcg = s->memcg_params.memcg;
+	rcu_read_lock();
+	memcg = rcu_dereference(s->memcg_params.memcg);
 	ret = memcg_kmem_charge_memcg(page, gfp, order, memcg);
 	if (ret)
-		return ret;
+		goto out;
 
 	lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
 	mod_lruvec_state(lruvec, cache_vmstat_idx(s), 1 << order);
@@ -275,8 +276,9 @@ static __always_inline int memcg_charge_slab(struct page *page,
 	/* transer try_charge() page references to kmem_cache */
 	percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
 	css_put_many(&memcg->css, 1 << order);
-
-	return 0;
+out:
+	rcu_read_unlock();
+	return ret;
 }
 
 static __always_inline void memcg_uncharge_slab(struct page *page, int order,
@@ -285,10 +287,12 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
-	memcg = s->memcg_params.memcg;
+	rcu_read_lock();
+	memcg = rcu_dereference(s->memcg_params.memcg);
 	lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
 	mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
 	memcg_kmem_uncharge_memcg(page, order, memcg);
+	rcu_read_unlock();
 
 	percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
 }
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 995920222127..36673a43ed31 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -236,7 +236,7 @@ void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg)
 		list_add(&s->root_caches_node, &slab_root_caches);
 	} else {
 		css_get(&memcg->css);
-		s->memcg_params.memcg = memcg;
+		rcu_assign_pointer(s->memcg_params.memcg, memcg);
 		list_add(&s->memcg_params.children_node,
 			 &s->memcg_params.root_cache->memcg_params.children);
 		list_add(&s->memcg_params.kmem_caches_node,
@@ -251,7 +251,8 @@ static void memcg_unlink_cache(struct kmem_cache *s)
 	} else {
 		list_del(&s->memcg_params.children_node);
 		list_del(&s->memcg_params.kmem_caches_node);
-		css_put(&s->memcg_params.memcg->css);
+		mem_cgroup_put(rcu_dereference_protected(s->memcg_params.memcg,
+			lockdep_is_held(&slab_mutex)));
 	}
 }
 #else
@@ -772,11 +773,13 @@ static void kmemcg_cache_deactivate(struct kmem_cache *s)
 	call_rcu(&s->memcg_params.rcu_head, kmemcg_schedule_work_after_rcu);
 }
 
-void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
+void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg,
+				  struct mem_cgroup *parent)
 {
 	int idx;
 	struct memcg_cache_array *arr;
 	struct kmem_cache *s, *c;
+	unsigned int nr_reparented;
 
 	idx = memcg_cache_id(memcg);
 
@@ -794,6 +797,20 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 		kmemcg_cache_deactivate(c);
 		arr->entries[idx] = NULL;
 	}
+	if (memcg != parent) {
+		nr_reparented = 0;
+		list_for_each_entry(s, &memcg->kmem_caches,
+				    memcg_params.kmem_caches_node) {
+			rcu_assign_pointer(s->memcg_params.memcg, parent);
+			css_put(&memcg->css);
+			nr_reparented++;
+		}
+		if (nr_reparented) {
+			list_splice_init(&memcg->kmem_caches,
+					 &parent->kmem_caches);
+			css_get_many(&parent->css, nr_reparented);
+		}
+	}
 	mutex_unlock(&slab_mutex);
 
 	put_online_mems();
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 7/7] mm: fix /proc/kpagecgroup interface for slab pages
  2019-05-08 20:24 [PATCH v3 0/7] mm: reparent slab memory on cgroup removal Roman Gushchin
                   ` (5 preceding siblings ...)
  2019-05-08 20:24 ` [PATCH v3 6/7] mm: reparent slab memory on cgroup removal Roman Gushchin
@ 2019-05-08 20:24 ` Roman Gushchin
  2019-05-11  0:34   ` Shakeel Butt
  2019-05-11  0:32 ` [PATCH v3 0/7] mm: reparent slab memory on cgroup removal Shakeel Butt
  7 siblings, 1 reply; 20+ messages in thread
From: Roman Gushchin @ 2019-05-08 20:24 UTC (permalink / raw)
  To: Andrew Morton, Shakeel Butt
  Cc: linux-mm, linux-kernel, kernel-team, Johannes Weiner,
	Michal Hocko, Rik van Riel, Christoph Lameter, Vladimir Davydov,
	cgroups, Roman Gushchin

Switching to an indirect scheme of getting mem_cgroup pointer for
!root slab pages broke /proc/kpagecgroup interface for them.

Let's fix it by learning page_cgroup_ino() how to get memcg
pointer for slab pages.

Reported-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/memcontrol.c  |  5 ++++-
 mm/slab.h        | 21 +++++++++++++++++++++
 mm/slab_common.c |  1 +
 3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6e4d9ed16069..8114838759f6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -494,7 +494,10 @@ ino_t page_cgroup_ino(struct page *page)
 	unsigned long ino = 0;
 
 	rcu_read_lock();
-	memcg = READ_ONCE(page->mem_cgroup);
+	if (PageSlab(page))
+		memcg = memcg_from_slab_page(page);
+	else
+		memcg = READ_ONCE(page->mem_cgroup);
 	while (memcg && !(memcg->css.flags & CSS_ONLINE))
 		memcg = parent_mem_cgroup(memcg);
 	if (memcg)
diff --git a/mm/slab.h b/mm/slab.h
index acdc1810639d..cb684fbe2cc2 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -256,6 +256,22 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
 	return s->memcg_params.root_cache;
 }
 
+static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
+{
+	struct kmem_cache *s;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	if (PageTail(page))
+		page = compound_head(page);
+
+	s = READ_ONCE(page->slab_cache);
+	if (s && !is_root_cache(s))
+		return rcu_dereference(s->memcg_params.memcg);
+
+	return NULL;
+}
+
 static __always_inline int memcg_charge_slab(struct page *page,
 					     gfp_t gfp, int order,
 					     struct kmem_cache *s)
@@ -338,6 +354,11 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
 	return s;
 }
 
+static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
+{
+	return NULL;
+}
+
 static inline int memcg_charge_slab(struct page *page, gfp_t gfp, int order,
 				    struct kmem_cache *s)
 {
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 36673a43ed31..0cfdad0a0aac 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -253,6 +253,7 @@ static void memcg_unlink_cache(struct kmem_cache *s)
 		list_del(&s->memcg_params.kmem_caches_node);
 		mem_cgroup_put(rcu_dereference_protected(s->memcg_params.memcg,
 			lockdep_is_held(&slab_mutex)));
+		rcu_assign_pointer(s->memcg_params.memcg, NULL);
 	}
 }
 #else
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/7] mm: reparent slab memory on cgroup removal
  2019-05-08 20:24 [PATCH v3 0/7] mm: reparent slab memory on cgroup removal Roman Gushchin
                   ` (6 preceding siblings ...)
  2019-05-08 20:24 ` [PATCH v3 7/7] mm: fix /proc/kpagecgroup interface for slab pages Roman Gushchin
@ 2019-05-11  0:32 ` Shakeel Butt
  2019-05-13 20:21   ` Roman Gushchin
  7 siblings, 1 reply; 20+ messages in thread
From: Shakeel Butt @ 2019-05-11  0:32 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Linux MM, LKML, Kernel Team, Johannes Weiner,
	Michal Hocko, Rik van Riel, Christoph Lameter, Vladimir Davydov,
	Cgroups

From: Roman Gushchin <guro@fb.com>
Date: Wed, May 8, 2019 at 1:30 PM
To: Andrew Morton, Shakeel Butt
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
<kernel-team@fb.com>, Johannes Weiner, Michal Hocko, Rik van Riel,
Christoph Lameter, Vladimir Davydov, <cgroups@vger.kernel.org>, Roman
Gushchin

> # Why do we need this?
>
> We've noticed that the number of dying cgroups is steadily growing on most
> of our hosts in production. The following investigation revealed an issue
> in userspace memory reclaim code [1], accounting of kernel stacks [2],
> and also the mainreason: slab objects.
>
> The underlying problem is quite simple: any page charged
> to a cgroup holds a reference to it, so the cgroup can't be reclaimed unless
> all charged pages are gone. If a slab object is actively used by other cgroups,
> it won't be reclaimed, and will prevent the origin cgroup from being reclaimed.
>
> Slab objects, and first of all vfs cache, is shared between cgroups, which are
> using the same underlying fs, and what's even more important, it's shared
> between multiple generations of the same workload. So if something is running
> periodically every time in a new cgroup (like how systemd works), we do
> accumulate multiple dying cgroups.
>
> Strictly speaking pagecache isn't different here, but there is a key difference:
> we disable protection and apply some extra pressure on LRUs of dying cgroups,

How do you apply extra pressure on dying cgroups? cgroup-v2 does not
have memory.force_empty.


> and these LRUs contain all charged pages.
> My experiments show that with the disabled kernel memory accounting the number
> of dying cgroups stabilizes at a relatively small number (~100, depends on
> memory pressure and cgroup creation rate), and with kernel memory accounting
> it grows pretty steadily up to several thousands.
>
> Memory cgroups are quite complex and big objects (mostly due to percpu stats),
> so it leads to noticeable memory losses. Memory occupied by dying cgroups
> is measured in hundreds of megabytes. I've even seen a host with more than 100Gb
> of memory wasted for dying cgroups. It leads to a degradation of performance
> with the uptime, and generally limits the usage of cgroups.
>
> My previous attempt [3] to fix the problem by applying extra pressure on slab
> shrinker lists caused a regressions with xfs and ext4, and has been reverted [4].
> The following attempts to find the right balance [5, 6] were not successful.
>
> So instead of trying to find a maybe non-existing balance, let's do reparent
> the accounted slabs to the parent cgroup on cgroup removal.
>
>
> # Implementation approach
>
> There is however a significant problem with reparenting of slab memory:
> there is no list of charged pages. Some of them are in shrinker lists,
> but not all. Introducing of a new list is really not an option.
>
> But fortunately there is a way forward: every slab page has a stable pointer
> to the corresponding kmem_cache. So the idea is to reparent kmem_caches
> instead of slab pages.
>
> It's actually simpler and cheaper, but requires some underlying changes:
> 1) Make kmem_caches to hold a single reference to the memory cgroup,
>    instead of a separate reference per every slab page.
> 2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
>    page->kmem_cache->memcg indirection instead. It's used only on
>    slab page release, so it shouldn't be a big issue.
> 3) Introduce a refcounter for non-root slab caches. It's required to
>    be able to destroy kmem_caches when they become empty and release
>    the associated memory cgroup.
>
> There is a bonus: currently we do release empty kmem_caches on cgroup
> removal, however all other are waiting for the releasing of the memory cgroup.
> These refactorings allow kmem_caches to be released as soon as they
> become inactive and free.
>
> Some additional implementation details are provided in corresponding
> commit messages.
>
>
> # Results
>
> Below is the average number of dying cgroups on two groups of our production
> hosts. They do run some sort of web frontend workload, the memory pressure
> is moderate. As we can see, with the kernel memory reparenting the number
> stabilizes in 50s range; however with the original version it grows almost
> linearly and doesn't show any signs of plateauing. The difference in slab
> and percpu usage between patched and unpatched versions also grows linearly.
> In 6 days it reached 200Mb.
>
> day           0    1    2    3    4    5    6
> original     39  338  580  827 1098 1349 1574
> patched      23   44   45   47   50   46   55
> mem diff(Mb) 53   73   99  137  148  182  209
>
>
> # History
>
> v3:
>   1) reworked memcg kmem_cache search on allocation path
>   2) fixed /proc/kpagecgroup interface
>
> v2:
>   1) switched to percpu kmem_cache refcounter
>   2) a reference to kmem_cache is held during the allocation
>   3) slabs stats are fixed for !MEMCG case (and the refactoring
>      is separated into a standalone patch)
>   4) kmem_cache reparenting is performed from deactivatation context
>
> v1:
>   https://lkml.org/lkml/2019/4/17/1095
>
>
> # Links
>
> [1]: commit 68600f623d69 ("mm: don't miss the last page because of
> round-off error")
> [2]: commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
> [3]: commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively
> small number of objects")
> [4]: commit a9a238e83fbb ("Revert "mm: slowly shrink slabs
> with a relatively small number of objects")
> [5]: https://lkml.org/lkml/2019/1/28/1865
> [6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2
>
>
> Roman Gushchin (7):
>   mm: postpone kmem_cache memcg pointer initialization to
>     memcg_link_cache()
>   mm: generalize postponed non-root kmem_cache deactivation
>   mm: introduce __memcg_kmem_uncharge_memcg()
>   mm: unify SLAB and SLUB page accounting
>   mm: rework non-root kmem_cache lifecycle management
>   mm: reparent slab memory on cgroup removal
>   mm: fix /proc/kpagecgroup interface for slab pages
>
>  include/linux/memcontrol.h |  10 +++
>  include/linux/slab.h       |  13 ++--
>  mm/memcontrol.c            |  97 ++++++++++++++++--------
>  mm/slab.c                  |  25 ++----
>  mm/slab.h                  | 120 +++++++++++++++++++++--------
>  mm/slab_common.c           | 151 ++++++++++++++++++++-----------------
>  mm/slub.c                  |  36 ++-------
>  7 files changed, 267 insertions(+), 185 deletions(-)
>
> --
> 2.20.1
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/7] mm: postpone kmem_cache memcg pointer initialization to memcg_link_cache()
  2019-05-08 20:24 ` [PATCH v3 1/7] mm: postpone kmem_cache memcg pointer initialization to memcg_link_cache() Roman Gushchin
@ 2019-05-11  0:32   ` Shakeel Butt
  0 siblings, 0 replies; 20+ messages in thread
From: Shakeel Butt @ 2019-05-11  0:32 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Linux MM, LKML, Kernel Team, Johannes Weiner,
	Michal Hocko, Rik van Riel, Christoph Lameter, Vladimir Davydov,
	Cgroups

From: Roman Gushchin <guro@fb.com>
Date: Wed, May 8, 2019 at 1:30 PM
To: Andrew Morton, Shakeel Butt
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
<kernel-team@fb.com>, Johannes Weiner, Michal Hocko, Rik van Riel,
Christoph Lameter, Vladimir Davydov, <cgroups@vger.kernel.org>, Roman
Gushchin

> Initialize kmem_cache->memcg_params.memcg pointer in
> memcg_link_cache() rather than in init_memcg_params().
>
> Once kmem_cache will hold a reference to the memory cgroup,
> it will simplify the refcounting.
>
> For non-root kmem_caches memcg_link_cache() is always called
> before the kmem_cache becomes visible to a user, so it's safe.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>


> ---
>  mm/slab.c        |  2 +-
>  mm/slab.h        |  5 +++--
>  mm/slab_common.c | 14 +++++++-------
>  mm/slub.c        |  2 +-
>  4 files changed, 12 insertions(+), 11 deletions(-)
>
> diff --git a/mm/slab.c b/mm/slab.c
> index 2915d912e89a..f6eff59e018e 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1268,7 +1268,7 @@ void __init kmem_cache_init(void)
>                                   nr_node_ids * sizeof(struct kmem_cache_node *),
>                                   SLAB_HWCACHE_ALIGN, 0, 0);
>         list_add(&kmem_cache->list, &slab_caches);
> -       memcg_link_cache(kmem_cache);
> +       memcg_link_cache(kmem_cache, NULL);
>         slab_state = PARTIAL;
>
>         /*
> diff --git a/mm/slab.h b/mm/slab.h
> index 43ac818b8592..6a562ca72bca 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -289,7 +289,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
>  }
>
>  extern void slab_init_memcg_params(struct kmem_cache *);
> -extern void memcg_link_cache(struct kmem_cache *s);
> +extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
>  extern void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
>                                 void (*deact_fn)(struct kmem_cache *));
>
> @@ -344,7 +344,8 @@ static inline void slab_init_memcg_params(struct kmem_cache *s)
>  {
>  }
>
> -static inline void memcg_link_cache(struct kmem_cache *s)
> +static inline void memcg_link_cache(struct kmem_cache *s,
> +                                   struct mem_cgroup *memcg)
>  {
>  }
>
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 58251ba63e4a..6e00bdf8618d 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -140,13 +140,12 @@ void slab_init_memcg_params(struct kmem_cache *s)
>  }
>
>  static int init_memcg_params(struct kmem_cache *s,
> -               struct mem_cgroup *memcg, struct kmem_cache *root_cache)
> +                            struct kmem_cache *root_cache)
>  {
>         struct memcg_cache_array *arr;
>
>         if (root_cache) {
>                 s->memcg_params.root_cache = root_cache;
> -               s->memcg_params.memcg = memcg;
>                 INIT_LIST_HEAD(&s->memcg_params.children_node);
>                 INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node);
>                 return 0;
> @@ -221,11 +220,12 @@ int memcg_update_all_caches(int num_memcgs)
>         return ret;
>  }
>
> -void memcg_link_cache(struct kmem_cache *s)
> +void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg)
>  {
>         if (is_root_cache(s)) {
>                 list_add(&s->root_caches_node, &slab_root_caches);
>         } else {
> +               s->memcg_params.memcg = memcg;
>                 list_add(&s->memcg_params.children_node,
>                          &s->memcg_params.root_cache->memcg_params.children);
>                 list_add(&s->memcg_params.kmem_caches_node,
> @@ -244,7 +244,7 @@ static void memcg_unlink_cache(struct kmem_cache *s)
>  }
>  #else
>  static inline int init_memcg_params(struct kmem_cache *s,
> -               struct mem_cgroup *memcg, struct kmem_cache *root_cache)
> +                                   struct kmem_cache *root_cache)
>  {
>         return 0;
>  }
> @@ -384,7 +384,7 @@ static struct kmem_cache *create_cache(const char *name,
>         s->useroffset = useroffset;
>         s->usersize = usersize;
>
> -       err = init_memcg_params(s, memcg, root_cache);
> +       err = init_memcg_params(s, root_cache);
>         if (err)
>                 goto out_free_cache;
>
> @@ -394,7 +394,7 @@ static struct kmem_cache *create_cache(const char *name,
>
>         s->refcount = 1;
>         list_add(&s->list, &slab_caches);
> -       memcg_link_cache(s);
> +       memcg_link_cache(s, memcg);
>  out:
>         if (err)
>                 return ERR_PTR(err);
> @@ -997,7 +997,7 @@ struct kmem_cache *__init create_kmalloc_cache(const char *name,
>
>         create_boot_cache(s, name, size, flags, useroffset, usersize);
>         list_add(&s->list, &slab_caches);
> -       memcg_link_cache(s);
> +       memcg_link_cache(s, NULL);
>         s->refcount = 1;
>         return s;
>  }
> diff --git a/mm/slub.c b/mm/slub.c
> index 5b2e364102e1..16f7e4f5a141 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4219,7 +4219,7 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
>         }
>         slab_init_memcg_params(s);
>         list_add(&s->list, &slab_caches);
> -       memcg_link_cache(s);
> +       memcg_link_cache(s, NULL);
>         return s;
>  }
>
> --
> 2.20.1
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/7] mm: generalize postponed non-root kmem_cache deactivation
  2019-05-08 20:24 ` [PATCH v3 2/7] mm: generalize postponed non-root kmem_cache deactivation Roman Gushchin
@ 2019-05-11  0:33   ` Shakeel Butt
  0 siblings, 0 replies; 20+ messages in thread
From: Shakeel Butt @ 2019-05-11  0:33 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Linux MM, LKML, Kernel Team, Johannes Weiner,
	Michal Hocko, Rik van Riel, Christoph Lameter, Vladimir Davydov,
	Cgroups

From: Roman Gushchin <guro@fb.com>
Date: Wed, May 8, 2019 at 1:30 PM
To: Andrew Morton, Shakeel Butt
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
<kernel-team@fb.com>, Johannes Weiner, Michal Hocko, Rik van Riel,
Christoph Lameter, Vladimir Davydov, <cgroups@vger.kernel.org>, Roman
Gushchin

> Currently SLUB uses a work scheduled after an RCU grace period
> to deactivate a non-root kmem_cache. This mechanism can be reused
> for kmem_caches reparenting, but requires some generalization.
>
> Let's decouple all infrastructure (rcu callback, work callback)
> from the SLUB-specific code, so it can be used with SLAB as well.
>
> Also, let's rename some functions to make the code look simpler.
> All SLAB/SLUB-specific functions start with "__". Remove "deact_"
> prefix from the corresponding struct fields.
>
> Here is the graph of a new calling scheme:
> kmemcg_cache_deactivate()
>   __kmemcg_cache_deactivate()                  SLAB/SLUB-specific
>   kmemcg_schedule_work_after_rcu()             rcu
>     kmemcg_after_rcu_workfn()                  work
>       kmemcg_cache_deactivate_after_rcu()
>         __kmemcg_cache_deactivate_after_rcu()  SLAB/SLUB-specific
>
> instead of:
> __kmemcg_cache_deactivate()                    SLAB/SLUB-specific
>   slab_deactivate_memcg_cache_rcu_sched()      SLUB-only
>     kmemcg_deactivate_rcufn                    SLUB-only, rcu
>       kmemcg_deactivate_workfn                 SLUB-only, work
>         kmemcg_cache_deact_after_rcu()         SLUB-only
>
> Signed-off-by: Roman Gushchin <guro@fb.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

> ---
>  include/linux/slab.h |  6 ++---
>  mm/slab.c            |  4 +++
>  mm/slab.h            |  3 ++-
>  mm/slab_common.c     | 62 ++++++++++++++++++++------------------------
>  mm/slub.c            |  8 +-----
>  5 files changed, 38 insertions(+), 45 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 9449b19c5f10..47923c173f30 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -642,10 +642,10 @@ struct memcg_cache_params {
>                         struct list_head children_node;
>                         struct list_head kmem_caches_node;
>
> -                       void (*deact_fn)(struct kmem_cache *);
> +                       void (*work_fn)(struct kmem_cache *);
>                         union {
> -                               struct rcu_head deact_rcu_head;
> -                               struct work_struct deact_work;
> +                               struct rcu_head rcu_head;
> +                               struct work_struct work;
>                         };
>                 };
>         };
> diff --git a/mm/slab.c b/mm/slab.c
> index f6eff59e018e..83000e46b870 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -2281,6 +2281,10 @@ void __kmemcg_cache_deactivate(struct kmem_cache *cachep)
>  {
>         __kmem_cache_shrink(cachep);
>  }
> +
> +void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
> +{
> +}
>  #endif
>
>  int __kmem_cache_shutdown(struct kmem_cache *cachep)
> diff --git a/mm/slab.h b/mm/slab.h
> index 6a562ca72bca..4a261c97c138 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -172,6 +172,7 @@ int __kmem_cache_shutdown(struct kmem_cache *);
>  void __kmem_cache_release(struct kmem_cache *);
>  int __kmem_cache_shrink(struct kmem_cache *);
>  void __kmemcg_cache_deactivate(struct kmem_cache *s);
> +void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
>  void slab_kmem_cache_release(struct kmem_cache *);
>
>  struct seq_file;
> @@ -291,7 +292,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
>  extern void slab_init_memcg_params(struct kmem_cache *);
>  extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
>  extern void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
> -                               void (*deact_fn)(struct kmem_cache *));
> +                               void (*work_fn)(struct kmem_cache *));
>
>  #else /* CONFIG_MEMCG_KMEM */
>
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 6e00bdf8618d..4e5b4292a763 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -691,17 +691,18 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
>         put_online_cpus();
>  }
>
> -static void kmemcg_deactivate_workfn(struct work_struct *work)
> +static void kmemcg_after_rcu_workfn(struct work_struct *work)
>  {
>         struct kmem_cache *s = container_of(work, struct kmem_cache,
> -                                           memcg_params.deact_work);
> +                                           memcg_params.work);
>
>         get_online_cpus();
>         get_online_mems();
>
>         mutex_lock(&slab_mutex);
>
> -       s->memcg_params.deact_fn(s);
> +       s->memcg_params.work_fn(s);
> +       s->memcg_params.work_fn = NULL;
>
>         mutex_unlock(&slab_mutex);
>
> @@ -712,37 +713,28 @@ static void kmemcg_deactivate_workfn(struct work_struct *work)
>         css_put(&s->memcg_params.memcg->css);
>  }
>
> -static void kmemcg_deactivate_rcufn(struct rcu_head *head)
> +/*
> + * We need to grab blocking locks.  Bounce to ->work.  The
> + * work item shares the space with the RCU head and can't be
> + * initialized eariler.
> +*/
> +static void kmemcg_schedule_work_after_rcu(struct rcu_head *head)
>  {
>         struct kmem_cache *s = container_of(head, struct kmem_cache,
> -                                           memcg_params.deact_rcu_head);
> +                                           memcg_params.rcu_head);
>
> -       /*
> -        * We need to grab blocking locks.  Bounce to ->deact_work.  The
> -        * work item shares the space with the RCU head and can't be
> -        * initialized eariler.
> -        */
> -       INIT_WORK(&s->memcg_params.deact_work, kmemcg_deactivate_workfn);
> -       queue_work(memcg_kmem_cache_wq, &s->memcg_params.deact_work);
> +       INIT_WORK(&s->memcg_params.work, kmemcg_after_rcu_workfn);
> +       queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
>  }
>
> -/**
> - * slab_deactivate_memcg_cache_rcu_sched - schedule deactivation after a
> - *                                        sched RCU grace period
> - * @s: target kmem_cache
> - * @deact_fn: deactivation function to call
> - *
> - * Schedule @deact_fn to be invoked with online cpus, mems and slab_mutex
> - * held after a sched RCU grace period.  The slab is guaranteed to stay
> - * alive until @deact_fn is finished.  This is to be used from
> - * __kmemcg_cache_deactivate().
> - */
> -void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
> -                                          void (*deact_fn)(struct kmem_cache *))
> +static void kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
>  {
> -       if (WARN_ON_ONCE(is_root_cache(s)) ||
> -           WARN_ON_ONCE(s->memcg_params.deact_fn))
> -               return;
> +       __kmemcg_cache_deactivate_after_rcu(s);
> +}
> +
> +static void kmemcg_cache_deactivate(struct kmem_cache *s)
> +{
> +       __kmemcg_cache_deactivate(s);
>
>         if (s->memcg_params.root_cache->memcg_params.dying)
>                 return;
> @@ -750,8 +742,9 @@ void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
>         /* pin memcg so that @s doesn't get destroyed in the middle */
>         css_get(&s->memcg_params.memcg->css);
>
> -       s->memcg_params.deact_fn = deact_fn;
> -       call_rcu(&s->memcg_params.deact_rcu_head, kmemcg_deactivate_rcufn);
> +       WARN_ON_ONCE(s->memcg_params.work_fn);
> +       s->memcg_params.work_fn = kmemcg_cache_deactivate_after_rcu;
> +       call_rcu(&s->memcg_params.rcu_head, kmemcg_schedule_work_after_rcu);
>  }
>
>  void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
> @@ -773,7 +766,7 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
>                 if (!c)
>                         continue;
>
> -               __kmemcg_cache_deactivate(c);
> +               kmemcg_cache_deactivate(c);
>                 arr->entries[idx] = NULL;
>         }
>         mutex_unlock(&slab_mutex);
> @@ -866,11 +859,12 @@ static void flush_memcg_workqueue(struct kmem_cache *s)
>         mutex_unlock(&slab_mutex);
>
>         /*
> -        * SLUB deactivates the kmem_caches through call_rcu. Make
> +        * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make
>          * sure all registered rcu callbacks have been invoked.
>          */
> -       if (IS_ENABLED(CONFIG_SLUB))
> -               rcu_barrier();
> +#ifndef CONFIG_SLOB
> +       rcu_barrier();
> +#endif
>
>         /*
>          * SLAB and SLUB create memcg kmem_caches through workqueue and SLUB
> diff --git a/mm/slub.c b/mm/slub.c
> index 16f7e4f5a141..43c34d54ad86 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4028,7 +4028,7 @@ int __kmem_cache_shrink(struct kmem_cache *s)
>  }
>
>  #ifdef CONFIG_MEMCG
> -static void kmemcg_cache_deact_after_rcu(struct kmem_cache *s)
> +void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
>  {
>         /*
>          * Called with all the locks held after a sched RCU grace period.
> @@ -4054,12 +4054,6 @@ void __kmemcg_cache_deactivate(struct kmem_cache *s)
>          */
>         slub_set_cpu_partial(s, 0);
>         s->min_partial = 0;
> -
> -       /*
> -        * s->cpu_partial is checked locklessly (see put_cpu_partial), so
> -        * we have to make sure the change is visible before shrinking.
> -        */
> -       slab_deactivate_memcg_cache_rcu_sched(s, kmemcg_cache_deact_after_rcu);
>  }
>  #endif /* CONFIG_MEMCG */
>
> --
> 2.20.1
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 3/7] mm: introduce __memcg_kmem_uncharge_memcg()
  2019-05-08 20:24 ` [PATCH v3 3/7] mm: introduce __memcg_kmem_uncharge_memcg() Roman Gushchin
@ 2019-05-11  0:33   ` Shakeel Butt
  0 siblings, 0 replies; 20+ messages in thread
From: Shakeel Butt @ 2019-05-11  0:33 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Linux MM, LKML, Kernel Team, Johannes Weiner,
	Michal Hocko, Rik van Riel, Christoph Lameter, Vladimir Davydov,
	Cgroups

From: Roman Gushchin <guro@fb.com>
Date: Wed, May 8, 2019 at 1:30 PM
To: Andrew Morton, Shakeel Butt
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
<kernel-team@fb.com>, Johannes Weiner, Michal Hocko, Rik van Riel,
Christoph Lameter, Vladimir Davydov, <cgroups@vger.kernel.org>, Roman
Gushchin

> Let's separate the page counter modification code out of
> __memcg_kmem_uncharge() in a way similar to what
> __memcg_kmem_charge() and __memcg_kmem_charge_memcg() work.
>
> This will allow to reuse this code later using a new
> memcg_kmem_uncharge_memcg() wrapper, which calls
> __memcg_kmem_unchare_memcg() if memcg_kmem_enabled()

__memcg_kmem_uncharge_memcg()

> check is passed.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>


> ---
>  include/linux/memcontrol.h | 10 ++++++++++
>  mm/memcontrol.c            | 25 +++++++++++++++++--------
>  2 files changed, 27 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 36bdfe8e5965..deb209510902 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1298,6 +1298,8 @@ int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order);
>  void __memcg_kmem_uncharge(struct page *page, int order);
>  int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
>                               struct mem_cgroup *memcg);
> +void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
> +                                unsigned int nr_pages);
>
>  extern struct static_key_false memcg_kmem_enabled_key;
>  extern struct workqueue_struct *memcg_kmem_cache_wq;
> @@ -1339,6 +1341,14 @@ static inline int memcg_kmem_charge_memcg(struct page *page, gfp_t gfp,
>                 return __memcg_kmem_charge_memcg(page, gfp, order, memcg);
>         return 0;
>  }
> +
> +static inline void memcg_kmem_uncharge_memcg(struct page *page, int order,
> +                                            struct mem_cgroup *memcg)
> +{
> +       if (memcg_kmem_enabled())
> +               __memcg_kmem_uncharge_memcg(memcg, 1 << order);
> +}
> +
>  /*
>   * helper for accessing a memcg's index. It will be used as an index in the
>   * child cache array in kmem_cache, and also to derive its name. This function
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 48a8f1c35176..b2c39f187cbb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2750,6 +2750,22 @@ int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
>         css_put(&memcg->css);
>         return ret;
>  }
> +
> +/**
> + * __memcg_kmem_uncharge_memcg: uncharge a kmem page
> + * @memcg: memcg to uncharge
> + * @nr_pages: number of pages to uncharge
> + */
> +void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
> +                                unsigned int nr_pages)
> +{
> +       if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> +               page_counter_uncharge(&memcg->kmem, nr_pages);
> +
> +       page_counter_uncharge(&memcg->memory, nr_pages);
> +       if (do_memsw_account())
> +               page_counter_uncharge(&memcg->memsw, nr_pages);
> +}
>  /**
>   * __memcg_kmem_uncharge: uncharge a kmem page
>   * @page: page to uncharge
> @@ -2764,14 +2780,7 @@ void __memcg_kmem_uncharge(struct page *page, int order)
>                 return;
>
>         VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
> -
> -       if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> -               page_counter_uncharge(&memcg->kmem, nr_pages);
> -
> -       page_counter_uncharge(&memcg->memory, nr_pages);
> -       if (do_memsw_account())
> -               page_counter_uncharge(&memcg->memsw, nr_pages);
> -
> +       __memcg_kmem_uncharge_memcg(memcg, nr_pages);
>         page->mem_cgroup = NULL;
>
>         /* slab pages do not have PageKmemcg flag set */
> --
> 2.20.1
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/7] mm: unify SLAB and SLUB page accounting
  2019-05-08 20:24 ` [PATCH v3 4/7] mm: unify SLAB and SLUB page accounting Roman Gushchin
@ 2019-05-11  0:33   ` Shakeel Butt
  2019-05-13 18:01   ` Christopher Lameter
  1 sibling, 0 replies; 20+ messages in thread
From: Shakeel Butt @ 2019-05-11  0:33 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Linux MM, LKML, Kernel Team, Johannes Weiner,
	Michal Hocko, Rik van Riel, Christoph Lameter, Vladimir Davydov,
	Cgroups

From: Roman Gushchin <guro@fb.com>
Date: Wed, May 8, 2019 at 1:40 PM
To: Andrew Morton, Shakeel Butt
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
<kernel-team@fb.com>, Johannes Weiner, Michal Hocko, Rik van Riel,
Christoph Lameter, Vladimir Davydov, <cgroups@vger.kernel.org>, Roman
Gushchin

> Currently the page accounting code is duplicated in SLAB and SLUB
> internals. Let's move it into new (un)charge_slab_page helpers
> in the slab_common.c file. These helpers will be responsible
> for statistics (global and memcg-aware) and memcg charging.
> So they are replacing direct memcg_(un)charge_slab() calls.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>


> ---
>  mm/slab.c | 19 +++----------------
>  mm/slab.h | 25 +++++++++++++++++++++++++
>  mm/slub.c | 14 ++------------
>  3 files changed, 30 insertions(+), 28 deletions(-)
>
> diff --git a/mm/slab.c b/mm/slab.c
> index 83000e46b870..32e6af9ed9af 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1389,7 +1389,6 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
>                                                                 int nodeid)
>  {
>         struct page *page;
> -       int nr_pages;
>
>         flags |= cachep->allocflags;
>
> @@ -1399,17 +1398,11 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
>                 return NULL;
>         }
>
> -       if (memcg_charge_slab(page, flags, cachep->gfporder, cachep)) {
> +       if (charge_slab_page(page, flags, cachep->gfporder, cachep)) {
>                 __free_pages(page, cachep->gfporder);
>                 return NULL;
>         }
>
> -       nr_pages = (1 << cachep->gfporder);
> -       if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
> -               mod_lruvec_page_state(page, NR_SLAB_RECLAIMABLE, nr_pages);
> -       else
> -               mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE, nr_pages);
> -
>         __SetPageSlab(page);
>         /* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
>         if (sk_memalloc_socks() && page_is_pfmemalloc(page))
> @@ -1424,12 +1417,6 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
>  static void kmem_freepages(struct kmem_cache *cachep, struct page *page)
>  {
>         int order = cachep->gfporder;
> -       unsigned long nr_freed = (1 << order);
> -
> -       if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
> -               mod_lruvec_page_state(page, NR_SLAB_RECLAIMABLE, -nr_freed);
> -       else
> -               mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE, -nr_freed);
>
>         BUG_ON(!PageSlab(page));
>         __ClearPageSlabPfmemalloc(page);
> @@ -1438,8 +1425,8 @@ static void kmem_freepages(struct kmem_cache *cachep, struct page *page)
>         page->mapping = NULL;
>
>         if (current->reclaim_state)
> -               current->reclaim_state->reclaimed_slab += nr_freed;
> -       memcg_uncharge_slab(page, order, cachep);
> +               current->reclaim_state->reclaimed_slab += 1 << order;
> +       uncharge_slab_page(page, order, cachep);
>         __free_pages(page, order);
>  }
>
> diff --git a/mm/slab.h b/mm/slab.h
> index 4a261c97c138..c9a31120fa1d 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -205,6 +205,12 @@ ssize_t slabinfo_write(struct file *file, const char __user *buffer,
>  void __kmem_cache_free_bulk(struct kmem_cache *, size_t, void **);
>  int __kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, size_t, void **);
>
> +static inline int cache_vmstat_idx(struct kmem_cache *s)
> +{
> +       return (s->flags & SLAB_RECLAIM_ACCOUNT) ?
> +               NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
> +}
> +
>  #ifdef CONFIG_MEMCG_KMEM
>
>  /* List of all root caches. */
> @@ -352,6 +358,25 @@ static inline void memcg_link_cache(struct kmem_cache *s,
>
>  #endif /* CONFIG_MEMCG_KMEM */
>
> +static __always_inline int charge_slab_page(struct page *page,
> +                                           gfp_t gfp, int order,
> +                                           struct kmem_cache *s)
> +{
> +       int ret = memcg_charge_slab(page, gfp, order, s);
> +
> +       if (!ret)
> +               mod_lruvec_page_state(page, cache_vmstat_idx(s), 1 << order);
> +
> +       return ret;
> +}
> +
> +static __always_inline void uncharge_slab_page(struct page *page, int order,
> +                                              struct kmem_cache *s)
> +{
> +       mod_lruvec_page_state(page, cache_vmstat_idx(s), -(1 << order));
> +       memcg_uncharge_slab(page, order, s);
> +}
> +
>  static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
>  {
>         struct kmem_cache *cachep;
> diff --git a/mm/slub.c b/mm/slub.c
> index 43c34d54ad86..9ec25a588bdd 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1494,7 +1494,7 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
>         else
>                 page = __alloc_pages_node(node, flags, order);
>
> -       if (page && memcg_charge_slab(page, flags, order, s)) {
> +       if (page && charge_slab_page(page, flags, order, s)) {
>                 __free_pages(page, order);
>                 page = NULL;
>         }
> @@ -1687,11 +1687,6 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
>         if (!page)
>                 return NULL;
>
> -       mod_lruvec_page_state(page,
> -               (s->flags & SLAB_RECLAIM_ACCOUNT) ?
> -               NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
> -               1 << oo_order(oo));
> -
>         inc_slabs_node(s, page_to_nid(page), page->objects);
>
>         return page;
> @@ -1725,18 +1720,13 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
>                         check_object(s, page, p, SLUB_RED_INACTIVE);
>         }
>
> -       mod_lruvec_page_state(page,
> -               (s->flags & SLAB_RECLAIM_ACCOUNT) ?
> -               NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
> -               -pages);
> -
>         __ClearPageSlabPfmemalloc(page);
>         __ClearPageSlab(page);
>
>         page->mapping = NULL;
>         if (current->reclaim_state)
>                 current->reclaim_state->reclaimed_slab += pages;
> -       memcg_uncharge_slab(page, order, s);
> +       uncharge_slab_page(page, order, s);
>         __free_pages(page, order);
>  }
>
> --
> 2.20.1
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 5/7] mm: rework non-root kmem_cache lifecycle management
  2019-05-08 20:24 ` [PATCH v3 5/7] mm: rework non-root kmem_cache lifecycle management Roman Gushchin
@ 2019-05-11  0:33   ` Shakeel Butt
  0 siblings, 0 replies; 20+ messages in thread
From: Shakeel Butt @ 2019-05-11  0:33 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Linux MM, LKML, Kernel Team, Johannes Weiner,
	Michal Hocko, Rik van Riel, Christoph Lameter, Vladimir Davydov,
	Cgroups

From: Roman Gushchin <guro@fb.com>
Date: Wed, May 8, 2019 at 1:41 PM
To: Andrew Morton, Shakeel Butt
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
<kernel-team@fb.com>, Johannes Weiner, Michal Hocko, Rik van Riel,
Christoph Lameter, Vladimir Davydov, <cgroups@vger.kernel.org>, Roman
Gushchin

> This commit makes several important changes in the lifecycle
> of a non-root kmem_cache, which also affect the lifecycle
> of a memory cgroup.
>
> Currently each charged slab page has a page->mem_cgroup pointer
> to the memory cgroup and holds a reference to it.
> Kmem_caches are held by the memcg and are released with it.
> It means that none of kmem_caches are released unless at least one
> reference to the memcg exists, which is not optimal.
>
> So the current scheme can be illustrated as:
> page->mem_cgroup->kmem_cache.
>
> To implement the slab memory reparenting we need to invert the scheme
> into: page->kmem_cache->mem_cgroup.
>
> Let's make every page to hold a reference to the kmem_cache (we
> already have a stable pointer), and make kmem_caches to hold a single
> reference to the memory cgroup.
>
> To make this possible we need to introduce a new percpu refcounter
> for non-root kmem_caches. The counter is initialized to the percpu
> mode, and is switched to atomic mode after deactivation, so we never
> shutdown an active cache. The counter is bumped for every charged page
> and also for every running allocation. So the kmem_cache can't
> be released unless all allocations complete.
>
> To shutdown non-active empty kmem_caches, let's reuse the
> infrastructure of the RCU-delayed work queue, used previously for
> the deactivation. After the generalization, it's perfectly suited
> for our needs.
>
> Since now we can release a kmem_cache at any moment after the
> deactivation, let's call sysfs_slab_remove() only from the shutdown
> path. It makes deactivation path simpler.
>
> Because we don't set the page->mem_cgroup pointer, we need to change
> the way how memcg-level stats is working for slab pages. We can't use
> mod_lruvec_page_state() helpers anymore, so switch over to
> mod_lruvec_state().
>
> * I used the following simple approach to test the performance
> (stolen from another patchset by T. Harding):
>
>     time find / -name fname-no-exist
>     echo 2 > /proc/sys/vm/drop_caches
>     repeat several times
>
> Results (I've chosen best results in several runs):
>
>         orig       patched
>
> real    0m0.700s   0m0.722s
> user    0m0.114s   0m0.120s
> sys     0m0.317s   0m0.324s
>
> real    0m0.729s   0m0.746s
> user    0m0.110s   0m0.139s
> sys     0m0.320s   0m0.317s
>
> real    0m0.745s   0m0.719s
> user    0m0.108s   0m0.124s
> sys     0m0.320s   0m0.323s
>

You need to re-run the experiment. The numbers are same as of the
previous version but the patch changed a lot.

> So it looks like the difference is not noticeable in this test.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  include/linux/slab.h |  3 +-
>  mm/memcontrol.c      | 53 +++++++++++++++++++++----------
>  mm/slab.h            | 74 +++++++++++++++++++++++---------------------
>  mm/slab_common.c     | 63 +++++++++++++++++++------------------
>  mm/slub.c            | 12 +------
>  5 files changed, 112 insertions(+), 93 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 47923c173f30..1b54e5f83342 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -16,6 +16,7 @@
>  #include <linux/overflow.h>
>  #include <linux/types.h>
>  #include <linux/workqueue.h>
> +#include <linux/percpu-refcount.h>
>
>
>  /*
> @@ -152,7 +153,6 @@ int kmem_cache_shrink(struct kmem_cache *);
>
>  void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
>  void memcg_deactivate_kmem_caches(struct mem_cgroup *);
> -void memcg_destroy_kmem_caches(struct mem_cgroup *);
>
>  /*
>   * Please use this macro to create slab caches. Simply specify the
> @@ -641,6 +641,7 @@ struct memcg_cache_params {
>                         struct mem_cgroup *memcg;
>                         struct list_head children_node;
>                         struct list_head kmem_caches_node;
> +                       struct percpu_ref refcnt;
>
>                         void (*work_fn)(struct kmem_cache *);
>                         union {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b2c39f187cbb..9b27988c8969 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2610,12 +2610,13 @@ static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
>  {
>         struct memcg_kmem_cache_create_work *cw;
>
> +       if (!css_tryget_online(&memcg->css))
> +               return;
> +
>         cw = kmalloc(sizeof(*cw), GFP_NOWAIT | __GFP_NOWARN);
>         if (!cw)
>                 return;
>
> -       css_get(&memcg->css);
> -
>         cw->memcg = memcg;
>         cw->cachep = cachep;
>         INIT_WORK(&cw->work, memcg_kmem_cache_create_func);
> @@ -2651,20 +2652,35 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
>         struct mem_cgroup *memcg;
>         struct kmem_cache *memcg_cachep;
>         int kmemcg_id;
> +       struct memcg_cache_array *arr;
>
>         VM_BUG_ON(!is_root_cache(cachep));
>
>         if (memcg_kmem_bypass())
>                 return cachep;
>
> -       memcg = get_mem_cgroup_from_current();
> +       rcu_read_lock();
> +
> +       if (unlikely(current->active_memcg))
> +               memcg = current->active_memcg;
> +       else
> +               memcg = mem_cgroup_from_task(current);
> +
> +       if (!memcg || memcg == root_mem_cgroup)
> +               goto out_unlock;
> +
>         kmemcg_id = READ_ONCE(memcg->kmemcg_id);
>         if (kmemcg_id < 0)
> -               goto out;
> +               goto out_unlock;
>
> -       memcg_cachep = cache_from_memcg_idx(cachep, kmemcg_id);
> -       if (likely(memcg_cachep))
> -               return memcg_cachep;
> +       arr = rcu_dereference(cachep->memcg_params.memcg_caches);
> +
> +       /*
> +        * Make sure we will access the up-to-date value. The code updating
> +        * memcg_caches issues a write barrier to match this (see
> +        * memcg_create_kmem_cache()).
> +        */
> +       memcg_cachep = READ_ONCE(arr->entries[kmemcg_id]);
>
>         /*
>          * If we are in a safe context (can wait, and not in interrupt
> @@ -2677,10 +2693,16 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
>          * memcg_create_kmem_cache, this means no further allocation
>          * could happen with the slab_mutex held. So it's better to
>          * defer everything.
> +        *
> +        * If the memcg is dying or memcg_cache is about to be released,
> +        * don't bother creating new kmem_caches.
>          */
> -       memcg_schedule_kmem_cache_create(memcg, cachep);
> -out:
> -       css_put(&memcg->css);
> +       if (unlikely(!memcg_cachep))
> +               memcg_schedule_kmem_cache_create(memcg, cachep);
> +       else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt))

Why not percpu_ref_tryget_live? Because arr->entries[kmemcg_id] will
be NULL even before percpu_ref_kill(&s->memcg_params.refcnt). I think
a comment would be good.

> +               cachep = memcg_cachep;
> +out_unlock:
> +       rcu_read_lock();
>         return cachep;
>  }
>
> @@ -2691,7 +2713,7 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
>  void memcg_kmem_put_cache(struct kmem_cache *cachep)
>  {
>         if (!is_root_cache(cachep))
> -               css_put(&cachep->memcg_params.memcg->css);
> +               percpu_ref_put(&cachep->memcg_params.refcnt);
>  }
>
>  /**
> @@ -2719,9 +2741,6 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
>                 cancel_charge(memcg, nr_pages);
>                 return -ENOMEM;
>         }
> -
> -       page->mem_cgroup = memcg;
> -
>         return 0;
>  }
>
> @@ -2744,8 +2763,10 @@ int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
>         memcg = get_mem_cgroup_from_current();
>         if (!mem_cgroup_is_root(memcg)) {
>                 ret = __memcg_kmem_charge_memcg(page, gfp, order, memcg);
> -               if (!ret)
> +               if (!ret) {
> +                       page->mem_cgroup = memcg;
>                         __SetPageKmemcg(page);
> +               }
>         }
>         css_put(&memcg->css);
>         return ret;
> @@ -3238,7 +3259,7 @@ static void memcg_free_kmem(struct mem_cgroup *memcg)
>                 memcg_offline_kmem(memcg);
>
>         if (memcg->kmem_state == KMEM_ALLOCATED) {
> -               memcg_destroy_kmem_caches(memcg);
> +               WARN_ON(!list_empty(&memcg->kmem_caches));
>                 static_branch_dec(&memcg_kmem_enabled_key);
>                 WARN_ON(page_counter_read(&memcg->kmem));
>         }
> diff --git a/mm/slab.h b/mm/slab.h
> index c9a31120fa1d..2acc68a7e0a0 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -173,6 +173,7 @@ void __kmem_cache_release(struct kmem_cache *);
>  int __kmem_cache_shrink(struct kmem_cache *);
>  void __kmemcg_cache_deactivate(struct kmem_cache *s);
>  void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
> +void kmemcg_cache_shutdown(struct kmem_cache *s);
>  void slab_kmem_cache_release(struct kmem_cache *);
>
>  struct seq_file;
> @@ -248,31 +249,6 @@ static inline const char *cache_name(struct kmem_cache *s)
>         return s->name;
>  }
>
> -/*
> - * Note, we protect with RCU only the memcg_caches array, not per-memcg caches.
> - * That said the caller must assure the memcg's cache won't go away by either
> - * taking a css reference to the owner cgroup, or holding the slab_mutex.
> - */
> -static inline struct kmem_cache *
> -cache_from_memcg_idx(struct kmem_cache *s, int idx)
> -{
> -       struct kmem_cache *cachep;
> -       struct memcg_cache_array *arr;
> -
> -       rcu_read_lock();
> -       arr = rcu_dereference(s->memcg_params.memcg_caches);
> -
> -       /*
> -        * Make sure we will access the up-to-date value. The code updating
> -        * memcg_caches issues a write barrier to match this (see
> -        * memcg_create_kmem_cache()).
> -        */
> -       cachep = READ_ONCE(arr->entries[idx]);
> -       rcu_read_unlock();
> -
> -       return cachep;
> -}
> -
>  static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
>  {
>         if (is_root_cache(s))

A comment that memcg_charge_slab() can only be called with memcg kmem_caches.

> @@ -284,15 +260,37 @@ static __always_inline int memcg_charge_slab(struct page *page,
>                                              gfp_t gfp, int order,
>                                              struct kmem_cache *s)
>  {
> -       if (is_root_cache(s))
> -               return 0;
> -       return memcg_kmem_charge_memcg(page, gfp, order, s->memcg_params.memcg);
> +       struct mem_cgroup *memcg;
> +       struct lruvec *lruvec;
> +       int ret;
> +
> +       memcg = s->memcg_params.memcg;
> +       ret = memcg_kmem_charge_memcg(page, gfp, order, memcg);
> +       if (ret)
> +               return ret;
> +
> +       lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
> +       mod_lruvec_state(lruvec, cache_vmstat_idx(s), 1 << order);
> +
> +       /* transer try_charge() page references to kmem_cache */
> +       percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
> +       css_put_many(&memcg->css, 1 << order);
> +
> +       return 0;
>  }
>
>  static __always_inline void memcg_uncharge_slab(struct page *page, int order,
>                                                 struct kmem_cache *s)
>  {
> -       memcg_kmem_uncharge(page, order);
> +       struct mem_cgroup *memcg;
> +       struct lruvec *lruvec;
> +
> +       memcg = s->memcg_params.memcg;
> +       lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
> +       mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
> +       memcg_kmem_uncharge_memcg(page, order, memcg);
> +
> +       percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
>  }
>
>  extern void slab_init_memcg_params(struct kmem_cache *);
> @@ -362,18 +360,24 @@ static __always_inline int charge_slab_page(struct page *page,
>                                             gfp_t gfp, int order,
>                                             struct kmem_cache *s)
>  {
> -       int ret = memcg_charge_slab(page, gfp, order, s);
> -
> -       if (!ret)
> -               mod_lruvec_page_state(page, cache_vmstat_idx(s), 1 << order);
> +       if (is_root_cache(s)) {
> +               mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
> +                                   1 << order);
> +               return 0;
> +       }
>
> -       return ret;
> +       return memcg_charge_slab(page, gfp, order, s);
>  }
>
>  static __always_inline void uncharge_slab_page(struct page *page, int order,
>                                                struct kmem_cache *s)
>  {
> -       mod_lruvec_page_state(page, cache_vmstat_idx(s), -(1 << order));
> +       if (is_root_cache(s)) {
> +               mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
> +                                   -(1 << order));
> +               return;
> +       }
> +
>         memcg_uncharge_slab(page, order, s);
>  }
>
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 4e5b4292a763..995920222127 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -45,6 +45,8 @@ static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work);
>  static DECLARE_WORK(slab_caches_to_rcu_destroy_work,
>                     slab_caches_to_rcu_destroy_workfn);
>
> +static void kmemcg_queue_cache_shutdown(struct percpu_ref *percpu_ref);
> +
>  /*
>   * Set of flags that will prevent slab merging
>   */
> @@ -145,6 +147,12 @@ static int init_memcg_params(struct kmem_cache *s,
>         struct memcg_cache_array *arr;
>
>         if (root_cache) {
> +               int ret = percpu_ref_init(&s->memcg_params.refcnt,
> +                                         kmemcg_queue_cache_shutdown,
> +                                         0, GFP_KERNEL);
> +               if (ret)
> +                       return ret;
> +
>                 s->memcg_params.root_cache = root_cache;
>                 INIT_LIST_HEAD(&s->memcg_params.children_node);
>                 INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node);
> @@ -170,6 +178,8 @@ static void destroy_memcg_params(struct kmem_cache *s)
>  {
>         if (is_root_cache(s))
>                 kvfree(rcu_access_pointer(s->memcg_params.memcg_caches));
> +       else
> +               percpu_ref_exit(&s->memcg_params.refcnt);
>  }
>
>  static void free_memcg_params(struct rcu_head *rcu)
> @@ -225,6 +235,7 @@ void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg)
>         if (is_root_cache(s)) {
>                 list_add(&s->root_caches_node, &slab_root_caches);
>         } else {
> +               css_get(&memcg->css);
>                 s->memcg_params.memcg = memcg;
>                 list_add(&s->memcg_params.children_node,
>                          &s->memcg_params.root_cache->memcg_params.children);
> @@ -240,6 +251,7 @@ static void memcg_unlink_cache(struct kmem_cache *s)
>         } else {
>                 list_del(&s->memcg_params.children_node);
>                 list_del(&s->memcg_params.kmem_caches_node);
> +               css_put(&s->memcg_params.memcg->css);
>         }
>  }
>  #else
> @@ -708,16 +720,13 @@ static void kmemcg_after_rcu_workfn(struct work_struct *work)
>
>         put_online_mems();
>         put_online_cpus();
> -
> -       /* done, put the ref from slab_deactivate_memcg_cache_rcu_sched() */
> -       css_put(&s->memcg_params.memcg->css);
>  }
>
>  /*
>   * We need to grab blocking locks.  Bounce to ->work.  The
>   * work item shares the space with the RCU head and can't be
> - * initialized eariler.
> -*/
> + * initialized earlier.
> + */
>  static void kmemcg_schedule_work_after_rcu(struct rcu_head *head)
>  {
>         struct kmem_cache *s = container_of(head, struct kmem_cache,
> @@ -727,9 +736,28 @@ static void kmemcg_schedule_work_after_rcu(struct rcu_head *head)
>         queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
>  }
>
> +static void kmemcg_cache_shutdown_after_rcu(struct kmem_cache *s)
> +{
> +       WARN_ON(shutdown_cache(s));
> +}
> +
> +static void kmemcg_queue_cache_shutdown(struct percpu_ref *percpu_ref)
> +{
> +       struct kmem_cache *s = container_of(percpu_ref, struct kmem_cache,
> +                                           memcg_params.refcnt);
> +

I think this function should be within slab_mutex to synchronize
against flush_memcg_workqueue().


> +       if (s->memcg_params.root_cache->memcg_params.dying)
> +               return;
> +
> +       WARN_ON(s->memcg_params.work_fn);
> +       s->memcg_params.work_fn = kmemcg_cache_shutdown_after_rcu;
> +       call_rcu(&s->memcg_params.rcu_head, kmemcg_schedule_work_after_rcu);
> +}
> +
>  static void kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
>  {
>         __kmemcg_cache_deactivate_after_rcu(s);
> +       percpu_ref_kill(&s->memcg_params.refcnt);
>  }
>
>  static void kmemcg_cache_deactivate(struct kmem_cache *s)
> @@ -739,9 +767,6 @@ static void kmemcg_cache_deactivate(struct kmem_cache *s)
>         if (s->memcg_params.root_cache->memcg_params.dying)
>                 return;
>
> -       /* pin memcg so that @s doesn't get destroyed in the middle */
> -       css_get(&s->memcg_params.memcg->css);
> -
>         WARN_ON_ONCE(s->memcg_params.work_fn);
>         s->memcg_params.work_fn = kmemcg_cache_deactivate_after_rcu;
>         call_rcu(&s->memcg_params.rcu_head, kmemcg_schedule_work_after_rcu);
> @@ -775,28 +800,6 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
>         put_online_cpus();
>  }
>
> -void memcg_destroy_kmem_caches(struct mem_cgroup *memcg)
> -{
> -       struct kmem_cache *s, *s2;
> -
> -       get_online_cpus();
> -       get_online_mems();
> -
> -       mutex_lock(&slab_mutex);
> -       list_for_each_entry_safe(s, s2, &memcg->kmem_caches,
> -                                memcg_params.kmem_caches_node) {
> -               /*
> -                * The cgroup is about to be freed and therefore has no charges
> -                * left. Hence, all its caches must be empty by now.
> -                */
> -               BUG_ON(shutdown_cache(s));
> -       }
> -       mutex_unlock(&slab_mutex);
> -
> -       put_online_mems();
> -       put_online_cpus();
> -}
> -
>  static int shutdown_memcg_caches(struct kmem_cache *s)
>  {
>         struct memcg_cache_array *arr;
> diff --git a/mm/slub.c b/mm/slub.c
> index 9ec25a588bdd..e7ce810ebd02 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4022,18 +4022,8 @@ void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
>  {
>         /*
>          * Called with all the locks held after a sched RCU grace period.
> -        * Even if @s becomes empty after shrinking, we can't know that @s
> -        * doesn't have allocations already in-flight and thus can't
> -        * destroy @s until the associated memcg is released.
> -        *
> -        * However, let's remove the sysfs files for empty caches here.
> -        * Each cache has a lot of interface files which aren't
> -        * particularly useful for empty draining caches; otherwise, we can
> -        * easily end up with millions of unnecessary sysfs files on
> -        * systems which have a lot of memory and transient cgroups.
>          */
> -       if (!__kmem_cache_shrink(s))
> -               sysfs_slab_remove(s);
> +       __kmem_cache_shrink(s);
>  }
>
>  void __kmemcg_cache_deactivate(struct kmem_cache *s)
> --
> 2.20.1
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 6/7] mm: reparent slab memory on cgroup removal
  2019-05-08 20:24 ` [PATCH v3 6/7] mm: reparent slab memory on cgroup removal Roman Gushchin
@ 2019-05-11  0:34   ` Shakeel Butt
  0 siblings, 0 replies; 20+ messages in thread
From: Shakeel Butt @ 2019-05-11  0:34 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Linux MM, LKML, Kernel Team, Johannes Weiner,
	Michal Hocko, Rik van Riel, Christoph Lameter, Vladimir Davydov,
	Cgroups

From: Roman Gushchin <guro@fb.com>
Date: Wed, May 8, 2019 at 1:41 PM
To: Andrew Morton, Shakeel Butt
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
<kernel-team@fb.com>, Johannes Weiner, Michal Hocko, Rik van Riel,
Christoph Lameter, Vladimir Davydov, <cgroups@vger.kernel.org>, Roman
Gushchin

> Let's reparent memcg slab memory on memcg offlining. This allows us
> to release the memory cgroup without waiting for the last outstanding
> kernel object (e.g. dentry used by another application).
>
> So instead of reparenting all accounted slab pages, let's do reparent
> a relatively small amount of kmem_caches. Reparenting is performed as
> a part of the deactivation process.
>
> Since the parent cgroup is already charged, everything we need to do
> is to splice the list of kmem_caches to the parent's kmem_caches list,
> swap the memcg pointer and drop the css refcounter for each kmem_cache
> and adjust the parent's css refcounter. Quite simple.
>
> Please, note that kmem_cache->memcg_params.memcg isn't a stable
> pointer anymore. It's safe to read it under rcu_read_lock() or
> with slab_mutex held.
>
> We can race with the slab allocation and deallocation paths. It's not
> a big problem: parent's charge and slab global stats are always
> correct, and we don't care anymore about the child usage and global
> stats. The child cgroup is already offline, so we don't use or show it
> anywhere.
>
> Local slab stats (NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE)
> aren't used anywhere except count_shadow_nodes(). But even there it
> won't break anything: after reparenting "nodes" will be 0 on child
> level (because we're already reparenting shrinker lists), and on
> parent level page stats always were 0, and this patch won't change
> anything.
>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  include/linux/slab.h |  4 ++--
>  mm/memcontrol.c      | 14 ++++++++------
>  mm/slab.h            | 14 +++++++++-----
>  mm/slab_common.c     | 23 ++++++++++++++++++++---
>  4 files changed, 39 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 1b54e5f83342..109cab2ad9b4 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -152,7 +152,7 @@ void kmem_cache_destroy(struct kmem_cache *);
>  int kmem_cache_shrink(struct kmem_cache *);
>
>  void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
> -void memcg_deactivate_kmem_caches(struct mem_cgroup *);
> +void memcg_deactivate_kmem_caches(struct mem_cgroup *, struct mem_cgroup *);
>
>  /*
>   * Please use this macro to create slab caches. Simply specify the
> @@ -638,7 +638,7 @@ struct memcg_cache_params {
>                         bool dying;
>                 };
>                 struct {
> -                       struct mem_cgroup *memcg;
> +                       struct mem_cgroup __rcu *memcg;
>                         struct list_head children_node;
>                         struct list_head kmem_caches_node;
>                         struct percpu_ref refcnt;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9b27988c8969..6e4d9ed16069 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3220,15 +3220,15 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
>          */
>         memcg->kmem_state = KMEM_ALLOCATED;
>
> -       memcg_deactivate_kmem_caches(memcg);
> -
> -       kmemcg_id = memcg->kmemcg_id;
> -       BUG_ON(kmemcg_id < 0);
> -
>         parent = parent_mem_cgroup(memcg);
>         if (!parent)
>                 parent = root_mem_cgroup;
>
> +       memcg_deactivate_kmem_caches(memcg, parent);
> +
> +       kmemcg_id = memcg->kmemcg_id;
> +       BUG_ON(kmemcg_id < 0);
> +
>         /*
>          * Change kmemcg_id of this cgroup and all its descendants to the
>          * parent's id, and then move all entries from this cgroup's list_lrus
> @@ -3261,7 +3261,6 @@ static void memcg_free_kmem(struct mem_cgroup *memcg)
>         if (memcg->kmem_state == KMEM_ALLOCATED) {
>                 WARN_ON(!list_empty(&memcg->kmem_caches));
>                 static_branch_dec(&memcg_kmem_enabled_key);
> -               WARN_ON(page_counter_read(&memcg->kmem));
>         }
>  }
>  #else
> @@ -4673,6 +4672,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>
>         /* The following stuff does not apply to the root */
>         if (!parent) {
> +#ifdef CONFIG_MEMCG_KMEM
> +               INIT_LIST_HEAD(&memcg->kmem_caches);
> +#endif
>                 root_mem_cgroup = memcg;
>                 return &memcg->css;
>         }
> diff --git a/mm/slab.h b/mm/slab.h
> index 2acc68a7e0a0..acdc1810639d 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -264,10 +264,11 @@ static __always_inline int memcg_charge_slab(struct page *page,
>         struct lruvec *lruvec;
>         int ret;
>
> -       memcg = s->memcg_params.memcg;
> +       rcu_read_lock();
> +       memcg = rcu_dereference(s->memcg_params.memcg);
>         ret = memcg_kmem_charge_memcg(page, gfp, order, memcg);

You can not do memcg_kmem_charge_memcg() within rcu_read_lock(). You
need to css_tryget_online(), though I don't know what to do on
failure? ENOMEM or retry with parent?

>         if (ret)
> -               return ret;
> +               goto out;
>
>         lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
>         mod_lruvec_state(lruvec, cache_vmstat_idx(s), 1 << order);
> @@ -275,8 +276,9 @@ static __always_inline int memcg_charge_slab(struct page *page,
>         /* transer try_charge() page references to kmem_cache */
>         percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
>         css_put_many(&memcg->css, 1 << order);
> -
> -       return 0;
> +out:
> +       rcu_read_unlock();
> +       return ret;
>  }
>
>  static __always_inline void memcg_uncharge_slab(struct page *page, int order,
> @@ -285,10 +287,12 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
>         struct mem_cgroup *memcg;
>         struct lruvec *lruvec;
>
> -       memcg = s->memcg_params.memcg;
> +       rcu_read_lock();
> +       memcg = rcu_dereference(s->memcg_params.memcg);
>         lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
>         mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
>         memcg_kmem_uncharge_memcg(page, order, memcg);
> +       rcu_read_unlock();
>
>         percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
>  }
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 995920222127..36673a43ed31 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -236,7 +236,7 @@ void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg)
>                 list_add(&s->root_caches_node, &slab_root_caches);
>         } else {
>                 css_get(&memcg->css);
> -               s->memcg_params.memcg = memcg;
> +               rcu_assign_pointer(s->memcg_params.memcg, memcg);
>                 list_add(&s->memcg_params.children_node,
>                          &s->memcg_params.root_cache->memcg_params.children);
>                 list_add(&s->memcg_params.kmem_caches_node,
> @@ -251,7 +251,8 @@ static void memcg_unlink_cache(struct kmem_cache *s)
>         } else {
>                 list_del(&s->memcg_params.children_node);
>                 list_del(&s->memcg_params.kmem_caches_node);
> -               css_put(&s->memcg_params.memcg->css);
> +               mem_cgroup_put(rcu_dereference_protected(s->memcg_params.memcg,
> +                       lockdep_is_held(&slab_mutex)));
>         }
>  }
>  #else
> @@ -772,11 +773,13 @@ static void kmemcg_cache_deactivate(struct kmem_cache *s)
>         call_rcu(&s->memcg_params.rcu_head, kmemcg_schedule_work_after_rcu);
>  }
>
> -void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
> +void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg,
> +                                 struct mem_cgroup *parent)
>  {
>         int idx;
>         struct memcg_cache_array *arr;
>         struct kmem_cache *s, *c;
> +       unsigned int nr_reparented;
>
>         idx = memcg_cache_id(memcg);
>
> @@ -794,6 +797,20 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
>                 kmemcg_cache_deactivate(c);
>                 arr->entries[idx] = NULL;
>         }
> +       if (memcg != parent) {

When will be the above condition false? Do we need it?


> +               nr_reparented = 0;
> +               list_for_each_entry(s, &memcg->kmem_caches,
> +                                   memcg_params.kmem_caches_node) {
> +                       rcu_assign_pointer(s->memcg_params.memcg, parent);
> +                       css_put(&memcg->css);
> +                       nr_reparented++;
> +               }
> +               if (nr_reparented) {
> +                       list_splice_init(&memcg->kmem_caches,
> +                                        &parent->kmem_caches);
> +                       css_get_many(&parent->css, nr_reparented);
> +               }
> +       }
>         mutex_unlock(&slab_mutex);
>
>         put_online_mems();
> --
> 2.20.1
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 7/7] mm: fix /proc/kpagecgroup interface for slab pages
  2019-05-08 20:24 ` [PATCH v3 7/7] mm: fix /proc/kpagecgroup interface for slab pages Roman Gushchin
@ 2019-05-11  0:34   ` Shakeel Butt
  0 siblings, 0 replies; 20+ messages in thread
From: Shakeel Butt @ 2019-05-11  0:34 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Linux MM, LKML, Kernel Team, Johannes Weiner,
	Michal Hocko, Rik van Riel, Christoph Lameter, Vladimir Davydov,
	Cgroups

From: Roman Gushchin <guro@fb.com>
Date: Wed, May 8, 2019 at 1:40 PM
To: Andrew Morton, Shakeel Butt
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
<kernel-team@fb.com>, Johannes Weiner, Michal Hocko, Rik van Riel,
Christoph Lameter, Vladimir Davydov, <cgroups@vger.kernel.org>, Roman
Gushchin

> Switching to an indirect scheme of getting mem_cgroup pointer for
> !root slab pages broke /proc/kpagecgroup interface for them.
>
> Let's fix it by learning page_cgroup_ino() how to get memcg
> pointer for slab pages.
>
> Reported-by: Shakeel Butt <shakeelb@google.com>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  mm/memcontrol.c  |  5 ++++-
>  mm/slab.h        | 21 +++++++++++++++++++++
>  mm/slab_common.c |  1 +
>  3 files changed, 26 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6e4d9ed16069..8114838759f6 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -494,7 +494,10 @@ ino_t page_cgroup_ino(struct page *page)
>         unsigned long ino = 0;
>
>         rcu_read_lock();
> -       memcg = READ_ONCE(page->mem_cgroup);
> +       if (PageSlab(page))
> +               memcg = memcg_from_slab_page(page);
> +       else
> +               memcg = READ_ONCE(page->mem_cgroup);
>         while (memcg && !(memcg->css.flags & CSS_ONLINE))
>                 memcg = parent_mem_cgroup(memcg);
>         if (memcg)
> diff --git a/mm/slab.h b/mm/slab.h
> index acdc1810639d..cb684fbe2cc2 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -256,6 +256,22 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
>         return s->memcg_params.root_cache;
>  }
>

Can you please document the preconditions of this function? It seems
like it must be PageSlab() then why need to check PageTail and do
compound_head().


> +static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
> +{
> +       struct kmem_cache *s;
> +
> +       WARN_ON_ONCE(!rcu_read_lock_held());
> +
> +       if (PageTail(page))
> +               page = compound_head(page);
> +
> +       s = READ_ONCE(page->slab_cache);
> +       if (s && !is_root_cache(s))
> +               return rcu_dereference(s->memcg_params.memcg);
> +
> +       return NULL;
> +}
> +
>  static __always_inline int memcg_charge_slab(struct page *page,
>                                              gfp_t gfp, int order,
>                                              struct kmem_cache *s)
> @@ -338,6 +354,11 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
>         return s;
>  }
>
> +static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
> +{
> +       return NULL;
> +}
> +
>  static inline int memcg_charge_slab(struct page *page, gfp_t gfp, int order,
>                                     struct kmem_cache *s)
>  {
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 36673a43ed31..0cfdad0a0aac 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -253,6 +253,7 @@ static void memcg_unlink_cache(struct kmem_cache *s)
>                 list_del(&s->memcg_params.kmem_caches_node);
>                 mem_cgroup_put(rcu_dereference_protected(s->memcg_params.memcg,
>                         lockdep_is_held(&slab_mutex)));
> +               rcu_assign_pointer(s->memcg_params.memcg, NULL);
>         }
>  }
>  #else
> --
> 2.20.1
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/7] mm: unify SLAB and SLUB page accounting
  2019-05-08 20:24 ` [PATCH v3 4/7] mm: unify SLAB and SLUB page accounting Roman Gushchin
  2019-05-11  0:33   ` Shakeel Butt
@ 2019-05-13 18:01   ` Christopher Lameter
  1 sibling, 0 replies; 20+ messages in thread
From: Christopher Lameter @ 2019-05-13 18:01 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Shakeel Butt, linux-mm, linux-kernel, kernel-team,
	Johannes Weiner, Michal Hocko, Rik van Riel, Vladimir Davydov,
	cgroups

On Wed, 8 May 2019, Roman Gushchin wrote:

> Currently the page accounting code is duplicated in SLAB and SLUB
> internals. Let's move it into new (un)charge_slab_page helpers
> in the slab_common.c file. These helpers will be responsible
> for statistics (global and memcg-aware) and memcg charging.
> So they are replacing direct memcg_(un)charge_slab() calls.

Looks good.

Acked-by: Christoph Lameter <cl@linux.com>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/7] mm: reparent slab memory on cgroup removal
  2019-05-11  0:32 ` [PATCH v3 0/7] mm: reparent slab memory on cgroup removal Shakeel Butt
@ 2019-05-13 20:21   ` Roman Gushchin
  2019-05-14 19:22     ` Shakeel Butt
  0 siblings, 1 reply; 20+ messages in thread
From: Roman Gushchin @ 2019-05-13 20:21 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Linux MM, LKML, Kernel Team, Johannes Weiner,
	Michal Hocko, Rik van Riel, Christoph Lameter, Vladimir Davydov,
	Cgroups

On Fri, May 10, 2019 at 05:32:15PM -0700, Shakeel Butt wrote:
> From: Roman Gushchin <guro@fb.com>
> Date: Wed, May 8, 2019 at 1:30 PM
> To: Andrew Morton, Shakeel Butt
> Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
> <kernel-team@fb.com>, Johannes Weiner, Michal Hocko, Rik van Riel,
> Christoph Lameter, Vladimir Davydov, <cgroups@vger.kernel.org>, Roman
> Gushchin
> 
> > # Why do we need this?
> >
> > We've noticed that the number of dying cgroups is steadily growing on most
> > of our hosts in production. The following investigation revealed an issue
> > in userspace memory reclaim code [1], accounting of kernel stacks [2],
> > and also the mainreason: slab objects.
> >
> > The underlying problem is quite simple: any page charged
> > to a cgroup holds a reference to it, so the cgroup can't be reclaimed unless
> > all charged pages are gone. If a slab object is actively used by other cgroups,
> > it won't be reclaimed, and will prevent the origin cgroup from being reclaimed.
> >
> > Slab objects, and first of all vfs cache, is shared between cgroups, which are
> > using the same underlying fs, and what's even more important, it's shared
> > between multiple generations of the same workload. So if something is running
> > periodically every time in a new cgroup (like how systemd works), we do
> > accumulate multiple dying cgroups.
> >
> > Strictly speaking pagecache isn't different here, but there is a key difference:
> > we disable protection and apply some extra pressure on LRUs of dying cgroups,
> 
> How do you apply extra pressure on dying cgroups? cgroup-v2 does not
> have memory.force_empty.

I mean the following part of get_scan_count():
	/*
	 * If the cgroup's already been deleted, make sure to
	 * scrape out the remaining cache.
	 */
	if (!scan && !mem_cgroup_online(memcg))
		scan = min(lruvec_size, SWAP_CLUSTER_MAX);

It seems to work well, so that pagecache alone doesn't pin too many
dying cgroups. The price we're paying is some excessive IO here,
which can be avoided had we be able to recharge the pagecache.


Btw, thank you very much for looking into the patchset. I'll address
all comments and send v4 soon.

Thanks!

Roman


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/7] mm: reparent slab memory on cgroup removal
  2019-05-13 20:21   ` Roman Gushchin
@ 2019-05-14 19:22     ` Shakeel Butt
  2019-05-14 20:04       ` Roman Gushchin
  0 siblings, 1 reply; 20+ messages in thread
From: Shakeel Butt @ 2019-05-14 19:22 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Linux MM, LKML, Kernel Team, Johannes Weiner,
	Michal Hocko, Rik van Riel, Christoph Lameter, Vladimir Davydov,
	Cgroups

From: Roman Gushchin <guro@fb.com>
Date: Mon, May 13, 2019 at 1:22 PM
To: Shakeel Butt
Cc: Andrew Morton, Linux MM, LKML, Kernel Team, Johannes Weiner,
Michal Hocko, Rik van Riel, Christoph Lameter, Vladimir Davydov,
Cgroups

> On Fri, May 10, 2019 at 05:32:15PM -0700, Shakeel Butt wrote:
> > From: Roman Gushchin <guro@fb.com>
> > Date: Wed, May 8, 2019 at 1:30 PM
> > To: Andrew Morton, Shakeel Butt
> > Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
> > <kernel-team@fb.com>, Johannes Weiner, Michal Hocko, Rik van Riel,
> > Christoph Lameter, Vladimir Davydov, <cgroups@vger.kernel.org>, Roman
> > Gushchin
> >
> > > # Why do we need this?
> > >
> > > We've noticed that the number of dying cgroups is steadily growing on most
> > > of our hosts in production. The following investigation revealed an issue
> > > in userspace memory reclaim code [1], accounting of kernel stacks [2],
> > > and also the mainreason: slab objects.
> > >
> > > The underlying problem is quite simple: any page charged
> > > to a cgroup holds a reference to it, so the cgroup can't be reclaimed unless
> > > all charged pages are gone. If a slab object is actively used by other cgroups,
> > > it won't be reclaimed, and will prevent the origin cgroup from being reclaimed.
> > >
> > > Slab objects, and first of all vfs cache, is shared between cgroups, which are
> > > using the same underlying fs, and what's even more important, it's shared
> > > between multiple generations of the same workload. So if something is running
> > > periodically every time in a new cgroup (like how systemd works), we do
> > > accumulate multiple dying cgroups.
> > >
> > > Strictly speaking pagecache isn't different here, but there is a key difference:
> > > we disable protection and apply some extra pressure on LRUs of dying cgroups,
> >
> > How do you apply extra pressure on dying cgroups? cgroup-v2 does not
> > have memory.force_empty.
>
> I mean the following part of get_scan_count():
>         /*
>          * If the cgroup's already been deleted, make sure to
>          * scrape out the remaining cache.
>          */
>         if (!scan && !mem_cgroup_online(memcg))
>                 scan = min(lruvec_size, SWAP_CLUSTER_MAX);
>
> It seems to work well, so that pagecache alone doesn't pin too many
> dying cgroups. The price we're paying is some excessive IO here,

Thanks for the explanation. However for this to work, something still
needs to trigger the memory pressure until then we will keep the
zombies around. BTW the get_scan_count() is getting really creepy. It
needs a refactor soon.

> which can be avoided had we be able to recharge the pagecache.
>

Are you looking into this? Do you envision a mount option which will
tell the filesystem is shared and do recharging on the offlining of
the origin memcg?

> Btw, thank you very much for looking into the patchset. I'll address
> all comments and send v4 soon.
>

You are most welcome.

thanks,
Shakeel


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 0/7] mm: reparent slab memory on cgroup removal
  2019-05-14 19:22     ` Shakeel Butt
@ 2019-05-14 20:04       ` Roman Gushchin
  0 siblings, 0 replies; 20+ messages in thread
From: Roman Gushchin @ 2019-05-14 20:04 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Linux MM, LKML, Kernel Team, Johannes Weiner,
	Michal Hocko, Rik van Riel, Christoph Lameter, Vladimir Davydov,
	Cgroups

On Tue, May 14, 2019 at 12:22:08PM -0700, Shakeel Butt wrote:
> From: Roman Gushchin <guro@fb.com>
> Date: Mon, May 13, 2019 at 1:22 PM
> To: Shakeel Butt
> Cc: Andrew Morton, Linux MM, LKML, Kernel Team, Johannes Weiner,
> Michal Hocko, Rik van Riel, Christoph Lameter, Vladimir Davydov,
> Cgroups
> 
> > On Fri, May 10, 2019 at 05:32:15PM -0700, Shakeel Butt wrote:
> > > From: Roman Gushchin <guro@fb.com>
> > > Date: Wed, May 8, 2019 at 1:30 PM
> > > To: Andrew Morton, Shakeel Butt
> > > Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
> > > <kernel-team@fb.com>, Johannes Weiner, Michal Hocko, Rik van Riel,
> > > Christoph Lameter, Vladimir Davydov, <cgroups@vger.kernel.org>, Roman
> > > Gushchin
> > >
> > > > # Why do we need this?
> > > >
> > > > We've noticed that the number of dying cgroups is steadily growing on most
> > > > of our hosts in production. The following investigation revealed an issue
> > > > in userspace memory reclaim code [1], accounting of kernel stacks [2],
> > > > and also the mainreason: slab objects.
> > > >
> > > > The underlying problem is quite simple: any page charged
> > > > to a cgroup holds a reference to it, so the cgroup can't be reclaimed unless
> > > > all charged pages are gone. If a slab object is actively used by other cgroups,
> > > > it won't be reclaimed, and will prevent the origin cgroup from being reclaimed.
> > > >
> > > > Slab objects, and first of all vfs cache, is shared between cgroups, which are
> > > > using the same underlying fs, and what's even more important, it's shared
> > > > between multiple generations of the same workload. So if something is running
> > > > periodically every time in a new cgroup (like how systemd works), we do
> > > > accumulate multiple dying cgroups.
> > > >
> > > > Strictly speaking pagecache isn't different here, but there is a key difference:
> > > > we disable protection and apply some extra pressure on LRUs of dying cgroups,
> > >
> > > How do you apply extra pressure on dying cgroups? cgroup-v2 does not
> > > have memory.force_empty.
> >
> > I mean the following part of get_scan_count():
> >         /*
> >          * If the cgroup's already been deleted, make sure to
> >          * scrape out the remaining cache.
> >          */
> >         if (!scan && !mem_cgroup_online(memcg))
> >                 scan = min(lruvec_size, SWAP_CLUSTER_MAX);
> >
> > It seems to work well, so that pagecache alone doesn't pin too many
> > dying cgroups. The price we're paying is some excessive IO here,
> 
> Thanks for the explanation. However for this to work, something still
> needs to trigger the memory pressure until then we will keep the
> zombies around. BTW the get_scan_count() is getting really creepy. It
> needs a refactor soon.

Sure, but that's true for all sorts of memory.
Re get_scan_count(): for sure, yeah, it's way too hairy now.

> 
> > which can be avoided had we be able to recharge the pagecache.
> >
> 
> Are you looking into this? Do you envision a mount option which will
> tell the filesystem is shared and do recharging on the offlining of
> the origin memcg?

Not really working on it now, but thinking of what to do here long-term.
One of the ideas I have (just an idea for now) is to move memcg pointer
from individual pages to the inode level. It can bring more opportunities
in terms of recharging and reparenting, but I'm not sure how complex it is
and what are possible downsides.

Do you have any plans or ideas here?

> 
> > Btw, thank you very much for looking into the patchset. I'll address
> > all comments and send v4 soon.
> >
> 
> You are most welcome.

Thanks!


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-05-14 20:04 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-08 20:24 [PATCH v3 0/7] mm: reparent slab memory on cgroup removal Roman Gushchin
2019-05-08 20:24 ` [PATCH v3 1/7] mm: postpone kmem_cache memcg pointer initialization to memcg_link_cache() Roman Gushchin
2019-05-11  0:32   ` Shakeel Butt
2019-05-08 20:24 ` [PATCH v3 2/7] mm: generalize postponed non-root kmem_cache deactivation Roman Gushchin
2019-05-11  0:33   ` Shakeel Butt
2019-05-08 20:24 ` [PATCH v3 3/7] mm: introduce __memcg_kmem_uncharge_memcg() Roman Gushchin
2019-05-11  0:33   ` Shakeel Butt
2019-05-08 20:24 ` [PATCH v3 4/7] mm: unify SLAB and SLUB page accounting Roman Gushchin
2019-05-11  0:33   ` Shakeel Butt
2019-05-13 18:01   ` Christopher Lameter
2019-05-08 20:24 ` [PATCH v3 5/7] mm: rework non-root kmem_cache lifecycle management Roman Gushchin
2019-05-11  0:33   ` Shakeel Butt
2019-05-08 20:24 ` [PATCH v3 6/7] mm: reparent slab memory on cgroup removal Roman Gushchin
2019-05-11  0:34   ` Shakeel Butt
2019-05-08 20:24 ` [PATCH v3 7/7] mm: fix /proc/kpagecgroup interface for slab pages Roman Gushchin
2019-05-11  0:34   ` Shakeel Butt
2019-05-11  0:32 ` [PATCH v3 0/7] mm: reparent slab memory on cgroup removal Shakeel Butt
2019-05-13 20:21   ` Roman Gushchin
2019-05-14 19:22     ` Shakeel Butt
2019-05-14 20:04       ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).