linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 00/10] mm: reparent slab memory on cgroup removal
@ 2019-06-05  2:44 Roman Gushchin
  2019-06-05  2:44 ` [PATCH v6 01/10] mm: add missing smp read barrier on getting memcg kmem_cache pointer Roman Gushchin
                   ` (10 more replies)
  0 siblings, 11 replies; 31+ messages in thread
From: Roman Gushchin @ 2019-06-05  2:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, kernel-team, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Roman Gushchin

# Why do we need this?

We've noticed that the number of dying cgroups is steadily growing on most
of our hosts in production. The following investigation revealed an issue
in userspace memory reclaim code [1], accounting of kernel stacks [2],
and also the mainreason: slab objects.

The underlying problem is quite simple: any page charged
to a cgroup holds a reference to it, so the cgroup can't be reclaimed unless
all charged pages are gone. If a slab object is actively used by other cgroups,
it won't be reclaimed, and will prevent the origin cgroup from being reclaimed.

Slab objects, and first of all vfs cache, is shared between cgroups, which are
using the same underlying fs, and what's even more important, it's shared
between multiple generations of the same workload. So if something is running
periodically every time in a new cgroup (like how systemd works), we do
accumulate multiple dying cgroups.

Strictly speaking pagecache isn't different here, but there is a key difference:
we disable protection and apply some extra pressure on LRUs of dying cgroups,
and these LRUs contain all charged pages.
My experiments show that with the disabled kernel memory accounting the number
of dying cgroups stabilizes at a relatively small number (~100, depends on
memory pressure and cgroup creation rate), and with kernel memory accounting
it grows pretty steadily up to several thousands.

Memory cgroups are quite complex and big objects (mostly due to percpu stats),
so it leads to noticeable memory losses. Memory occupied by dying cgroups
is measured in hundreds of megabytes. I've even seen a host with more than 100Gb
of memory wasted for dying cgroups. It leads to a degradation of performance
with the uptime, and generally limits the usage of cgroups.

My previous attempt [3] to fix the problem by applying extra pressure on slab
shrinker lists caused a regressions with xfs and ext4, and has been reverted [4].
The following attempts to find the right balance [5, 6] were not successful.

So instead of trying to find a maybe non-existing balance, let's do reparent
the accounted slabs to the parent cgroup on cgroup removal.


# Implementation approach

There is however a significant problem with reparenting of slab memory:
there is no list of charged pages. Some of them are in shrinker lists,
but not all. Introducing of a new list is really not an option.

But fortunately there is a way forward: every slab page has a stable pointer
to the corresponding kmem_cache. So the idea is to reparent kmem_caches
instead of slab pages.

It's actually simpler and cheaper, but requires some underlying changes:
1) Make kmem_caches to hold a single reference to the memory cgroup,
   instead of a separate reference per every slab page.
2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
   page->kmem_cache->memcg indirection instead. It's used only on
   slab page release, so it shouldn't be a big issue.
3) Introduce a refcounter for non-root slab caches. It's required to
   be able to destroy kmem_caches when they become empty and release
   the associated memory cgroup.

There is a bonus: currently we do release empty kmem_caches on cgroup
removal, however all other are waiting for the releasing of the memory cgroup.
These refactorings allow kmem_caches to be released as soon as they
become inactive and free.

Some additional implementation details are provided in corresponding
commit messages.


# Results

Below is the average number of dying cgroups on two groups of our production
hosts. They do run some sort of web frontend workload, the memory pressure
is moderate. As we can see, with the kernel memory reparenting the number
stabilizes in 60s range; however with the original version it grows almost
linearly and doesn't show any signs of plateauing. The difference in slab
and percpu usage between patched and unpatched versions also grows linearly.
In 7 days it exceeded 200Mb.

day           0    1    2    3    4    5    6    7
original     56  362  628  752 1070 1250 1490 1560
patched      23   46   51   55   60   57   67   69
mem diff(Mb) 22   74  123  152  164  182  214  241


# History

v6:
  1) split biggest patches into parts to make the review easier
  2) changed synchronization around the dying flag
  3) sysfs entry removal on deactivation is back
  4) got rid of redundant rcu wait on kmem_cache release
  5) fixed getting memcg pointer in mem_cgroup_from_kmem()
  5) fixed missed smp_rmb()
  6) removed redundant CONFIG_SLOB
  7) some renames and cosmetic fixes

v5:
  1) fixed a compilation warning around missing kmemcg_queue_cache_shutdown()
  2) s/rcu_read_lock()/rcu_read_unlock() in memcg_kmem_get_cache()

v4:
  1) removed excessive memcg != parent check in memcg_deactivate_kmem_caches()
  2) fixed rcu_read_lock() usage in memcg_charge_slab()
  3) fixed synchronization around dying flag in kmemcg_queue_cache_shutdown()
  4) refreshed test results data
  5) reworked PageTail() checks in memcg_from_slab_page()
  6) added some comments in multiple places

v3:
  1) reworked memcg kmem_cache search on allocation path
  2) fixed /proc/kpagecgroup interface

v2:
  1) switched to percpu kmem_cache refcounter
  2) a reference to kmem_cache is held during the allocation
  3) slabs stats are fixed for !MEMCG case (and the refactoring
     is separated into a standalone patch)
  4) kmem_cache reparenting is performed from deactivatation context

v1:
  https://lkml.org/lkml/2019/4/17/1095


# Links

[1]: commit 68600f623d69 ("mm: don't miss the last page because of
round-off error")
[2]: commit 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
[3]: commit 172b06c32b94 ("mm: slowly shrink slabs with a relatively
small number of objects")
[4]: commit a9a238e83fbb ("Revert "mm: slowly shrink slabs
with a relatively small number of objects")
[5]: https://lkml.org/lkml/2019/1/28/1865
[6]: https://marc.info/?l=linux-mm&m=155064763626437&w=2


Roman Gushchin (10):
  mm: add missing smp read barrier on getting memcg kmem_cache pointer
  mm: postpone kmem_cache memcg pointer initialization to
    memcg_link_cache()
  mm: rename slab delayed deactivation functions and fields
  mm: generalize postponed non-root kmem_cache deactivation
  mm: introduce __memcg_kmem_uncharge_memcg()
  mm: unify SLAB and SLUB page accounting
  mm: synchronize access to kmem_cache dying flag using a spinlock
  mm: rework non-root kmem_cache lifecycle management
  mm: stop setting page->mem_cgroup pointer for slab pages
  mm: reparent slab memory on cgroup removal

 include/linux/memcontrol.h |  10 +++
 include/linux/slab.h       |  13 +--
 mm/list_lru.c              |  11 ++-
 mm/memcontrol.c            | 102 +++++++++++++++-------
 mm/slab.c                  |  25 ++----
 mm/slab.h                  | 140 +++++++++++++++++++++---------
 mm/slab_common.c           | 170 ++++++++++++++++++++++---------------
 mm/slub.c                  |  24 +-----
 8 files changed, 312 insertions(+), 183 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v6 01/10] mm: add missing smp read barrier on getting memcg kmem_cache pointer
  2019-06-05  2:44 [PATCH v6 00/10] mm: reparent slab memory on cgroup removal Roman Gushchin
@ 2019-06-05  2:44 ` Roman Gushchin
  2019-06-05  4:35   ` Shakeel Butt
                     ` (2 more replies)
  2019-06-05  2:44 ` [PATCH v6 02/10] mm: postpone kmem_cache memcg pointer initialization to memcg_link_cache() Roman Gushchin
                   ` (9 subsequent siblings)
  10 siblings, 3 replies; 31+ messages in thread
From: Roman Gushchin @ 2019-06-05  2:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, kernel-team, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Roman Gushchin

Johannes noticed that reading the memcg kmem_cache pointer in
cache_from_memcg_idx() is performed using READ_ONCE() macro,
which doesn't implement a SMP barrier, which is required
by the logic.

Add a proper smp_rmb() to be paired with smp_wmb() in
memcg_create_kmem_cache().

The same applies to memcg_create_kmem_cache() itself,
which reads the same value without barriers and READ_ONCE().

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/slab.h        | 1 +
 mm/slab_common.c | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/slab.h b/mm/slab.h
index 739099af6cbb..1176b61bb8fc 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -260,6 +260,7 @@ cache_from_memcg_idx(struct kmem_cache *s, int idx)
 	 * memcg_caches issues a write barrier to match this (see
 	 * memcg_create_kmem_cache()).
 	 */
+	smp_rmb();
 	cachep = READ_ONCE(arr->entries[idx]);
 	rcu_read_unlock();
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 58251ba63e4a..8092bdfc05d5 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -652,7 +652,8 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 	 * allocation (see memcg_kmem_get_cache()), several threads can try to
 	 * create the same cache, but only one of them may succeed.
 	 */
-	if (arr->entries[idx])
+	smp_rmb();
+	if (READ_ONCE(arr->entries[idx]))
 		goto out_unlock;
 
 	cgroup_name(css->cgroup, memcg_name_buf, sizeof(memcg_name_buf));
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 02/10] mm: postpone kmem_cache memcg pointer initialization to memcg_link_cache()
  2019-06-05  2:44 [PATCH v6 00/10] mm: reparent slab memory on cgroup removal Roman Gushchin
  2019-06-05  2:44 ` [PATCH v6 01/10] mm: add missing smp read barrier on getting memcg kmem_cache pointer Roman Gushchin
@ 2019-06-05  2:44 ` Roman Gushchin
  2019-06-05  2:44 ` [PATCH v6 03/10] mm: rename slab delayed deactivation functions and fields Roman Gushchin
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 31+ messages in thread
From: Roman Gushchin @ 2019-06-05  2:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, kernel-team, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Roman Gushchin

Initialize kmem_cache->memcg_params.memcg pointer in
memcg_link_cache() rather than in init_memcg_params().

Once kmem_cache will hold a reference to the memory cgroup,
it will simplify the refcounting.

For non-root kmem_caches memcg_link_cache() is always called
before the kmem_cache becomes visible to a user, so it's safe.

Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/slab.c        |  2 +-
 mm/slab.h        |  5 +++--
 mm/slab_common.c | 14 +++++++-------
 mm/slub.c        |  2 +-
 4 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 9e3eee5568b6..a4091f8b3655 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1239,7 +1239,7 @@ void __init kmem_cache_init(void)
 				  nr_node_ids * sizeof(struct kmem_cache_node *),
 				  SLAB_HWCACHE_ALIGN, 0, 0);
 	list_add(&kmem_cache->list, &slab_caches);
-	memcg_link_cache(kmem_cache);
+	memcg_link_cache(kmem_cache, NULL);
 	slab_state = PARTIAL;
 
 	/*
diff --git a/mm/slab.h b/mm/slab.h
index 1176b61bb8fc..c16e5af0fb59 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -290,7 +290,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 }
 
 extern void slab_init_memcg_params(struct kmem_cache *);
-extern void memcg_link_cache(struct kmem_cache *s);
+extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
 extern void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
 				void (*deact_fn)(struct kmem_cache *));
 
@@ -345,7 +345,8 @@ static inline void slab_init_memcg_params(struct kmem_cache *s)
 {
 }
 
-static inline void memcg_link_cache(struct kmem_cache *s)
+static inline void memcg_link_cache(struct kmem_cache *s,
+				    struct mem_cgroup *memcg)
 {
 }
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 8092bdfc05d5..77df6029de8e 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -140,13 +140,12 @@ void slab_init_memcg_params(struct kmem_cache *s)
 }
 
 static int init_memcg_params(struct kmem_cache *s,
-		struct mem_cgroup *memcg, struct kmem_cache *root_cache)
+			     struct kmem_cache *root_cache)
 {
 	struct memcg_cache_array *arr;
 
 	if (root_cache) {
 		s->memcg_params.root_cache = root_cache;
-		s->memcg_params.memcg = memcg;
 		INIT_LIST_HEAD(&s->memcg_params.children_node);
 		INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node);
 		return 0;
@@ -221,11 +220,12 @@ int memcg_update_all_caches(int num_memcgs)
 	return ret;
 }
 
-void memcg_link_cache(struct kmem_cache *s)
+void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg)
 {
 	if (is_root_cache(s)) {
 		list_add(&s->root_caches_node, &slab_root_caches);
 	} else {
+		s->memcg_params.memcg = memcg;
 		list_add(&s->memcg_params.children_node,
 			 &s->memcg_params.root_cache->memcg_params.children);
 		list_add(&s->memcg_params.kmem_caches_node,
@@ -244,7 +244,7 @@ static void memcg_unlink_cache(struct kmem_cache *s)
 }
 #else
 static inline int init_memcg_params(struct kmem_cache *s,
-		struct mem_cgroup *memcg, struct kmem_cache *root_cache)
+				    struct kmem_cache *root_cache)
 {
 	return 0;
 }
@@ -384,7 +384,7 @@ static struct kmem_cache *create_cache(const char *name,
 	s->useroffset = useroffset;
 	s->usersize = usersize;
 
-	err = init_memcg_params(s, memcg, root_cache);
+	err = init_memcg_params(s, root_cache);
 	if (err)
 		goto out_free_cache;
 
@@ -394,7 +394,7 @@ static struct kmem_cache *create_cache(const char *name,
 
 	s->refcount = 1;
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s);
+	memcg_link_cache(s, memcg);
 out:
 	if (err)
 		return ERR_PTR(err);
@@ -998,7 +998,7 @@ struct kmem_cache *__init create_kmalloc_cache(const char *name,
 
 	create_boot_cache(s, name, size, flags, useroffset, usersize);
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s);
+	memcg_link_cache(s, NULL);
 	s->refcount = 1;
 	return s;
 }
diff --git a/mm/slub.c b/mm/slub.c
index 1802c87799ff..9cb2eef62a37 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4213,7 +4213,7 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
 	}
 	slab_init_memcg_params(s);
 	list_add(&s->list, &slab_caches);
-	memcg_link_cache(s);
+	memcg_link_cache(s, NULL);
 	return s;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 03/10] mm: rename slab delayed deactivation functions and fields
  2019-06-05  2:44 [PATCH v6 00/10] mm: reparent slab memory on cgroup removal Roman Gushchin
  2019-06-05  2:44 ` [PATCH v6 01/10] mm: add missing smp read barrier on getting memcg kmem_cache pointer Roman Gushchin
  2019-06-05  2:44 ` [PATCH v6 02/10] mm: postpone kmem_cache memcg pointer initialization to memcg_link_cache() Roman Gushchin
@ 2019-06-05  2:44 ` Roman Gushchin
  2019-06-09 12:13   ` Vladimir Davydov
  2019-06-05  2:44 ` [PATCH v6 04/10] mm: generalize postponed non-root kmem_cache deactivation Roman Gushchin
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Roman Gushchin @ 2019-06-05  2:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, kernel-team, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Roman Gushchin

The delayed work/rcu deactivation infrastructure of non-root
kmem_caches can be also used for asynchronous release of these
objects. Let's get rid of the word "deactivation" in corresponding
names to make the code look better after generalization.

It's easier to make the renaming first, so that the generalized
code will look consistent from scratch.

Let's rename struct memcg_cache_params fields:
  deact_fn -> work_fn
  deact_rcu_head -> rcu_head
  deact_work -> work

And RCU/delayed work callbacks in slab common code:
  kmemcg_deactivate_rcufn -> kmemcg_rcufn
  kmemcg_deactivate_workfn -> kmemcg_workfn

This patch contains no functional changes, only renamings.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/slab.h |  6 +++---
 mm/slab.h            |  2 +-
 mm/slab_common.c     | 30 +++++++++++++++---------------
 3 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 9449b19c5f10..47923c173f30 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -642,10 +642,10 @@ struct memcg_cache_params {
 			struct list_head children_node;
 			struct list_head kmem_caches_node;
 
-			void (*deact_fn)(struct kmem_cache *);
+			void (*work_fn)(struct kmem_cache *);
 			union {
-				struct rcu_head deact_rcu_head;
-				struct work_struct deact_work;
+				struct rcu_head rcu_head;
+				struct work_struct work;
 			};
 		};
 	};
diff --git a/mm/slab.h b/mm/slab.h
index c16e5af0fb59..8ff90f42548a 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -292,7 +292,7 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 extern void slab_init_memcg_params(struct kmem_cache *);
 extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
 extern void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
-				void (*deact_fn)(struct kmem_cache *));
+				void (*work_fn)(struct kmem_cache *));
 
 #else /* CONFIG_MEMCG_KMEM */
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 77df6029de8e..d019ee66bdc4 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -692,17 +692,17 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 	put_online_cpus();
 }
 
-static void kmemcg_deactivate_workfn(struct work_struct *work)
+static void kmemcg_workfn(struct work_struct *work)
 {
 	struct kmem_cache *s = container_of(work, struct kmem_cache,
-					    memcg_params.deact_work);
+					    memcg_params.work);
 
 	get_online_cpus();
 	get_online_mems();
 
 	mutex_lock(&slab_mutex);
 
-	s->memcg_params.deact_fn(s);
+	s->memcg_params.work_fn(s);
 
 	mutex_unlock(&slab_mutex);
 
@@ -713,36 +713,36 @@ static void kmemcg_deactivate_workfn(struct work_struct *work)
 	css_put(&s->memcg_params.memcg->css);
 }
 
-static void kmemcg_deactivate_rcufn(struct rcu_head *head)
+static void kmemcg_rcufn(struct rcu_head *head)
 {
 	struct kmem_cache *s = container_of(head, struct kmem_cache,
-					    memcg_params.deact_rcu_head);
+					    memcg_params.rcu_head);
 
 	/*
-	 * We need to grab blocking locks.  Bounce to ->deact_work.  The
+	 * We need to grab blocking locks.  Bounce to ->work.  The
 	 * work item shares the space with the RCU head and can't be
 	 * initialized eariler.
 	 */
-	INIT_WORK(&s->memcg_params.deact_work, kmemcg_deactivate_workfn);
-	queue_work(memcg_kmem_cache_wq, &s->memcg_params.deact_work);
+	INIT_WORK(&s->memcg_params.work, kmemcg_workfn);
+	queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
 }
 
 /**
  * slab_deactivate_memcg_cache_rcu_sched - schedule deactivation after a
  *					   sched RCU grace period
  * @s: target kmem_cache
- * @deact_fn: deactivation function to call
+ * @work_fn: deactivation function to call
  *
- * Schedule @deact_fn to be invoked with online cpus, mems and slab_mutex
+ * Schedule @work_fn to be invoked with online cpus, mems and slab_mutex
  * held after a sched RCU grace period.  The slab is guaranteed to stay
- * alive until @deact_fn is finished.  This is to be used from
+ * alive until @work_fn is finished.  This is to be used from
  * __kmemcg_cache_deactivate().
  */
 void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
-					   void (*deact_fn)(struct kmem_cache *))
+					   void (*work_fn)(struct kmem_cache *))
 {
 	if (WARN_ON_ONCE(is_root_cache(s)) ||
-	    WARN_ON_ONCE(s->memcg_params.deact_fn))
+	    WARN_ON_ONCE(s->memcg_params.work_fn))
 		return;
 
 	if (s->memcg_params.root_cache->memcg_params.dying)
@@ -751,8 +751,8 @@ void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
 	/* pin memcg so that @s doesn't get destroyed in the middle */
 	css_get(&s->memcg_params.memcg->css);
 
-	s->memcg_params.deact_fn = deact_fn;
-	call_rcu(&s->memcg_params.deact_rcu_head, kmemcg_deactivate_rcufn);
+	s->memcg_params.work_fn = work_fn;
+	call_rcu(&s->memcg_params.rcu_head, kmemcg_rcufn);
 }
 
 void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 04/10] mm: generalize postponed non-root kmem_cache deactivation
  2019-06-05  2:44 [PATCH v6 00/10] mm: reparent slab memory on cgroup removal Roman Gushchin
                   ` (2 preceding siblings ...)
  2019-06-05  2:44 ` [PATCH v6 03/10] mm: rename slab delayed deactivation functions and fields Roman Gushchin
@ 2019-06-05  2:44 ` Roman Gushchin
  2019-06-09 12:23   ` Vladimir Davydov
  2019-06-05  2:44 ` [PATCH v6 05/10] mm: introduce __memcg_kmem_uncharge_memcg() Roman Gushchin
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Roman Gushchin @ 2019-06-05  2:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, kernel-team, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Roman Gushchin

Currently SLUB uses a work scheduled after an RCU grace period
to deactivate a non-root kmem_cache. This mechanism can be reused
for kmem_caches release, but requires generalization for SLAB
case.

Introduce kmemcg_cache_deactivate() function, which calls
allocator-specific __kmem_cache_deactivate() and schedules
execution of __kmem_cache_deactivate_after_rcu() with all
necessary locks in a worker context after an rcu grace period.

Here is the new calling scheme:
  kmemcg_cache_deactivate()
    __kmemcg_cache_deactivate()                  SLAB/SLUB-specific
    kmemcg_rcufn()                               rcu
      kmemcg_workfn()                            work
        __kmemcg_cache_deactivate_after_rcu()    SLAB/SLUB-specific

instead of:
  __kmemcg_cache_deactivate()                    SLAB/SLUB-specific
    slab_deactivate_memcg_cache_rcu_sched()      SLUB-only
      kmemcg_rcufn()                             rcu
        kmemcg_workfn()                          work
          kmemcg_cache_deact_after_rcu()         SLUB-only

For consistency, all allocator-specific functions start with "__".

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/slab.c        |  4 ++++
 mm/slab.h        |  3 +--
 mm/slab_common.c | 27 ++++++++-------------------
 mm/slub.c        |  8 +-------
 4 files changed, 14 insertions(+), 28 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index a4091f8b3655..4b865393ebb4 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2252,6 +2252,10 @@ void __kmemcg_cache_deactivate(struct kmem_cache *cachep)
 {
 	__kmem_cache_shrink(cachep);
 }
+
+void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
+{
+}
 #endif
 
 int __kmem_cache_shutdown(struct kmem_cache *cachep)
diff --git a/mm/slab.h b/mm/slab.h
index 8ff90f42548a..d35d85794247 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -172,6 +172,7 @@ int __kmem_cache_shutdown(struct kmem_cache *);
 void __kmem_cache_release(struct kmem_cache *);
 int __kmem_cache_shrink(struct kmem_cache *);
 void __kmemcg_cache_deactivate(struct kmem_cache *s);
+void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s);
 void slab_kmem_cache_release(struct kmem_cache *);
 
 struct seq_file;
@@ -291,8 +292,6 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 
 extern void slab_init_memcg_params(struct kmem_cache *);
 extern void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg);
-extern void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
-				void (*work_fn)(struct kmem_cache *));
 
 #else /* CONFIG_MEMCG_KMEM */
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index d019ee66bdc4..09b26673b63f 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -709,7 +709,7 @@ static void kmemcg_workfn(struct work_struct *work)
 	put_online_mems();
 	put_online_cpus();
 
-	/* done, put the ref from slab_deactivate_memcg_cache_rcu_sched() */
+	/* done, put the ref from kmemcg_cache_deactivate() */
 	css_put(&s->memcg_params.memcg->css);
 }
 
@@ -727,31 +727,21 @@ static void kmemcg_rcufn(struct rcu_head *head)
 	queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
 }
 
-/**
- * slab_deactivate_memcg_cache_rcu_sched - schedule deactivation after a
- *					   sched RCU grace period
- * @s: target kmem_cache
- * @work_fn: deactivation function to call
- *
- * Schedule @work_fn to be invoked with online cpus, mems and slab_mutex
- * held after a sched RCU grace period.  The slab is guaranteed to stay
- * alive until @work_fn is finished.  This is to be used from
- * __kmemcg_cache_deactivate().
- */
-void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
-					   void (*work_fn)(struct kmem_cache *))
+static void kmemcg_cache_deactivate(struct kmem_cache *s)
 {
 	if (WARN_ON_ONCE(is_root_cache(s)) ||
 	    WARN_ON_ONCE(s->memcg_params.work_fn))
 		return;
 
+	__kmemcg_cache_deactivate(s);
+
 	if (s->memcg_params.root_cache->memcg_params.dying)
 		return;
 
 	/* pin memcg so that @s doesn't get destroyed in the middle */
 	css_get(&s->memcg_params.memcg->css);
 
-	s->memcg_params.work_fn = work_fn;
+	s->memcg_params.work_fn = __kmemcg_cache_deactivate_after_rcu;
 	call_rcu(&s->memcg_params.rcu_head, kmemcg_rcufn);
 }
 
@@ -774,7 +764,7 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 		if (!c)
 			continue;
 
-		__kmemcg_cache_deactivate(c);
+		kmemcg_cache_deactivate(c);
 		arr->entries[idx] = NULL;
 	}
 	mutex_unlock(&slab_mutex);
@@ -867,11 +857,10 @@ static void flush_memcg_workqueue(struct kmem_cache *s)
 	mutex_unlock(&slab_mutex);
 
 	/*
-	 * SLUB deactivates the kmem_caches through call_rcu. Make
+	 * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make
 	 * sure all registered rcu callbacks have been invoked.
 	 */
-	if (IS_ENABLED(CONFIG_SLUB))
-		rcu_barrier();
+	rcu_barrier();
 
 	/*
 	 * SLAB and SLUB create memcg kmem_caches through workqueue and SLUB
diff --git a/mm/slub.c b/mm/slub.c
index 9cb2eef62a37..ae3b1e49ecec 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4022,7 +4022,7 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 }
 
 #ifdef CONFIG_MEMCG
-static void kmemcg_cache_deact_after_rcu(struct kmem_cache *s)
+void __kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
 {
 	/*
 	 * Called with all the locks held after a sched RCU grace period.
@@ -4048,12 +4048,6 @@ void __kmemcg_cache_deactivate(struct kmem_cache *s)
 	 */
 	slub_set_cpu_partial(s, 0);
 	s->min_partial = 0;
-
-	/*
-	 * s->cpu_partial is checked locklessly (see put_cpu_partial), so
-	 * we have to make sure the change is visible before shrinking.
-	 */
-	slab_deactivate_memcg_cache_rcu_sched(s, kmemcg_cache_deact_after_rcu);
 }
 #endif	/* CONFIG_MEMCG */
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 05/10] mm: introduce __memcg_kmem_uncharge_memcg()
  2019-06-05  2:44 [PATCH v6 00/10] mm: reparent slab memory on cgroup removal Roman Gushchin
                   ` (3 preceding siblings ...)
  2019-06-05  2:44 ` [PATCH v6 04/10] mm: generalize postponed non-root kmem_cache deactivation Roman Gushchin
@ 2019-06-05  2:44 ` Roman Gushchin
  2019-06-09 12:29   ` Vladimir Davydov
  2019-06-05  2:44 ` [PATCH v6 06/10] mm: unify SLAB and SLUB page accounting Roman Gushchin
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Roman Gushchin @ 2019-06-05  2:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, kernel-team, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Roman Gushchin

Let's separate the page counter modification code out of
__memcg_kmem_uncharge() in a way similar to what
__memcg_kmem_charge() and __memcg_kmem_charge_memcg() work.

This will allow to reuse this code later using a new
memcg_kmem_uncharge_memcg() wrapper, which calls
__memcg_kmem_uncharge_memcg() if memcg_kmem_enabled()
check is passed.

Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
---
 include/linux/memcontrol.h | 10 ++++++++++
 mm/memcontrol.c            | 25 +++++++++++++++++--------
 2 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3ca57bacfdd2..9abf31bbe53a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1304,6 +1304,8 @@ int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order);
 void __memcg_kmem_uncharge(struct page *page, int order);
 int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
 			      struct mem_cgroup *memcg);
+void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
+				 unsigned int nr_pages);
 
 extern struct static_key_false memcg_kmem_enabled_key;
 extern struct workqueue_struct *memcg_kmem_cache_wq;
@@ -1345,6 +1347,14 @@ static inline int memcg_kmem_charge_memcg(struct page *page, gfp_t gfp,
 		return __memcg_kmem_charge_memcg(page, gfp, order, memcg);
 	return 0;
 }
+
+static inline void memcg_kmem_uncharge_memcg(struct page *page, int order,
+					     struct mem_cgroup *memcg)
+{
+	if (memcg_kmem_enabled())
+		__memcg_kmem_uncharge_memcg(memcg, 1 << order);
+}
+
 /*
  * helper for accessing a memcg's index. It will be used as an index in the
  * child cache array in kmem_cache, and also to derive its name. This function
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bdb66871cdec..3427396da612 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2731,6 +2731,22 @@ int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
 	css_put(&memcg->css);
 	return ret;
 }
+
+/**
+ * __memcg_kmem_uncharge_memcg: uncharge a kmem page
+ * @memcg: memcg to uncharge
+ * @nr_pages: number of pages to uncharge
+ */
+void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg,
+				 unsigned int nr_pages)
+{
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		page_counter_uncharge(&memcg->kmem, nr_pages);
+
+	page_counter_uncharge(&memcg->memory, nr_pages);
+	if (do_memsw_account())
+		page_counter_uncharge(&memcg->memsw, nr_pages);
+}
 /**
  * __memcg_kmem_uncharge: uncharge a kmem page
  * @page: page to uncharge
@@ -2745,14 +2761,7 @@ void __memcg_kmem_uncharge(struct page *page, int order)
 		return;
 
 	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
-
-	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
-		page_counter_uncharge(&memcg->kmem, nr_pages);
-
-	page_counter_uncharge(&memcg->memory, nr_pages);
-	if (do_memsw_account())
-		page_counter_uncharge(&memcg->memsw, nr_pages);
-
+	__memcg_kmem_uncharge_memcg(memcg, nr_pages);
 	page->mem_cgroup = NULL;
 
 	/* slab pages do not have PageKmemcg flag set */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 06/10] mm: unify SLAB and SLUB page accounting
  2019-06-05  2:44 [PATCH v6 00/10] mm: reparent slab memory on cgroup removal Roman Gushchin
                   ` (4 preceding siblings ...)
  2019-06-05  2:44 ` [PATCH v6 05/10] mm: introduce __memcg_kmem_uncharge_memcg() Roman Gushchin
@ 2019-06-05  2:44 ` Roman Gushchin
  2019-06-05  2:44 ` [PATCH v6 07/10] mm: synchronize access to kmem_cache dying flag using a spinlock Roman Gushchin
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 31+ messages in thread
From: Roman Gushchin @ 2019-06-05  2:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, kernel-team, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Roman Gushchin

Currently the page accounting code is duplicated in SLAB and SLUB
internals. Let's move it into new (un)charge_slab_page helpers
in the slab_common.c file. These helpers will be responsible
for statistics (global and memcg-aware) and memcg charging.
So they are replacing direct memcg_(un)charge_slab() calls.

Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/slab.c | 19 +++----------------
 mm/slab.h | 25 +++++++++++++++++++++++++
 mm/slub.c | 14 ++------------
 3 files changed, 30 insertions(+), 28 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 4b865393ebb4..b417824a9b15 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1360,7 +1360,6 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
 								int nodeid)
 {
 	struct page *page;
-	int nr_pages;
 
 	flags |= cachep->allocflags;
 
@@ -1370,17 +1369,11 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
 		return NULL;
 	}
 
-	if (memcg_charge_slab(page, flags, cachep->gfporder, cachep)) {
+	if (charge_slab_page(page, flags, cachep->gfporder, cachep)) {
 		__free_pages(page, cachep->gfporder);
 		return NULL;
 	}
 
-	nr_pages = (1 << cachep->gfporder);
-	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
-		mod_lruvec_page_state(page, NR_SLAB_RECLAIMABLE, nr_pages);
-	else
-		mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE, nr_pages);
-
 	__SetPageSlab(page);
 	/* Record if ALLOC_NO_WATERMARKS was set when allocating the slab */
 	if (sk_memalloc_socks() && page_is_pfmemalloc(page))
@@ -1395,12 +1388,6 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
 static void kmem_freepages(struct kmem_cache *cachep, struct page *page)
 {
 	int order = cachep->gfporder;
-	unsigned long nr_freed = (1 << order);
-
-	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
-		mod_lruvec_page_state(page, NR_SLAB_RECLAIMABLE, -nr_freed);
-	else
-		mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE, -nr_freed);
 
 	BUG_ON(!PageSlab(page));
 	__ClearPageSlabPfmemalloc(page);
@@ -1409,8 +1396,8 @@ static void kmem_freepages(struct kmem_cache *cachep, struct page *page)
 	page->mapping = NULL;
 
 	if (current->reclaim_state)
-		current->reclaim_state->reclaimed_slab += nr_freed;
-	memcg_uncharge_slab(page, order, cachep);
+		current->reclaim_state->reclaimed_slab += 1 << order;
+	uncharge_slab_page(page, order, cachep);
 	__free_pages(page, order);
 }
 
diff --git a/mm/slab.h b/mm/slab.h
index d35d85794247..345fb59afb8f 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -205,6 +205,12 @@ ssize_t slabinfo_write(struct file *file, const char __user *buffer,
 void __kmem_cache_free_bulk(struct kmem_cache *, size_t, void **);
 int __kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, size_t, void **);
 
+static inline int cache_vmstat_idx(struct kmem_cache *s)
+{
+	return (s->flags & SLAB_RECLAIM_ACCOUNT) ?
+		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
+}
+
 #ifdef CONFIG_MEMCG_KMEM
 
 /* List of all root caches. */
@@ -362,6 +368,25 @@ static inline struct kmem_cache *virt_to_cache(const void *obj)
 	return page->slab_cache;
 }
 
+static __always_inline int charge_slab_page(struct page *page,
+					    gfp_t gfp, int order,
+					    struct kmem_cache *s)
+{
+	int ret = memcg_charge_slab(page, gfp, order, s);
+
+	if (!ret)
+		mod_lruvec_page_state(page, cache_vmstat_idx(s), 1 << order);
+
+	return ret;
+}
+
+static __always_inline void uncharge_slab_page(struct page *page, int order,
+					       struct kmem_cache *s)
+{
+	mod_lruvec_page_state(page, cache_vmstat_idx(s), -(1 << order));
+	memcg_uncharge_slab(page, order, s);
+}
+
 static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
 {
 	struct kmem_cache *cachep;
diff --git a/mm/slub.c b/mm/slub.c
index ae3b1e49ecec..6a5174b51cd6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1488,7 +1488,7 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
 	else
 		page = __alloc_pages_node(node, flags, order);
 
-	if (page && memcg_charge_slab(page, flags, order, s)) {
+	if (page && charge_slab_page(page, flags, order, s)) {
 		__free_pages(page, order);
 		page = NULL;
 	}
@@ -1681,11 +1681,6 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	if (!page)
 		return NULL;
 
-	mod_lruvec_page_state(page,
-		(s->flags & SLAB_RECLAIM_ACCOUNT) ?
-		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
-		1 << oo_order(oo));
-
 	inc_slabs_node(s, page_to_nid(page), page->objects);
 
 	return page;
@@ -1719,18 +1714,13 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
 			check_object(s, page, p, SLUB_RED_INACTIVE);
 	}
 
-	mod_lruvec_page_state(page,
-		(s->flags & SLAB_RECLAIM_ACCOUNT) ?
-		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
-		-pages);
-
 	__ClearPageSlabPfmemalloc(page);
 	__ClearPageSlab(page);
 
 	page->mapping = NULL;
 	if (current->reclaim_state)
 		current->reclaim_state->reclaimed_slab += pages;
-	memcg_uncharge_slab(page, order, s);
+	uncharge_slab_page(page, order, s);
 	__free_pages(page, order);
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 07/10] mm: synchronize access to kmem_cache dying flag using a spinlock
  2019-06-05  2:44 [PATCH v6 00/10] mm: reparent slab memory on cgroup removal Roman Gushchin
                   ` (5 preceding siblings ...)
  2019-06-05  2:44 ` [PATCH v6 06/10] mm: unify SLAB and SLUB page accounting Roman Gushchin
@ 2019-06-05  2:44 ` Roman Gushchin
  2019-06-05 16:56   ` Johannes Weiner
  2019-06-09 14:31   ` Vladimir Davydov
  2019-06-05  2:44 ` [PATCH v6 08/10] mm: rework non-root kmem_cache lifecycle management Roman Gushchin
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 31+ messages in thread
From: Roman Gushchin @ 2019-06-05  2:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, kernel-team, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Roman Gushchin

Currently the memcg_params.dying flag and the corresponding
workqueue used for the asynchronous deactivation of kmem_caches
is synchronized using the slab_mutex.

It makes impossible to check this flag from the irq context,
which will be required in order to implement asynchronous release
of kmem_caches.

So let's switch over to the irq-save flavor of the spinlock-based
synchronization.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/slab_common.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 09b26673b63f..2914a8f0aa85 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -130,6 +130,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
 #ifdef CONFIG_MEMCG_KMEM
 
 LIST_HEAD(slab_root_caches);
+static DEFINE_SPINLOCK(memcg_kmem_wq_lock);
 
 void slab_init_memcg_params(struct kmem_cache *s)
 {
@@ -629,6 +630,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 	struct memcg_cache_array *arr;
 	struct kmem_cache *s = NULL;
 	char *cache_name;
+	bool dying;
 	int idx;
 
 	get_online_cpus();
@@ -640,7 +642,13 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 	 * The memory cgroup could have been offlined while the cache
 	 * creation work was pending.
 	 */
-	if (memcg->kmem_state != KMEM_ONLINE || root_cache->memcg_params.dying)
+	if (memcg->kmem_state != KMEM_ONLINE)
+		goto out_unlock;
+
+	spin_lock_irq(&memcg_kmem_wq_lock);
+	dying = root_cache->memcg_params.dying;
+	spin_unlock_irq(&memcg_kmem_wq_lock);
+	if (dying)
 		goto out_unlock;
 
 	idx = memcg_cache_id(memcg);
@@ -735,14 +743,17 @@ static void kmemcg_cache_deactivate(struct kmem_cache *s)
 
 	__kmemcg_cache_deactivate(s);
 
+	spin_lock_irq(&memcg_kmem_wq_lock);
 	if (s->memcg_params.root_cache->memcg_params.dying)
-		return;
+		goto unlock;
 
 	/* pin memcg so that @s doesn't get destroyed in the middle */
 	css_get(&s->memcg_params.memcg->css);
 
 	s->memcg_params.work_fn = __kmemcg_cache_deactivate_after_rcu;
 	call_rcu(&s->memcg_params.rcu_head, kmemcg_rcufn);
+unlock:
+	spin_unlock_irq(&memcg_kmem_wq_lock);
 }
 
 void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
@@ -852,9 +863,9 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
 
 static void flush_memcg_workqueue(struct kmem_cache *s)
 {
-	mutex_lock(&slab_mutex);
+	spin_lock_irq(&memcg_kmem_wq_lock);
 	s->memcg_params.dying = true;
-	mutex_unlock(&slab_mutex);
+	spin_unlock_irq(&memcg_kmem_wq_lock);
 
 	/*
 	 * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 08/10] mm: rework non-root kmem_cache lifecycle management
  2019-06-05  2:44 [PATCH v6 00/10] mm: reparent slab memory on cgroup removal Roman Gushchin
                   ` (6 preceding siblings ...)
  2019-06-05  2:44 ` [PATCH v6 07/10] mm: synchronize access to kmem_cache dying flag using a spinlock Roman Gushchin
@ 2019-06-05  2:44 ` Roman Gushchin
  2019-06-09 17:09   ` Vladimir Davydov
  2019-06-05  2:44 ` [PATCH v6 09/10] mm: stop setting page->mem_cgroup pointer for slab pages Roman Gushchin
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Roman Gushchin @ 2019-06-05  2:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, kernel-team, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Roman Gushchin

Currently each charged slab page holds a reference to the cgroup to
which it's charged. Kmem_caches are held by the memcg and are released
all together with the memory cgroup. It means that none of kmem_caches
are released unless at least one reference to the memcg exists, which
is very far from optimal.

Let's rework it in a way that allows releasing individual kmem_caches
as soon as the cgroup is offline, the kmem_cache is empty and there
are no pending allocations.

To make it possible, let's introduce a new percpu refcounter for
non-root kmem caches. The counter is initialized to the percpu mode,
and is switched to the atomic mode during kmem_cache deactivation. The
counter is bumped for every charged page and also for every running
allocation. So the kmem_cache can't be released unless all allocations
complete.

To shutdown non-active empty kmem_caches, let's reuse the work queue,
previously used for the kmem_cache deactivation. Once the reference
counter reaches 0, let's schedule an asynchronous kmem_cache release.

* I used the following simple approach to test the performance
(stolen from another patchset by T. Harding):

    time find / -name fname-no-exist
    echo 2 > /proc/sys/vm/drop_caches
    repeat 10 times

Results:

        orig		patched

real	0m1.455s	real	0m1.355s
user	0m0.206s	user	0m0.219s
sys	0m0.855s	sys	0m0.807s

real	0m1.487s	real	0m1.699s
user	0m0.221s	user	0m0.256s
sys	0m0.806s	sys	0m0.948s

real	0m1.515s	real	0m1.505s
user	0m0.183s	user	0m0.215s
sys	0m0.876s	sys	0m0.858s

real	0m1.291s	real	0m1.380s
user	0m0.193s	user	0m0.198s
sys	0m0.843s	sys	0m0.786s

real	0m1.364s	real	0m1.374s
user	0m0.180s	user	0m0.182s
sys	0m0.868s	sys	0m0.806s

real	0m1.352s	real	0m1.312s
user	0m0.201s	user	0m0.212s
sys	0m0.820s	sys	0m0.761s

real	0m1.302s	real	0m1.349s
user	0m0.205s	user	0m0.203s
sys	0m0.803s	sys	0m0.792s

real	0m1.334s	real	0m1.301s
user	0m0.194s	user	0m0.201s
sys	0m0.806s	sys	0m0.779s

real	0m1.426s	real	0m1.434s
user	0m0.216s	user	0m0.181s
sys	0m0.824s	sys	0m0.864s

real	0m1.350s	real	0m1.295s
user	0m0.200s	user	0m0.190s
sys	0m0.842s	sys	0m0.811s

So it looks like the difference is not noticeable in this test.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/slab.h |  3 +-
 mm/memcontrol.c      | 51 +++++++++++++++++++++-------
 mm/slab.h            | 45 +++++++------------------
 mm/slab_common.c     | 79 ++++++++++++++++++++++++++------------------
 4 files changed, 100 insertions(+), 78 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 47923c173f30..1b54e5f83342 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -16,6 +16,7 @@
 #include <linux/overflow.h>
 #include <linux/types.h>
 #include <linux/workqueue.h>
+#include <linux/percpu-refcount.h>
 
 
 /*
@@ -152,7 +153,6 @@ int kmem_cache_shrink(struct kmem_cache *);
 
 void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
 void memcg_deactivate_kmem_caches(struct mem_cgroup *);
-void memcg_destroy_kmem_caches(struct mem_cgroup *);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
@@ -641,6 +641,7 @@ struct memcg_cache_params {
 			struct mem_cgroup *memcg;
 			struct list_head children_node;
 			struct list_head kmem_caches_node;
+			struct percpu_ref refcnt;
 
 			void (*work_fn)(struct kmem_cache *);
 			union {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3427396da612..49084e2d81ff 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2591,12 +2591,13 @@ static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
 {
 	struct memcg_kmem_cache_create_work *cw;
 
+	if (!css_tryget_online(&memcg->css))
+		return;
+
 	cw = kmalloc(sizeof(*cw), GFP_NOWAIT | __GFP_NOWARN);
 	if (!cw)
 		return;
 
-	css_get(&memcg->css);
-
 	cw->memcg = memcg;
 	cw->cachep = cachep;
 	INIT_WORK(&cw->work, memcg_kmem_cache_create_func);
@@ -2631,6 +2632,7 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
 {
 	struct mem_cgroup *memcg;
 	struct kmem_cache *memcg_cachep;
+	struct memcg_cache_array *arr;
 	int kmemcg_id;
 
 	VM_BUG_ON(!is_root_cache(cachep));
@@ -2638,14 +2640,29 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
 	if (memcg_kmem_bypass())
 		return cachep;
 
-	memcg = get_mem_cgroup_from_current();
+	rcu_read_lock();
+
+	if (unlikely(current->active_memcg))
+		memcg = current->active_memcg;
+	else
+		memcg = mem_cgroup_from_task(current);
+
+	if (!memcg || memcg == root_mem_cgroup)
+		goto out_unlock;
+
 	kmemcg_id = READ_ONCE(memcg->kmemcg_id);
 	if (kmemcg_id < 0)
-		goto out;
+		goto out_unlock;
+
+	arr = rcu_dereference(cachep->memcg_params.memcg_caches);
 
-	memcg_cachep = cache_from_memcg_idx(cachep, kmemcg_id);
-	if (likely(memcg_cachep))
-		return memcg_cachep;
+	/*
+	 * Make sure we will access the up-to-date value. The code updating
+	 * memcg_caches issues a write barrier to match this (see
+	 * memcg_create_kmem_cache()).
+	 */
+	smp_rmb();
+	memcg_cachep = READ_ONCE(arr->entries[kmemcg_id]);
 
 	/*
 	 * If we are in a safe context (can wait, and not in interrupt
@@ -2658,10 +2675,20 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
 	 * memcg_create_kmem_cache, this means no further allocation
 	 * could happen with the slab_mutex held. So it's better to
 	 * defer everything.
+	 *
+	 * If the memcg is dying or memcg_cache is about to be released,
+	 * don't bother creating new kmem_caches. Because memcg_cachep
+	 * is ZEROed as the fist step of kmem offlining, we don't need
+	 * percpu_ref_tryget_live() here. css_tryget_online() check in
+	 * memcg_schedule_kmem_cache_create() will prevent us from
+	 * creation of a new kmem_cache.
 	 */
-	memcg_schedule_kmem_cache_create(memcg, cachep);
-out:
-	css_put(&memcg->css);
+	if (unlikely(!memcg_cachep))
+		memcg_schedule_kmem_cache_create(memcg, cachep);
+	else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt))
+		cachep = memcg_cachep;
+out_unlock:
+	rcu_read_unlock();
 	return cachep;
 }
 
@@ -2672,7 +2699,7 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
 void memcg_kmem_put_cache(struct kmem_cache *cachep)
 {
 	if (!is_root_cache(cachep))
-		css_put(&cachep->memcg_params.memcg->css);
+		percpu_ref_put(&cachep->memcg_params.refcnt);
 }
 
 /**
@@ -3219,7 +3246,7 @@ static void memcg_free_kmem(struct mem_cgroup *memcg)
 		memcg_offline_kmem(memcg);
 
 	if (memcg->kmem_state == KMEM_ALLOCATED) {
-		memcg_destroy_kmem_caches(memcg);
+		WARN_ON(!list_empty(&memcg->kmem_caches));
 		static_branch_dec(&memcg_kmem_enabled_key);
 		WARN_ON(page_counter_read(&memcg->kmem));
 	}
diff --git a/mm/slab.h b/mm/slab.h
index 345fb59afb8f..5d2b8511e6fb 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -248,32 +248,6 @@ static inline const char *cache_name(struct kmem_cache *s)
 	return s->name;
 }
 
-/*
- * Note, we protect with RCU only the memcg_caches array, not per-memcg caches.
- * That said the caller must assure the memcg's cache won't go away by either
- * taking a css reference to the owner cgroup, or holding the slab_mutex.
- */
-static inline struct kmem_cache *
-cache_from_memcg_idx(struct kmem_cache *s, int idx)
-{
-	struct kmem_cache *cachep;
-	struct memcg_cache_array *arr;
-
-	rcu_read_lock();
-	arr = rcu_dereference(s->memcg_params.memcg_caches);
-
-	/*
-	 * Make sure we will access the up-to-date value. The code updating
-	 * memcg_caches issues a write barrier to match this (see
-	 * memcg_create_kmem_cache()).
-	 */
-	smp_rmb();
-	cachep = READ_ONCE(arr->entries[idx]);
-	rcu_read_unlock();
-
-	return cachep;
-}
-
 static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
 {
 	if (is_root_cache(s))
@@ -285,14 +259,25 @@ static __always_inline int memcg_charge_slab(struct page *page,
 					     gfp_t gfp, int order,
 					     struct kmem_cache *s)
 {
+	int ret;
+
 	if (is_root_cache(s))
 		return 0;
-	return memcg_kmem_charge_memcg(page, gfp, order, s->memcg_params.memcg);
+
+	ret = memcg_kmem_charge_memcg(page, gfp, order, s->memcg_params.memcg);
+	if (ret)
+		return ret;
+
+	percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
+
+	return 0;
 }
 
 static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 						struct kmem_cache *s)
 {
+	if (!is_root_cache(s))
+		percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
 	memcg_kmem_uncharge(page, order);
 }
 
@@ -324,12 +309,6 @@ static inline const char *cache_name(struct kmem_cache *s)
 	return s->name;
 }
 
-static inline struct kmem_cache *
-cache_from_memcg_idx(struct kmem_cache *s, int idx)
-{
-	return NULL;
-}
-
 static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
 {
 	return s;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 2914a8f0aa85..8255283025e3 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -132,6 +132,8 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
 LIST_HEAD(slab_root_caches);
 static DEFINE_SPINLOCK(memcg_kmem_wq_lock);
 
+static void kmemcg_cache_shutdown(struct percpu_ref *percpu_ref);
+
 void slab_init_memcg_params(struct kmem_cache *s)
 {
 	s->memcg_params.root_cache = NULL;
@@ -146,6 +148,12 @@ static int init_memcg_params(struct kmem_cache *s,
 	struct memcg_cache_array *arr;
 
 	if (root_cache) {
+		int ret = percpu_ref_init(&s->memcg_params.refcnt,
+					  kmemcg_cache_shutdown,
+					  0, GFP_KERNEL);
+		if (ret)
+			return ret;
+
 		s->memcg_params.root_cache = root_cache;
 		INIT_LIST_HEAD(&s->memcg_params.children_node);
 		INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node);
@@ -171,6 +179,8 @@ static void destroy_memcg_params(struct kmem_cache *s)
 {
 	if (is_root_cache(s))
 		kvfree(rcu_access_pointer(s->memcg_params.memcg_caches));
+	else
+		percpu_ref_exit(&s->memcg_params.refcnt);
 }
 
 static void free_memcg_params(struct rcu_head *rcu)
@@ -226,6 +236,7 @@ void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg)
 	if (is_root_cache(s)) {
 		list_add(&s->root_caches_node, &slab_root_caches);
 	} else {
+		css_get(&memcg->css);
 		s->memcg_params.memcg = memcg;
 		list_add(&s->memcg_params.children_node,
 			 &s->memcg_params.root_cache->memcg_params.children);
@@ -241,6 +252,7 @@ static void memcg_unlink_cache(struct kmem_cache *s)
 	} else {
 		list_del(&s->memcg_params.children_node);
 		list_del(&s->memcg_params.kmem_caches_node);
+		css_put(&s->memcg_params.memcg->css);
 	}
 }
 #else
@@ -686,7 +698,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 	}
 
 	/*
-	 * Since readers won't lock (see cache_from_memcg_idx()), we need a
+	 * Since readers won't lock (see memcg_kmem_get_cache()), we need a
 	 * barrier here to ensure nobody will see the kmem_cache partially
 	 * initialized.
 	 */
@@ -711,14 +723,12 @@ static void kmemcg_workfn(struct work_struct *work)
 	mutex_lock(&slab_mutex);
 
 	s->memcg_params.work_fn(s);
+	s->memcg_params.work_fn = NULL;
 
 	mutex_unlock(&slab_mutex);
 
 	put_online_mems();
 	put_online_cpus();
-
-	/* done, put the ref from kmemcg_cache_deactivate() */
-	css_put(&s->memcg_params.memcg->css);
 }
 
 static void kmemcg_rcufn(struct rcu_head *head)
@@ -735,10 +745,39 @@ static void kmemcg_rcufn(struct rcu_head *head)
 	queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
 }
 
+static void kmemcg_cache_shutdown_fn(struct kmem_cache *s)
+{
+	WARN_ON(shutdown_cache(s));
+}
+
+static void kmemcg_cache_shutdown(struct percpu_ref *percpu_ref)
+{
+	struct kmem_cache *s = container_of(percpu_ref, struct kmem_cache,
+					    memcg_params.refcnt);
+	unsigned long flags;
+
+	spin_lock_irqsave(&memcg_kmem_wq_lock, flags);
+	if (s->memcg_params.root_cache->memcg_params.dying)
+		goto unlock;
+
+	WARN_ON(s->memcg_params.work_fn);
+	s->memcg_params.work_fn = kmemcg_cache_shutdown_fn;
+	INIT_WORK(&s->memcg_params.work, kmemcg_workfn);
+	queue_work(memcg_kmem_cache_wq, &s->memcg_params.work);
+
+unlock:
+	spin_unlock_irqrestore(&memcg_kmem_wq_lock, flags);
+}
+
+static void kmemcg_cache_deactivate_after_rcu(struct kmem_cache *s)
+{
+	__kmemcg_cache_deactivate_after_rcu(s);
+	percpu_ref_kill(&s->memcg_params.refcnt);
+}
+
 static void kmemcg_cache_deactivate(struct kmem_cache *s)
 {
-	if (WARN_ON_ONCE(is_root_cache(s)) ||
-	    WARN_ON_ONCE(s->memcg_params.work_fn))
+	if (WARN_ON_ONCE(is_root_cache(s)))
 		return;
 
 	__kmemcg_cache_deactivate(s);
@@ -747,10 +786,8 @@ static void kmemcg_cache_deactivate(struct kmem_cache *s)
 	if (s->memcg_params.root_cache->memcg_params.dying)
 		goto unlock;
 
-	/* pin memcg so that @s doesn't get destroyed in the middle */
-	css_get(&s->memcg_params.memcg->css);
-
-	s->memcg_params.work_fn = __kmemcg_cache_deactivate_after_rcu;
+	WARN_ON_ONCE(s->memcg_params.work_fn);
+	s->memcg_params.work_fn = kmemcg_cache_deactivate_after_rcu;
 	call_rcu(&s->memcg_params.rcu_head, kmemcg_rcufn);
 unlock:
 	spin_unlock_irq(&memcg_kmem_wq_lock);
@@ -784,28 +821,6 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 	put_online_cpus();
 }
 
-void memcg_destroy_kmem_caches(struct mem_cgroup *memcg)
-{
-	struct kmem_cache *s, *s2;
-
-	get_online_cpus();
-	get_online_mems();
-
-	mutex_lock(&slab_mutex);
-	list_for_each_entry_safe(s, s2, &memcg->kmem_caches,
-				 memcg_params.kmem_caches_node) {
-		/*
-		 * The cgroup is about to be freed and therefore has no charges
-		 * left. Hence, all its caches must be empty by now.
-		 */
-		BUG_ON(shutdown_cache(s));
-	}
-	mutex_unlock(&slab_mutex);
-
-	put_online_mems();
-	put_online_cpus();
-}
-
 static int shutdown_memcg_caches(struct kmem_cache *s)
 {
 	struct memcg_cache_array *arr;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 09/10] mm: stop setting page->mem_cgroup pointer for slab pages
  2019-06-05  2:44 [PATCH v6 00/10] mm: reparent slab memory on cgroup removal Roman Gushchin
                   ` (7 preceding siblings ...)
  2019-06-05  2:44 ` [PATCH v6 08/10] mm: rework non-root kmem_cache lifecycle management Roman Gushchin
@ 2019-06-05  2:44 ` Roman Gushchin
  2019-06-09 17:09   ` Vladimir Davydov
  2019-06-05  2:44 ` [PATCH v6 10/10] mm: reparent slab memory on cgroup removal Roman Gushchin
  2019-06-05  4:14 ` [PATCH v6 00/10] " Andrew Morton
  10 siblings, 1 reply; 31+ messages in thread
From: Roman Gushchin @ 2019-06-05  2:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, kernel-team, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Roman Gushchin

Every slab page charged to a non-root memory cgroup has a pointer
to the memory cgroup and holds a reference to it, which protects
a non-empty memory cgroup from being released. At the same time
the page has a pointer to the corresponding kmem_cache, and also
hold a reference to the kmem_cache. And kmem_cache by itself
holds a reference to the cgroup.

So there is clearly some redundancy, which allows to stop setting
the page->mem_cgroup pointer and rely on getting memcg pointer
indirectly via kmem_cache. Further it will allow to change this
pointer easier, without a need to go over all charged pages.

So let's stop setting page->mem_cgroup pointer for slab pages,
and stop using the css refcounter directly for protecting
the memory cgroup from going away. Instead rely on kmem_cache
as an intermediate object.

Make sure that vmstats and shrinker lists are working as previously,
as well as /proc/kpagecgroup interface.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 mm/list_lru.c   |  3 +-
 mm/memcontrol.c | 12 ++++----
 mm/slab.h       | 74 ++++++++++++++++++++++++++++++++++++++++---------
 3 files changed, 70 insertions(+), 19 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 927d85be32f6..0f1f6b06b7f3 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -12,6 +12,7 @@
 #include <linux/slab.h>
 #include <linux/mutex.h>
 #include <linux/memcontrol.h>
+#include "slab.h"
 
 #ifdef CONFIG_MEMCG_KMEM
 static LIST_HEAD(list_lrus);
@@ -63,7 +64,7 @@ static __always_inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr)
 	if (!memcg_kmem_enabled())
 		return NULL;
 	page = virt_to_head_page(ptr);
-	return page->mem_cgroup;
+	return memcg_from_slab_page(page);
 }
 
 static inline struct list_lru_one *
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 49084e2d81ff..c097b1fc74ec 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -485,7 +485,10 @@ ino_t page_cgroup_ino(struct page *page)
 	unsigned long ino = 0;
 
 	rcu_read_lock();
-	memcg = READ_ONCE(page->mem_cgroup);
+	if (PageHead(page) && PageSlab(page))
+		memcg = memcg_from_slab_page(page);
+	else
+		memcg = READ_ONCE(page->mem_cgroup);
 	while (memcg && !(memcg->css.flags & CSS_ONLINE))
 		memcg = parent_mem_cgroup(memcg);
 	if (memcg)
@@ -2727,9 +2730,6 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
 		cancel_charge(memcg, nr_pages);
 		return -ENOMEM;
 	}
-
-	page->mem_cgroup = memcg;
-
 	return 0;
 }
 
@@ -2752,8 +2752,10 @@ int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order)
 	memcg = get_mem_cgroup_from_current();
 	if (!mem_cgroup_is_root(memcg)) {
 		ret = __memcg_kmem_charge_memcg(page, gfp, order, memcg);
-		if (!ret)
+		if (!ret) {
+			page->mem_cgroup = memcg;
 			__SetPageKmemcg(page);
+		}
 	}
 	css_put(&memcg->css);
 	return ret;
diff --git a/mm/slab.h b/mm/slab.h
index 5d2b8511e6fb..7ead47cb9338 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -255,30 +255,67 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
 	return s->memcg_params.root_cache;
 }
 
+/*
+ * Expects a pointer to a slab page. Please note, that PageSlab() check
+ * isn't sufficient, as it returns true also for tail compound slab pages,
+ * which do not have slab_cache pointer set.
+ * So this function assumes that the page can pass PageHead() and PageSlab()
+ * checks.
+ */
+static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
+{
+	struct kmem_cache *s;
+
+	s = READ_ONCE(page->slab_cache);
+	if (s && !is_root_cache(s))
+		return s->memcg_params.memcg;
+
+	return NULL;
+}
+
+/*
+ * Charge the slab page belonging to the non-root kmem_cache.
+ * Can be called for non-root kmem_caches only.
+ */
 static __always_inline int memcg_charge_slab(struct page *page,
 					     gfp_t gfp, int order,
 					     struct kmem_cache *s)
 {
+	struct mem_cgroup *memcg;
+	struct lruvec *lruvec;
 	int ret;
 
-	if (is_root_cache(s))
-		return 0;
-
-	ret = memcg_kmem_charge_memcg(page, gfp, order, s->memcg_params.memcg);
+	memcg = s->memcg_params.memcg;
+	ret = memcg_kmem_charge_memcg(page, gfp, order, memcg);
 	if (ret)
 		return ret;
 
+	lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
+	mod_lruvec_state(lruvec, cache_vmstat_idx(s), 1 << order);
+
+	/* transer try_charge() page references to kmem_cache */
 	percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
+	css_put_many(&memcg->css, 1 << order);
 
 	return 0;
 }
 
+/*
+ * Uncharge a slab page belonging to a non-root kmem_cache.
+ * Can be called for non-root kmem_caches only.
+ */
 static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 						struct kmem_cache *s)
 {
-	if (!is_root_cache(s))
-		percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
-	memcg_kmem_uncharge(page, order);
+	struct mem_cgroup *memcg;
+	struct lruvec *lruvec;
+
+	memcg = s->memcg_params.memcg;
+	lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
+	mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
+	memcg_kmem_uncharge_memcg(page, order, memcg);
+
+	percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
 }
 
 extern void slab_init_memcg_params(struct kmem_cache *);
@@ -314,6 +351,11 @@ static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
 	return s;
 }
 
+static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
+{
+	return NULL;
+}
+
 static inline int memcg_charge_slab(struct page *page, gfp_t gfp, int order,
 				    struct kmem_cache *s)
 {
@@ -351,18 +393,24 @@ static __always_inline int charge_slab_page(struct page *page,
 					    gfp_t gfp, int order,
 					    struct kmem_cache *s)
 {
-	int ret = memcg_charge_slab(page, gfp, order, s);
-
-	if (!ret)
-		mod_lruvec_page_state(page, cache_vmstat_idx(s), 1 << order);
+	if (is_root_cache(s)) {
+		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
+				    1 << order);
+		return 0;
+	}
 
-	return ret;
+	return memcg_charge_slab(page, gfp, order, s);
 }
 
 static __always_inline void uncharge_slab_page(struct page *page, int order,
 					       struct kmem_cache *s)
 {
-	mod_lruvec_page_state(page, cache_vmstat_idx(s), -(1 << order));
+	if (is_root_cache(s)) {
+		mod_node_page_state(page_pgdat(page), cache_vmstat_idx(s),
+				    -(1 << order));
+		return;
+	}
+
 	memcg_uncharge_slab(page, order, s);
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v6 10/10] mm: reparent slab memory on cgroup removal
  2019-06-05  2:44 [PATCH v6 00/10] mm: reparent slab memory on cgroup removal Roman Gushchin
                   ` (8 preceding siblings ...)
  2019-06-05  2:44 ` [PATCH v6 09/10] mm: stop setting page->mem_cgroup pointer for slab pages Roman Gushchin
@ 2019-06-05  2:44 ` Roman Gushchin
  2019-06-09 17:18   ` Vladimir Davydov
  2019-06-05  4:14 ` [PATCH v6 00/10] " Andrew Morton
  10 siblings, 1 reply; 31+ messages in thread
From: Roman Gushchin @ 2019-06-05  2:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, kernel-team, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, Waiman Long, Roman Gushchin

Let's reparent memcg slab memory on memcg offlining. This allows us
to release the memory cgroup without waiting for the last outstanding
kernel object (e.g. dentry used by another application).

So instead of reparenting all accounted slab pages, let's do reparent
a relatively small amount of kmem_caches. Reparenting is performed as
a part of the deactivation process.

Since the parent cgroup is already charged, everything we need to do
is to splice the list of kmem_caches to the parent's kmem_caches list,
swap the memcg pointer and drop the css refcounter for each kmem_cache
and adjust the parent's css refcounter. Quite simple.

Please, note that kmem_cache->memcg_params.memcg isn't a stable
pointer anymore. It's safe to read it under rcu_read_lock() or
with slab_mutex held.

We can race with the slab allocation and deallocation paths. It's not
a big problem: parent's charge and slab global stats are always
correct, and we don't care anymore about the child usage and global
stats. The child cgroup is already offline, so we don't use or show it
anywhere.

Local slab stats (NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE)
aren't used anywhere except count_shadow_nodes(). But even there it
won't break anything: after reparenting "nodes" will be 0 on child
level (because we're already reparenting shrinker lists), and on
parent level page stats always were 0, and this patch won't change
anything.

Signed-off-by: Roman Gushchin <guro@fb.com>
---
 include/linux/slab.h |  4 ++--
 mm/list_lru.c        |  8 +++++++-
 mm/memcontrol.c      | 14 ++++++++------
 mm/slab.h            | 23 +++++++++++++++++------
 mm/slab_common.c     | 22 +++++++++++++++++++---
 5 files changed, 53 insertions(+), 18 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 1b54e5f83342..109cab2ad9b4 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -152,7 +152,7 @@ void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 
 void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
-void memcg_deactivate_kmem_caches(struct mem_cgroup *);
+void memcg_deactivate_kmem_caches(struct mem_cgroup *, struct mem_cgroup *);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
@@ -638,7 +638,7 @@ struct memcg_cache_params {
 			bool dying;
 		};
 		struct {
-			struct mem_cgroup *memcg;
+			struct mem_cgroup __rcu *memcg;
 			struct list_head children_node;
 			struct list_head kmem_caches_node;
 			struct percpu_ref refcnt;
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 0f1f6b06b7f3..0b2319897e86 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -77,11 +77,15 @@ list_lru_from_kmem(struct list_lru_node *nlru, void *ptr,
 	if (!nlru->memcg_lrus)
 		goto out;
 
+	rcu_read_lock();
 	memcg = mem_cgroup_from_kmem(ptr);
-	if (!memcg)
+	if (!memcg) {
+		rcu_read_unlock();
 		goto out;
+	}
 
 	l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
+	rcu_read_unlock();
 out:
 	if (memcg_ptr)
 		*memcg_ptr = memcg;
@@ -131,12 +135,14 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
 
 	spin_lock(&nlru->lock);
 	if (list_empty(item)) {
+		rcu_read_lock();
 		l = list_lru_from_kmem(nlru, item, &memcg);
 		list_add_tail(item, &l->list);
 		/* Set shrinker bit if the first element was added */
 		if (!l->nr_items++)
 			memcg_set_shrinker_bit(memcg, nid,
 					       lru_shrinker_id(lru));
+		rcu_read_unlock();
 		nlru->nr_items++;
 		spin_unlock(&nlru->lock);
 		return true;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c097b1fc74ec..0f64a2c06803 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3209,15 +3209,15 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 	 */
 	memcg->kmem_state = KMEM_ALLOCATED;
 
-	memcg_deactivate_kmem_caches(memcg);
-
-	kmemcg_id = memcg->kmemcg_id;
-	BUG_ON(kmemcg_id < 0);
-
 	parent = parent_mem_cgroup(memcg);
 	if (!parent)
 		parent = root_mem_cgroup;
 
+	memcg_deactivate_kmem_caches(memcg, parent);
+
+	kmemcg_id = memcg->kmemcg_id;
+	BUG_ON(kmemcg_id < 0);
+
 	/*
 	 * Change kmemcg_id of this cgroup and all its descendants to the
 	 * parent's id, and then move all entries from this cgroup's list_lrus
@@ -3250,7 +3250,6 @@ static void memcg_free_kmem(struct mem_cgroup *memcg)
 	if (memcg->kmem_state == KMEM_ALLOCATED) {
 		WARN_ON(!list_empty(&memcg->kmem_caches));
 		static_branch_dec(&memcg_kmem_enabled_key);
-		WARN_ON(page_counter_read(&memcg->kmem));
 	}
 }
 #else
@@ -4675,6 +4674,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 
 	/* The following stuff does not apply to the root */
 	if (!parent) {
+#ifdef CONFIG_MEMCG_KMEM
+		INIT_LIST_HEAD(&memcg->kmem_caches);
+#endif
 		root_mem_cgroup = memcg;
 		return &memcg->css;
 	}
diff --git a/mm/slab.h b/mm/slab.h
index 7ead47cb9338..34bf92382ecd 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -268,7 +268,7 @@ static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
 
 	s = READ_ONCE(page->slab_cache);
 	if (s && !is_root_cache(s))
-		return s->memcg_params.memcg;
+		return rcu_dereference(s->memcg_params.memcg);
 
 	return NULL;
 }
@@ -285,10 +285,18 @@ static __always_inline int memcg_charge_slab(struct page *page,
 	struct lruvec *lruvec;
 	int ret;
 
-	memcg = s->memcg_params.memcg;
+	rcu_read_lock();
+	memcg = rcu_dereference(s->memcg_params.memcg);
+	while (memcg && !css_tryget_online(&memcg->css))
+		memcg = parent_mem_cgroup(memcg);
+	rcu_read_unlock();
+
+	if (unlikely(!memcg))
+		return true;
+
 	ret = memcg_kmem_charge_memcg(page, gfp, order, memcg);
 	if (ret)
-		return ret;
+		goto out;
 
 	lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
 	mod_lruvec_state(lruvec, cache_vmstat_idx(s), 1 << order);
@@ -296,8 +304,9 @@ static __always_inline int memcg_charge_slab(struct page *page,
 	/* transer try_charge() page references to kmem_cache */
 	percpu_ref_get_many(&s->memcg_params.refcnt, 1 << order);
 	css_put_many(&memcg->css, 1 << order);
-
-	return 0;
+out:
+	css_put(&memcg->css);
+	return ret;
 }
 
 /*
@@ -310,10 +319,12 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 
-	memcg = s->memcg_params.memcg;
+	rcu_read_lock();
+	memcg = rcu_dereference(s->memcg_params.memcg);
 	lruvec = mem_cgroup_lruvec(page_pgdat(page), memcg);
 	mod_lruvec_state(lruvec, cache_vmstat_idx(s), -(1 << order));
 	memcg_kmem_uncharge_memcg(page, order, memcg);
+	rcu_read_unlock();
 
 	percpu_ref_put_many(&s->memcg_params.refcnt, 1 << order);
 }
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 8255283025e3..00b380f5d467 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -237,7 +237,7 @@ void memcg_link_cache(struct kmem_cache *s, struct mem_cgroup *memcg)
 		list_add(&s->root_caches_node, &slab_root_caches);
 	} else {
 		css_get(&memcg->css);
-		s->memcg_params.memcg = memcg;
+		rcu_assign_pointer(s->memcg_params.memcg, memcg);
 		list_add(&s->memcg_params.children_node,
 			 &s->memcg_params.root_cache->memcg_params.children);
 		list_add(&s->memcg_params.kmem_caches_node,
@@ -252,7 +252,9 @@ static void memcg_unlink_cache(struct kmem_cache *s)
 	} else {
 		list_del(&s->memcg_params.children_node);
 		list_del(&s->memcg_params.kmem_caches_node);
-		css_put(&s->memcg_params.memcg->css);
+		mem_cgroup_put(rcu_dereference_protected(s->memcg_params.memcg,
+			lockdep_is_held(&slab_mutex)));
+		rcu_assign_pointer(s->memcg_params.memcg, NULL);
 	}
 }
 #else
@@ -793,11 +795,13 @@ static void kmemcg_cache_deactivate(struct kmem_cache *s)
 	spin_unlock_irq(&memcg_kmem_wq_lock);
 }
 
-void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
+void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg,
+				  struct mem_cgroup *parent)
 {
 	int idx;
 	struct memcg_cache_array *arr;
 	struct kmem_cache *s, *c;
+	unsigned int nr_reparented;
 
 	idx = memcg_cache_id(memcg);
 
@@ -815,6 +819,18 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 		kmemcg_cache_deactivate(c);
 		arr->entries[idx] = NULL;
 	}
+	nr_reparented = 0;
+	list_for_each_entry(s, &memcg->kmem_caches,
+			    memcg_params.kmem_caches_node) {
+		rcu_assign_pointer(s->memcg_params.memcg, parent);
+		css_put(&memcg->css);
+		nr_reparented++;
+	}
+	if (nr_reparented) {
+		list_splice_init(&memcg->kmem_caches,
+				 &parent->kmem_caches);
+		css_get_many(&parent->css, nr_reparented);
+	}
 	mutex_unlock(&slab_mutex);
 
 	put_online_mems();
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 00/10] mm: reparent slab memory on cgroup removal
  2019-06-05  2:44 [PATCH v6 00/10] mm: reparent slab memory on cgroup removal Roman Gushchin
                   ` (9 preceding siblings ...)
  2019-06-05  2:44 ` [PATCH v6 10/10] mm: reparent slab memory on cgroup removal Roman Gushchin
@ 2019-06-05  4:14 ` Andrew Morton
  2019-06-05 20:45   ` Roman Gushchin
  10 siblings, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2019-06-05  4:14 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, linux-kernel, kernel-team, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, Waiman Long

On Tue, 4 Jun 2019 19:44:44 -0700 Roman Gushchin <guro@fb.com> wrote:

> So instead of trying to find a maybe non-existing balance, let's do reparent
> the accounted slabs to the parent cgroup on cgroup removal.

s/slabs/slab caches/.  Take more care with the terminology, please...

> There is a bonus: currently we do release empty kmem_caches on cgroup
> removal, however all other are waiting for the releasing of the memory cgroup.
> These refactorings allow kmem_caches to be released as soon as they
> become inactive and free.

Unclear.

s/All other/releasing of all non-empty slab caches depends upon the releasing/

I think?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 01/10] mm: add missing smp read barrier on getting memcg kmem_cache pointer
  2019-06-05  2:44 ` [PATCH v6 01/10] mm: add missing smp read barrier on getting memcg kmem_cache pointer Roman Gushchin
@ 2019-06-05  4:35   ` Shakeel Butt
  2019-06-05 17:14     ` Roman Gushchin
  2019-06-05 16:42   ` Johannes Weiner
  2019-06-09 12:10   ` Vladimir Davydov
  2 siblings, 1 reply; 31+ messages in thread
From: Shakeel Butt @ 2019-06-05  4:35 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Linux MM, LKML, Kernel Team, Johannes Weiner,
	Vladimir Davydov, Waiman Long

On Tue, Jun 4, 2019 at 7:45 PM Roman Gushchin <guro@fb.com> wrote:
>
> Johannes noticed that reading the memcg kmem_cache pointer in
> cache_from_memcg_idx() is performed using READ_ONCE() macro,
> which doesn't implement a SMP barrier, which is required
> by the logic.
>
> Add a proper smp_rmb() to be paired with smp_wmb() in
> memcg_create_kmem_cache().
>
> The same applies to memcg_create_kmem_cache() itself,
> which reads the same value without barriers and READ_ONCE().
>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Roman Gushchin <guro@fb.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

This seems like independent to the series. Shouldn't this be Cc'ed stable?

> ---
>  mm/slab.h        | 1 +
>  mm/slab_common.c | 3 ++-
>  2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/mm/slab.h b/mm/slab.h
> index 739099af6cbb..1176b61bb8fc 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -260,6 +260,7 @@ cache_from_memcg_idx(struct kmem_cache *s, int idx)
>          * memcg_caches issues a write barrier to match this (see
>          * memcg_create_kmem_cache()).
>          */
> +       smp_rmb();
>         cachep = READ_ONCE(arr->entries[idx]);
>         rcu_read_unlock();
>
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 58251ba63e4a..8092bdfc05d5 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -652,7 +652,8 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
>          * allocation (see memcg_kmem_get_cache()), several threads can try to
>          * create the same cache, but only one of them may succeed.
>          */
> -       if (arr->entries[idx])
> +       smp_rmb();
> +       if (READ_ONCE(arr->entries[idx]))
>                 goto out_unlock;
>
>         cgroup_name(css->cgroup, memcg_name_buf, sizeof(memcg_name_buf));
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 01/10] mm: add missing smp read barrier on getting memcg kmem_cache pointer
  2019-06-05  2:44 ` [PATCH v6 01/10] mm: add missing smp read barrier on getting memcg kmem_cache pointer Roman Gushchin
  2019-06-05  4:35   ` Shakeel Butt
@ 2019-06-05 16:42   ` Johannes Weiner
  2019-06-09 12:10   ` Vladimir Davydov
  2 siblings, 0 replies; 31+ messages in thread
From: Johannes Weiner @ 2019-06-05 16:42 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, Shakeel Butt,
	Vladimir Davydov, Waiman Long

On Tue, Jun 04, 2019 at 07:44:45PM -0700, Roman Gushchin wrote:
> Johannes noticed that reading the memcg kmem_cache pointer in
> cache_from_memcg_idx() is performed using READ_ONCE() macro,
> which doesn't implement a SMP barrier, which is required
> by the logic.
> 
> Add a proper smp_rmb() to be paired with smp_wmb() in
> memcg_create_kmem_cache().
> 
> The same applies to memcg_create_kmem_cache() itself,
> which reads the same value without barriers and READ_ONCE().
> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Roman Gushchin <guro@fb.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 07/10] mm: synchronize access to kmem_cache dying flag using a spinlock
  2019-06-05  2:44 ` [PATCH v6 07/10] mm: synchronize access to kmem_cache dying flag using a spinlock Roman Gushchin
@ 2019-06-05 16:56   ` Johannes Weiner
  2019-06-05 22:02     ` Roman Gushchin
  2019-06-09 14:31   ` Vladimir Davydov
  1 sibling, 1 reply; 31+ messages in thread
From: Johannes Weiner @ 2019-06-05 16:56 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team, Shakeel Butt,
	Vladimir Davydov, Waiman Long

On Tue, Jun 04, 2019 at 07:44:51PM -0700, Roman Gushchin wrote:
> Currently the memcg_params.dying flag and the corresponding
> workqueue used for the asynchronous deactivation of kmem_caches
> is synchronized using the slab_mutex.
> 
> It makes impossible to check this flag from the irq context,
> which will be required in order to implement asynchronous release
> of kmem_caches.
> 
> So let's switch over to the irq-save flavor of the spinlock-based
> synchronization.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  mm/slab_common.c | 19 +++++++++++++++----
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 09b26673b63f..2914a8f0aa85 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -130,6 +130,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
>  #ifdef CONFIG_MEMCG_KMEM
>  
>  LIST_HEAD(slab_root_caches);
> +static DEFINE_SPINLOCK(memcg_kmem_wq_lock);
>  
>  void slab_init_memcg_params(struct kmem_cache *s)
>  {
> @@ -629,6 +630,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
>  	struct memcg_cache_array *arr;
>  	struct kmem_cache *s = NULL;
>  	char *cache_name;
> +	bool dying;
>  	int idx;
>  
>  	get_online_cpus();
> @@ -640,7 +642,13 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
>  	 * The memory cgroup could have been offlined while the cache
>  	 * creation work was pending.
>  	 */
> -	if (memcg->kmem_state != KMEM_ONLINE || root_cache->memcg_params.dying)
> +	if (memcg->kmem_state != KMEM_ONLINE)
> +		goto out_unlock;
> +
> +	spin_lock_irq(&memcg_kmem_wq_lock);
> +	dying = root_cache->memcg_params.dying;
> +	spin_unlock_irq(&memcg_kmem_wq_lock);
> +	if (dying)
>  		goto out_unlock;

What does this lock protect? The dying flag could get set right after
the unlock.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 01/10] mm: add missing smp read barrier on getting memcg kmem_cache pointer
  2019-06-05  4:35   ` Shakeel Butt
@ 2019-06-05 17:14     ` Roman Gushchin
  2019-06-05 19:51       ` Shakeel Butt
  0 siblings, 1 reply; 31+ messages in thread
From: Roman Gushchin @ 2019-06-05 17:14 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Linux MM, LKML, Kernel Team, Johannes Weiner,
	Vladimir Davydov, Waiman Long

On Tue, Jun 04, 2019 at 09:35:02PM -0700, Shakeel Butt wrote:
> On Tue, Jun 4, 2019 at 7:45 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > Johannes noticed that reading the memcg kmem_cache pointer in
> > cache_from_memcg_idx() is performed using READ_ONCE() macro,
> > which doesn't implement a SMP barrier, which is required
> > by the logic.
> >
> > Add a proper smp_rmb() to be paired with smp_wmb() in
> > memcg_create_kmem_cache().
> >
> > The same applies to memcg_create_kmem_cache() itself,
> > which reads the same value without barriers and READ_ONCE().
> >
> > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> 
> Reviewed-by: Shakeel Butt <shakeelb@google.com>
> 
> This seems like independent to the series. Shouldn't this be Cc'ed stable?

It is independent, but let's keep it here to avoid merge conflicts.

It has been so for a long time, and nobody complained, so I'm not sure
if we really need a stable backport. Do you have a different opinion?

Thank you!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 01/10] mm: add missing smp read barrier on getting memcg kmem_cache pointer
  2019-06-05 17:14     ` Roman Gushchin
@ 2019-06-05 19:51       ` Shakeel Butt
  0 siblings, 0 replies; 31+ messages in thread
From: Shakeel Butt @ 2019-06-05 19:51 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Linux MM, LKML, Kernel Team, Johannes Weiner,
	Vladimir Davydov, Waiman Long

On Wed, Jun 5, 2019 at 10:14 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Tue, Jun 04, 2019 at 09:35:02PM -0700, Shakeel Butt wrote:
> > On Tue, Jun 4, 2019 at 7:45 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > Johannes noticed that reading the memcg kmem_cache pointer in
> > > cache_from_memcg_idx() is performed using READ_ONCE() macro,
> > > which doesn't implement a SMP barrier, which is required
> > > by the logic.
> > >
> > > Add a proper smp_rmb() to be paired with smp_wmb() in
> > > memcg_create_kmem_cache().
> > >
> > > The same applies to memcg_create_kmem_cache() itself,
> > > which reads the same value without barriers and READ_ONCE().
> > >
> > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Signed-off-by: Roman Gushchin <guro@fb.com>
> >
> > Reviewed-by: Shakeel Butt <shakeelb@google.com>
> >
> > This seems like independent to the series. Shouldn't this be Cc'ed stable?
>
> It is independent, but let's keep it here to avoid merge conflicts.
>
> It has been so for a long time, and nobody complained, so I'm not sure
> if we really need a stable backport. Do you have a different opinion?
>

Nah, it's fine as it is.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 00/10] mm: reparent slab memory on cgroup removal
  2019-06-05  4:14 ` [PATCH v6 00/10] " Andrew Morton
@ 2019-06-05 20:45   ` Roman Gushchin
  0 siblings, 0 replies; 31+ messages in thread
From: Roman Gushchin @ 2019-06-05 20:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Kernel Team, Johannes Weiner,
	Shakeel Butt, Vladimir Davydov, Waiman Long

On Tue, Jun 04, 2019 at 09:14:18PM -0700, Andrew Morton wrote:
> On Tue, 4 Jun 2019 19:44:44 -0700 Roman Gushchin <guro@fb.com> wrote:
> 
> > So instead of trying to find a maybe non-existing balance, let's do reparent
> > the accounted slabs to the parent cgroup on cgroup removal.
>
> s/slabs/slab caches/.  Take more care with the terminology, please...

Slabs are effectively reparented too (what's most important, their
references), but I agree, "slab caches" suits better here.

> 
> > There is a bonus: currently we do release empty kmem_caches on cgroup
> > removal, however all other are waiting for the releasing of the memory cgroup.
> > These refactorings allow kmem_caches to be released as soon as they
> > become inactive and free.
> 
> Unclear.
> 
> s/All other/releasing of all non-empty slab caches depends upon the releasing/
> 
> I think?
> 
How about this?

There is a bonus: currently we release all memcg kmem_caches all together
with the memory cgroup itself. This patchset allows individual kmem_caches
to be released as soon as they become inactive and free.

--

Sorry, my bad, I was focused on patches, and didn't give enough attention
to the cover letter. I hope to get some feedback from Vladimir, and then
post a next version with these issues fixed.

Thank you for looking into it!

Roman

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 07/10] mm: synchronize access to kmem_cache dying flag using a spinlock
  2019-06-05 16:56   ` Johannes Weiner
@ 2019-06-05 22:02     ` Roman Gushchin
  2019-06-06  0:48       ` Roman Gushchin
  0 siblings, 1 reply; 31+ messages in thread
From: Roman Gushchin @ 2019-06-05 22:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, linux-kernel, Kernel Team, Shakeel Butt,
	Vladimir Davydov, Waiman Long

On Wed, Jun 05, 2019 at 12:56:16PM -0400, Johannes Weiner wrote:
> On Tue, Jun 04, 2019 at 07:44:51PM -0700, Roman Gushchin wrote:
> > Currently the memcg_params.dying flag and the corresponding
> > workqueue used for the asynchronous deactivation of kmem_caches
> > is synchronized using the slab_mutex.
> > 
> > It makes impossible to check this flag from the irq context,
> > which will be required in order to implement asynchronous release
> > of kmem_caches.
> > 
> > So let's switch over to the irq-save flavor of the spinlock-based
> > synchronization.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > ---
> >  mm/slab_common.c | 19 +++++++++++++++----
> >  1 file changed, 15 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/slab_common.c b/mm/slab_common.c
> > index 09b26673b63f..2914a8f0aa85 100644
> > --- a/mm/slab_common.c
> > +++ b/mm/slab_common.c
> > @@ -130,6 +130,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
> >  #ifdef CONFIG_MEMCG_KMEM
> >  
> >  LIST_HEAD(slab_root_caches);
> > +static DEFINE_SPINLOCK(memcg_kmem_wq_lock);
> >  
> >  void slab_init_memcg_params(struct kmem_cache *s)
> >  {
> > @@ -629,6 +630,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
> >  	struct memcg_cache_array *arr;
> >  	struct kmem_cache *s = NULL;
> >  	char *cache_name;
> > +	bool dying;
> >  	int idx;
> >  
> >  	get_online_cpus();
> > @@ -640,7 +642,13 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
> >  	 * The memory cgroup could have been offlined while the cache
> >  	 * creation work was pending.
> >  	 */
> > -	if (memcg->kmem_state != KMEM_ONLINE || root_cache->memcg_params.dying)
> > +	if (memcg->kmem_state != KMEM_ONLINE)
> > +		goto out_unlock;
> > +
> > +	spin_lock_irq(&memcg_kmem_wq_lock);
> > +	dying = root_cache->memcg_params.dying;
> > +	spin_unlock_irq(&memcg_kmem_wq_lock);
> > +	if (dying)
> >  		goto out_unlock;
> 
> What does this lock protect? The dying flag could get set right after
> the unlock.
>

Hi Johannes!

Here is my logic:

1) flush_memcg_workqueue() must guarantee that no new memcg kmem_caches
will be created, and there are no works queued, which will touch
the root kmem_cache, so it can be released
2) so it sets the dying flag, waits for an rcu grace period and flushes
the workqueue (that means for all in-flight works)
3) dying flag in checked in kmemcg_cache_shutdown() and
kmemcg_cache_deactivate(), so that if it set, no new works/rcu tasks
will be queued. corresponding queue_work()/call_rcu() are all under
memcg_kmem_wq_lock lock.
4) memcg_schedule_kmem_cache_create() doesn't check the dying flag
(probably to avoid taking locks on a hot path), but it does
memcg_create_kmem_cache(), which is part of the scheduled work.
And it does it at the very beginning, so even if new kmem_caches
are scheduled to be created, the root kmem_cache won't be touched.

Previously the flag was checked under slab_mutex, but now we set it
under memcg_kmem_wq_lock lock. So I'm not sure we can read it without
taking this lock.

If the flag will be set after unlock, it's fine. It means that the
work has already been scheduled, and flush_workqueue() in
flush_memcg_workqueue() will wait for it. The only problem is if we
don't see the flag after flush_workqueue() is called, but I don't
see how it's possible.

Does it makes sense? I'm sure there are ways to make it more obvious.
Please, let me know if you've any ideas.

Thank you!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 07/10] mm: synchronize access to kmem_cache dying flag using a spinlock
  2019-06-05 22:02     ` Roman Gushchin
@ 2019-06-06  0:48       ` Roman Gushchin
  0 siblings, 0 replies; 31+ messages in thread
From: Roman Gushchin @ 2019-06-06  0:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, linux-kernel, Kernel Team, Shakeel Butt,
	Vladimir Davydov, Waiman Long

On Wed, Jun 05, 2019 at 03:02:03PM -0700, Roman Gushchin wrote:
> On Wed, Jun 05, 2019 at 12:56:16PM -0400, Johannes Weiner wrote:
> > On Tue, Jun 04, 2019 at 07:44:51PM -0700, Roman Gushchin wrote:
> > > Currently the memcg_params.dying flag and the corresponding
> > > workqueue used for the asynchronous deactivation of kmem_caches
> > > is synchronized using the slab_mutex.
> > > 
> > > It makes impossible to check this flag from the irq context,
> > > which will be required in order to implement asynchronous release
> > > of kmem_caches.
> > > 
> > > So let's switch over to the irq-save flavor of the spinlock-based
> > > synchronization.
> > > 
> > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > ---
> > >  mm/slab_common.c | 19 +++++++++++++++----
> > >  1 file changed, 15 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/mm/slab_common.c b/mm/slab_common.c
> > > index 09b26673b63f..2914a8f0aa85 100644
> > > --- a/mm/slab_common.c
> > > +++ b/mm/slab_common.c
> > > @@ -130,6 +130,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
> > >  #ifdef CONFIG_MEMCG_KMEM
> > >  
> > >  LIST_HEAD(slab_root_caches);
> > > +static DEFINE_SPINLOCK(memcg_kmem_wq_lock);
> > >  
> > >  void slab_init_memcg_params(struct kmem_cache *s)
> > >  {
> > > @@ -629,6 +630,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
> > >  	struct memcg_cache_array *arr;
> > >  	struct kmem_cache *s = NULL;
> > >  	char *cache_name;
> > > +	bool dying;
> > >  	int idx;
> > >  
> > >  	get_online_cpus();
> > > @@ -640,7 +642,13 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
> > >  	 * The memory cgroup could have been offlined while the cache
> > >  	 * creation work was pending.
> > >  	 */
> > > -	if (memcg->kmem_state != KMEM_ONLINE || root_cache->memcg_params.dying)
> > > +	if (memcg->kmem_state != KMEM_ONLINE)
> > > +		goto out_unlock;
> > > +
> > > +	spin_lock_irq(&memcg_kmem_wq_lock);
> > > +	dying = root_cache->memcg_params.dying;
> > > +	spin_unlock_irq(&memcg_kmem_wq_lock);
> > > +	if (dying)
> > >  		goto out_unlock;
> > 
> > What does this lock protect? The dying flag could get set right after
> > the unlock.
> >
> 
> Hi Johannes!
> 
> Here is my logic:
> 
> 1) flush_memcg_workqueue() must guarantee that no new memcg kmem_caches
> will be created, and there are no works queued, which will touch
> the root kmem_cache, so it can be released
> 2) so it sets the dying flag, waits for an rcu grace period and flushes
> the workqueue (that means for all in-flight works)
> 3) dying flag in checked in kmemcg_cache_shutdown() and
> kmemcg_cache_deactivate(), so that if it set, no new works/rcu tasks
> will be queued. corresponding queue_work()/call_rcu() are all under
> memcg_kmem_wq_lock lock.
> 4) memcg_schedule_kmem_cache_create() doesn't check the dying flag
> (probably to avoid taking locks on a hot path), but it does
> memcg_create_kmem_cache(), which is part of the scheduled work.
> And it does it at the very beginning, so even if new kmem_caches
> are scheduled to be created, the root kmem_cache won't be touched.
> 
> Previously the flag was checked under slab_mutex, but now we set it
> under memcg_kmem_wq_lock lock. So I'm not sure we can read it without
> taking this lock.
> 
> If the flag will be set after unlock, it's fine. It means that the
> work has already been scheduled, and flush_workqueue() in
> flush_memcg_workqueue() will wait for it. The only problem is if we
> don't see the flag after flush_workqueue() is called, but I don't
> see how it's possible.
> 
> Does it makes sense? I'm sure there are ways to make it more obvious.
> Please, let me know if you've any ideas.

Hm, after some thoughts, I've found that the problem is that we check
the dying flag of the root cache. But it's the same in the existing code.

So currently (without my patches):
1) we do set the dying flag under slab_mutex
2) waiting for the workqueue to flush
3) grabbing the slab_mutex and going to release the root kmem_cache

a concurrent memcg_kmem_cache_create_func() can be scheduled after 2),
grab the slab_mutex after 3) and check the kmem_cache->memcg_params.dying
flag of already released kmem_cache.

The reason why it's not a real problem is that it's expected from a user
that kmem_cache will not be used for new allocations after calling
kmem_cache_destroy(). It means no new memcg kmem_cache creation will be
scheduled, and we can avoid checking the dying flag at all.

Does this makes sense?

Thanks!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 01/10] mm: add missing smp read barrier on getting memcg kmem_cache pointer
  2019-06-05  2:44 ` [PATCH v6 01/10] mm: add missing smp read barrier on getting memcg kmem_cache pointer Roman Gushchin
  2019-06-05  4:35   ` Shakeel Butt
  2019-06-05 16:42   ` Johannes Weiner
@ 2019-06-09 12:10   ` Vladimir Davydov
  2019-06-10 20:33     ` Johannes Weiner
  2 siblings, 1 reply; 31+ messages in thread
From: Vladimir Davydov @ 2019-06-09 12:10 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team,
	Johannes Weiner, Shakeel Butt, Waiman Long

On Tue, Jun 04, 2019 at 07:44:45PM -0700, Roman Gushchin wrote:
> Johannes noticed that reading the memcg kmem_cache pointer in
> cache_from_memcg_idx() is performed using READ_ONCE() macro,
> which doesn't implement a SMP barrier, which is required
> by the logic.
> 
> Add a proper smp_rmb() to be paired with smp_wmb() in
> memcg_create_kmem_cache().
> 
> The same applies to memcg_create_kmem_cache() itself,
> which reads the same value without barriers and READ_ONCE().
> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  mm/slab.h        | 1 +
>  mm/slab_common.c | 3 ++-
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/slab.h b/mm/slab.h
> index 739099af6cbb..1176b61bb8fc 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -260,6 +260,7 @@ cache_from_memcg_idx(struct kmem_cache *s, int idx)
>  	 * memcg_caches issues a write barrier to match this (see
>  	 * memcg_create_kmem_cache()).
>  	 */
> +	smp_rmb();
>  	cachep = READ_ONCE(arr->entries[idx]);

Hmm, we used to have lockless_dereference() here, but it was replaced
with READ_ONCE some time ago. The commit message claims that READ_ONCE
has an implicit read barrier in it.

commit 506458efaf153c1ea480591c5602a5a3ba5a3b76
Author: Will Deacon <will.deacon@arm.com>
Date:   Tue Oct 24 11:22:48 2017 +0100

    locking/barriers: Convert users of lockless_dereference() to READ_ONCE()

    READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it
    can be used instead of lockless_dereference() without any change in
    semantics.

    Signed-off-by: Will Deacon <will.deacon@arm.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

commit 76ebbe78f7390aee075a7f3768af197ded1bdfbb
Author: Will Deacon <will.deacon@arm.com>
Date:   Tue Oct 24 11:22:47 2017 +0100

    locking/barriers: Add implicit smp_read_barrier_depends() to READ_ONCE()

    In preparation for the removal of lockless_dereference(), which is the
    same as READ_ONCE() on all architectures other than Alpha, add an
    implicit smp_read_barrier_depends() to READ_ONCE() so that it can be
    used to head dependency chains on all architectures.

    Signed-off-by: Will Deacon <will.deacon@arm.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Link: http://lkml.kernel.org/r/1508840570-22169-3-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 03/10] mm: rename slab delayed deactivation functions and fields
  2019-06-05  2:44 ` [PATCH v6 03/10] mm: rename slab delayed deactivation functions and fields Roman Gushchin
@ 2019-06-09 12:13   ` Vladimir Davydov
  0 siblings, 0 replies; 31+ messages in thread
From: Vladimir Davydov @ 2019-06-09 12:13 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team,
	Johannes Weiner, Shakeel Butt, Waiman Long

On Tue, Jun 04, 2019 at 07:44:47PM -0700, Roman Gushchin wrote:
> The delayed work/rcu deactivation infrastructure of non-root
> kmem_caches can be also used for asynchronous release of these
> objects. Let's get rid of the word "deactivation" in corresponding
> names to make the code look better after generalization.
> 
> It's easier to make the renaming first, so that the generalized
> code will look consistent from scratch.
> 
> Let's rename struct memcg_cache_params fields:
>   deact_fn -> work_fn
>   deact_rcu_head -> rcu_head
>   deact_work -> work
> 
> And RCU/delayed work callbacks in slab common code:
>   kmemcg_deactivate_rcufn -> kmemcg_rcufn
>   kmemcg_deactivate_workfn -> kmemcg_workfn
> 
> This patch contains no functional changes, only renamings.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 04/10] mm: generalize postponed non-root kmem_cache deactivation
  2019-06-05  2:44 ` [PATCH v6 04/10] mm: generalize postponed non-root kmem_cache deactivation Roman Gushchin
@ 2019-06-09 12:23   ` Vladimir Davydov
  0 siblings, 0 replies; 31+ messages in thread
From: Vladimir Davydov @ 2019-06-09 12:23 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team,
	Johannes Weiner, Shakeel Butt, Waiman Long

On Tue, Jun 04, 2019 at 07:44:48PM -0700, Roman Gushchin wrote:
> Currently SLUB uses a work scheduled after an RCU grace period
> to deactivate a non-root kmem_cache. This mechanism can be reused
> for kmem_caches release, but requires generalization for SLAB
> case.
> 
> Introduce kmemcg_cache_deactivate() function, which calls
> allocator-specific __kmem_cache_deactivate() and schedules
> execution of __kmem_cache_deactivate_after_rcu() with all
> necessary locks in a worker context after an rcu grace period.
> 
> Here is the new calling scheme:
>   kmemcg_cache_deactivate()
>     __kmemcg_cache_deactivate()                  SLAB/SLUB-specific
>     kmemcg_rcufn()                               rcu
>       kmemcg_workfn()                            work
>         __kmemcg_cache_deactivate_after_rcu()    SLAB/SLUB-specific
> 
> instead of:
>   __kmemcg_cache_deactivate()                    SLAB/SLUB-specific
>     slab_deactivate_memcg_cache_rcu_sched()      SLUB-only
>       kmemcg_rcufn()                             rcu
>         kmemcg_workfn()                          work
>           kmemcg_cache_deact_after_rcu()         SLUB-only
> 
> For consistency, all allocator-specific functions start with "__".
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Much easier to review after extracting renaming, thanks.

Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 05/10] mm: introduce __memcg_kmem_uncharge_memcg()
  2019-06-05  2:44 ` [PATCH v6 05/10] mm: introduce __memcg_kmem_uncharge_memcg() Roman Gushchin
@ 2019-06-09 12:29   ` Vladimir Davydov
  0 siblings, 0 replies; 31+ messages in thread
From: Vladimir Davydov @ 2019-06-09 12:29 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team,
	Johannes Weiner, Shakeel Butt, Waiman Long

On Tue, Jun 04, 2019 at 07:44:49PM -0700, Roman Gushchin wrote:
> Let's separate the page counter modification code out of
> __memcg_kmem_uncharge() in a way similar to what
> __memcg_kmem_charge() and __memcg_kmem_charge_memcg() work.
> 
> This will allow to reuse this code later using a new
> memcg_kmem_uncharge_memcg() wrapper, which calls
> __memcg_kmem_uncharge_memcg() if memcg_kmem_enabled()
> check is passed.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Reviewed-by: Shakeel Butt <shakeelb@google.com>

Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 07/10] mm: synchronize access to kmem_cache dying flag using a spinlock
  2019-06-05  2:44 ` [PATCH v6 07/10] mm: synchronize access to kmem_cache dying flag using a spinlock Roman Gushchin
  2019-06-05 16:56   ` Johannes Weiner
@ 2019-06-09 14:31   ` Vladimir Davydov
  2019-06-10 20:46     ` Roman Gushchin
  1 sibling, 1 reply; 31+ messages in thread
From: Vladimir Davydov @ 2019-06-09 14:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team,
	Johannes Weiner, Shakeel Butt, Waiman Long

On Tue, Jun 04, 2019 at 07:44:51PM -0700, Roman Gushchin wrote:
> Currently the memcg_params.dying flag and the corresponding
> workqueue used for the asynchronous deactivation of kmem_caches
> is synchronized using the slab_mutex.
> 
> It makes impossible to check this flag from the irq context,
> which will be required in order to implement asynchronous release
> of kmem_caches.
> 
> So let's switch over to the irq-save flavor of the spinlock-based
> synchronization.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  mm/slab_common.c | 19 +++++++++++++++----
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 09b26673b63f..2914a8f0aa85 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -130,6 +130,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
>  #ifdef CONFIG_MEMCG_KMEM
>  
>  LIST_HEAD(slab_root_caches);
> +static DEFINE_SPINLOCK(memcg_kmem_wq_lock);
>  
>  void slab_init_memcg_params(struct kmem_cache *s)
>  {
> @@ -629,6 +630,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
>  	struct memcg_cache_array *arr;
>  	struct kmem_cache *s = NULL;
>  	char *cache_name;
> +	bool dying;
>  	int idx;
>  
>  	get_online_cpus();
> @@ -640,7 +642,13 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
>  	 * The memory cgroup could have been offlined while the cache
>  	 * creation work was pending.
>  	 */
> -	if (memcg->kmem_state != KMEM_ONLINE || root_cache->memcg_params.dying)
> +	if (memcg->kmem_state != KMEM_ONLINE)
> +		goto out_unlock;
> +
> +	spin_lock_irq(&memcg_kmem_wq_lock);
> +	dying = root_cache->memcg_params.dying;
> +	spin_unlock_irq(&memcg_kmem_wq_lock);
> +	if (dying)
>  		goto out_unlock;

I do understand why we need to sync setting dying flag for a kmem cache
about to be destroyed in flush_memcg_workqueue vs checking the flag in
kmemcg_cache_deactivate: this is needed so that we don't schedule a new
deactivation work after we flush RCU/workqueue. However, I don't think
it's necessary to check the dying flag here, in memcg_create_kmem_cache:
we can't schedule a new cache creation work after kmem_cache_destroy has
started, because one mustn't allocate from a dead kmem cache; since we
flush the queue before getting to actual destruction, no cache creation
work can be pending. Yeah, it might happen that a cache creation work
starts execution while flush_memcg_workqueue is in progress, but I don't
see any point in optimizing this case - after all, cache destruction is
a very cold path. Since checking the flag in memcg_create_kmem_cache
raises question, I suggest to simply drop this check.

Anyway, it would be nice to see some comment in the code explaining why
we check dying flag under a spin lock in kmemcg_cache_deactivate.

>  
>  	idx = memcg_cache_id(memcg);
> @@ -735,14 +743,17 @@ static void kmemcg_cache_deactivate(struct kmem_cache *s)
>  
>  	__kmemcg_cache_deactivate(s);
>  
> +	spin_lock_irq(&memcg_kmem_wq_lock);
>  	if (s->memcg_params.root_cache->memcg_params.dying)
> -		return;
> +		goto unlock;
>  
>  	/* pin memcg so that @s doesn't get destroyed in the middle */
>  	css_get(&s->memcg_params.memcg->css);
>  
>  	s->memcg_params.work_fn = __kmemcg_cache_deactivate_after_rcu;
>  	call_rcu(&s->memcg_params.rcu_head, kmemcg_rcufn);
> +unlock:
> +	spin_unlock_irq(&memcg_kmem_wq_lock);
>  }
>  
>  void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
> @@ -852,9 +863,9 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
>  
>  static void flush_memcg_workqueue(struct kmem_cache *s)
>  {
> -	mutex_lock(&slab_mutex);
> +	spin_lock_irq(&memcg_kmem_wq_lock);
>  	s->memcg_params.dying = true;
> -	mutex_unlock(&slab_mutex);
> +	spin_unlock_irq(&memcg_kmem_wq_lock);
>  
>  	/*
>  	 * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 08/10] mm: rework non-root kmem_cache lifecycle management
  2019-06-05  2:44 ` [PATCH v6 08/10] mm: rework non-root kmem_cache lifecycle management Roman Gushchin
@ 2019-06-09 17:09   ` Vladimir Davydov
  0 siblings, 0 replies; 31+ messages in thread
From: Vladimir Davydov @ 2019-06-09 17:09 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team,
	Johannes Weiner, Shakeel Butt, Waiman Long

On Tue, Jun 04, 2019 at 07:44:52PM -0700, Roman Gushchin wrote:
> Currently each charged slab page holds a reference to the cgroup to
> which it's charged. Kmem_caches are held by the memcg and are released
> all together with the memory cgroup. It means that none of kmem_caches
> are released unless at least one reference to the memcg exists, which
> is very far from optimal.
> 
> Let's rework it in a way that allows releasing individual kmem_caches
> as soon as the cgroup is offline, the kmem_cache is empty and there
> are no pending allocations.
> 
> To make it possible, let's introduce a new percpu refcounter for
> non-root kmem caches. The counter is initialized to the percpu mode,
> and is switched to the atomic mode during kmem_cache deactivation. The
> counter is bumped for every charged page and also for every running
> allocation. So the kmem_cache can't be released unless all allocations
> complete.
> 
> To shutdown non-active empty kmem_caches, let's reuse the work queue,
> previously used for the kmem_cache deactivation. Once the reference
> counter reaches 0, let's schedule an asynchronous kmem_cache release.
> 
> * I used the following simple approach to test the performance
> (stolen from another patchset by T. Harding):
> 
>     time find / -name fname-no-exist
>     echo 2 > /proc/sys/vm/drop_caches
>     repeat 10 times
> 
> Results:
> 
>         orig		patched
> 
> real	0m1.455s	real	0m1.355s
> user	0m0.206s	user	0m0.219s
> sys	0m0.855s	sys	0m0.807s
> 
> real	0m1.487s	real	0m1.699s
> user	0m0.221s	user	0m0.256s
> sys	0m0.806s	sys	0m0.948s
> 
> real	0m1.515s	real	0m1.505s
> user	0m0.183s	user	0m0.215s
> sys	0m0.876s	sys	0m0.858s
> 
> real	0m1.291s	real	0m1.380s
> user	0m0.193s	user	0m0.198s
> sys	0m0.843s	sys	0m0.786s
> 
> real	0m1.364s	real	0m1.374s
> user	0m0.180s	user	0m0.182s
> sys	0m0.868s	sys	0m0.806s
> 
> real	0m1.352s	real	0m1.312s
> user	0m0.201s	user	0m0.212s
> sys	0m0.820s	sys	0m0.761s
> 
> real	0m1.302s	real	0m1.349s
> user	0m0.205s	user	0m0.203s
> sys	0m0.803s	sys	0m0.792s
> 
> real	0m1.334s	real	0m1.301s
> user	0m0.194s	user	0m0.201s
> sys	0m0.806s	sys	0m0.779s
> 
> real	0m1.426s	real	0m1.434s
> user	0m0.216s	user	0m0.181s
> sys	0m0.824s	sys	0m0.864s
> 
> real	0m1.350s	real	0m1.295s
> user	0m0.200s	user	0m0.190s
> sys	0m0.842s	sys	0m0.811s
> 
> So it looks like the difference is not noticeable in this test.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 09/10] mm: stop setting page->mem_cgroup pointer for slab pages
  2019-06-05  2:44 ` [PATCH v6 09/10] mm: stop setting page->mem_cgroup pointer for slab pages Roman Gushchin
@ 2019-06-09 17:09   ` Vladimir Davydov
  0 siblings, 0 replies; 31+ messages in thread
From: Vladimir Davydov @ 2019-06-09 17:09 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team,
	Johannes Weiner, Shakeel Butt, Waiman Long

On Tue, Jun 04, 2019 at 07:44:53PM -0700, Roman Gushchin wrote:
> Every slab page charged to a non-root memory cgroup has a pointer
> to the memory cgroup and holds a reference to it, which protects
> a non-empty memory cgroup from being released. At the same time
> the page has a pointer to the corresponding kmem_cache, and also
> hold a reference to the kmem_cache. And kmem_cache by itself
> holds a reference to the cgroup.
> 
> So there is clearly some redundancy, which allows to stop setting
> the page->mem_cgroup pointer and rely on getting memcg pointer
> indirectly via kmem_cache. Further it will allow to change this
> pointer easier, without a need to go over all charged pages.
> 
> So let's stop setting page->mem_cgroup pointer for slab pages,
> and stop using the css refcounter directly for protecting
> the memory cgroup from going away. Instead rely on kmem_cache
> as an intermediate object.
> 
> Make sure that vmstats and shrinker lists are working as previously,
> as well as /proc/kpagecgroup interface.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>

Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 10/10] mm: reparent slab memory on cgroup removal
  2019-06-05  2:44 ` [PATCH v6 10/10] mm: reparent slab memory on cgroup removal Roman Gushchin
@ 2019-06-09 17:18   ` Vladimir Davydov
  0 siblings, 0 replies; 31+ messages in thread
From: Vladimir Davydov @ 2019-06-09 17:18 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team,
	Johannes Weiner, Shakeel Butt, Waiman Long

On Tue, Jun 04, 2019 at 07:44:54PM -0700, Roman Gushchin wrote:
> Let's reparent memcg slab memory on memcg offlining. This allows us
> to release the memory cgroup without waiting for the last outstanding
> kernel object (e.g. dentry used by another application).
> 
> So instead of reparenting all accounted slab pages, let's do reparent
> a relatively small amount of kmem_caches. Reparenting is performed as
> a part of the deactivation process.
> 
> Since the parent cgroup is already charged, everything we need to do
> is to splice the list of kmem_caches to the parent's kmem_caches list,
> swap the memcg pointer and drop the css refcounter for each kmem_cache
> and adjust the parent's css refcounter. Quite simple.
> 
> Please, note that kmem_cache->memcg_params.memcg isn't a stable
> pointer anymore. It's safe to read it under rcu_read_lock() or
> with slab_mutex held.
> 
> We can race with the slab allocation and deallocation paths. It's not
> a big problem: parent's charge and slab global stats are always
> correct, and we don't care anymore about the child usage and global
> stats. The child cgroup is already offline, so we don't use or show it
> anywhere.
> 
> Local slab stats (NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE)
> aren't used anywhere except count_shadow_nodes(). But even there it
> won't break anything: after reparenting "nodes" will be 0 on child
> level (because we're already reparenting shrinker lists), and on
> parent level page stats always were 0, and this patch won't change
> anything.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> ---
>  include/linux/slab.h |  4 ++--
>  mm/list_lru.c        |  8 +++++++-
>  mm/memcontrol.c      | 14 ++++++++------
>  mm/slab.h            | 23 +++++++++++++++++------
>  mm/slab_common.c     | 22 +++++++++++++++++++---
>  5 files changed, 53 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 1b54e5f83342..109cab2ad9b4 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -152,7 +152,7 @@ void kmem_cache_destroy(struct kmem_cache *);
>  int kmem_cache_shrink(struct kmem_cache *);
>  
>  void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
> -void memcg_deactivate_kmem_caches(struct mem_cgroup *);
> +void memcg_deactivate_kmem_caches(struct mem_cgroup *, struct mem_cgroup *);
>  
>  /*
>   * Please use this macro to create slab caches. Simply specify the
> @@ -638,7 +638,7 @@ struct memcg_cache_params {
>  			bool dying;
>  		};
>  		struct {
> -			struct mem_cgroup *memcg;
> +			struct mem_cgroup __rcu *memcg;
>  			struct list_head children_node;
>  			struct list_head kmem_caches_node;
>  			struct percpu_ref refcnt;
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index 0f1f6b06b7f3..0b2319897e86 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -77,11 +77,15 @@ list_lru_from_kmem(struct list_lru_node *nlru, void *ptr,
>  	if (!nlru->memcg_lrus)
>  		goto out;
>  
> +	rcu_read_lock();
>  	memcg = mem_cgroup_from_kmem(ptr);
> -	if (!memcg)
> +	if (!memcg) {
> +		rcu_read_unlock();
>  		goto out;
> +	}
>  
>  	l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
> +	rcu_read_unlock();
>  out:
>  	if (memcg_ptr)
>  		*memcg_ptr = memcg;
> @@ -131,12 +135,14 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
>  
>  	spin_lock(&nlru->lock);
>  	if (list_empty(item)) {
> +		rcu_read_lock();
>  		l = list_lru_from_kmem(nlru, item, &memcg);
>  		list_add_tail(item, &l->list);
>  		/* Set shrinker bit if the first element was added */
>  		if (!l->nr_items++)
>  			memcg_set_shrinker_bit(memcg, nid,
>  					       lru_shrinker_id(lru));
> +		rcu_read_unlock();

AFAICS we don't need rcu_read_lock here, because holding nlru->lock
guarantees that the cgroup doesn't get freed. If that's correct,
I think we better remove __rcu mark and use READ_ONCE for accessing
memcg_params.memcg, thus making it the caller's responsibility to
ensure the cgroup lifetime.

>  		nlru->nr_items++;
>  		spin_unlock(&nlru->lock);
>  		return true;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c097b1fc74ec..0f64a2c06803 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3209,15 +3209,15 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
>  	 */
>  	memcg->kmem_state = KMEM_ALLOCATED;
>  
> -	memcg_deactivate_kmem_caches(memcg);
> -
> -	kmemcg_id = memcg->kmemcg_id;
> -	BUG_ON(kmemcg_id < 0);
> -
>  	parent = parent_mem_cgroup(memcg);
>  	if (!parent)
>  		parent = root_mem_cgroup;
>  
> +	memcg_deactivate_kmem_caches(memcg, parent);
> +
> +	kmemcg_id = memcg->kmemcg_id;
> +	BUG_ON(kmemcg_id < 0);
> +
>  	/*
>  	 * Change kmemcg_id of this cgroup and all its descendants to the
>  	 * parent's id, and then move all entries from this cgroup's list_lrus
> @@ -3250,7 +3250,6 @@ static void memcg_free_kmem(struct mem_cgroup *memcg)
>  	if (memcg->kmem_state == KMEM_ALLOCATED) {
>  		WARN_ON(!list_empty(&memcg->kmem_caches));
>  		static_branch_dec(&memcg_kmem_enabled_key);
> -		WARN_ON(page_counter_read(&memcg->kmem));
>  	}
>  }
>  #else
> @@ -4675,6 +4674,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  
>  	/* The following stuff does not apply to the root */
>  	if (!parent) {
> +#ifdef CONFIG_MEMCG_KMEM
> +		INIT_LIST_HEAD(&memcg->kmem_caches);
> +#endif
>  		root_mem_cgroup = memcg;
>  		return &memcg->css;
>  	}
> diff --git a/mm/slab.h b/mm/slab.h
> index 7ead47cb9338..34bf92382ecd 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -268,7 +268,7 @@ static inline struct mem_cgroup *memcg_from_slab_page(struct page *page)
>  
>  	s = READ_ONCE(page->slab_cache);
>  	if (s && !is_root_cache(s))
> -		return s->memcg_params.memcg;
> +		return rcu_dereference(s->memcg_params.memcg);

I guess it's worth updating the comment with a few words re cgroup
lifetime.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 01/10] mm: add missing smp read barrier on getting memcg kmem_cache pointer
  2019-06-09 12:10   ` Vladimir Davydov
@ 2019-06-10 20:33     ` Johannes Weiner
  2019-06-10 20:38       ` Roman Gushchin
  0 siblings, 1 reply; 31+ messages in thread
From: Johannes Weiner @ 2019-06-10 20:33 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Roman Gushchin, Andrew Morton, linux-mm, linux-kernel,
	kernel-team, Shakeel Butt, Waiman Long

On Sun, Jun 09, 2019 at 03:10:52PM +0300, Vladimir Davydov wrote:
> On Tue, Jun 04, 2019 at 07:44:45PM -0700, Roman Gushchin wrote:
> > Johannes noticed that reading the memcg kmem_cache pointer in
> > cache_from_memcg_idx() is performed using READ_ONCE() macro,
> > which doesn't implement a SMP barrier, which is required
> > by the logic.
> > 
> > Add a proper smp_rmb() to be paired with smp_wmb() in
> > memcg_create_kmem_cache().
> > 
> > The same applies to memcg_create_kmem_cache() itself,
> > which reads the same value without barriers and READ_ONCE().
> > 
> > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > ---
> >  mm/slab.h        | 1 +
> >  mm/slab_common.c | 3 ++-
> >  2 files changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/slab.h b/mm/slab.h
> > index 739099af6cbb..1176b61bb8fc 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -260,6 +260,7 @@ cache_from_memcg_idx(struct kmem_cache *s, int idx)
> >  	 * memcg_caches issues a write barrier to match this (see
> >  	 * memcg_create_kmem_cache()).
> >  	 */
> > +	smp_rmb();
> >  	cachep = READ_ONCE(arr->entries[idx]);
> 
> Hmm, we used to have lockless_dereference() here, but it was replaced
> with READ_ONCE some time ago. The commit message claims that READ_ONCE
> has an implicit read barrier in it.

Thanks for catching this Vladimir. I wasn't aware of this change to
the memory model. Indeed, we don't need to change anything here.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 01/10] mm: add missing smp read barrier on getting memcg kmem_cache pointer
  2019-06-10 20:33     ` Johannes Weiner
@ 2019-06-10 20:38       ` Roman Gushchin
  0 siblings, 0 replies; 31+ messages in thread
From: Roman Gushchin @ 2019-06-10 20:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Vladimir Davydov, Andrew Morton, linux-mm, linux-kernel,
	Kernel Team, Shakeel Butt, Waiman Long

On Mon, Jun 10, 2019 at 04:33:44PM -0400, Johannes Weiner wrote:
> On Sun, Jun 09, 2019 at 03:10:52PM +0300, Vladimir Davydov wrote:
> > On Tue, Jun 04, 2019 at 07:44:45PM -0700, Roman Gushchin wrote:
> > > Johannes noticed that reading the memcg kmem_cache pointer in
> > > cache_from_memcg_idx() is performed using READ_ONCE() macro,
> > > which doesn't implement a SMP barrier, which is required
> > > by the logic.
> > > 
> > > Add a proper smp_rmb() to be paired with smp_wmb() in
> > > memcg_create_kmem_cache().
> > > 
> > > The same applies to memcg_create_kmem_cache() itself,
> > > which reads the same value without barriers and READ_ONCE().
> > > 
> > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > ---
> > >  mm/slab.h        | 1 +
> > >  mm/slab_common.c | 3 ++-
> > >  2 files changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/slab.h b/mm/slab.h
> > > index 739099af6cbb..1176b61bb8fc 100644
> > > --- a/mm/slab.h
> > > +++ b/mm/slab.h
> > > @@ -260,6 +260,7 @@ cache_from_memcg_idx(struct kmem_cache *s, int idx)
> > >  	 * memcg_caches issues a write barrier to match this (see
> > >  	 * memcg_create_kmem_cache()).
> > >  	 */
> > > +	smp_rmb();
> > >  	cachep = READ_ONCE(arr->entries[idx]);
> > 
> > Hmm, we used to have lockless_dereference() here, but it was replaced
> > with READ_ONCE some time ago. The commit message claims that READ_ONCE
> > has an implicit read barrier in it.
> 
> Thanks for catching this Vladimir. I wasn't aware of this change to
> the memory model. Indeed, we don't need to change anything here.

Cool, I'm dropping this patch.

Thanks!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v6 07/10] mm: synchronize access to kmem_cache dying flag using a spinlock
  2019-06-09 14:31   ` Vladimir Davydov
@ 2019-06-10 20:46     ` Roman Gushchin
  0 siblings, 0 replies; 31+ messages in thread
From: Roman Gushchin @ 2019-06-10 20:46 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, linux-mm, linux-kernel, Kernel Team,
	Johannes Weiner, Shakeel Butt, Waiman Long

On Sun, Jun 09, 2019 at 05:31:32PM +0300, Vladimir Davydov wrote:
> On Tue, Jun 04, 2019 at 07:44:51PM -0700, Roman Gushchin wrote:
> > Currently the memcg_params.dying flag and the corresponding
> > workqueue used for the asynchronous deactivation of kmem_caches
> > is synchronized using the slab_mutex.
> > 
> > It makes impossible to check this flag from the irq context,
> > which will be required in order to implement asynchronous release
> > of kmem_caches.
> > 
> > So let's switch over to the irq-save flavor of the spinlock-based
> > synchronization.
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> > ---
> >  mm/slab_common.c | 19 +++++++++++++++----
> >  1 file changed, 15 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/slab_common.c b/mm/slab_common.c
> > index 09b26673b63f..2914a8f0aa85 100644
> > --- a/mm/slab_common.c
> > +++ b/mm/slab_common.c
> > @@ -130,6 +130,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
> >  #ifdef CONFIG_MEMCG_KMEM
> >  
> >  LIST_HEAD(slab_root_caches);
> > +static DEFINE_SPINLOCK(memcg_kmem_wq_lock);
> >  
> >  void slab_init_memcg_params(struct kmem_cache *s)
> >  {
> > @@ -629,6 +630,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
> >  	struct memcg_cache_array *arr;
> >  	struct kmem_cache *s = NULL;
> >  	char *cache_name;
> > +	bool dying;
> >  	int idx;
> >  
> >  	get_online_cpus();
> > @@ -640,7 +642,13 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
> >  	 * The memory cgroup could have been offlined while the cache
> >  	 * creation work was pending.
> >  	 */
> > -	if (memcg->kmem_state != KMEM_ONLINE || root_cache->memcg_params.dying)
> > +	if (memcg->kmem_state != KMEM_ONLINE)
> > +		goto out_unlock;
> > +
> > +	spin_lock_irq(&memcg_kmem_wq_lock);
> > +	dying = root_cache->memcg_params.dying;
> > +	spin_unlock_irq(&memcg_kmem_wq_lock);
> > +	if (dying)
> >  		goto out_unlock;
> 
> I do understand why we need to sync setting dying flag for a kmem cache
> about to be destroyed in flush_memcg_workqueue vs checking the flag in
> kmemcg_cache_deactivate: this is needed so that we don't schedule a new
> deactivation work after we flush RCU/workqueue. However, I don't think
> it's necessary to check the dying flag here, in memcg_create_kmem_cache:
> we can't schedule a new cache creation work after kmem_cache_destroy has
> started, because one mustn't allocate from a dead kmem cache; since we
> flush the queue before getting to actual destruction, no cache creation
> work can be pending. Yeah, it might happen that a cache creation work
> starts execution while flush_memcg_workqueue is in progress, but I don't
> see any point in optimizing this case - after all, cache destruction is
> a very cold path. Since checking the flag in memcg_create_kmem_cache
> raises question, I suggest to simply drop this check.

Yeah, I came to the same conclusion (in a thread with Johannes),
that this check is not required. I'll drop it in a separate patch.

> 
> Anyway, it would be nice to see some comment in the code explaining why
> we check dying flag under a spin lock in kmemcg_cache_deactivate.

Sure, will add some.

Btw, thank you very much for reviewing the series!

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2019-06-10 20:47 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-05  2:44 [PATCH v6 00/10] mm: reparent slab memory on cgroup removal Roman Gushchin
2019-06-05  2:44 ` [PATCH v6 01/10] mm: add missing smp read barrier on getting memcg kmem_cache pointer Roman Gushchin
2019-06-05  4:35   ` Shakeel Butt
2019-06-05 17:14     ` Roman Gushchin
2019-06-05 19:51       ` Shakeel Butt
2019-06-05 16:42   ` Johannes Weiner
2019-06-09 12:10   ` Vladimir Davydov
2019-06-10 20:33     ` Johannes Weiner
2019-06-10 20:38       ` Roman Gushchin
2019-06-05  2:44 ` [PATCH v6 02/10] mm: postpone kmem_cache memcg pointer initialization to memcg_link_cache() Roman Gushchin
2019-06-05  2:44 ` [PATCH v6 03/10] mm: rename slab delayed deactivation functions and fields Roman Gushchin
2019-06-09 12:13   ` Vladimir Davydov
2019-06-05  2:44 ` [PATCH v6 04/10] mm: generalize postponed non-root kmem_cache deactivation Roman Gushchin
2019-06-09 12:23   ` Vladimir Davydov
2019-06-05  2:44 ` [PATCH v6 05/10] mm: introduce __memcg_kmem_uncharge_memcg() Roman Gushchin
2019-06-09 12:29   ` Vladimir Davydov
2019-06-05  2:44 ` [PATCH v6 06/10] mm: unify SLAB and SLUB page accounting Roman Gushchin
2019-06-05  2:44 ` [PATCH v6 07/10] mm: synchronize access to kmem_cache dying flag using a spinlock Roman Gushchin
2019-06-05 16:56   ` Johannes Weiner
2019-06-05 22:02     ` Roman Gushchin
2019-06-06  0:48       ` Roman Gushchin
2019-06-09 14:31   ` Vladimir Davydov
2019-06-10 20:46     ` Roman Gushchin
2019-06-05  2:44 ` [PATCH v6 08/10] mm: rework non-root kmem_cache lifecycle management Roman Gushchin
2019-06-09 17:09   ` Vladimir Davydov
2019-06-05  2:44 ` [PATCH v6 09/10] mm: stop setting page->mem_cgroup pointer for slab pages Roman Gushchin
2019-06-09 17:09   ` Vladimir Davydov
2019-06-05  2:44 ` [PATCH v6 10/10] mm: reparent slab memory on cgroup removal Roman Gushchin
2019-06-09 17:18   ` Vladimir Davydov
2019-06-05  4:14 ` [PATCH v6 00/10] " Andrew Morton
2019-06-05 20:45   ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).