[PATCH 0/9] Shrink the list lru size on memory cgroup removal

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/9] Shrink the list lru size on memory cgroup removal
@ 2021-04-28  9:49 Muchun Song
  2021-04-28  9:49 ` [PATCH 1/9] mm: list_lru: fix list_lru_count_one() return value Muchun Song
                   ` (10 more replies)
  0 siblings, 11 replies; 30+ messages in thread
From: Muchun Song @ 2021-04-28  9:49 UTC (permalink / raw)
  To: willy, akpm, hannes, mhocko, vdavydov.dev, shakeelb, guro,
	shy828301, alexs, alexander.h.duyck, richard.weiyang
  Cc: linux-fsdevel, linux-kernel, linux-mm, Muchun Song

In our server, we found a suspected memory leak problem. The kmalloc-32
consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
memory.

After our in-depth analysis, the memory consumption of kmalloc-32 slab
cache is the cause of list_lru_one allocation.

  crash> p memcg_nr_cache_ids
  memcg_nr_cache_ids = $2 = 24574

memcg_nr_cache_ids is very large and memory consumption of each list_lru
can be calculated with the following formula.

  num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)

There are 4 numa nodes in our system, so each list_lru consumes ~3MB.

  crash> list super_blocks | wc -l
  952

Every mount will register 2 list lrus, one is for inode, another is for
dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3
MB (~5.6GB). But the number of memory cgroup is less than 500. So I
guess more than 12286 containers have been deployed on this machine (I
do not know why there are so many containers, it may be a user's bug or
the user really want to do that). But now there are less than 500
containers in the system. And memcg_nr_cache_ids has not been reduced
to a suitable value. This can waste a lot of memory. If we want to reduce
memcg_nr_cache_ids, we have to reboot the server. This is not what we
want.

So this patchset will dynamically adjust the value of memcg_nr_cache_ids
to keep healthy memory consumption. In this case, we may be able to restore
a healthy environment even if the users have created tens of thousands of
memory cgroups and then destroyed those memory cgroups. This patchset also
contains some code simplification.

Muchun Song (9):
  mm: list_lru: fix list_lru_count_one() return value
  mm: memcontrol: remove kmemcg_id reparenting
  mm: list_lru: rename memcg_drain_all_list_lrus to
    memcg_reparent_list_lrus
  mm: memcontrol: remove the kmem states
  mm: memcontrol: move memcg_online_kmem() to mem_cgroup_css_online()
  mm: list_lru: support for shrinking list lru
  ida: introduce ida_max() to return the maximum allocated ID
  mm: memcontrol: shrink the list lru size
  mm: memcontrol: rename memcg_{get,put}_cache_ids to
    memcg_list_lru_resize_{lock,unlock}

 include/linux/idr.h        |   1 +
 include/linux/list_lru.h   |   2 +-
 include/linux/memcontrol.h |  15 ++----
 lib/idr.c                  |  40 +++++++++++++++
 mm/list_lru.c              |  89 +++++++++++++++++++++++++--------
 mm/memcontrol.c            | 121 +++++++++++++++++++++++++--------------------
 6 files changed, 183 insertions(+), 85 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/9] mm: list_lru: fix list_lru_count_one() return value
  2021-04-28  9:49 [PATCH 0/9] Shrink the list lru size on memory cgroup removal Muchun Song
@ 2021-04-28  9:49 ` Muchun Song
  2021-04-28  9:49 ` [PATCH 2/9] mm: memcontrol: remove kmemcg_id reparenting Muchun Song
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Muchun Song @ 2021-04-28  9:49 UTC (permalink / raw)
  To: willy, akpm, hannes, mhocko, vdavydov.dev, shakeelb, guro,
	shy828301, alexs, alexander.h.duyck, richard.weiyang
  Cc: linux-fsdevel, linux-kernel, linux-mm, Muchun Song

Since commit 2788cf0c401c ("memcg: reparent list_lrus and free kmemcg_id
on css offline"), the ->nr_items can be negative during memory cgroup
reparenting. In this case, list_lru_count_one() can returns an unusually
large value. In order to not surprise the user. So return zero when
->nr_items is negative.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/list_lru.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index cd58790d0fb3..4962d48d4410 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -176,13 +176,16 @@ unsigned long list_lru_count_one(struct list_lru *lru,
 {
 	struct list_lru_node *nlru = &lru->node[nid];
 	struct list_lru_one *l;
-	unsigned long count;
+	long count;
 
 	rcu_read_lock();
 	l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
 	count = READ_ONCE(l->nr_items);
 	rcu_read_unlock();
 
+	if (unlikely(count < 0))
+		count = 0;
+
 	return count;
 }
 EXPORT_SYMBOL_GPL(list_lru_count_one);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/9] mm: memcontrol: remove kmemcg_id reparenting
  2021-04-28  9:49 [PATCH 0/9] Shrink the list lru size on memory cgroup removal Muchun Song
  2021-04-28  9:49 ` [PATCH 1/9] mm: list_lru: fix list_lru_count_one() return value Muchun Song
@ 2021-04-28  9:49 ` Muchun Song
  2021-04-28  9:49 ` [PATCH 3/9] mm: list_lru: rename memcg_drain_all_list_lrus to memcg_reparent_list_lrus Muchun Song
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Muchun Song @ 2021-04-28  9:49 UTC (permalink / raw)
  To: willy, akpm, hannes, mhocko, vdavydov.dev, shakeelb, guro,
	shy828301, alexs, alexander.h.duyck, richard.weiyang
  Cc: linux-fsdevel, linux-kernel, linux-mm, Muchun Song

Since slab objects and kmem pages are charged to object cgroup instead
of memory cgroup, memcg_reparent_objcgs() will reparent this cgroup and
all its descendants to the parent cgroup. It means that the new parent
memory cgroup can be returned by mem_cgroup_from_obj() which is called
from list_lru_from_kmem(). This can make further list_lru_add()'s add
elements to the parent's list. So we do not need to change kmemcg_id
of an offline cgroup to its parent's id. This is just waste CPU cycles.
Just remove those redundant code.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/memcontrol.c | 20 ++------------------
 1 file changed, 2 insertions(+), 18 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 64ada9e650a5..21e12312509c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3485,8 +3485,7 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
 
 static void memcg_offline_kmem(struct mem_cgroup *memcg)
 {
-	struct cgroup_subsys_state *css;
-	struct mem_cgroup *parent, *child;
+	struct mem_cgroup *parent;
 	int kmemcg_id;
 
 	if (memcg->kmem_state != KMEM_ONLINE)
@@ -3503,22 +3502,7 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 	kmemcg_id = memcg->kmemcg_id;
 	BUG_ON(kmemcg_id < 0);
 
-	/*
-	 * Change kmemcg_id of this cgroup and all its descendants to the
-	 * parent's id, and then move all entries from this cgroup's list_lrus
-	 * to ones of the parent. After we have finished, all list_lrus
-	 * corresponding to this cgroup are guaranteed to remain empty. The
-	 * ordering is imposed by list_lru_node->lock taken by
-	 * memcg_drain_all_list_lrus().
-	 */
-	rcu_read_lock(); /* can be called from css_free w/o cgroup_mutex */
-	css_for_each_descendant_pre(css, &memcg->css) {
-		child = mem_cgroup_from_css(css);
-		BUG_ON(child->kmemcg_id != kmemcg_id);
-		child->kmemcg_id = parent->kmemcg_id;
-	}
-	rcu_read_unlock();
-
+	/* memcg_reparent_objcgs() must be called before this. */
 	memcg_drain_all_list_lrus(kmemcg_id, parent);
 
 	memcg_free_cache_id(kmemcg_id);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/9] mm: list_lru: rename memcg_drain_all_list_lrus to memcg_reparent_list_lrus
  2021-04-28  9:49 [PATCH 0/9] Shrink the list lru size on memory cgroup removal Muchun Song
  2021-04-28  9:49 ` [PATCH 1/9] mm: list_lru: fix list_lru_count_one() return value Muchun Song
  2021-04-28  9:49 ` [PATCH 2/9] mm: memcontrol: remove kmemcg_id reparenting Muchun Song
@ 2021-04-28  9:49 ` Muchun Song
  2021-04-28  9:49 ` [PATCH 4/9] mm: memcontrol: remove the kmem states Muchun Song
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Muchun Song @ 2021-04-28  9:49 UTC (permalink / raw)
  To: willy, akpm, hannes, mhocko, vdavydov.dev, shakeelb, guro,
	shy828301, alexs, alexander.h.duyck, richard.weiyang
  Cc: linux-fsdevel, linux-kernel, linux-mm, Muchun Song

Since we do not change memcg->kmemcg_id before calling
memcg_drain_all_list_lrus(), so we do not need to take kmemcg_id as
parameter. The two parameters of memcg_drain_all_list_lrus() seems
odd, one is kmemcg_id, another is memcg. Now we can change the
kmemcg_id to the memcg. It is more consistent. Since the purpose of
the memcg_drain_all_list_lrus() is list_lru reparenting. So also
rename it to memcg_reparent_list_lrus(). The name is also consistent
with memcg_reparent_objcgs().

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/list_lru.h |  2 +-
 mm/list_lru.c            | 23 ++++++++++++-----------
 mm/memcontrol.c          |  2 +-
 3 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 9dcaa3e582c9..e8a5e3a2c0dd 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -70,7 +70,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware,
 	__list_lru_init((lru), true, NULL, shrinker)
 
 int memcg_update_all_list_lrus(int num_memcgs);
-void memcg_drain_all_list_lrus(int src_idx, struct mem_cgroup *dst_memcg);
+void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *parent);
 
 /**
  * list_lru_add: add an element to the lru list's tail
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 4962d48d4410..d78dba5a6dab 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -523,11 +523,11 @@ int memcg_update_all_list_lrus(int new_size)
 	goto out;
 }
 
-static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
-				      int src_idx, struct mem_cgroup *dst_memcg)
+static void memcg_reparent_list_lru_node(struct list_lru *lru, int nid,
+					 struct mem_cgroup *memcg,
+					 struct mem_cgroup *parent)
 {
 	struct list_lru_node *nlru = &lru->node[nid];
-	int dst_idx = dst_memcg->kmemcg_id;
 	struct list_lru_one *src, *dst;
 
 	/*
@@ -536,22 +536,23 @@ static void memcg_drain_list_lru_node(struct list_lru *lru, int nid,
 	 */
 	spin_lock_irq(&nlru->lock);
 
-	src = list_lru_from_memcg_idx(nlru, src_idx);
-	dst = list_lru_from_memcg_idx(nlru, dst_idx);
+	src = list_lru_from_memcg_idx(nlru, memcg->kmemcg_id);
+	dst = list_lru_from_memcg_idx(nlru, parent->kmemcg_id);
 
 	list_splice_init(&src->list, &dst->list);
 
 	if (src->nr_items) {
 		dst->nr_items += src->nr_items;
-		set_shrinker_bit(dst_memcg, nid, lru_shrinker_id(lru));
+		set_shrinker_bit(parent, nid, lru_shrinker_id(lru));
 		src->nr_items = 0;
 	}
 
 	spin_unlock_irq(&nlru->lock);
 }
 
-static void memcg_drain_list_lru(struct list_lru *lru,
-				 int src_idx, struct mem_cgroup *dst_memcg)
+static void memcg_reparent_list_lru(struct list_lru *lru,
+				    struct mem_cgroup *memcg,
+				    struct mem_cgroup *parent)
 {
 	int i;
 
@@ -559,16 +560,16 @@ static void memcg_drain_list_lru(struct list_lru *lru,
 		return;
 
 	for_each_node(i)
-		memcg_drain_list_lru_node(lru, i, src_idx, dst_memcg);
+		memcg_reparent_list_lru_node(lru, i, memcg, parent);
 }
 
-void memcg_drain_all_list_lrus(int src_idx, struct mem_cgroup *dst_memcg)
+void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *parent)
 {
 	struct list_lru *lru;
 
 	mutex_lock(&list_lrus_mutex);
 	list_for_each_entry(lru, &list_lrus, list)
-		memcg_drain_list_lru(lru, src_idx, dst_memcg);
+		memcg_reparent_list_lru(lru, memcg, parent);
 	mutex_unlock(&list_lrus_mutex);
 }
 #else
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 21e12312509c..c1ce4fdba028 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3503,7 +3503,7 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 	BUG_ON(kmemcg_id < 0);
 
 	/* memcg_reparent_objcgs() must be called before this. */
-	memcg_drain_all_list_lrus(kmemcg_id, parent);
+	memcg_reparent_list_lrus(memcg, parent);
 
 	memcg_free_cache_id(kmemcg_id);
 }
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/9] mm: memcontrol: remove the kmem states
  2021-04-28  9:49 [PATCH 0/9] Shrink the list lru size on memory cgroup removal Muchun Song
                   ` (2 preceding siblings ...)
  2021-04-28  9:49 ` [PATCH 3/9] mm: list_lru: rename memcg_drain_all_list_lrus to memcg_reparent_list_lrus Muchun Song
@ 2021-04-28  9:49 ` Muchun Song
  2021-04-28  9:49 ` [PATCH 5/9] mm: memcontrol: move memcg_online_kmem() to mem_cgroup_css_online() Muchun Song
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Muchun Song @ 2021-04-28  9:49 UTC (permalink / raw)
  To: willy, akpm, hannes, mhocko, vdavydov.dev, shakeelb, guro,
	shy828301, alexs, alexander.h.duyck, richard.weiyang
  Cc: linux-fsdevel, linux-kernel, linux-mm, Muchun Song

Now the kmem states is only used to indicate whether the kmem is
offlined. But we can use ->kmemcg_id to do the same things. So
remove the kmem states to simplify the code.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/memcontrol.h |  7 -------
 mm/memcontrol.c            | 10 ++--------
 2 files changed, 2 insertions(+), 15 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c193be760709..6350c563c7b8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -182,12 +182,6 @@ struct mem_cgroup_thresholds {
 	struct mem_cgroup_threshold_ary *spare;
 };
 
-enum memcg_kmem_state {
-	KMEM_NONE,
-	KMEM_ALLOCATED,
-	KMEM_ONLINE,
-};
-
 #if defined(CONFIG_SMP)
 struct memcg_padding {
 	char x[0];
@@ -320,7 +314,6 @@ struct mem_cgroup {
 
 #ifdef CONFIG_MEMCG_KMEM
 	int kmemcg_id;
-	enum memcg_kmem_state kmem_state;
 	struct obj_cgroup __rcu *objcg;
 	struct list_head objcg_list; /* list of inherited objcgs */
 #endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c1ce4fdba028..9b9a5368a3e9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3461,7 +3461,6 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
 		return 0;
 
 	BUG_ON(memcg->kmemcg_id >= 0);
-	BUG_ON(memcg->kmem_state);
 
 	memcg_id = memcg_alloc_cache_id();
 	if (memcg_id < 0)
@@ -3478,7 +3477,6 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
 	static_branch_enable(&memcg_kmem_enabled_key);
 
 	memcg->kmemcg_id = memcg_id;
-	memcg->kmem_state = KMEM_ONLINE;
 
 	return 0;
 }
@@ -3488,11 +3486,6 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 	struct mem_cgroup *parent;
 	int kmemcg_id;
 
-	if (memcg->kmem_state != KMEM_ONLINE)
-		return;
-
-	memcg->kmem_state = KMEM_ALLOCATED;
-
 	parent = parent_mem_cgroup(memcg);
 	if (!parent)
 		parent = root_mem_cgroup;
@@ -3506,12 +3499,13 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 	memcg_reparent_list_lrus(memcg, parent);
 
 	memcg_free_cache_id(kmemcg_id);
+	memcg->kmemcg_id = -1;
 }
 
 static void memcg_free_kmem(struct mem_cgroup *memcg)
 {
 	/* css_alloc() failed, offlining didn't happen */
-	if (unlikely(memcg->kmem_state == KMEM_ONLINE))
+	if (unlikely(memcg->kmemcg_id != -1))
 		memcg_offline_kmem(memcg);
 }
 #else
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 5/9] mm: memcontrol: move memcg_online_kmem() to mem_cgroup_css_online()
  2021-04-28  9:49 [PATCH 0/9] Shrink the list lru size on memory cgroup removal Muchun Song
                   ` (3 preceding siblings ...)
  2021-04-28  9:49 ` [PATCH 4/9] mm: memcontrol: remove the kmem states Muchun Song
@ 2021-04-28  9:49 ` Muchun Song
  2021-04-28  9:49 ` [PATCH 6/9] mm: list_lru: support for shrinking list lru Muchun Song
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Muchun Song @ 2021-04-28  9:49 UTC (permalink / raw)
  To: willy, akpm, hannes, mhocko, vdavydov.dev, shakeelb, guro,
	shy828301, alexs, alexander.h.duyck, richard.weiyang
  Cc: linux-fsdevel, linux-kernel, linux-mm, Muchun Song

Move memcg_online_kmem() to mem_cgroup_css_online() to simplify the
code. In this case, we can remove memcg_free_kmem().

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/memcontrol.c | 41 ++++++++++++++++-------------------------
 1 file changed, 16 insertions(+), 25 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9b9a5368a3e9..1610d501e7b5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3460,6 +3460,9 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
 	if (cgroup_memory_nokmem)
 		return 0;
 
+	if (mem_cgroup_is_root(memcg))
+		return 0;
+
 	BUG_ON(memcg->kmemcg_id >= 0);
 
 	memcg_id = memcg_alloc_cache_id();
@@ -3486,6 +3489,9 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 	struct mem_cgroup *parent;
 	int kmemcg_id;
 
+	if (mem_cgroup_is_root(memcg))
+		return;
+
 	parent = parent_mem_cgroup(memcg);
 	if (!parent)
 		parent = root_mem_cgroup;
@@ -3499,14 +3505,6 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 	memcg_reparent_list_lrus(memcg, parent);
 
 	memcg_free_cache_id(kmemcg_id);
-	memcg->kmemcg_id = -1;
-}
-
-static void memcg_free_kmem(struct mem_cgroup *memcg)
-{
-	/* css_alloc() failed, offlining didn't happen */
-	if (unlikely(memcg->kmemcg_id != -1))
-		memcg_offline_kmem(memcg);
 }
 #else
 static int memcg_online_kmem(struct mem_cgroup *memcg)
@@ -3516,9 +3514,6 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
 static void memcg_offline_kmem(struct mem_cgroup *memcg)
 {
 }
-static void memcg_free_kmem(struct mem_cgroup *memcg)
-{
-}
 #endif /* CONFIG_MEMCG_KMEM */
 
 static int memcg_update_kmem_max(struct mem_cgroup *memcg,
@@ -5047,7 +5042,6 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 {
 	struct mem_cgroup *parent = mem_cgroup_from_css(parent_css);
 	struct mem_cgroup *memcg, *old_memcg;
-	long error = -ENOMEM;
 
 	old_memcg = set_active_memcg(parent);
 	memcg = mem_cgroup_alloc();
@@ -5077,38 +5071,36 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	}
 
 	/* The following stuff does not apply to the root */
-	error = memcg_online_kmem(memcg);
-	if (error)
-		goto fail;
-
 	if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
 		static_branch_inc(&memcg_sockets_enabled_key);
 
 	return &memcg->css;
-fail:
-	mem_cgroup_id_remove(memcg);
-	mem_cgroup_free(memcg);
-	return ERR_PTR(error);
 }
 
 static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
+	if (memcg_online_kmem(memcg))
+		goto remove_id;
+
 	/*
 	 * A memcg must be visible for expand_shrinker_info()
 	 * by the time the maps are allocated. So, we allocate maps
 	 * here, when for_each_mem_cgroup() can't skip it.
 	 */
-	if (alloc_shrinker_info(memcg)) {
-		mem_cgroup_id_remove(memcg);
-		return -ENOMEM;
-	}
+	if (alloc_shrinker_info(memcg))
+		goto offline_kmem;
 
 	/* Online state pins memcg ID, memcg ID pins CSS */
 	refcount_set(&memcg->id.ref, 1);
 	css_get(css);
 	return 0;
+offline_kmem:
+	memcg_offline_kmem(memcg);
+remove_id:
+	mem_cgroup_id_remove(memcg);
+	return -ENOMEM;
 }
 
 static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
@@ -5166,7 +5158,6 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	cancel_work_sync(&memcg->high_work);
 	mem_cgroup_remove_from_trees(memcg);
 	free_shrinker_info(memcg);
-	memcg_free_kmem(memcg);
 	mem_cgroup_free(memcg);
 }
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 6/9] mm: list_lru: support for shrinking list lru
  2021-04-28  9:49 [PATCH 0/9] Shrink the list lru size on memory cgroup removal Muchun Song
                   ` (4 preceding siblings ...)
  2021-04-28  9:49 ` [PATCH 5/9] mm: memcontrol: move memcg_online_kmem() to mem_cgroup_css_online() Muchun Song
@ 2021-04-28  9:49 ` Muchun Song
  2021-04-28  9:49 ` [PATCH 7/9] ida: introduce ida_max() to return the maximum allocated ID Muchun Song
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Muchun Song @ 2021-04-28  9:49 UTC (permalink / raw)
  To: willy, akpm, hannes, mhocko, vdavydov.dev, shakeelb, guro,
	shy828301, alexs, alexander.h.duyck, richard.weiyang
  Cc: linux-fsdevel, linux-kernel, linux-mm, Muchun Song

Now memcg_update_all_list_lrus() only can increase the size of all the
list lrus. This patch adds an ability to make it can shrink the size
of all the list lrus. This can help us save memory when the user want
to shrink the size.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/list_lru.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 49 insertions(+), 4 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index d78dba5a6dab..3ee5239922c9 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -383,13 +383,11 @@ static void memcg_destroy_list_lru_node(struct list_lru_node *nlru)
 	kvfree(memcg_lrus);
 }
 
-static int memcg_update_list_lru_node(struct list_lru_node *nlru,
-				      int old_size, int new_size)
+static int memcg_list_lru_node_inc(struct list_lru_node *nlru,
+				   int old_size, int new_size)
 {
 	struct list_lru_memcg *old, *new;
 
-	BUG_ON(old_size > new_size);
-
 	old = rcu_dereference_protected(nlru->memcg_lrus,
 					lockdep_is_held(&list_lrus_mutex));
 	new = kvmalloc(sizeof(*new) + new_size * sizeof(void *), GFP_KERNEL);
@@ -418,11 +416,58 @@ static int memcg_update_list_lru_node(struct list_lru_node *nlru,
 	return 0;
 }
 
+/* This function always returns 0. */
+static int memcg_list_lru_node_dec(struct list_lru_node *nlru,
+				   int old_size, int new_size)
+{
+	struct list_lru_memcg *old, *new;
+
+	old = rcu_dereference_protected(nlru->memcg_lrus,
+					lockdep_is_held(&list_lrus_mutex));
+	__memcg_destroy_list_lru_node(old, new_size, old_size);
+
+	/* Reuse the old array if the allocation failures here. */
+	new = kvmalloc(sizeof(*new) + new_size * sizeof(void *), GFP_KERNEL);
+	if (!new)
+		return 0;
+
+	memcpy(&new->lru, &old->lru, new_size * sizeof(void *));
+
+	/*
+	 * The locking below allows readers that hold nlru->lock avoid taking
+	 * rcu_read_lock (see list_lru_from_memcg_idx).
+	 *
+	 * Since list_lru_{add,del} may be called under an IRQ-safe lock,
+	 * we have to use IRQ-safe primitives here to avoid deadlock.
+	 */
+	spin_lock_irq(&nlru->lock);
+	rcu_assign_pointer(nlru->memcg_lrus, new);
+	spin_unlock_irq(&nlru->lock);
+
+	kvfree_rcu(old, rcu);
+	return 0;
+}
+
+static int memcg_update_list_lru_node(struct list_lru_node *nlru,
+				      int old_size, int new_size)
+{
+	if (new_size > old_size)
+		return memcg_list_lru_node_inc(nlru, old_size, new_size);
+	else if (new_size < old_size)
+		return memcg_list_lru_node_dec(nlru, old_size, new_size);
+
+	return 0;
+}
+
 static void memcg_cancel_update_list_lru_node(struct list_lru_node *nlru,
 					      int old_size, int new_size)
 {
 	struct list_lru_memcg *memcg_lrus;
 
+	/* Nothing to do for the shrinking case. */
+	if (old_size >= new_size)
+		return;
+
 	memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus,
 					       lockdep_is_held(&list_lrus_mutex));
 	/* do not bother shrinking the array back to the old size, because we
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 7/9] ida: introduce ida_max() to return the maximum allocated ID
  2021-04-28  9:49 [PATCH 0/9] Shrink the list lru size on memory cgroup removal Muchun Song
                   ` (5 preceding siblings ...)
  2021-04-28  9:49 ` [PATCH 6/9] mm: list_lru: support for shrinking list lru Muchun Song
@ 2021-04-28  9:49 ` Muchun Song
  2021-04-29  6:47   ` Christoph Hellwig
  2021-04-28  9:49 ` [PATCH 8/9] mm: memcontrol: shrink the list lru size Muchun Song
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Muchun Song @ 2021-04-28  9:49 UTC (permalink / raw)
  To: willy, akpm, hannes, mhocko, vdavydov.dev, shakeelb, guro,
	shy828301, alexs, alexander.h.duyck, richard.weiyang
  Cc: linux-fsdevel, linux-kernel, linux-mm, Muchun Song

Introduce ida_max() to return the maximum allocated ID. This will be
used by memory cgroup in the later patch.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/idr.h |  1 +
 lib/idr.c           | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index a0dce14090a9..c3968a6348d1 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -255,6 +255,7 @@ struct ida {
 int ida_alloc_range(struct ida *, unsigned int min, unsigned int max, gfp_t);
 void ida_free(struct ida *, unsigned int id);
 void ida_destroy(struct ida *ida);
+int ida_max(struct ida *ida);
 
 /**
  * ida_alloc() - Allocate an unused ID.
diff --git a/lib/idr.c b/lib/idr.c
index f4ab4f4aa3c7..bcfcaae89aa7 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -553,6 +553,46 @@ void ida_destroy(struct ida *ida)
 }
 EXPORT_SYMBOL(ida_destroy);
 
+/**
+ * ida_max() - Return the maximum allocated ID.
+ * @ida: IDA handle.
+ *
+ * Context: Any context. It is safe to call this function without
+ * locking in your code.
+ *
+ * Return: The maximum allocated ID, or %-ENOSPC if the @ida is empty
+ */
+int ida_max(struct ida *ida)
+{
+	XA_STATE(xas, &ida->xa, 0);
+	struct ida_bitmap *curr, *prev;
+	unsigned long flags;
+	unsigned int bit, index;
+
+	xas_lock_irqsave(&xas, flags);
+	if (ida_is_empty(ida)) {
+		xas_unlock_irqrestore(&xas, flags);
+		return -ENOSPC;
+	}
+
+	xas_for_each(&xas, curr, ULONG_MAX) {
+		prev = curr;
+		index = xas.xa_index;
+	}
+
+	if (xa_is_value(prev)) {
+		unsigned long val = xa_to_value(prev);
+
+		bit = find_last_bit(&val, BITS_PER_XA_VALUE);
+	} else {
+		bit = find_last_bit(prev->bitmap, IDA_BITMAP_BITS);
+	}
+
+	xas_unlock_irqrestore(&xas, flags);
+
+	return index * IDA_BITMAP_BITS + bit;
+}
+
 #ifndef __KERNEL__
 extern void xa_dump_index(unsigned long index, unsigned int shift);
 #define IDA_CHUNK_SHIFT		ilog2(IDA_BITMAP_BITS)
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 8/9] mm: memcontrol: shrink the list lru size
  2021-04-28  9:49 [PATCH 0/9] Shrink the list lru size on memory cgroup removal Muchun Song
                   ` (6 preceding siblings ...)
  2021-04-28  9:49 ` [PATCH 7/9] ida: introduce ida_max() to return the maximum allocated ID Muchun Song
@ 2021-04-28  9:49 ` Muchun Song
  2021-04-28  9:49 ` [PATCH 9/9] mm: memcontrol: rename memcg_{get,put}_cache_ids to memcg_list_lru_resize_{lock,unlock} Muchun Song
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Muchun Song @ 2021-04-28  9:49 UTC (permalink / raw)
  To: willy, akpm, hannes, mhocko, vdavydov.dev, shakeelb, guro,
	shy828301, alexs, alexander.h.duyck, richard.weiyang
  Cc: linux-fsdevel, linux-kernel, linux-mm, Muchun Song

In our server, we found a suspected memory leak problem. The kmalloc-32
consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
memory.

After our in-depth analysis, the memory consumption of kmalloc-32 slab
cache is the cause of list_lru_one allocation.

  crash> p memcg_nr_cache_ids
  memcg_nr_cache_ids = $2 = 24574

memcg_nr_cache_ids is very large and memory consumption of each list_lru
can be calculated with the following formula.

  num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)

There are 4 numa nodes in our system, so each list_lru consumes ~3MB.

  crash> list super_blocks | wc -l
  952

Every mount will register 2 list lrus, one is for inode, another is for
dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3
MB (~5.6GB). But the number of memory cgroup is less than 500. So I
guess more than 12286 containers have been deployed on this machine (I
do not know why there are so many containers, it may be a user's bug or
the user really want to do that). But now there are less than 500
containers in the system. And memcg_nr_cache_ids has not been reduced
to a suitable value. This can waste a lot of memory. If we want to reduce
memcg_nr_cache_ids, we have to reboot the server. This is not what we
want. So this patch will dynamically adjust the value of
memcg_nr_cache_ids to keep healthy memory consumption. In this case, we
may be able to restore a healthy environment even if the users have
created tens of thousands of memory cgroups.

In this patch, I adjusted the calculation formula of memcg_nr_cache_ids
from "size = 2 * (id + 1)" to "size = 2 * id" in memcg_alloc_cache_id().
Because this can make things more simple when shrink the list lru size.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/memcontrol.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 49 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1610d501e7b5..f8cdd87cf693 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -362,6 +362,8 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
 static DEFINE_IDA(memcg_cache_ida);
 int memcg_nr_cache_ids;

+static int kmemcg_max_id;
+
 /* Protects memcg_nr_cache_ids */
 static DECLARE_RWSEM(memcg_cache_ids_sem);

@@ -2856,8 +2858,11 @@ static int memcg_alloc_cache_id(void)
 	if (id < 0)
 		return id;

-	if (id < memcg_nr_cache_ids)
+	if (id < memcg_nr_cache_ids) {
+		if (id > kmemcg_max_id)
+			kmemcg_max_id = id;
 		return id;
+	}

 	/*
 	 * There's no space for the new id in memcg_caches arrays,
@@ -2865,15 +2870,17 @@ static int memcg_alloc_cache_id(void)
 	 */
 	down_write(&memcg_cache_ids_sem);

-	size = 2 * (id + 1);
+	size = 2 * id;
 	if (size < MEMCG_CACHES_MIN_SIZE)
 		size = MEMCG_CACHES_MIN_SIZE;
 	else if (size > MEMCG_CACHES_MAX_SIZE)
 		size = MEMCG_CACHES_MAX_SIZE;

 	err = memcg_update_all_list_lrus(size);
-	if (!err)
+	if (!err) {
 		memcg_nr_cache_ids = size;
+		kmemcg_max_id = id;
+	}

 	up_write(&memcg_cache_ids_sem);

@@ -2884,9 +2891,48 @@ static int memcg_alloc_cache_id(void)
 	return id;
 }

+static inline int nearest_fit_id(int id)
+{
+	if (unlikely(id < MEMCG_CACHES_MIN_SIZE))
+		return MEMCG_CACHES_MIN_SIZE;
+
+	return 1 << (__fls(id) + 1);
+}
+
+/*
+ * memcg_alloc_cache_id() and memcg_free_cache_id() are serialized by
+ * cgroup_mutex. So there is no race on kmemcg_max_id.
+ */
 static void memcg_free_cache_id(int id)
 {
 	ida_simple_remove(&memcg_cache_ida, id);
+
+	if (kmemcg_max_id == id) {
+		/*
+		 * In order to avoid @memcg_nr_cache_ids bouncing between
+		 * @memcg_nr_cache_ids / 2 and @memcg_nr_cache_ids. We only
+		 * shrink the list lru size when @kmemcg_max_id is smaller
+		 * than @memcg_nr_cache_ids / 3.
+		 */
+		int size = memcg_nr_cache_ids / 3;
+
+		kmemcg_max_id = ida_max(&memcg_cache_ida);
+		if (kmemcg_max_id < size) {
+			/*
+			 * Find the first value greater than @kmemcg_max_id
+			 * which can fit our need. And shrink the list lru
+			 * to this size.
+			 */
+			size = nearest_fit_id(kmemcg_max_id);
+
+			down_write(&memcg_cache_ids_sem);
+			if (size != memcg_nr_cache_ids) {
+				memcg_update_all_list_lrus(size);
+				memcg_nr_cache_ids = size;
+			}
+			up_write(&memcg_cache_ids_sem);
+		}
+	}
 }

 /*
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 9/9] mm: memcontrol: rename memcg_{get,put}_cache_ids to memcg_list_lru_resize_{lock,unlock}
  2021-04-28  9:49 [PATCH 0/9] Shrink the list lru size on memory cgroup removal Muchun Song
                   ` (7 preceding siblings ...)
  2021-04-28  9:49 ` [PATCH 8/9] mm: memcontrol: shrink the list lru size Muchun Song
@ 2021-04-28  9:49 ` Muchun Song
  2021-04-28 23:32   ` Shakeel Butt
  2021-04-30  0:49 ` Dave Chinner
  10 siblings, 0 replies; 30+ messages in thread
From: Muchun Song @ 2021-04-28  9:49 UTC (permalink / raw)
  To: willy, akpm, hannes, mhocko, vdavydov.dev, shakeelb, guro,
	shy828301, alexs, alexander.h.duyck, richard.weiyang
  Cc: linux-fsdevel, linux-kernel, linux-mm, Muchun Song

The rwsem is held for writing during list lru arrays relocation and
memcg_nr_cache_ids updates. Therefore memcg_get_cache_ids implies
memcg_nr_cache_ids cannot be updated. It acts as a lock primitive.
So rename it to a more suitable name.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/memcontrol.h | 8 ++++----
 mm/list_lru.c              | 8 ++++----
 mm/memcontrol.c            | 4 ++--
 3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6350c563c7b8..e8ba6ee1b369 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1635,8 +1635,8 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
 extern struct static_key_false memcg_kmem_enabled_key;
 
 extern int memcg_nr_cache_ids;
-void memcg_get_cache_ids(void);
-void memcg_put_cache_ids(void);
+void memcg_list_lru_resize_lock(void);
+void memcg_list_lru_resize_unlock(void);
 
 /*
  * Helper macro to loop through all memcg-specific caches. Callers must still
@@ -1711,11 +1711,11 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
 	return -1;
 }
 
-static inline void memcg_get_cache_ids(void)
+static inline void memcg_list_lru_resize_lock(void)
 {
 }
 
-static inline void memcg_put_cache_ids(void)
+static inline void memcg_list_lru_resize_unlock(void)
 {
 }
 
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 3ee5239922c9..e0ba0641b4e1 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -640,7 +640,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware,
 	else
 		lru->shrinker_id = -1;
 #endif
-	memcg_get_cache_ids();
+	memcg_list_lru_resize_lock();
 
 	lru->node = kcalloc(nr_node_ids, sizeof(*lru->node), GFP_KERNEL);
 	if (!lru->node)
@@ -663,7 +663,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware,
 
 	list_lru_register(lru);
 out:
-	memcg_put_cache_ids();
+	memcg_list_lru_resize_unlock();
 	return err;
 }
 EXPORT_SYMBOL_GPL(__list_lru_init);
@@ -674,7 +674,7 @@ void list_lru_destroy(struct list_lru *lru)
 	if (!lru->node)
 		return;
 
-	memcg_get_cache_ids();
+	memcg_list_lru_resize_lock();
 
 	list_lru_unregister(lru);
 
@@ -685,6 +685,6 @@ void list_lru_destroy(struct list_lru *lru)
 #ifdef CONFIG_MEMCG_KMEM
 	lru->shrinker_id = -1;
 #endif
-	memcg_put_cache_ids();
+	memcg_list_lru_resize_unlock();
 }
 EXPORT_SYMBOL_GPL(list_lru_destroy);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f8cdd87cf693..437465611845 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -367,12 +367,12 @@ static int kmemcg_max_id;
 /* Protects memcg_nr_cache_ids */
 static DECLARE_RWSEM(memcg_cache_ids_sem);
 
-void memcg_get_cache_ids(void)
+void memcg_list_lru_resize_lock(void)
 {
 	down_read(&memcg_cache_ids_sem);
 }
 
-void memcg_put_cache_ids(void)
+void memcg_list_lru_resize_unlock(void)
 {
 	up_read(&memcg_cache_ids_sem);
 }
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal
  2021-04-28  9:49 [PATCH 0/9] Shrink the list lru size on memory cgroup removal Muchun Song
@ 2021-04-28 23:32   ` Shakeel Butt
  2021-04-28  9:49 ` [PATCH 2/9] mm: memcontrol: remove kmemcg_id reparenting Muchun Song
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Shakeel Butt @ 2021-04-28 23:32 UTC (permalink / raw)
  To: Muchun Song
  Cc: Matthew Wilcox, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Roman Gushchin, Yang Shi, alexs,
	Alexander Duyck, Wei Yang, linux-fsdevel, LKML, Linux MM

On Wed, Apr 28, 2021 at 2:54 AM Muchun Song <songmuchun@bytedance.com> wrote:
>
> In our server, we found a suspected memory leak problem. The kmalloc-32
> consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
> memory.
>
> After our in-depth analysis, the memory consumption of kmalloc-32 slab
> cache is the cause of list_lru_one allocation.
>
>   crash> p memcg_nr_cache_ids
>   memcg_nr_cache_ids = $2 = 24574
>
> memcg_nr_cache_ids is very large and memory consumption of each list_lru
> can be calculated with the following formula.
>
>   num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
>
> There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
>
>   crash> list super_blocks | wc -l
>   952
>
> Every mount will register 2 list lrus, one is for inode, another is for
> dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3
> MB (~5.6GB). But the number of memory cgroup is less than 500. So I
> guess more than 12286 containers have been deployed on this machine (I
> do not know why there are so many containers, it may be a user's bug or
> the user really want to do that). But now there are less than 500
> containers in the system. And memcg_nr_cache_ids has not been reduced
> to a suitable value. This can waste a lot of memory. If we want to reduce
> memcg_nr_cache_ids, we have to reboot the server. This is not what we
> want.
>
> So this patchset will dynamically adjust the value of memcg_nr_cache_ids
> to keep healthy memory consumption. In this case, we may be able to restore
> a healthy environment even if the users have created tens of thousands of
> memory cgroups and then destroyed those memory cgroups. This patchset also
> contains some code simplification.
>

There was a recent discussion [1] on the same issue. Did you get the
chance to take a look at that. I have not gone through this patch
series yet but will do in the next couple of weeks.

[1] https://lore.kernel.org/linux-fsdevel/20210405054848.GA1077931@in.ibm.com/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal
@ 2021-04-28 23:32   ` Shakeel Butt
  0 siblings, 0 replies; 30+ messages in thread
From: Shakeel Butt @ 2021-04-28 23:32 UTC (permalink / raw)
  To: Muchun Song
  Cc: Matthew Wilcox, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Roman Gushchin, Yang Shi, alexs,
	Alexander Duyck, Wei Yang, linux-fsdevel, LKML, Linux MM

On Wed, Apr 28, 2021 at 2:54 AM Muchun Song <songmuchun@bytedance.com> wrote:
>
> In our server, we found a suspected memory leak problem. The kmalloc-32
> consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
> memory.
>
> After our in-depth analysis, the memory consumption of kmalloc-32 slab
> cache is the cause of list_lru_one allocation.
>
>   crash> p memcg_nr_cache_ids
>   memcg_nr_cache_ids = $2 = 24574
>
> memcg_nr_cache_ids is very large and memory consumption of each list_lru
> can be calculated with the following formula.
>
>   num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
>
> There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
>
>   crash> list super_blocks | wc -l
>   952
>
> Every mount will register 2 list lrus, one is for inode, another is for
> dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3
> MB (~5.6GB). But the number of memory cgroup is less than 500. So I
> guess more than 12286 containers have been deployed on this machine (I
> do not know why there are so many containers, it may be a user's bug or
> the user really want to do that). But now there are less than 500
> containers in the system. And memcg_nr_cache_ids has not been reduced
> to a suitable value. This can waste a lot of memory. If we want to reduce
> memcg_nr_cache_ids, we have to reboot the server. This is not what we
> want.
>
> So this patchset will dynamically adjust the value of memcg_nr_cache_ids
> to keep healthy memory consumption. In this case, we may be able to restore
> a healthy environment even if the users have created tens of thousands of
> memory cgroups and then destroyed those memory cgroups. This patchset also
> contains some code simplification.
>

There was a recent discussion [1] on the same issue. Did you get the
chance to take a look at that. I have not gone through this patch
series yet but will do in the next couple of weeks.

[1] https://lore.kernel.org/linux-fsdevel/20210405054848.GA1077931@in.ibm.com/


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal
  2021-04-28 23:32   ` Shakeel Butt
@ 2021-04-29  3:05     ` Muchun Song
  -1 siblings, 0 replies; 30+ messages in thread
From: Muchun Song @ 2021-04-29  3:05 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Matthew Wilcox, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Roman Gushchin, Yang Shi, alexs,
	Alexander Duyck, Wei Yang, linux-fsdevel, LKML, Linux MM

On Thu, Apr 29, 2021 at 7:32 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Wed, Apr 28, 2021 at 2:54 AM Muchun Song <songmuchun@bytedance.com> wrote:
> >
> > In our server, we found a suspected memory leak problem. The kmalloc-32
> > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
> > memory.
> >
> > After our in-depth analysis, the memory consumption of kmalloc-32 slab
> > cache is the cause of list_lru_one allocation.
> >
> >   crash> p memcg_nr_cache_ids
> >   memcg_nr_cache_ids = $2 = 24574
> >
> > memcg_nr_cache_ids is very large and memory consumption of each list_lru
> > can be calculated with the following formula.
> >
> >   num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
> >
> > There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
> >
> >   crash> list super_blocks | wc -l
> >   952
> >
> > Every mount will register 2 list lrus, one is for inode, another is for
> > dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3
> > MB (~5.6GB). But the number of memory cgroup is less than 500. So I
> > guess more than 12286 containers have been deployed on this machine (I
> > do not know why there are so many containers, it may be a user's bug or
> > the user really want to do that). But now there are less than 500
> > containers in the system. And memcg_nr_cache_ids has not been reduced
> > to a suitable value. This can waste a lot of memory. If we want to reduce
> > memcg_nr_cache_ids, we have to reboot the server. This is not what we
> > want.
> >
> > So this patchset will dynamically adjust the value of memcg_nr_cache_ids
> > to keep healthy memory consumption. In this case, we may be able to restore
> > a healthy environment even if the users have created tens of thousands of
> > memory cgroups and then destroyed those memory cgroups. This patchset also
> > contains some code simplification.
> >
>
> There was a recent discussion [1] on the same issue. Did you get the
> chance to take a look at that. I have not gone through this patch
> series yet but will do in the next couple of weeks.
>
> [1] https://lore.kernel.org/linux-fsdevel/20210405054848.GA1077931@in.ibm.com/

Thanks for your reminder.

No, I haven't. But now I have looked at this. The issue is very similar
to mine. But Bharata seems to want to run 10k containers. And
optimize the memory consumption of list_lru_one in this case.
This is not what I do. I want to try to shrink the size of the list lrus
when the number of memcgs is reduced from tens of thousands
to hundreds.

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal
@ 2021-04-29  3:05     ` Muchun Song
  0 siblings, 0 replies; 30+ messages in thread
From: Muchun Song @ 2021-04-29  3:05 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Matthew Wilcox, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Roman Gushchin, Yang Shi, alexs,
	Alexander Duyck, Wei Yang, linux-fsdevel, LKML, Linux MM

On Thu, Apr 29, 2021 at 7:32 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Wed, Apr 28, 2021 at 2:54 AM Muchun Song <songmuchun@bytedance.com> wrote:
> >
> > In our server, we found a suspected memory leak problem. The kmalloc-32
> > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
> > memory.
> >
> > After our in-depth analysis, the memory consumption of kmalloc-32 slab
> > cache is the cause of list_lru_one allocation.
> >
> >   crash> p memcg_nr_cache_ids
> >   memcg_nr_cache_ids = $2 = 24574
> >
> > memcg_nr_cache_ids is very large and memory consumption of each list_lru
> > can be calculated with the following formula.
> >
> >   num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
> >
> > There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
> >
> >   crash> list super_blocks | wc -l
> >   952
> >
> > Every mount will register 2 list lrus, one is for inode, another is for
> > dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3
> > MB (~5.6GB). But the number of memory cgroup is less than 500. So I
> > guess more than 12286 containers have been deployed on this machine (I
> > do not know why there are so many containers, it may be a user's bug or
> > the user really want to do that). But now there are less than 500
> > containers in the system. And memcg_nr_cache_ids has not been reduced
> > to a suitable value. This can waste a lot of memory. If we want to reduce
> > memcg_nr_cache_ids, we have to reboot the server. This is not what we
> > want.
> >
> > So this patchset will dynamically adjust the value of memcg_nr_cache_ids
> > to keep healthy memory consumption. In this case, we may be able to restore
> > a healthy environment even if the users have created tens of thousands of
> > memory cgroups and then destroyed those memory cgroups. This patchset also
> > contains some code simplification.
> >
>
> There was a recent discussion [1] on the same issue. Did you get the
> chance to take a look at that. I have not gone through this patch
> series yet but will do in the next couple of weeks.
>
> [1] https://lore.kernel.org/linux-fsdevel/20210405054848.GA1077931@in.ibm.com/

Thanks for your reminder.

No, I haven't. But now I have looked at this. The issue is very similar
to mine. But Bharata seems to want to run 10k containers. And
optimize the memory consumption of list_lru_one in this case.
This is not what I do. I want to try to shrink the size of the list lrus
when the number of memcgs is reduced from tens of thousands
to hundreds.

Thanks.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/9] ida: introduce ida_max() to return the maximum allocated ID
  2021-04-28  9:49 ` [PATCH 7/9] ida: introduce ida_max() to return the maximum allocated ID Muchun Song
@ 2021-04-29  6:47   ` Christoph Hellwig
  2021-04-29  7:36       ` Muchun Song
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2021-04-29  6:47 UTC (permalink / raw)
  To: Muchun Song
  Cc: willy, akpm, hannes, mhocko, vdavydov.dev, shakeelb, guro,
	shy828301, alexs, alexander.h.duyck, richard.weiyang,
	linux-fsdevel, linux-kernel, linux-mm

On Wed, Apr 28, 2021 at 05:49:47PM +0800, Muchun Song wrote:
> Introduce ida_max() to return the maximum allocated ID. This will be
> used by memory cgroup in the later patch.

Please also add a lower-level xa_max as this funcationalty would also
come in handy at the xarray level in a few places.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [PATCH 7/9] ida: introduce ida_max() to return the maximum allocated ID
  2021-04-29  6:47   ` Christoph Hellwig
@ 2021-04-29  7:36       ` Muchun Song
  0 siblings, 0 replies; 30+ messages in thread
From: Muchun Song @ 2021-04-29  7:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Shakeel Butt, Roman Gushchin, Yang Shi, alexs,
	Alexander Duyck, Wei Yang, linux-fsdevel, LKML,
	Linux Memory Management List

On Thu, Apr 29, 2021 at 2:49 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Apr 28, 2021 at 05:49:47PM +0800, Muchun Song wrote:
> > Introduce ida_max() to return the maximum allocated ID. This will be
> > used by memory cgroup in the later patch.
>
> Please also add a lower-level xa_max as this funcationalty would also
> come in handy at the xarray level in a few places.

Good suggestion. This way is also efficient. Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [PATCH 7/9] ida: introduce ida_max() to return the maximum allocated ID
@ 2021-04-29  7:36       ` Muchun Song
  0 siblings, 0 replies; 30+ messages in thread
From: Muchun Song @ 2021-04-29  7:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Shakeel Butt, Roman Gushchin, Yang Shi, alexs,
	Alexander Duyck, Wei Yang, linux-fsdevel, LKML,
	Linux Memory Management List

On Thu, Apr 29, 2021 at 2:49 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Apr 28, 2021 at 05:49:47PM +0800, Muchun Song wrote:
> > Introduce ida_max() to return the maximum allocated ID. This will be
> > used by memory cgroup in the later patch.
>
> Please also add a lower-level xa_max as this funcationalty would also
> come in handy at the xarray level in a few places.

Good suggestion. This way is also efficient. Thanks.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal
  2021-04-28  9:49 [PATCH 0/9] Shrink the list lru size on memory cgroup removal Muchun Song
                   ` (9 preceding siblings ...)
  2021-04-28 23:32   ` Shakeel Butt
@ 2021-04-30  0:49 ` Dave Chinner
  2021-04-30  1:39   ` Roman Gushchin
  10 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2021-04-30  0:49 UTC (permalink / raw)
  To: Muchun Song
  Cc: willy, akpm, hannes, mhocko, vdavydov.dev, shakeelb, guro,
	shy828301, alexs, alexander.h.duyck, richard.weiyang,
	linux-fsdevel, linux-kernel, linux-mm

On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote:
> In our server, we found a suspected memory leak problem. The kmalloc-32
> consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
> memory.
> 
> After our in-depth analysis, the memory consumption of kmalloc-32 slab
> cache is the cause of list_lru_one allocation.
> 
>   crash> p memcg_nr_cache_ids
>   memcg_nr_cache_ids = $2 = 24574
> 
> memcg_nr_cache_ids is very large and memory consumption of each list_lru
> can be calculated with the following formula.
> 
>   num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
> 
> There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
> 
>   crash> list super_blocks | wc -l
>   952

The more I see people trying to work around this, the more I think
that the way memcgs have been grafted into the list_lru is back to
front.

We currently allocate scope for every memcg to be able to tracked on
every not on every superblock instantiated in the system, regardless
of whether that superblock is even accessible to that memcg.

These huge memcg counts come from container hosts where memcgs are
confined to just a small subset of the total number of superblocks
that instantiated at any given point in time.

IOWs, for these systems with huge container counts, list_lru does
not need the capability of tracking every memcg on every superblock.

What it comes down to is that the list_lru is only needed for a
given memcg if that memcg is instatiating and freeing objects on a
given list_lru.

Which makes me think we should be moving more towards "add the memcg
to the list_lru at the first insert" model rather than "instantiate
all at memcg init time just in case". The model we originally came
up with for supprting memcgs is really starting to show it's limits,
and we should address those limitations rahter than hack more
complexity into the system that does nothing to remove the
limitations that are causing the problems in the first place.

> Every mount will register 2 list lrus, one is for inode, another is for
> dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3
> MB (~5.6GB).  But the number of memory cgroup is less than 500. So I
> guess more than 12286 containers have been deployed on this machine (I
> do not know why there are so many containers, it may be a user's bug or
> the user really want to do that). But now there are less than 500
> containers in the system. And memcg_nr_cache_ids has not been reduced
> to a suitable value. This can waste a lot of memory. If we want to reduce
> memcg_nr_cache_ids, we have to reboot the server. This is not what we
> want.

Exactly my point. This model is broken and doesn't scale to large
counts of either memcgs or superblocks.

We need a different model for dynamically adding, removing and
mapping memcgs to LRU lists they are actually using so that we can
efficiently scale to tens of thousands of memcgs instances along
with tens of thousands of unique superblock instants. That's the
real problem that needs solving here.

> So this patchset will dynamically adjust the value of memcg_nr_cache_ids
> to keep healthy memory consumption.

Gigabytes of RAM for largely unused memcg list_lrus on every
superblock is not healthy. It's highly inefficient because the
infrastructure we currently have was never designed to scale to
these numbers of unique containers and superblocks...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal
  2021-04-30  0:49 ` Dave Chinner
@ 2021-04-30  1:39   ` Roman Gushchin
  2021-04-30  3:27     ` Dave Chinner
  0 siblings, 1 reply; 30+ messages in thread
From: Roman Gushchin @ 2021-04-30  1:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Muchun Song, willy, akpm, hannes, mhocko, vdavydov.dev, shakeelb,
	shy828301, alexs, alexander.h.duyck, richard.weiyang,
	linux-fsdevel, linux-kernel, linux-mm

On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote:
> On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote:
> > In our server, we found a suspected memory leak problem. The kmalloc-32
> > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
> > memory.
> > 
> > After our in-depth analysis, the memory consumption of kmalloc-32 slab
> > cache is the cause of list_lru_one allocation.
> > 
> >   crash> p memcg_nr_cache_ids
> >   memcg_nr_cache_ids = $2 = 24574
> > 
> > memcg_nr_cache_ids is very large and memory consumption of each list_lru
> > can be calculated with the following formula.
> > 
> >   num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
> > 
> > There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
> > 
> >   crash> list super_blocks | wc -l
> >   952
> 
> The more I see people trying to work around this, the more I think
> that the way memcgs have been grafted into the list_lru is back to
> front.
> 
> We currently allocate scope for every memcg to be able to tracked on
> every not on every superblock instantiated in the system, regardless
> of whether that superblock is even accessible to that memcg.
> 
> These huge memcg counts come from container hosts where memcgs are
> confined to just a small subset of the total number of superblocks
> that instantiated at any given point in time.
> 
> IOWs, for these systems with huge container counts, list_lru does
> not need the capability of tracking every memcg on every superblock.
> 
> What it comes down to is that the list_lru is only needed for a
> given memcg if that memcg is instatiating and freeing objects on a
> given list_lru.
> 
> Which makes me think we should be moving more towards "add the memcg
> to the list_lru at the first insert" model rather than "instantiate
> all at memcg init time just in case". The model we originally came
> up with for supprting memcgs is really starting to show it's limits,
> and we should address those limitations rahter than hack more
> complexity into the system that does nothing to remove the
> limitations that are causing the problems in the first place.

I totally agree.

It looks like the initial implementation of the whole kernel memory accounting
and memcg-aware shrinkers was based on the idea that the number of memory
cgroups is relatively small and stable. With systemd creating a separate cgroup
for everything including short-living processes it simple not true anymore.

Thanks!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal
  2021-04-30  1:39   ` Roman Gushchin
@ 2021-04-30  3:27     ` Dave Chinner
  2021-04-30  8:32         ` Muchun Song
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2021-04-30  3:27 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Muchun Song, willy, akpm, hannes, mhocko, vdavydov.dev, shakeelb,
	shy828301, alexs, alexander.h.duyck, richard.weiyang,
	linux-fsdevel, linux-kernel, linux-mm

On Thu, Apr 29, 2021 at 06:39:40PM -0700, Roman Gushchin wrote:
> On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote:
> > On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote:
> > > In our server, we found a suspected memory leak problem. The kmalloc-32
> > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
> > > memory.
> > > 
> > > After our in-depth analysis, the memory consumption of kmalloc-32 slab
> > > cache is the cause of list_lru_one allocation.
> > > 
> > >   crash> p memcg_nr_cache_ids
> > >   memcg_nr_cache_ids = $2 = 24574
> > > 
> > > memcg_nr_cache_ids is very large and memory consumption of each list_lru
> > > can be calculated with the following formula.
> > > 
> > >   num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
> > > 
> > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
> > > 
> > >   crash> list super_blocks | wc -l
> > >   952
> > 
> > The more I see people trying to work around this, the more I think
> > that the way memcgs have been grafted into the list_lru is back to
> > front.
> > 
> > We currently allocate scope for every memcg to be able to tracked on
> > every not on every superblock instantiated in the system, regardless
> > of whether that superblock is even accessible to that memcg.
> > 
> > These huge memcg counts come from container hosts where memcgs are
> > confined to just a small subset of the total number of superblocks
> > that instantiated at any given point in time.
> > 
> > IOWs, for these systems with huge container counts, list_lru does
> > not need the capability of tracking every memcg on every superblock.
> > 
> > What it comes down to is that the list_lru is only needed for a
> > given memcg if that memcg is instatiating and freeing objects on a
> > given list_lru.
> > 
> > Which makes me think we should be moving more towards "add the memcg
> > to the list_lru at the first insert" model rather than "instantiate
> > all at memcg init time just in case". The model we originally came
> > up with for supprting memcgs is really starting to show it's limits,
> > and we should address those limitations rahter than hack more
> > complexity into the system that does nothing to remove the
> > limitations that are causing the problems in the first place.
> 
> I totally agree.
> 
> It looks like the initial implementation of the whole kernel memory accounting
> and memcg-aware shrinkers was based on the idea that the number of memory
> cgroups is relatively small and stable.

Yes, that was one of the original assumptions - tens to maybe low
hundreds of memcgs at most. The other was that memcgs weren't NUMA
aware, and so would only need a single LRU list per memcg. Hence the
total overhead even with "lots" of memcgsi and superblocks the
overhead wasn't that great.

Then came "memcgs need to be NUMA aware" because of the size of the
machines they were being use for resrouce management in, and that
greatly increased the per-memcg, per LRU overhead. Now we're talking
about needing to support a couple of orders of magnitude more memcgs
and superblocks than were originally designed for.

So, really, we're way beyond the original design scope of this
subsystem now.

> With systemd creating a separate cgroup
> for everything including short-living processes it simple not true anymore.

Yeah, that too. Everything is much more dynamic these days...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal
  2021-04-30  3:27     ` Dave Chinner
@ 2021-04-30  8:32         ` Muchun Song
  0 siblings, 0 replies; 30+ messages in thread
From: Muchun Song @ 2021-04-30  8:32 UTC (permalink / raw)
  To: Dave Chinner, Roman Gushchin
  Cc: Matthew Wilcox, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Shakeel Butt, Yang Shi, alexs, Alexander Duyck,
	Wei Yang, linux-fsdevel, LKML, Linux Memory Management List

On Fri, Apr 30, 2021 at 11:27 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Apr 29, 2021 at 06:39:40PM -0700, Roman Gushchin wrote:
> > On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote:
> > > On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote:
> > > > In our server, we found a suspected memory leak problem. The kmalloc-32
> > > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
> > > > memory.
> > > >
> > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab
> > > > cache is the cause of list_lru_one allocation.
> > > >
> > > >   crash> p memcg_nr_cache_ids
> > > >   memcg_nr_cache_ids = $2 = 24574
> > > >
> > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru
> > > > can be calculated with the following formula.
> > > >
> > > >   num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
> > > >
> > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
> > > >
> > > >   crash> list super_blocks | wc -l
> > > >   952
> > >
> > > The more I see people trying to work around this, the more I think
> > > that the way memcgs have been grafted into the list_lru is back to
> > > front.
> > >
> > > We currently allocate scope for every memcg to be able to tracked on
> > > every not on every superblock instantiated in the system, regardless
> > > of whether that superblock is even accessible to that memcg.
> > >
> > > These huge memcg counts come from container hosts where memcgs are
> > > confined to just a small subset of the total number of superblocks
> > > that instantiated at any given point in time.
> > >
> > > IOWs, for these systems with huge container counts, list_lru does
> > > not need the capability of tracking every memcg on every superblock.
> > >
> > > What it comes down to is that the list_lru is only needed for a
> > > given memcg if that memcg is instatiating and freeing objects on a
> > > given list_lru.
> > >
> > > Which makes me think we should be moving more towards "add the memcg
> > > to the list_lru at the first insert" model rather than "instantiate
> > > all at memcg init time just in case". The model we originally came
> > > up with for supprting memcgs is really starting to show it's limits,
> > > and we should address those limitations rahter than hack more
> > > complexity into the system that does nothing to remove the
> > > limitations that are causing the problems in the first place.
> >
> > I totally agree.
> >
> > It looks like the initial implementation of the whole kernel memory accounting
> > and memcg-aware shrinkers was based on the idea that the number of memory
> > cgroups is relatively small and stable.
>
> Yes, that was one of the original assumptions - tens to maybe low
> hundreds of memcgs at most. The other was that memcgs weren't NUMA
> aware, and so would only need a single LRU list per memcg. Hence the
> total overhead even with "lots" of memcgsi and superblocks the
> overhead wasn't that great.
>
> Then came "memcgs need to be NUMA aware" because of the size of the
> machines they were being use for resrouce management in, and that
> greatly increased the per-memcg, per LRU overhead. Now we're talking
> about needing to support a couple of orders of magnitude more memcgs
> and superblocks than were originally designed for.
>
> So, really, we're way beyond the original design scope of this
> subsystem now.

Got it. So it is better to allocate the structure of the list_lru_node
dynamically. We should only allocate it when it is really demanded.
But allocating memory by using GFP_ATOMIC in list_lru_add() is
not a good idea. So we should allocate the memory out of
list_lru_add(). I can propose an approach that may work.

Before start, we should know about the following rules of list lrus.

- Only objects allocated with __GFP_ACCOUNT need to allocate
  the struct list_lru_node.
- The caller of allocating memory must know which list_lru the
  object will insert.

So we can allocate struct list_lru_node when allocating the
object instead of allocating it when list_lru_add().  It is easy, because
we already know the list_lru and memcg which the object belongs
to. So we can introduce a new helper to allocate the object and
list_lru_node. Like below.

void *list_lru_kmem_cache_alloc(struct list_lru *lru, struct kmem_cache *s,
                                gfp_t gfpflags)
{
        void *ret = kmem_cache_alloc(s, gfpflags);

        if (ret && (gfpflags & __GFP_ACCOUNT)) {
                struct mem_cgroup *memcg = mem_cgroup_from_obj(ret);

                if (mem_cgroup_is_root(memcg))
                        return ret;

                /* Allocate per-memcg list_lru_node, if it already
allocated, do nothing. */
                memcg_list_lru_node_alloc(lru, memcg,
page_to_nid(virt_to_page(ret)), gfpflags);
        }

        return ret;
}

If the user wants to insert the allocated object to its lru list in
the feature. The
user should use list_lru_kmem_cache_alloc() instead of kmem_cache_alloc().
I have looked at the code closely. There are 3 different kmem_caches that
need to use this new API to allocate memory. They are inode_cachep,
dentry_cache and radix_tree_node_cachep. I think that it is easy to migrate.

Hi Roman and Dave,

What do you think about this approach? If there is no problem, I can provide
a preliminary patchset within a week.

Thanks.

>
> > With systemd creating a separate cgroup
> > for everything including short-living processes it simple not true anymore.
>
> Yeah, that too. Everything is much more dynamic these days...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal
@ 2021-04-30  8:32         ` Muchun Song
  0 siblings, 0 replies; 30+ messages in thread
From: Muchun Song @ 2021-04-30  8:32 UTC (permalink / raw)
  To: Dave Chinner, Roman Gushchin
  Cc: Matthew Wilcox, Andrew Morton, Johannes Weiner, Michal Hocko,
	Vladimir Davydov, Shakeel Butt, Yang Shi, alexs, Alexander Duyck,
	Wei Yang, linux-fsdevel, LKML, Linux Memory Management List

On Fri, Apr 30, 2021 at 11:27 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Apr 29, 2021 at 06:39:40PM -0700, Roman Gushchin wrote:
> > On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote:
> > > On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote:
> > > > In our server, we found a suspected memory leak problem. The kmalloc-32
> > > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
> > > > memory.
> > > >
> > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab
> > > > cache is the cause of list_lru_one allocation.
> > > >
> > > >   crash> p memcg_nr_cache_ids
> > > >   memcg_nr_cache_ids = $2 = 24574
> > > >
> > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru
> > > > can be calculated with the following formula.
> > > >
> > > >   num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
> > > >
> > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
> > > >
> > > >   crash> list super_blocks | wc -l
> > > >   952
> > >
> > > The more I see people trying to work around this, the more I think
> > > that the way memcgs have been grafted into the list_lru is back to
> > > front.
> > >
> > > We currently allocate scope for every memcg to be able to tracked on
> > > every not on every superblock instantiated in the system, regardless
> > > of whether that superblock is even accessible to that memcg.
> > >
> > > These huge memcg counts come from container hosts where memcgs are
> > > confined to just a small subset of the total number of superblocks
> > > that instantiated at any given point in time.
> > >
> > > IOWs, for these systems with huge container counts, list_lru does
> > > not need the capability of tracking every memcg on every superblock.
> > >
> > > What it comes down to is that the list_lru is only needed for a
> > > given memcg if that memcg is instatiating and freeing objects on a
> > > given list_lru.
> > >
> > > Which makes me think we should be moving more towards "add the memcg
> > > to the list_lru at the first insert" model rather than "instantiate
> > > all at memcg init time just in case". The model we originally came
> > > up with for supprting memcgs is really starting to show it's limits,
> > > and we should address those limitations rahter than hack more
> > > complexity into the system that does nothing to remove the
> > > limitations that are causing the problems in the first place.
> >
> > I totally agree.
> >
> > It looks like the initial implementation of the whole kernel memory accounting
> > and memcg-aware shrinkers was based on the idea that the number of memory
> > cgroups is relatively small and stable.
>
> Yes, that was one of the original assumptions - tens to maybe low
> hundreds of memcgs at most. The other was that memcgs weren't NUMA
> aware, and so would only need a single LRU list per memcg. Hence the
> total overhead even with "lots" of memcgsi and superblocks the
> overhead wasn't that great.
>
> Then came "memcgs need to be NUMA aware" because of the size of the
> machines they were being use for resrouce management in, and that
> greatly increased the per-memcg, per LRU overhead. Now we're talking
> about needing to support a couple of orders of magnitude more memcgs
> and superblocks than were originally designed for.
>
> So, really, we're way beyond the original design scope of this
> subsystem now.

Got it. So it is better to allocate the structure of the list_lru_node
dynamically. We should only allocate it when it is really demanded.
But allocating memory by using GFP_ATOMIC in list_lru_add() is
not a good idea. So we should allocate the memory out of
list_lru_add(). I can propose an approach that may work.

Before start, we should know about the following rules of list lrus.

- Only objects allocated with __GFP_ACCOUNT need to allocate
  the struct list_lru_node.
- The caller of allocating memory must know which list_lru the
  object will insert.

So we can allocate struct list_lru_node when allocating the
object instead of allocating it when list_lru_add().  It is easy, because
we already know the list_lru and memcg which the object belongs
to. So we can introduce a new helper to allocate the object and
list_lru_node. Like below.

void *list_lru_kmem_cache_alloc(struct list_lru *lru, struct kmem_cache *s,
                                gfp_t gfpflags)
{
        void *ret = kmem_cache_alloc(s, gfpflags);

        if (ret && (gfpflags & __GFP_ACCOUNT)) {
                struct mem_cgroup *memcg = mem_cgroup_from_obj(ret);

                if (mem_cgroup_is_root(memcg))
                        return ret;

                /* Allocate per-memcg list_lru_node, if it already
allocated, do nothing. */
                memcg_list_lru_node_alloc(lru, memcg,
page_to_nid(virt_to_page(ret)), gfpflags);
        }

        return ret;
}

If the user wants to insert the allocated object to its lru list in
the feature. The
user should use list_lru_kmem_cache_alloc() instead of kmem_cache_alloc().
I have looked at the code closely. There are 3 different kmem_caches that
need to use this new API to allocate memory. They are inode_cachep,
dentry_cache and radix_tree_node_cachep. I think that it is easy to migrate.

Hi Roman and Dave,

What do you think about this approach? If there is no problem, I can provide
a preliminary patchset within a week.

Thanks.

>
> > With systemd creating a separate cgroup
> > for everything including short-living processes it simple not true anymore.
>
> Yeah, that too. Everything is much more dynamic these days...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal
  2021-04-30  8:32         ` Muchun Song
  (?)
@ 2021-05-01  3:10         ` Roman Gushchin
  -1 siblings, 0 replies; 30+ messages in thread
From: Roman Gushchin @ 2021-05-01  3:10 UTC (permalink / raw)
  To: Muchun Song
  Cc: Dave Chinner, Matthew Wilcox, Andrew Morton, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Shakeel Butt, Yang Shi, alexs,
	Alexander Duyck, Wei Yang, linux-fsdevel, LKML,
	Linux Memory Management List

On Fri, Apr 30, 2021 at 04:32:39PM +0800, Muchun Song wrote:
> On Fri, Apr 30, 2021 at 11:27 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Thu, Apr 29, 2021 at 06:39:40PM -0700, Roman Gushchin wrote:
> > > On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote:
> > > > On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote:
> > > > > In our server, we found a suspected memory leak problem. The kmalloc-32
> > > > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
> > > > > memory.
> > > > >
> > > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab
> > > > > cache is the cause of list_lru_one allocation.
> > > > >
> > > > >   crash> p memcg_nr_cache_ids
> > > > >   memcg_nr_cache_ids = $2 = 24574
> > > > >
> > > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru
> > > > > can be calculated with the following formula.
> > > > >
> > > > >   num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
> > > > >
> > > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
> > > > >
> > > > >   crash> list super_blocks | wc -l
> > > > >   952
> > > >
> > > > The more I see people trying to work around this, the more I think
> > > > that the way memcgs have been grafted into the list_lru is back to
> > > > front.
> > > >
> > > > We currently allocate scope for every memcg to be able to tracked on
> > > > every not on every superblock instantiated in the system, regardless
> > > > of whether that superblock is even accessible to that memcg.
> > > >
> > > > These huge memcg counts come from container hosts where memcgs are
> > > > confined to just a small subset of the total number of superblocks
> > > > that instantiated at any given point in time.
> > > >
> > > > IOWs, for these systems with huge container counts, list_lru does
> > > > not need the capability of tracking every memcg on every superblock.
> > > >
> > > > What it comes down to is that the list_lru is only needed for a
> > > > given memcg if that memcg is instatiating and freeing objects on a
> > > > given list_lru.
> > > >
> > > > Which makes me think we should be moving more towards "add the memcg
> > > > to the list_lru at the first insert" model rather than "instantiate
> > > > all at memcg init time just in case". The model we originally came
> > > > up with for supprting memcgs is really starting to show it's limits,
> > > > and we should address those limitations rahter than hack more
> > > > complexity into the system that does nothing to remove the
> > > > limitations that are causing the problems in the first place.
> > >
> > > I totally agree.
> > >
> > > It looks like the initial implementation of the whole kernel memory accounting
> > > and memcg-aware shrinkers was based on the idea that the number of memory
> > > cgroups is relatively small and stable.
> >
> > Yes, that was one of the original assumptions - tens to maybe low
> > hundreds of memcgs at most. The other was that memcgs weren't NUMA
> > aware, and so would only need a single LRU list per memcg. Hence the
> > total overhead even with "lots" of memcgsi and superblocks the
> > overhead wasn't that great.
> >
> > Then came "memcgs need to be NUMA aware" because of the size of the
> > machines they were being use for resrouce management in, and that
> > greatly increased the per-memcg, per LRU overhead. Now we're talking
> > about needing to support a couple of orders of magnitude more memcgs
> > and superblocks than were originally designed for.
> >
> > So, really, we're way beyond the original design scope of this
> > subsystem now.
> 
> Got it. So it is better to allocate the structure of the list_lru_node
> dynamically. We should only allocate it when it is really demanded.
> But allocating memory by using GFP_ATOMIC in list_lru_add() is
> not a good idea. So we should allocate the memory out of
> list_lru_add(). I can propose an approach that may work.
> 
> Before start, we should know about the following rules of list lrus.
> 
> - Only objects allocated with __GFP_ACCOUNT need to allocate
>   the struct list_lru_node.
> - The caller of allocating memory must know which list_lru the
>   object will insert.
> 
> So we can allocate struct list_lru_node when allocating the
> object instead of allocating it when list_lru_add().  It is easy, because
> we already know the list_lru and memcg which the object belongs
> to. So we can introduce a new helper to allocate the object and
> list_lru_node. Like below.
> 
> void *list_lru_kmem_cache_alloc(struct list_lru *lru, struct kmem_cache *s,
>                                 gfp_t gfpflags)
> {
>         void *ret = kmem_cache_alloc(s, gfpflags);
> 
>         if (ret && (gfpflags & __GFP_ACCOUNT)) {
>                 struct mem_cgroup *memcg = mem_cgroup_from_obj(ret);
> 
>                 if (mem_cgroup_is_root(memcg))
>                         return ret;
> 
>                 /* Allocate per-memcg list_lru_node, if it already
> allocated, do nothing. */
>                 memcg_list_lru_node_alloc(lru, memcg,
> page_to_nid(virt_to_page(ret)), gfpflags);
>         }
> 
>         return ret;
> }
> 
> If the user wants to insert the allocated object to its lru list in
> the feature. The
> user should use list_lru_kmem_cache_alloc() instead of kmem_cache_alloc().
> I have looked at the code closely. There are 3 different kmem_caches that
> need to use this new API to allocate memory. They are inode_cachep,
> dentry_cache and radix_tree_node_cachep. I think that it is easy to migrate.
> 
> Hi Roman and Dave,
> 
> What do you think about this approach? If there is no problem, I can provide
> a preliminary patchset within a week.

At a very first glance it looks similar to what Bharata proposed, but with some
additional tricks. It would be nice to find a common ground here. In general,
I think it's a right direction.

In general I believe we might need some more fundamental changes, but I don't
have a specific recipe yet. I need to think more of it.

Thanks!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal
  2021-04-30  8:32         ` Muchun Song
  (?)
  (?)
@ 2021-05-01  3:27         ` Matthew Wilcox
  -1 siblings, 0 replies; 30+ messages in thread
From: Matthew Wilcox @ 2021-05-01  3:27 UTC (permalink / raw)
  To: Muchun Song
  Cc: Dave Chinner, Roman Gushchin, Andrew Morton, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Shakeel Butt, Yang Shi, alexs,
	Alexander Duyck, Wei Yang, linux-fsdevel, LKML,
	Linux Memory Management List

On Fri, Apr 30, 2021 at 04:32:39PM +0800, Muchun Song wrote:
> Before start, we should know about the following rules of list lrus.
> 
> - Only objects allocated with __GFP_ACCOUNT need to allocate
>   the struct list_lru_node.
> - The caller of allocating memory must know which list_lru the
>   object will insert.
> 
> So we can allocate struct list_lru_node when allocating the
> object instead of allocating it when list_lru_add().  It is easy, because
> we already know the list_lru and memcg which the object belongs
> to. So we can introduce a new helper to allocate the object and
> list_lru_node. Like below.

I feel like there may be a simpler solution, although I'm not really
familiar with the list_lru situation.  The three caches you mention:

> I have looked at the code closely. There are 3 different kmem_caches that
> need to use this new API to allocate memory. They are inode_cachep,
> dentry_cache and radix_tree_node_cachep. I think that it is easy to migrate.

are all filesystem.  So if there's a way of knowing which filesystems
are exposed to each container, we can allocate the list_lru structures at
"mount" time rather than at first allocation for a given cache/lru/memcg
combination.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal
  2021-04-30  8:32         ` Muchun Song
                           ` (2 preceding siblings ...)
  (?)
@ 2021-05-02 23:58         ` Dave Chinner
  2021-05-03  6:33             ` Muchun Song
  -1 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2021-05-02 23:58 UTC (permalink / raw)
  To: Muchun Song
  Cc: Roman Gushchin, Matthew Wilcox, Andrew Morton, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Shakeel Butt, Yang Shi, alexs,
	Alexander Duyck, Wei Yang, linux-fsdevel, LKML,
	Linux Memory Management List

On Fri, Apr 30, 2021 at 04:32:39PM +0800, Muchun Song wrote:
> On Fri, Apr 30, 2021 at 11:27 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Thu, Apr 29, 2021 at 06:39:40PM -0700, Roman Gushchin wrote:
> > > On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote:
> > > > On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote:
> > > > > In our server, we found a suspected memory leak problem. The kmalloc-32
> > > > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
> > > > > memory.
> > > > >
> > > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab
> > > > > cache is the cause of list_lru_one allocation.
> > > > >
> > > > >   crash> p memcg_nr_cache_ids
> > > > >   memcg_nr_cache_ids = $2 = 24574
> > > > >
> > > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru
> > > > > can be calculated with the following formula.
> > > > >
> > > > >   num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
> > > > >
> > > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
> > > > >
> > > > >   crash> list super_blocks | wc -l
> > > > >   952
> > > >
> > > > The more I see people trying to work around this, the more I think
> > > > that the way memcgs have been grafted into the list_lru is back to
> > > > front.
> > > >
> > > > We currently allocate scope for every memcg to be able to tracked on
> > > > every not on every superblock instantiated in the system, regardless
> > > > of whether that superblock is even accessible to that memcg.
> > > >
> > > > These huge memcg counts come from container hosts where memcgs are
> > > > confined to just a small subset of the total number of superblocks
> > > > that instantiated at any given point in time.
> > > >
> > > > IOWs, for these systems with huge container counts, list_lru does
> > > > not need the capability of tracking every memcg on every superblock.
> > > >
> > > > What it comes down to is that the list_lru is only needed for a
> > > > given memcg if that memcg is instatiating and freeing objects on a
> > > > given list_lru.
> > > >
> > > > Which makes me think we should be moving more towards "add the memcg
> > > > to the list_lru at the first insert" model rather than "instantiate
> > > > all at memcg init time just in case". The model we originally came
> > > > up with for supprting memcgs is really starting to show it's limits,
> > > > and we should address those limitations rahter than hack more
> > > > complexity into the system that does nothing to remove the
> > > > limitations that are causing the problems in the first place.
> > >
> > > I totally agree.
> > >
> > > It looks like the initial implementation of the whole kernel memory accounting
> > > and memcg-aware shrinkers was based on the idea that the number of memory
> > > cgroups is relatively small and stable.
> >
> > Yes, that was one of the original assumptions - tens to maybe low
> > hundreds of memcgs at most. The other was that memcgs weren't NUMA
> > aware, and so would only need a single LRU list per memcg. Hence the
> > total overhead even with "lots" of memcgsi and superblocks the
> > overhead wasn't that great.
> >
> > Then came "memcgs need to be NUMA aware" because of the size of the
> > machines they were being use for resrouce management in, and that
> > greatly increased the per-memcg, per LRU overhead. Now we're talking
> > about needing to support a couple of orders of magnitude more memcgs
> > and superblocks than were originally designed for.
> >
> > So, really, we're way beyond the original design scope of this
> > subsystem now.
> 
> Got it. So it is better to allocate the structure of the list_lru_node
> dynamically. We should only allocate it when it is really demanded.
> But allocating memory by using GFP_ATOMIC in list_lru_add() is
> not a good idea. So we should allocate the memory out of
> list_lru_add(). I can propose an approach that may work.
> 
> Before start, we should know about the following rules of list lrus.
> 
> - Only objects allocated with __GFP_ACCOUNT need to allocate
>   the struct list_lru_node.

This seems .... misguided. inode and dentry caches are already
marked as accounted, so individual calls to allocate from these
slabs do not need this annotation.

> - The caller of allocating memory must know which list_lru the
>   object will insert.
> 
> So we can allocate struct list_lru_node when allocating the
> object instead of allocating it when list_lru_add().  It is easy, because
> we already know the list_lru and memcg which the object belongs
> to. So we can introduce a new helper to allocate the object and
> list_lru_node. Like below.
> 
> void *list_lru_kmem_cache_alloc(struct list_lru *lru, struct kmem_cache *s,
>                                 gfp_t gfpflags)
> {
>         void *ret = kmem_cache_alloc(s, gfpflags);
> 
>         if (ret && (gfpflags & __GFP_ACCOUNT)) {
>                 struct mem_cgroup *memcg = mem_cgroup_from_obj(ret);
> 
>                 if (mem_cgroup_is_root(memcg))
>                         return ret;
> 
>                 /* Allocate per-memcg list_lru_node, if it already
> allocated, do nothing. */
>                 memcg_list_lru_node_alloc(lru, memcg,
> page_to_nid(virt_to_page(ret)), gfpflags);

If we are allowing kmem_cache_alloc() to fail, then we can allow
memcg_list_lru_node_alloc() to fail, too.

Also, why put this outside kmem_cache_alloc()? Node id and memcg is
already known internally to kmem_cache_alloc() when allocating from
a slab, so why not associate the slab allocation with the LRU
directly when doing the memcg accounting and so avoid doing costly
duplicate work on every allocation?

i.e. the list-lru was moved inside the mm/ dir because "it's a mm
specific construct only", so why not actually make use of that
designation to internalise this entire memcg management issue into
the slab allocation routines? i.e.  an API like
kmem_cache_alloc_lru(cache, lru, gfpflags) allows this to be
completely internalised and efficiently implemented with minimal
change to callers. It also means that memory allocation callers
don't need to know anything about memcg management, which is always
a win....

>         }
> 
>         return ret;
> }
> 
> If the user wants to insert the allocated object to its lru list in
> the feature. The
> user should use list_lru_kmem_cache_alloc() instead of kmem_cache_alloc().
> I have looked at the code closely. There are 3 different kmem_caches that
> need to use this new API to allocate memory. They are inode_cachep,
> dentry_cache and radix_tree_node_cachep. I think that it is easy to migrate.

It might work, but I think you may have overlooked the complexity
of inode allocation for filesystems. i.e.  alloc_inode() calls out
to filesystem allocation functions more often than it allocates
directly from the inode_cachep.  i.e.  Most filesystems provide
their own ->alloc_inode superblock operation, and they allocate
inodes out of their own specific slab caches, not the inode_cachep.

And then you have filesystems like XFS, where alloc_inode() will
never be called, and implement ->alloc_inode as:

/* Catch misguided souls that try to use this interface on XFS */
STATIC struct inode *
xfs_fs_alloc_inode(
        struct super_block      *sb)
{
	BUG();
	return NULL;
}

Because all the inode caching and allocation is internal to XFS and
VFS inode management interfaces are not used.

So I suspect that an external wrapper function is not the way to go
here - either internalising the LRU management into the slab
allocation or adding the memcg code to alloc_inode() and filesystem
specific routines would make a lot more sense to me.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal
  2021-05-02 23:58         ` Dave Chinner
@ 2021-05-03  6:33             ` Muchun Song
  0 siblings, 0 replies; 30+ messages in thread
From: Muchun Song @ 2021-05-03  6:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Roman Gushchin, Matthew Wilcox, Andrew Morton, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Shakeel Butt, Yang Shi, alexs,
	Alexander Duyck, Wei Yang, linux-fsdevel, LKML,
	Linux Memory Management List

On Mon, May 3, 2021 at 7:58 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, Apr 30, 2021 at 04:32:39PM +0800, Muchun Song wrote:
> > On Fri, Apr 30, 2021 at 11:27 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Thu, Apr 29, 2021 at 06:39:40PM -0700, Roman Gushchin wrote:
> > > > On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote:
> > > > > On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote:
> > > > > > In our server, we found a suspected memory leak problem. The kmalloc-32
> > > > > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
> > > > > > memory.
> > > > > >
> > > > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab
> > > > > > cache is the cause of list_lru_one allocation.
> > > > > >
> > > > > >   crash> p memcg_nr_cache_ids
> > > > > >   memcg_nr_cache_ids = $2 = 24574
> > > > > >
> > > > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru
> > > > > > can be calculated with the following formula.
> > > > > >
> > > > > >   num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
> > > > > >
> > > > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
> > > > > >
> > > > > >   crash> list super_blocks | wc -l
> > > > > >   952
> > > > >
> > > > > The more I see people trying to work around this, the more I think
> > > > > that the way memcgs have been grafted into the list_lru is back to
> > > > > front.
> > > > >
> > > > > We currently allocate scope for every memcg to be able to tracked on
> > > > > every not on every superblock instantiated in the system, regardless
> > > > > of whether that superblock is even accessible to that memcg.
> > > > >
> > > > > These huge memcg counts come from container hosts where memcgs are
> > > > > confined to just a small subset of the total number of superblocks
> > > > > that instantiated at any given point in time.
> > > > >
> > > > > IOWs, for these systems with huge container counts, list_lru does
> > > > > not need the capability of tracking every memcg on every superblock.
> > > > >
> > > > > What it comes down to is that the list_lru is only needed for a
> > > > > given memcg if that memcg is instatiating and freeing objects on a
> > > > > given list_lru.
> > > > >
> > > > > Which makes me think we should be moving more towards "add the memcg
> > > > > to the list_lru at the first insert" model rather than "instantiate
> > > > > all at memcg init time just in case". The model we originally came
> > > > > up with for supprting memcgs is really starting to show it's limits,
> > > > > and we should address those limitations rahter than hack more
> > > > > complexity into the system that does nothing to remove the
> > > > > limitations that are causing the problems in the first place.
> > > >
> > > > I totally agree.
> > > >
> > > > It looks like the initial implementation of the whole kernel memory accounting
> > > > and memcg-aware shrinkers was based on the idea that the number of memory
> > > > cgroups is relatively small and stable.
> > >
> > > Yes, that was one of the original assumptions - tens to maybe low
> > > hundreds of memcgs at most. The other was that memcgs weren't NUMA
> > > aware, and so would only need a single LRU list per memcg. Hence the
> > > total overhead even with "lots" of memcgsi and superblocks the
> > > overhead wasn't that great.
> > >
> > > Then came "memcgs need to be NUMA aware" because of the size of the
> > > machines they were being use for resrouce management in, and that
> > > greatly increased the per-memcg, per LRU overhead. Now we're talking
> > > about needing to support a couple of orders of magnitude more memcgs
> > > and superblocks than were originally designed for.
> > >
> > > So, really, we're way beyond the original design scope of this
> > > subsystem now.
> >
> > Got it. So it is better to allocate the structure of the list_lru_node
> > dynamically. We should only allocate it when it is really demanded.
> > But allocating memory by using GFP_ATOMIC in list_lru_add() is
> > not a good idea. So we should allocate the memory out of
> > list_lru_add(). I can propose an approach that may work.
> >
> > Before start, we should know about the following rules of list lrus.
> >
> > - Only objects allocated with __GFP_ACCOUNT need to allocate
> >   the struct list_lru_node.
>
> This seems .... misguided. inode and dentry caches are already
> marked as accounted, so individual calls to allocate from these
> slabs do not need this annotation.

Sorry for the confusion. You are right.

>
> > - The caller of allocating memory must know which list_lru the
> >   object will insert.
> >
> > So we can allocate struct list_lru_node when allocating the
> > object instead of allocating it when list_lru_add().  It is easy, because
> > we already know the list_lru and memcg which the object belongs
> > to. So we can introduce a new helper to allocate the object and
> > list_lru_node. Like below.
> >
> > void *list_lru_kmem_cache_alloc(struct list_lru *lru, struct kmem_cache *s,
> >                                 gfp_t gfpflags)
> > {
> >         void *ret = kmem_cache_alloc(s, gfpflags);
> >
> >         if (ret && (gfpflags & __GFP_ACCOUNT)) {
> >                 struct mem_cgroup *memcg = mem_cgroup_from_obj(ret);
> >
> >                 if (mem_cgroup_is_root(memcg))
> >                         return ret;
> >
> >                 /* Allocate per-memcg list_lru_node, if it already
> > allocated, do nothing. */
> >                 memcg_list_lru_node_alloc(lru, memcg,
> > page_to_nid(virt_to_page(ret)), gfpflags);
>
> If we are allowing kmem_cache_alloc() to fail, then we can allow
> memcg_list_lru_node_alloc() to fail, too.
>
> Also, why put this outside kmem_cache_alloc()? Node id and memcg is
> already known internally to kmem_cache_alloc() when allocating from
> a slab, so why not associate the slab allocation with the LRU
> directly when doing the memcg accounting and so avoid doing costly
> duplicate work on every allocation?
>
> i.e. the list-lru was moved inside the mm/ dir because "it's a mm
> specific construct only", so why not actually make use of that
> designation to internalise this entire memcg management issue into
> the slab allocation routines? i.e.  an API like

Yeah, we can.

> kmem_cache_alloc_lru(cache, lru, gfpflags) allows this to be
> completely internalised and efficiently implemented with minimal
> change to callers. It also means that memory allocation callers
> don't need to know anything about memcg management, which is always
> a win....

Great idea. It's efficient. I'd give it a try.

>
> >         }
> >
> >         return ret;
> > }
> >
> > If the user wants to insert the allocated object to its lru list in
> > the feature. The
> > user should use list_lru_kmem_cache_alloc() instead of kmem_cache_alloc().
> > I have looked at the code closely. There are 3 different kmem_caches that
> > need to use this new API to allocate memory. They are inode_cachep,
> > dentry_cache and radix_tree_node_cachep. I think that it is easy to migrate.
>
> It might work, but I think you may have overlooked the complexity
> of inode allocation for filesystems. i.e.  alloc_inode() calls out
> to filesystem allocation functions more often than it allocates
> directly from the inode_cachep.  i.e.  Most filesystems provide
> their own ->alloc_inode superblock operation, and they allocate
> inodes out of their own specific slab caches, not the inode_cachep.

I didn't realize this before. You are right. Most filesystems
have their own kmem_cache instead of inode_cachep.
We need a lot of filesystems special to be changed.
Thanks for your reminder.

>
> And then you have filesystems like XFS, where alloc_inode() will
> never be called, and implement ->alloc_inode as:
>
> /* Catch misguided souls that try to use this interface on XFS */
> STATIC struct inode *
> xfs_fs_alloc_inode(
>         struct super_block      *sb)
> {
>         BUG();
>         return NULL;
> }
>
> Because all the inode caching and allocation is internal to XFS and
> VFS inode management interfaces are not used.
>
> So I suspect that an external wrapper function is not the way to go
> here - either internalising the LRU management into the slab
> allocation or adding the memcg code to alloc_inode() and filesystem
> specific routines would make a lot more sense to me.

Sure. If we introduce kmem_cache_alloc_lru, all filesystems
need to migrate to kmem_cache_alloc_lru. I cannot figure out
an approach that does not need to change filesystems code.

Thanks.

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal
@ 2021-05-03  6:33             ` Muchun Song
  0 siblings, 0 replies; 30+ messages in thread
From: Muchun Song @ 2021-05-03  6:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Roman Gushchin, Matthew Wilcox, Andrew Morton, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Shakeel Butt, Yang Shi, alexs,
	Alexander Duyck, Wei Yang, linux-fsdevel, LKML,
	Linux Memory Management List

On Mon, May 3, 2021 at 7:58 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, Apr 30, 2021 at 04:32:39PM +0800, Muchun Song wrote:
> > On Fri, Apr 30, 2021 at 11:27 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Thu, Apr 29, 2021 at 06:39:40PM -0700, Roman Gushchin wrote:
> > > > On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote:
> > > > > On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote:
> > > > > > In our server, we found a suspected memory leak problem. The kmalloc-32
> > > > > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB
> > > > > > memory.
> > > > > >
> > > > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab
> > > > > > cache is the cause of list_lru_one allocation.
> > > > > >
> > > > > >   crash> p memcg_nr_cache_ids
> > > > > >   memcg_nr_cache_ids = $2 = 24574
> > > > > >
> > > > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru
> > > > > > can be calculated with the following formula.
> > > > > >
> > > > > >   num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32)
> > > > > >
> > > > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB.
> > > > > >
> > > > > >   crash> list super_blocks | wc -l
> > > > > >   952
> > > > >
> > > > > The more I see people trying to work around this, the more I think
> > > > > that the way memcgs have been grafted into the list_lru is back to
> > > > > front.
> > > > >
> > > > > We currently allocate scope for every memcg to be able to tracked on
> > > > > every not on every superblock instantiated in the system, regardless
> > > > > of whether that superblock is even accessible to that memcg.
> > > > >
> > > > > These huge memcg counts come from container hosts where memcgs are
> > > > > confined to just a small subset of the total number of superblocks
> > > > > that instantiated at any given point in time.
> > > > >
> > > > > IOWs, for these systems with huge container counts, list_lru does
> > > > > not need the capability of tracking every memcg on every superblock.
> > > > >
> > > > > What it comes down to is that the list_lru is only needed for a
> > > > > given memcg if that memcg is instatiating and freeing objects on a
> > > > > given list_lru.
> > > > >
> > > > > Which makes me think we should be moving more towards "add the memcg
> > > > > to the list_lru at the first insert" model rather than "instantiate
> > > > > all at memcg init time just in case". The model we originally came
> > > > > up with for supprting memcgs is really starting to show it's limits,
> > > > > and we should address those limitations rahter than hack more
> > > > > complexity into the system that does nothing to remove the
> > > > > limitations that are causing the problems in the first place.
> > > >
> > > > I totally agree.
> > > >
> > > > It looks like the initial implementation of the whole kernel memory accounting
> > > > and memcg-aware shrinkers was based on the idea that the number of memory
> > > > cgroups is relatively small and stable.
> > >
> > > Yes, that was one of the original assumptions - tens to maybe low
> > > hundreds of memcgs at most. The other was that memcgs weren't NUMA
> > > aware, and so would only need a single LRU list per memcg. Hence the
> > > total overhead even with "lots" of memcgsi and superblocks the
> > > overhead wasn't that great.
> > >
> > > Then came "memcgs need to be NUMA aware" because of the size of the
> > > machines they were being use for resrouce management in, and that
> > > greatly increased the per-memcg, per LRU overhead. Now we're talking
> > > about needing to support a couple of orders of magnitude more memcgs
> > > and superblocks than were originally designed for.
> > >
> > > So, really, we're way beyond the original design scope of this
> > > subsystem now.
> >
> > Got it. So it is better to allocate the structure of the list_lru_node
> > dynamically. We should only allocate it when it is really demanded.
> > But allocating memory by using GFP_ATOMIC in list_lru_add() is
> > not a good idea. So we should allocate the memory out of
> > list_lru_add(). I can propose an approach that may work.
> >
> > Before start, we should know about the following rules of list lrus.
> >
> > - Only objects allocated with __GFP_ACCOUNT need to allocate
> >   the struct list_lru_node.
>
> This seems .... misguided. inode and dentry caches are already
> marked as accounted, so individual calls to allocate from these
> slabs do not need this annotation.

Sorry for the confusion. You are right.

>
> > - The caller of allocating memory must know which list_lru the
> >   object will insert.
> >
> > So we can allocate struct list_lru_node when allocating the
> > object instead of allocating it when list_lru_add().  It is easy, because
> > we already know the list_lru and memcg which the object belongs
> > to. So we can introduce a new helper to allocate the object and
> > list_lru_node. Like below.
> >
> > void *list_lru_kmem_cache_alloc(struct list_lru *lru, struct kmem_cache *s,
> >                                 gfp_t gfpflags)
> > {
> >         void *ret = kmem_cache_alloc(s, gfpflags);
> >
> >         if (ret && (gfpflags & __GFP_ACCOUNT)) {
> >                 struct mem_cgroup *memcg = mem_cgroup_from_obj(ret);
> >
> >                 if (mem_cgroup_is_root(memcg))
> >                         return ret;
> >
> >                 /* Allocate per-memcg list_lru_node, if it already
> > allocated, do nothing. */
> >                 memcg_list_lru_node_alloc(lru, memcg,
> > page_to_nid(virt_to_page(ret)), gfpflags);
>
> If we are allowing kmem_cache_alloc() to fail, then we can allow
> memcg_list_lru_node_alloc() to fail, too.
>
> Also, why put this outside kmem_cache_alloc()? Node id and memcg is
> already known internally to kmem_cache_alloc() when allocating from
> a slab, so why not associate the slab allocation with the LRU
> directly when doing the memcg accounting and so avoid doing costly
> duplicate work on every allocation?
>
> i.e. the list-lru was moved inside the mm/ dir because "it's a mm
> specific construct only", so why not actually make use of that
> designation to internalise this entire memcg management issue into
> the slab allocation routines? i.e.  an API like

Yeah, we can.

> kmem_cache_alloc_lru(cache, lru, gfpflags) allows this to be
> completely internalised and efficiently implemented with minimal
> change to callers. It also means that memory allocation callers
> don't need to know anything about memcg management, which is always
> a win....

Great idea. It's efficient. I'd give it a try.

>
> >         }
> >
> >         return ret;
> > }
> >
> > If the user wants to insert the allocated object to its lru list in
> > the feature. The
> > user should use list_lru_kmem_cache_alloc() instead of kmem_cache_alloc().
> > I have looked at the code closely. There are 3 different kmem_caches that
> > need to use this new API to allocate memory. They are inode_cachep,
> > dentry_cache and radix_tree_node_cachep. I think that it is easy to migrate.
>
> It might work, but I think you may have overlooked the complexity
> of inode allocation for filesystems. i.e.  alloc_inode() calls out
> to filesystem allocation functions more often than it allocates
> directly from the inode_cachep.  i.e.  Most filesystems provide
> their own ->alloc_inode superblock operation, and they allocate
> inodes out of their own specific slab caches, not the inode_cachep.

I didn't realize this before. You are right. Most filesystems
have their own kmem_cache instead of inode_cachep.
We need a lot of filesystems special to be changed.
Thanks for your reminder.

>
> And then you have filesystems like XFS, where alloc_inode() will
> never be called, and implement ->alloc_inode as:
>
> /* Catch misguided souls that try to use this interface on XFS */
> STATIC struct inode *
> xfs_fs_alloc_inode(
>         struct super_block      *sb)
> {
>         BUG();
>         return NULL;
> }
>
> Because all the inode caching and allocation is internal to XFS and
> VFS inode management interfaces are not used.
>
> So I suspect that an external wrapper function is not the way to go
> here - either internalising the LRU management into the slab
> allocation or adding the memcg code to alloc_inode() and filesystem
> specific routines would make a lot more sense to me.

Sure. If we introduce kmem_cache_alloc_lru, all filesystems
need to migrate to kmem_cache_alloc_lru. I cannot figure out
an approach that does not need to change filesystems code.

Thanks.

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal
  2021-05-03  6:33             ` Muchun Song
  (?)
@ 2021-05-05  1:13             ` Dave Chinner
  2021-05-07  5:45                 ` Muchun Song
  -1 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2021-05-05  1:13 UTC (permalink / raw)
  To: Muchun Song
  Cc: Roman Gushchin, Matthew Wilcox, Andrew Morton, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Shakeel Butt, Yang Shi, alexs,
	Alexander Duyck, Wei Yang, linux-fsdevel, LKML,
	Linux Memory Management List

On Mon, May 03, 2021 at 02:33:21PM +0800, Muchun Song wrote:
> On Mon, May 3, 2021 at 7:58 AM Dave Chinner <david@fromorbit.com> wrote:
> > > If the user wants to insert the allocated object to its lru list in
> > > the feature. The
> > > user should use list_lru_kmem_cache_alloc() instead of kmem_cache_alloc().
> > > I have looked at the code closely. There are 3 different kmem_caches that
> > > need to use this new API to allocate memory. They are inode_cachep,
> > > dentry_cache and radix_tree_node_cachep. I think that it is easy to migrate.
> >
> > It might work, but I think you may have overlooked the complexity
> > of inode allocation for filesystems. i.e.  alloc_inode() calls out
> > to filesystem allocation functions more often than it allocates
> > directly from the inode_cachep.  i.e.  Most filesystems provide
> > their own ->alloc_inode superblock operation, and they allocate
> > inodes out of their own specific slab caches, not the inode_cachep.
> 
> I didn't realize this before. You are right. Most filesystems
> have their own kmem_cache instead of inode_cachep.
> We need a lot of filesystems special to be changed.
> Thanks for your reminder.
> 
> >
> > And then you have filesystems like XFS, where alloc_inode() will
> > never be called, and implement ->alloc_inode as:
> >
> > /* Catch misguided souls that try to use this interface on XFS */
> > STATIC struct inode *
> > xfs_fs_alloc_inode(
> >         struct super_block      *sb)
> > {
> >         BUG();
> >         return NULL;
> > }
> >
> > Because all the inode caching and allocation is internal to XFS and
> > VFS inode management interfaces are not used.
> >
> > So I suspect that an external wrapper function is not the way to go
> > here - either internalising the LRU management into the slab
> > allocation or adding the memcg code to alloc_inode() and filesystem
> > specific routines would make a lot more sense to me.
> 
> Sure. If we introduce kmem_cache_alloc_lru, all filesystems
> need to migrate to kmem_cache_alloc_lru. I cannot figure out
> an approach that does not need to change filesystems code.

Right, I don't think there's a way to avoid changing all the
filesystem code if we are touching the cache allocation routines.
However, if we hide it all inside the allocation routine, then
the changes to each filesystem is effectively just a 1-liner like:

-	inode = kmem_cache_alloc(inode_cache, GFP_NOFS);
+	inode = kmem_cache_alloc_lru(inode_cache, sb->s_inode_lru, GFP_NOFS);

Or perhaps, define a generic wrapper function like:

static inline void *
alloc_inode_sb(struct superblock *sb, struct kmem_cache *cache, gfp_flags_t gfp)
{
	return kmem_cache_alloc_lru(cache, sb->s_inode_lru, gfp);
}

And then each filesystem ends up with:

-	inode = kmem_cache_alloc(inode_cache, GFP_NOFS);
+	inode = alloc_inode_sb(sb, inode_cache, GFP_NOFS);

so that all the superblock LRU stuff is also hidden from the
filesystems...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal
  2021-05-05  1:13             ` Dave Chinner
@ 2021-05-07  5:45                 ` Muchun Song
  0 siblings, 0 replies; 30+ messages in thread
From: Muchun Song @ 2021-05-07  5:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Roman Gushchin, Matthew Wilcox, Andrew Morton, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Shakeel Butt, Yang Shi, alexs,
	Wei Yang, linux-fsdevel, LKML, Linux Memory Management List

On Wed, May 5, 2021 at 9:13 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, May 03, 2021 at 02:33:21PM +0800, Muchun Song wrote:
> > On Mon, May 3, 2021 at 7:58 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > If the user wants to insert the allocated object to its lru list in
> > > > the feature. The
> > > > user should use list_lru_kmem_cache_alloc() instead of kmem_cache_alloc().
> > > > I have looked at the code closely. There are 3 different kmem_caches that
> > > > need to use this new API to allocate memory. They are inode_cachep,
> > > > dentry_cache and radix_tree_node_cachep. I think that it is easy to migrate.
> > >
> > > It might work, but I think you may have overlooked the complexity
> > > of inode allocation for filesystems. i.e.  alloc_inode() calls out
> > > to filesystem allocation functions more often than it allocates
> > > directly from the inode_cachep.  i.e.  Most filesystems provide
> > > their own ->alloc_inode superblock operation, and they allocate
> > > inodes out of their own specific slab caches, not the inode_cachep.
> >
> > I didn't realize this before. You are right. Most filesystems
> > have their own kmem_cache instead of inode_cachep.
> > We need a lot of filesystems special to be changed.
> > Thanks for your reminder.
> >
> > >
> > > And then you have filesystems like XFS, where alloc_inode() will
> > > never be called, and implement ->alloc_inode as:
> > >
> > > /* Catch misguided souls that try to use this interface on XFS */
> > > STATIC struct inode *
> > > xfs_fs_alloc_inode(
> > >         struct super_block      *sb)
> > > {
> > >         BUG();
> > >         return NULL;
> > > }
> > >
> > > Because all the inode caching and allocation is internal to XFS and
> > > VFS inode management interfaces are not used.
> > >
> > > So I suspect that an external wrapper function is not the way to go
> > > here - either internalising the LRU management into the slab
> > > allocation or adding the memcg code to alloc_inode() and filesystem
> > > specific routines would make a lot more sense to me.
> >
> > Sure. If we introduce kmem_cache_alloc_lru, all filesystems
> > need to migrate to kmem_cache_alloc_lru. I cannot figure out
> > an approach that does not need to change filesystems code.
>
> Right, I don't think there's a way to avoid changing all the
> filesystem code if we are touching the cache allocation routines.
> However, if we hide it all inside the allocation routine, then
> the changes to each filesystem is effectively just a 1-liner like:
>
> -       inode = kmem_cache_alloc(inode_cache, GFP_NOFS);
> +       inode = kmem_cache_alloc_lru(inode_cache, sb->s_inode_lru, GFP_NOFS);
>
> Or perhaps, define a generic wrapper function like:
>
> static inline void *
> alloc_inode_sb(struct superblock *sb, struct kmem_cache *cache, gfp_flags_t gfp)
> {
>         return kmem_cache_alloc_lru(cache, sb->s_inode_lru, gfp);
> }

Good idea. I am doing this. A preliminary patch is expected next week.

Thanks.

>
> And then each filesystem ends up with:
>
> -       inode = kmem_cache_alloc(inode_cache, GFP_NOFS);
> +       inode = alloc_inode_sb(sb, inode_cache, GFP_NOFS);
>
> so that all the superblock LRU stuff is also hidden from the
> filesystems...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal
@ 2021-05-07  5:45                 ` Muchun Song
  0 siblings, 0 replies; 30+ messages in thread
From: Muchun Song @ 2021-05-07  5:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Roman Gushchin, Matthew Wilcox, Andrew Morton, Johannes Weiner,
	Michal Hocko, Vladimir Davydov, Shakeel Butt, Yang Shi, alexs,
	Wei Yang, linux-fsdevel, LKML, Linux Memory Management List

On Wed, May 5, 2021 at 9:13 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, May 03, 2021 at 02:33:21PM +0800, Muchun Song wrote:
> > On Mon, May 3, 2021 at 7:58 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > If the user wants to insert the allocated object to its lru list in
> > > > the feature. The
> > > > user should use list_lru_kmem_cache_alloc() instead of kmem_cache_alloc().
> > > > I have looked at the code closely. There are 3 different kmem_caches that
> > > > need to use this new API to allocate memory. They are inode_cachep,
> > > > dentry_cache and radix_tree_node_cachep. I think that it is easy to migrate.
> > >
> > > It might work, but I think you may have overlooked the complexity
> > > of inode allocation for filesystems. i.e.  alloc_inode() calls out
> > > to filesystem allocation functions more often than it allocates
> > > directly from the inode_cachep.  i.e.  Most filesystems provide
> > > their own ->alloc_inode superblock operation, and they allocate
> > > inodes out of their own specific slab caches, not the inode_cachep.
> >
> > I didn't realize this before. You are right. Most filesystems
> > have their own kmem_cache instead of inode_cachep.
> > We need a lot of filesystems special to be changed.
> > Thanks for your reminder.
> >
> > >
> > > And then you have filesystems like XFS, where alloc_inode() will
> > > never be called, and implement ->alloc_inode as:
> > >
> > > /* Catch misguided souls that try to use this interface on XFS */
> > > STATIC struct inode *
> > > xfs_fs_alloc_inode(
> > >         struct super_block      *sb)
> > > {
> > >         BUG();
> > >         return NULL;
> > > }
> > >
> > > Because all the inode caching and allocation is internal to XFS and
> > > VFS inode management interfaces are not used.
> > >
> > > So I suspect that an external wrapper function is not the way to go
> > > here - either internalising the LRU management into the slab
> > > allocation or adding the memcg code to alloc_inode() and filesystem
> > > specific routines would make a lot more sense to me.
> >
> > Sure. If we introduce kmem_cache_alloc_lru, all filesystems
> > need to migrate to kmem_cache_alloc_lru. I cannot figure out
> > an approach that does not need to change filesystems code.
>
> Right, I don't think there's a way to avoid changing all the
> filesystem code if we are touching the cache allocation routines.
> However, if we hide it all inside the allocation routine, then
> the changes to each filesystem is effectively just a 1-liner like:
>
> -       inode = kmem_cache_alloc(inode_cache, GFP_NOFS);
> +       inode = kmem_cache_alloc_lru(inode_cache, sb->s_inode_lru, GFP_NOFS);
>
> Or perhaps, define a generic wrapper function like:
>
> static inline void *
> alloc_inode_sb(struct superblock *sb, struct kmem_cache *cache, gfp_flags_t gfp)
> {
>         return kmem_cache_alloc_lru(cache, sb->s_inode_lru, gfp);
> }

Good idea. I am doing this. A preliminary patch is expected next week.

Thanks.

>
> And then each filesystem ends up with:
>
> -       inode = kmem_cache_alloc(inode_cache, GFP_NOFS);
> +       inode = alloc_inode_sb(sb, inode_cache, GFP_NOFS);
>
> so that all the superblock LRU stuff is also hidden from the
> filesystems...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2021-05-07  5:46 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-28  9:49 [PATCH 0/9] Shrink the list lru size on memory cgroup removal Muchun Song
2021-04-28  9:49 ` [PATCH 1/9] mm: list_lru: fix list_lru_count_one() return value Muchun Song
2021-04-28  9:49 ` [PATCH 2/9] mm: memcontrol: remove kmemcg_id reparenting Muchun Song
2021-04-28  9:49 ` [PATCH 3/9] mm: list_lru: rename memcg_drain_all_list_lrus to memcg_reparent_list_lrus Muchun Song
2021-04-28  9:49 ` [PATCH 4/9] mm: memcontrol: remove the kmem states Muchun Song
2021-04-28  9:49 ` [PATCH 5/9] mm: memcontrol: move memcg_online_kmem() to mem_cgroup_css_online() Muchun Song
2021-04-28  9:49 ` [PATCH 6/9] mm: list_lru: support for shrinking list lru Muchun Song
2021-04-28  9:49 ` [PATCH 7/9] ida: introduce ida_max() to return the maximum allocated ID Muchun Song
2021-04-29  6:47   ` Christoph Hellwig
2021-04-29  7:36     ` [External] " Muchun Song
2021-04-29  7:36       ` Muchun Song
2021-04-28  9:49 ` [PATCH 8/9] mm: memcontrol: shrink the list lru size Muchun Song
2021-04-28  9:49 ` [PATCH 9/9] mm: memcontrol: rename memcg_{get,put}_cache_ids to memcg_list_lru_resize_{lock,unlock} Muchun Song
2021-04-28 23:32 ` [PATCH 0/9] Shrink the list lru size on memory cgroup removal Shakeel Butt
2021-04-28 23:32   ` Shakeel Butt
2021-04-29  3:05   ` [External] " Muchun Song
2021-04-29  3:05     ` Muchun Song
2021-04-30  0:49 ` Dave Chinner
2021-04-30  1:39   ` Roman Gushchin
2021-04-30  3:27     ` Dave Chinner
2021-04-30  8:32       ` [External] " Muchun Song
2021-04-30  8:32         ` Muchun Song
2021-05-01  3:10         ` Roman Gushchin
2021-05-01  3:27         ` Matthew Wilcox
2021-05-02 23:58         ` Dave Chinner
2021-05-03  6:33           ` Muchun Song
2021-05-03  6:33             ` Muchun Song
2021-05-05  1:13             ` Dave Chinner
2021-05-07  5:45               ` Muchun Song
2021-05-07  5:45                 ` Muchun Song

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.