All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] protect page cache from freeing inode
@ 2019-12-24  7:53 Yafang Shao
  2019-12-24  7:53 ` [PATCH v2 1/5] mm, memcg: reduce size of struct mem_cgroup by using bit field Yafang Shao
                   ` (4 more replies)
  0 siblings, 5 replies; 36+ messages in thread
From: Yafang Shao @ 2019-12-24  7:53 UTC (permalink / raw)
  To: hannes, david, mhocko, vdavydov.dev, akpm, viro
  Cc: linux-mm, linux-fsdevel, Yafang Shao


On my server there're some running MEMCGs protected by memory.{min, low},
but I found the usage of these MEMCGs abruptly became very small, which
were far less than the protect limit. It confused me and finally I
found that was because of inode stealing.
Once an inode is freed, all its belonging page caches will be dropped as
well, no matter how may page caches it has. So if we intend to protect the
page caches in a memcg, we must protect their host (the inode) first.
Otherwise the memcg protection can be easily bypassed with freeing inode,
especially if there're big files in this memcg.
The inherent mismatch between memcg and inode is a trouble. One inode can
be shared by different MEMCGs, but it is a very rare case. If an inode is
shared, its belonging page caches may be charged to different MEMCGs.
Currently there's no perfect solution to fix this kind of issue, but the
inode majority-writer ownership switching can help it more or less.

This patchset contains five patches,
Patches 1-3: Minor opmizations and also the preparation of Patch-5
Patch 4: Preparation of Patch-5
Patch 5: the real issue I want to fix

- Changes against v1:
Use the memcg passed from the shrink_control, instead of getting it from
inode itself, suggested by Dave. That could make the laying better.

Yafang Shao (5):
  mm, memcg: reduce size of struct mem_cgroup by using bit field
  mm, memcg: introduce MEMCG_PROT_SKIP for memcg zero usage case
  mm, memcg: reset memcg's memory.{min, low} for reclaiming itself
  mm: make memcg visible to lru walker isolation function
  memcg, inode: protect page cache from freeing inode

 fs/inode.c                 | 25 ++++++++++++++--
 include/linux/memcontrol.h | 54 ++++++++++++++++++++++++++++-------
 mm/list_lru.c              | 22 +++++++-------
 mm/memcontrol.c            | 71 +++++++++++++++++++++++++++++++++++-----------
 mm/vmscan.c                | 11 +++++++
 5 files changed, 144 insertions(+), 39 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2 1/5] mm, memcg: reduce size of struct mem_cgroup by using bit field
  2019-12-24  7:53 [PATCH v2 0/5] protect page cache from freeing inode Yafang Shao
@ 2019-12-24  7:53 ` Yafang Shao
  2019-12-26 21:23   ` Roman Gushchin
  2019-12-24  7:53 ` [PATCH v2 2/5] mm, memcg: introduce MEMCG_PROT_SKIP for memcg zero usage case Yafang Shao
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 36+ messages in thread
From: Yafang Shao @ 2019-12-24  7:53 UTC (permalink / raw)
  To: hannes, david, mhocko, vdavydov.dev, akpm, viro
  Cc: linux-mm, linux-fsdevel, Yafang Shao

There are some members in struct mem_group can be either 0(false) or
1(true), so we can define them using bit field to reduce size. With this
patch, the size of struct mem_cgroup can be reduced by 64 bytes in theory,
but as there're some MEMCG_PADDING()s, the real number may be different,
which is relate with the cacheline size. Anyway, this patch could reduce
the size of struct mem_cgroup more or less.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/memcontrol.h | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a7a0a1a5..612a457 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -229,20 +229,26 @@ struct mem_cgroup {
 	/*
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
-	bool use_hierarchy;
+	unsigned int use_hierarchy : 1;
 
 	/*
 	 * Should the OOM killer kill all belonging tasks, had it kill one?
 	 */
-	bool oom_group;
+	unsigned int  oom_group : 1;
 
 	/* protected by memcg_oom_lock */
-	bool		oom_lock;
-	int		under_oom;
+	unsigned int oom_lock : 1;
 
-	int	swappiness;
 	/* OOM-Killer disable */
-	int		oom_kill_disable;
+	unsigned int oom_kill_disable : 1;
+
+	/* Legacy tcp memory accounting */
+	unsigned int tcpmem_active : 1;
+	unsigned int tcpmem_pressure : 1;
+
+	int under_oom;
+
+	int	swappiness;
 
 	/* memory.events and memory.events.local */
 	struct cgroup_file events_file;
@@ -297,9 +303,6 @@ struct mem_cgroup {
 
 	unsigned long		socket_pressure;
 
-	/* Legacy tcp memory accounting */
-	bool			tcpmem_active;
-	int			tcpmem_pressure;
 
 #ifdef CONFIG_MEMCG_KMEM
         /* Index in the kmem_cache->memcg_params.memcg_caches array */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 2/5] mm, memcg: introduce MEMCG_PROT_SKIP for memcg zero usage case
  2019-12-24  7:53 [PATCH v2 0/5] protect page cache from freeing inode Yafang Shao
  2019-12-24  7:53 ` [PATCH v2 1/5] mm, memcg: reduce size of struct mem_cgroup by using bit field Yafang Shao
@ 2019-12-24  7:53 ` Yafang Shao
  2019-12-26 21:36   ` Roman Gushchin
  2019-12-24  7:53 ` [PATCH v2 3/5] mm, memcg: reset memcg's memory.{min, low} for reclaiming itself Yafang Shao
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 36+ messages in thread
From: Yafang Shao @ 2019-12-24  7:53 UTC (permalink / raw)
  To: hannes, david, mhocko, vdavydov.dev, akpm, viro
  Cc: linux-mm, linux-fsdevel, Yafang Shao, Roman Gushchin

If the usage of a memcg is zero, we don't need to do useless work to scan
it. That is a minor optimization.

Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/memcontrol.h | 1 +
 mm/memcontrol.c            | 2 +-
 mm/vmscan.c                | 6 ++++++
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 612a457..1a315c7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -54,6 +54,7 @@ enum mem_cgroup_protection {
 	MEMCG_PROT_NONE,
 	MEMCG_PROT_LOW,
 	MEMCG_PROT_MIN,
+	MEMCG_PROT_SKIP,	/* For zero usage case */
 };
 
 struct mem_cgroup_reclaim_cookie {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c5b5f74..f35fcca 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6292,7 +6292,7 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 
 	usage = page_counter_read(&memcg->memory);
 	if (!usage)
-		return MEMCG_PROT_NONE;
+		return MEMCG_PROT_SKIP;
 
 	emin = memcg->memory.min;
 	elow = memcg->memory.low;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5a6445e..3c4c2da 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2677,6 +2677,12 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 			 * thresholds (see get_scan_count).
 			 */
 			break;
+		case MEMCG_PROT_SKIP:
+			/*
+			 * Skip scanning this memcg if the usage of it is
+			 * zero.
+			 */
+			continue;
 		}
 
 		reclaimed = sc->nr_reclaimed;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 3/5] mm, memcg: reset memcg's memory.{min, low} for reclaiming itself
  2019-12-24  7:53 [PATCH v2 0/5] protect page cache from freeing inode Yafang Shao
  2019-12-24  7:53 ` [PATCH v2 1/5] mm, memcg: reduce size of struct mem_cgroup by using bit field Yafang Shao
  2019-12-24  7:53 ` [PATCH v2 2/5] mm, memcg: introduce MEMCG_PROT_SKIP for memcg zero usage case Yafang Shao
@ 2019-12-24  7:53 ` Yafang Shao
  2019-12-26 21:45   ` Roman Gushchin
  2019-12-24  7:53 ` [PATCH v2 4/5] mm: make memcg visible to lru walker isolation function Yafang Shao
  2019-12-24  7:53 ` [PATCH v2 5/5] memcg, inode: protect page cache from freeing inode Yafang Shao
  4 siblings, 1 reply; 36+ messages in thread
From: Yafang Shao @ 2019-12-24  7:53 UTC (permalink / raw)
  To: hannes, david, mhocko, vdavydov.dev, akpm, viro
  Cc: linux-mm, linux-fsdevel, Yafang Shao, Chris Down

memory.{emin, elow} are set in mem_cgroup_protected(), and the values of
them won't be changed until next recalculation in this function. After
either or both of them are set, the next reclaimer to relcaim this memcg
may be a different reclaimer, e.g. this memcg is also the root memcg of
the new reclaimer, and then in mem_cgroup_protection() in get_scan_count()
the old values of them will be used to calculate scan count, that is not
proper. We should reset them to zero in this case.

Here's an example of this issue.

    root_mem_cgroup
         /
        A   memory.max=1024M memory.min=512M memory.current=800M

Once kswapd is waked up, it will try to scan all MEMCGs, including
this A, and it will assign memory.emin of A with 512M.
After that, A may reach its hard limit(memory.max), and then it will
do memcg reclaim. Because A is the root of this reclaimer, so it will
not calculate its memory.emin. So the memory.emin is the old value
512M, and then this old value will be used in
mem_cgroup_protection() in get_scan_count() to get the scan count.
That is not proper.

Cc: Chris Down <chris@chrisdown.name>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 mm/memcontrol.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f35fcca..2e78931 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6287,8 +6287,17 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 
 	if (!root)
 		root = root_mem_cgroup;
-	if (memcg == root)
+	if (memcg == root) {
+		/*
+		 * Reset memory.(emin, elow) for reclaiming the memcg
+		 * itself.
+		 */
+		if (memcg != root_mem_cgroup) {
+			memcg->memory.emin = 0;
+			memcg->memory.elow = 0;
+		}
 		return MEMCG_PROT_NONE;
+	}
 
 	usage = page_counter_read(&memcg->memory);
 	if (!usage)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 4/5] mm: make memcg visible to lru walker isolation function
  2019-12-24  7:53 [PATCH v2 0/5] protect page cache from freeing inode Yafang Shao
                   ` (2 preceding siblings ...)
  2019-12-24  7:53 ` [PATCH v2 3/5] mm, memcg: reset memcg's memory.{min, low} for reclaiming itself Yafang Shao
@ 2019-12-24  7:53 ` Yafang Shao
  2020-01-04  3:35   ` Dave Chinner
  2019-12-24  7:53 ` [PATCH v2 5/5] memcg, inode: protect page cache from freeing inode Yafang Shao
  4 siblings, 1 reply; 36+ messages in thread
From: Yafang Shao @ 2019-12-24  7:53 UTC (permalink / raw)
  To: hannes, david, mhocko, vdavydov.dev, akpm, viro
  Cc: linux-mm, linux-fsdevel, Yafang Shao, Dave Chinner

The lru walker isolation function may use this memcg to do something, e.g.
the inode isolatation function will use the memcg to do inode protection in
followup patch. So make memcg visible to the lru walker isolation function.

Something should be emphasized in this patch is it replaces
for_each_memcg_cache_index() with for_each_mem_cgroup() in
list_lru_walk_node(). Because there's a gap between these two MACROs that
for_each_mem_cgroup() depends on CONFIG_MEMCG while the other one depends
on CONFIG_MEMCG_KMEM. But as list_lru_memcg_aware() returns false if
CONFIG_MEMCG_KMEM is not configured, it is safe to this replacement.

Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/memcontrol.h | 21 +++++++++++++++++++++
 mm/list_lru.c              | 22 ++++++++++++----------
 mm/memcontrol.c            | 15 ---------------
 3 files changed, 33 insertions(+), 25 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1a315c7..f36ada9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -449,6 +449,21 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
 int mem_cgroup_scan_tasks(struct mem_cgroup *,
 			  int (*)(struct task_struct *, void *), void *);
 
+/*
+ * Iteration constructs for visiting all cgroups (under a tree).  If
+ * loops are exited prematurely (break), mem_cgroup_iter_break() must
+ * be used for reference counting.
+ */
+#define for_each_mem_cgroup_tree(iter, root)		\
+	for (iter = mem_cgroup_iter(root, NULL, NULL);	\
+	     iter != NULL;				\
+	     iter = mem_cgroup_iter(root, iter, NULL))
+
+#define for_each_mem_cgroup(iter)			\
+	for (iter = mem_cgroup_iter(NULL, NULL, NULL);	\
+	     iter != NULL;				\
+	     iter = mem_cgroup_iter(NULL, iter, NULL))
+
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 {
 	if (mem_cgroup_disabled())
@@ -949,6 +964,12 @@ static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	return 0;
 }
 
+#define for_each_mem_cgroup_tree(iter)		\
+	for (iter = NULL; iter; )
+
+#define for_each_mem_cgroup(iter)		\
+	for (iter = NULL; iter; )
+
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 {
 	return 0;
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 0f1f6b0..536830d 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -207,11 +207,11 @@ unsigned long list_lru_count_node(struct list_lru *lru, int nid)
 EXPORT_SYMBOL_GPL(list_lru_count_node);
 
 static unsigned long
-__list_lru_walk_one(struct list_lru_node *nlru, int memcg_idx,
+__list_lru_walk_one(struct list_lru_node *nlru, struct mem_cgroup *memcg,
 		    list_lru_walk_cb isolate, void *cb_arg,
 		    unsigned long *nr_to_walk)
 {
-
+	int memcg_idx = memcg_cache_id(memcg);
 	struct list_lru_one *l;
 	struct list_head *item, *n;
 	unsigned long isolated = 0;
@@ -273,7 +273,7 @@ unsigned long list_lru_count_node(struct list_lru *lru, int nid)
 	unsigned long ret;
 
 	spin_lock(&nlru->lock);
-	ret = __list_lru_walk_one(nlru, memcg_cache_id(memcg), isolate, cb_arg,
+	ret = __list_lru_walk_one(nlru, memcg, isolate, cb_arg,
 				  nr_to_walk);
 	spin_unlock(&nlru->lock);
 	return ret;
@@ -289,7 +289,7 @@ unsigned long list_lru_count_node(struct list_lru *lru, int nid)
 	unsigned long ret;
 
 	spin_lock_irq(&nlru->lock);
-	ret = __list_lru_walk_one(nlru, memcg_cache_id(memcg), isolate, cb_arg,
+	ret = __list_lru_walk_one(nlru, memcg, isolate, cb_arg,
 				  nr_to_walk);
 	spin_unlock_irq(&nlru->lock);
 	return ret;
@@ -299,17 +299,15 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
 				 list_lru_walk_cb isolate, void *cb_arg,
 				 unsigned long *nr_to_walk)
 {
+	struct mem_cgroup *memcg;
 	long isolated = 0;
-	int memcg_idx;
 
-	isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
-				      nr_to_walk);
-	if (*nr_to_walk > 0 && list_lru_memcg_aware(lru)) {
-		for_each_memcg_cache_index(memcg_idx) {
+	if (list_lru_memcg_aware(lru)) {
+		for_each_mem_cgroup(memcg) {
 			struct list_lru_node *nlru = &lru->node[nid];
 
 			spin_lock(&nlru->lock);
-			isolated += __list_lru_walk_one(nlru, memcg_idx,
+			isolated += __list_lru_walk_one(nlru, memcg,
 							isolate, cb_arg,
 							nr_to_walk);
 			spin_unlock(&nlru->lock);
@@ -317,7 +315,11 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
 			if (*nr_to_walk <= 0)
 				break;
 		}
+	} else {
+		isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
+					      nr_to_walk);
 	}
+
 	return isolated;
 }
 EXPORT_SYMBOL_GPL(list_lru_walk_node);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2e78931..2fc2bf4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -222,21 +222,6 @@ enum res_type {
 /* Used for OOM nofiier */
 #define OOM_CONTROL		(0)
 
-/*
- * Iteration constructs for visiting all cgroups (under a tree).  If
- * loops are exited prematurely (break), mem_cgroup_iter_break() must
- * be used for reference counting.
- */
-#define for_each_mem_cgroup_tree(iter, root)		\
-	for (iter = mem_cgroup_iter(root, NULL, NULL);	\
-	     iter != NULL;				\
-	     iter = mem_cgroup_iter(root, iter, NULL))
-
-#define for_each_mem_cgroup(iter)			\
-	for (iter = mem_cgroup_iter(NULL, NULL, NULL);	\
-	     iter != NULL;				\
-	     iter = mem_cgroup_iter(NULL, iter, NULL))
-
 static inline bool should_force_charge(void)
 {
 	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 5/5] memcg, inode: protect page cache from freeing inode
  2019-12-24  7:53 [PATCH v2 0/5] protect page cache from freeing inode Yafang Shao
                   ` (3 preceding siblings ...)
  2019-12-24  7:53 ` [PATCH v2 4/5] mm: make memcg visible to lru walker isolation function Yafang Shao
@ 2019-12-24  7:53 ` Yafang Shao
  2019-12-25 13:01     ` kbuild test robot
                     ` (2 more replies)
  4 siblings, 3 replies; 36+ messages in thread
From: Yafang Shao @ 2019-12-24  7:53 UTC (permalink / raw)
  To: hannes, david, mhocko, vdavydov.dev, akpm, viro
  Cc: linux-mm, linux-fsdevel, Yafang Shao, Roman Gushchin, Chris Down,
	Dave Chinner

On my server there're some running MEMCGs protected by memory.{min, low},
but I found the usage of these MEMCGs abruptly became very small, which
were far less than the protect limit. It confused me and finally I
found that was because of inode stealing.
Once an inode is freed, all its belonging page caches will be dropped as
well, no matter how may page caches it has. So if we intend to protect the
page caches in a memcg, we must protect their host (the inode) first.
Otherwise the memcg protection can be easily bypassed with freeing inode,
especially if there're big files in this memcg.

Supposes we have a memcg, and the stat of this memcg is,
	memory.current = 1024M
	memory.min = 512M
And in this memcg there's a inode with 800M page caches.
Once this memcg is scanned by kswapd or other regular reclaimers,
    kswapd <<<< It can be either of the regular reclaimers.
        shrink_node_memcgs
	    switch (mem_cgroup_protected()) <<<< Not protected
		case MEMCG_PROT_NONE:  <<<< Will scan this memcg
			beak;
            shrink_lruvec() <<<< Reclaim the page caches
            shrink_slab()   <<<< It may free this inode and drop all its
                                 page caches(800M).
So we must protect the inode first if we want to protect page caches.

The inherent mismatch between memcg and inode is a trouble. One inode can
be shared by different MEMCGs, but it is a very rare case. If an inode is
shared, its belonging page caches may be charged to different MEMCGs.
Currently there's no perfect solution to fix this kind of issue, but the
inode majority-writer ownership switching can help it more or less.

Cc: Roman Gushchin <guro@fb.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 fs/inode.c                 | 25 +++++++++++++++++++++++--
 include/linux/memcontrol.h | 11 ++++++++++-
 mm/memcontrol.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                |  5 +++++
 4 files changed, 81 insertions(+), 3 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index fef457a..4f4b2f3 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -54,6 +54,13 @@
  *   inode_hash_lock
  */
 
+struct inode_head {
+	struct list_head *freeable;
+#ifdef CONFIG_MEMCG_KMEM
+	struct mem_cgroup *memcg;
+#endif
+};
+
 static unsigned int i_hash_mask __read_mostly;
 static unsigned int i_hash_shift __read_mostly;
 static struct hlist_head *inode_hashtable __read_mostly;
@@ -724,8 +731,10 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 static enum lru_status inode_lru_isolate(struct list_head *item,
 		struct list_lru_one *lru, spinlock_t *lru_lock, void *arg)
 {
-	struct list_head *freeable = arg;
+	struct inode_head *ihead = (struct inode_head *)arg;
+	struct list_head *freeable = ihead->freeable;
 	struct inode	*inode = container_of(item, struct inode, i_lru);
+	struct mem_cgroup *memcg = NULL;
 
 	/*
 	 * we are inverting the lru lock/inode->i_lock here, so use a trylock.
@@ -734,6 +743,15 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
 	if (!spin_trylock(&inode->i_lock))
 		return LRU_SKIP;
 
+#ifdef CONFIG_MEMCG_KMEM
+	memcg = ihead->memcg;
+#endif
+	if (memcg && inode->i_data.nrpages &&
+	    !(memcg_can_reclaim_inode(memcg, inode))) {
+		spin_unlock(&inode->i_lock);
+		return LRU_ROTATE;
+	}
+
 	/*
 	 * Referenced or dirty inodes are still in use. Give them another pass
 	 * through the LRU as we canot reclaim them now.
@@ -789,11 +807,14 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
  */
 long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
 {
+	struct inode_head ihead;
 	LIST_HEAD(freeable);
 	long freed;
 
+	ihead.freeable = &freeable;
+	ihead.memcg = sc->memcg;
 	freed = list_lru_shrink_walk(&sb->s_inode_lru, sc,
-				     inode_lru_isolate, &freeable);
+				     inode_lru_isolate, &ihead);
 	dispose_list(&freeable);
 	return freed;
 }
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f36ada9..d1d4175 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -247,6 +247,9 @@ struct mem_cgroup {
 	unsigned int tcpmem_active : 1;
 	unsigned int tcpmem_pressure : 1;
 
+	/* Soft protection will be ignored if it's true */
+	unsigned int in_low_reclaim : 1;
+
 	int under_oom;
 
 	int	swappiness;
@@ -363,7 +366,7 @@ static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
 
 enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 						struct mem_cgroup *memcg);
-
+bool memcg_can_reclaim_inode(struct mem_cgroup *memcg, struct inode *inode);
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 			  gfp_t gfp_mask, struct mem_cgroup **memcgp,
 			  bool compound);
@@ -865,6 +868,12 @@ static inline enum mem_cgroup_protection mem_cgroup_protected(
 	return MEMCG_PROT_NONE;
 }
 
+static inline bool memcg_can_reclaim_inode(struct mem_cgroup *memcg,
+					   struct inode *memcg)
+{
+	return true;
+}
+
 static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask,
 					struct mem_cgroup **memcgp,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2fc2bf4..c3498fd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6340,6 +6340,49 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 }
 
 /**
+ * Once an inode is freed, all its belonging page caches will be dropped as
+ * well, even if there're lots of page caches. So if we intend to protect
+ * page caches in a memcg, we must protect their host(the inode) first.
+ * Otherwise the memcg protection can be easily bypassed with freeing inode,
+ * especially if there're big files in this memcg.
+ * Note that it may happen that the page caches are already charged to the
+ * memcg, but the inode hasn't been added to this memcg yet. In this case,
+ * this inode is not protected.
+ * The inherent mismatch between memcg and inode is a trouble. One inode
+ * can be shared by different MEMCGs, but it is a very rare case. If
+ * an inode is shared, its belonging page caches may be charged to
+ * different MEMCGs. Currently there's no perfect solution to fix this
+ * kind of issue, but the inode majority-writer ownership switching can
+ * help it more or less.
+ */
+bool memcg_can_reclaim_inode(struct mem_cgroup *memcg,
+			     struct inode *inode)
+{
+	unsigned long cgroup_size;
+	unsigned long protection;
+	bool reclaimable = true;
+
+	if (memcg == root_mem_cgroup)
+		goto out;
+
+	protection = mem_cgroup_protection(memcg, memcg->in_low_reclaim);
+	if (!protection)
+		goto out;
+
+	/*
+	 * Don't protect this inode if the usage of this memcg is still
+	 * above the protection after reclaiming this inode and all its
+	 * belonging page caches.
+	 */
+	cgroup_size = mem_cgroup_size(memcg);
+	if (inode->i_data.nrpages + protection > cgroup_size)
+		reclaimable = false;
+
+out:
+	return reclaimable;
+}
+
+/**
  * mem_cgroup_try_charge - try charging a page
  * @page: page to charge
  * @mm: mm context of the victim
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3c4c2da..ecc5c1d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2666,6 +2666,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 				sc->memcg_low_skipped = 1;
 				continue;
 			}
+
+			memcg->in_low_reclaim = 1;
 			memcg_memory_event(memcg, MEMCG_LOW);
 			break;
 		case MEMCG_PROT_NONE:
@@ -2693,6 +2695,9 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 		shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
 			    sc->priority);
 
+		if (memcg->in_low_reclaim)
+			memcg->in_low_reclaim = 0;
+
 		/* Record the group's reclaim efficiency */
 		vmpressure(sc->gfp_mask, memcg, false,
 			   sc->nr_scanned - scanned,
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 5/5] memcg, inode: protect page cache from freeing inode
  2019-12-24  7:53 ` [PATCH v2 5/5] memcg, inode: protect page cache from freeing inode Yafang Shao
@ 2019-12-25 13:01     ` kbuild test robot
  2019-12-25 13:18     ` kbuild test robot
  2020-01-04  3:55   ` Dave Chinner
  2 siblings, 0 replies; 36+ messages in thread
From: kbuild test robot @ 2019-12-25 13:01 UTC (permalink / raw)
  To: Yafang Shao
  Cc: kbuild-all, hannes, david, mhocko, vdavydov.dev, akpm, viro,
	linux-mm, linux-fsdevel, Yafang Shao, Roman Gushchin, Chris Down,
	Dave Chinner

[-- Attachment #1: Type: text/plain, Size: 4639 bytes --]

Hi Yafang,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.5-rc3 next-20191219]
[cannot apply to mmotm/master]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Yafang-Shao/protect-page-cache-from-freeing-inode/20191225-193636
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 46cf053efec6a3a5f343fead837777efe8252a46
config: um-x86_64_defconfig (attached as .config)
compiler: gcc-7 (Debian 7.5.0-3) 7.5.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=um SUBARCH=x86_64

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from include/linux/swap.h:9:0,
                    from mm/vmscan.c:22:
   include/linux/memcontrol.h:872:23: error: conflicting types for 'memcg'
            struct inode *memcg)
                          ^~~~~
   include/linux/memcontrol.h:871:63: note: previous definition of 'memcg' was here
    static inline bool memcg_can_reclaim_inode(struct mem_cgroup *memcg,
                                                                  ^~~~~
   mm/vmscan.c: In function 'shrink_node_memcgs':
>> mm/vmscan.c:2670:9: error: dereferencing pointer to incomplete type 'struct mem_cgroup'
       memcg->in_low_reclaim = 1;
            ^~
--
   In file included from include/linux/swap.h:9:0,
                    from fs/inode.c:11:
   include/linux/memcontrol.h:872:23: error: conflicting types for 'memcg'
            struct inode *memcg)
                          ^~~~~
   include/linux/memcontrol.h:871:63: note: previous definition of 'memcg' was here
    static inline bool memcg_can_reclaim_inode(struct mem_cgroup *memcg,
                                                                  ^~~~~
   fs/inode.c: In function 'prune_icache_sb':
>> fs/inode.c:822:7: error: 'struct inode_head' has no member named 'memcg'
     ihead.memcg = sc->memcg;
          ^

vim +2670 mm/vmscan.c

  2639	
  2640	static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
  2641	{
  2642		struct mem_cgroup *target_memcg = sc->target_mem_cgroup;
  2643		struct mem_cgroup *memcg;
  2644	
  2645		memcg = mem_cgroup_iter(target_memcg, NULL, NULL);
  2646		do {
  2647			struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
  2648			unsigned long reclaimed;
  2649			unsigned long scanned;
  2650	
  2651			switch (mem_cgroup_protected(target_memcg, memcg)) {
  2652			case MEMCG_PROT_MIN:
  2653				/*
  2654				 * Hard protection.
  2655				 * If there is no reclaimable memory, OOM.
  2656				 */
  2657				continue;
  2658			case MEMCG_PROT_LOW:
  2659				/*
  2660				 * Soft protection.
  2661				 * Respect the protection only as long as
  2662				 * there is an unprotected supply
  2663				 * of reclaimable memory from other cgroups.
  2664				 */
  2665				if (!sc->memcg_low_reclaim) {
  2666					sc->memcg_low_skipped = 1;
  2667					continue;
  2668				}
  2669	
> 2670				memcg->in_low_reclaim = 1;
  2671				memcg_memory_event(memcg, MEMCG_LOW);
  2672				break;
  2673			case MEMCG_PROT_NONE:
  2674				/*
  2675				 * All protection thresholds breached. We may
  2676				 * still choose to vary the scan pressure
  2677				 * applied based on by how much the cgroup in
  2678				 * question has exceeded its protection
  2679				 * thresholds (see get_scan_count).
  2680				 */
  2681				break;
  2682			case MEMCG_PROT_SKIP:
  2683				/*
  2684				 * Skip scanning this memcg if the usage of it is
  2685				 * zero.
  2686				 */
  2687				continue;
  2688			}
  2689	
  2690			reclaimed = sc->nr_reclaimed;
  2691			scanned = sc->nr_scanned;
  2692	
  2693			shrink_lruvec(lruvec, sc);
  2694	
  2695			shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
  2696				    sc->priority);
  2697	
  2698			if (memcg->in_low_reclaim)
  2699				memcg->in_low_reclaim = 0;
  2700	
  2701			/* Record the group's reclaim efficiency */
  2702			vmpressure(sc->gfp_mask, memcg, false,
  2703				   sc->nr_scanned - scanned,
  2704				   sc->nr_reclaimed - reclaimed);
  2705	
  2706		} while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL)));
  2707	}
  2708	

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 8440 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 5/5] memcg, inode: protect page cache from freeing inode
@ 2019-12-25 13:01     ` kbuild test robot
  0 siblings, 0 replies; 36+ messages in thread
From: kbuild test robot @ 2019-12-25 13:01 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4767 bytes --]

Hi Yafang,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.5-rc3 next-20191219]
[cannot apply to mmotm/master]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Yafang-Shao/protect-page-cache-from-freeing-inode/20191225-193636
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 46cf053efec6a3a5f343fead837777efe8252a46
config: um-x86_64_defconfig (attached as .config)
compiler: gcc-7 (Debian 7.5.0-3) 7.5.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=um SUBARCH=x86_64

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from include/linux/swap.h:9:0,
                    from mm/vmscan.c:22:
   include/linux/memcontrol.h:872:23: error: conflicting types for 'memcg'
            struct inode *memcg)
                          ^~~~~
   include/linux/memcontrol.h:871:63: note: previous definition of 'memcg' was here
    static inline bool memcg_can_reclaim_inode(struct mem_cgroup *memcg,
                                                                  ^~~~~
   mm/vmscan.c: In function 'shrink_node_memcgs':
>> mm/vmscan.c:2670:9: error: dereferencing pointer to incomplete type 'struct mem_cgroup'
       memcg->in_low_reclaim = 1;
            ^~
--
   In file included from include/linux/swap.h:9:0,
                    from fs/inode.c:11:
   include/linux/memcontrol.h:872:23: error: conflicting types for 'memcg'
            struct inode *memcg)
                          ^~~~~
   include/linux/memcontrol.h:871:63: note: previous definition of 'memcg' was here
    static inline bool memcg_can_reclaim_inode(struct mem_cgroup *memcg,
                                                                  ^~~~~
   fs/inode.c: In function 'prune_icache_sb':
>> fs/inode.c:822:7: error: 'struct inode_head' has no member named 'memcg'
     ihead.memcg = sc->memcg;
          ^

vim +2670 mm/vmscan.c

  2639	
  2640	static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
  2641	{
  2642		struct mem_cgroup *target_memcg = sc->target_mem_cgroup;
  2643		struct mem_cgroup *memcg;
  2644	
  2645		memcg = mem_cgroup_iter(target_memcg, NULL, NULL);
  2646		do {
  2647			struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
  2648			unsigned long reclaimed;
  2649			unsigned long scanned;
  2650	
  2651			switch (mem_cgroup_protected(target_memcg, memcg)) {
  2652			case MEMCG_PROT_MIN:
  2653				/*
  2654				 * Hard protection.
  2655				 * If there is no reclaimable memory, OOM.
  2656				 */
  2657				continue;
  2658			case MEMCG_PROT_LOW:
  2659				/*
  2660				 * Soft protection.
  2661				 * Respect the protection only as long as
  2662				 * there is an unprotected supply
  2663				 * of reclaimable memory from other cgroups.
  2664				 */
  2665				if (!sc->memcg_low_reclaim) {
  2666					sc->memcg_low_skipped = 1;
  2667					continue;
  2668				}
  2669	
> 2670				memcg->in_low_reclaim = 1;
  2671				memcg_memory_event(memcg, MEMCG_LOW);
  2672				break;
  2673			case MEMCG_PROT_NONE:
  2674				/*
  2675				 * All protection thresholds breached. We may
  2676				 * still choose to vary the scan pressure
  2677				 * applied based on by how much the cgroup in
  2678				 * question has exceeded its protection
  2679				 * thresholds (see get_scan_count).
  2680				 */
  2681				break;
  2682			case MEMCG_PROT_SKIP:
  2683				/*
  2684				 * Skip scanning this memcg if the usage of it is
  2685				 * zero.
  2686				 */
  2687				continue;
  2688			}
  2689	
  2690			reclaimed = sc->nr_reclaimed;
  2691			scanned = sc->nr_scanned;
  2692	
  2693			shrink_lruvec(lruvec, sc);
  2694	
  2695			shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
  2696				    sc->priority);
  2697	
  2698			if (memcg->in_low_reclaim)
  2699				memcg->in_low_reclaim = 0;
  2700	
  2701			/* Record the group's reclaim efficiency */
  2702			vmpressure(sc->gfp_mask, memcg, false,
  2703				   sc->nr_scanned - scanned,
  2704				   sc->nr_reclaimed - reclaimed);
  2705	
  2706		} while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL)));
  2707	}
  2708	

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org Intel Corporation

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 8440 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 5/5] memcg, inode: protect page cache from freeing inode
  2019-12-24  7:53 ` [PATCH v2 5/5] memcg, inode: protect page cache from freeing inode Yafang Shao
@ 2019-12-25 13:18     ` kbuild test robot
  2019-12-25 13:18     ` kbuild test robot
  2020-01-04  3:55   ` Dave Chinner
  2 siblings, 0 replies; 36+ messages in thread
From: kbuild test robot @ 2019-12-25 13:18 UTC (permalink / raw)
  To: Yafang Shao
  Cc: kbuild-all, hannes, david, mhocko, vdavydov.dev, akpm, viro,
	linux-mm, linux-fsdevel, Yafang Shao, Roman Gushchin, Chris Down,
	Dave Chinner

[-- Attachment #1: Type: text/plain, Size: 2165 bytes --]

Hi Yafang,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.5-rc3 next-20191220]
[cannot apply to mmotm/master]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Yafang-Shao/protect-page-cache-from-freeing-inode/20191225-193636
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 46cf053efec6a3a5f343fead837777efe8252a46
config: i386-tinyconfig (attached as .config)
compiler: gcc-7 (Debian 7.5.0-3) 7.5.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from include/linux/swap.h:9:0,
                    from include/linux/suspend.h:5,
                    from arch/x86/kernel/asm-offsets.c:13:
>> include/linux/memcontrol.h:872:23: error: conflicting types for 'memcg'
            struct inode *memcg)
                          ^~~~~
   include/linux/memcontrol.h:871:63: note: previous definition of 'memcg' was here
    static inline bool memcg_can_reclaim_inode(struct mem_cgroup *memcg,
                                                                  ^~~~~
   make[2]: *** [arch/x86/kernel/asm-offsets.s] Error 1
   make[2]: Target '__build' not remade because of errors.
   make[1]: *** [prepare0] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [sub-make] Error 2
   9 real  4 user  3 sys  93.40% cpu 	make prepare

vim +/memcg +872 include/linux/memcontrol.h

   870	
   871	static inline bool memcg_can_reclaim_inode(struct mem_cgroup *memcg,
 > 872						   struct inode *memcg)
   873	{
   874		return true;
   875	}
   876	

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 7283 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 5/5] memcg, inode: protect page cache from freeing inode
@ 2019-12-25 13:18     ` kbuild test robot
  0 siblings, 0 replies; 36+ messages in thread
From: kbuild test robot @ 2019-12-25 13:18 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2220 bytes --]

Hi Yafang,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.5-rc3 next-20191220]
[cannot apply to mmotm/master]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Yafang-Shao/protect-page-cache-from-freeing-inode/20191225-193636
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 46cf053efec6a3a5f343fead837777efe8252a46
config: i386-tinyconfig (attached as .config)
compiler: gcc-7 (Debian 7.5.0-3) 7.5.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from include/linux/swap.h:9:0,
                    from include/linux/suspend.h:5,
                    from arch/x86/kernel/asm-offsets.c:13:
>> include/linux/memcontrol.h:872:23: error: conflicting types for 'memcg'
            struct inode *memcg)
                          ^~~~~
   include/linux/memcontrol.h:871:63: note: previous definition of 'memcg' was here
    static inline bool memcg_can_reclaim_inode(struct mem_cgroup *memcg,
                                                                  ^~~~~
   make[2]: *** [arch/x86/kernel/asm-offsets.s] Error 1
   make[2]: Target '__build' not remade because of errors.
   make[1]: *** [prepare0] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [sub-make] Error 2
   9 real  4 user  3 sys  93.40% cpu 	make prepare

vim +/memcg +872 include/linux/memcontrol.h

   870	
   871	static inline bool memcg_can_reclaim_inode(struct mem_cgroup *memcg,
 > 872						   struct inode *memcg)
   873	{
   874		return true;
   875	}
   876	

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org Intel Corporation

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 7283 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 5/5] memcg, inode: protect page cache from freeing inode
  2019-12-25 13:18     ` kbuild test robot
@ 2019-12-26  5:09       ` Yafang Shao
  -1 siblings, 0 replies; 36+ messages in thread
From: Yafang Shao @ 2019-12-26  5:09 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, Johannes Weiner, Dave Chinner, Michal Hocko,
	Vladimir Davydov, Andrew Morton, Al Viro, Linux MM,
	linux-fsdevel, Roman Gushchin, Chris Down, Dave Chinner

On Wed, Dec 25, 2019 at 9:19 PM kbuild test robot <lkp@intel.com> wrote:
>
> Hi Yafang,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on linus/master]
> [also build test ERROR on v5.5-rc3 next-20191220]
> [cannot apply to mmotm/master]
> [if your patch is applied to the wrong git tree, please drop us a note to help
> improve the system. BTW, we also suggest to use '--base' option to specify the
> base tree in git format-patch, please see https://stackoverflow.com/a/37406982]
>
> url:    https://github.com/0day-ci/linux/commits/Yafang-Shao/protect-page-cache-from-freeing-inode/20191225-193636
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 46cf053efec6a3a5f343fead837777efe8252a46
> config: i386-tinyconfig (attached as .config)
> compiler: gcc-7 (Debian 7.5.0-3) 7.5.0
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=i386
>
> If you fix the issue, kindly add following tag
> Reported-by: kbuild test robot <lkp@intel.com>
>
> All errors (new ones prefixed by >>):
>
>    In file included from include/linux/swap.h:9:0,
>                     from include/linux/suspend.h:5,
>                     from arch/x86/kernel/asm-offsets.c:13:
> >> include/linux/memcontrol.h:872:23: error: conflicting types for 'memcg'
>             struct inode *memcg)
>                           ^~~~~
>    include/linux/memcontrol.h:871:63: note: previous definition of 'memcg' was here
>     static inline bool memcg_can_reclaim_inode(struct mem_cgroup *memcg,
>                                                                   ^~~~~
>    make[2]: *** [arch/x86/kernel/asm-offsets.s] Error 1
>    make[2]: Target '__build' not remade because of errors.
>    make[1]: *** [prepare0] Error 2
>    make[1]: Target 'prepare' not remade because of errors.
>    make: *** [sub-make] Error 2
>    9 real  4 user  3 sys  93.40% cpu    make prepare
>
> vim +/memcg +872 include/linux/memcontrol.h
>
>    870
>    871  static inline bool memcg_can_reclaim_inode(struct mem_cgroup *memcg,
>  > 872                                             struct inode *memcg)
>    873  {
>    874          return true;
>    875  }
>    876
>

Will fix this build error (when CONFIG_MEMCG_KMEM is not set) in next
version, thanks kbuild test robot.

Thanks
Yafang

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 5/5] memcg, inode: protect page cache from freeing inode
@ 2019-12-26  5:09       ` Yafang Shao
  0 siblings, 0 replies; 36+ messages in thread
From: Yafang Shao @ 2019-12-26  5:09 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, Johannes Weiner, Dave Chinner, Michal Hocko,
	Vladimir Davydov, Andrew Morton, Al Viro, Linux MM,
	linux-fsdevel, Roman Gushchin, Chris Down, Dave Chinner

On Wed, Dec 25, 2019 at 9:19 PM kbuild test robot <lkp@intel.com> wrote:
>
> Hi Yafang,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on linus/master]
> [also build test ERROR on v5.5-rc3 next-20191220]
> [cannot apply to mmotm/master]
> [if your patch is applied to the wrong git tree, please drop us a note to help
> improve the system. BTW, we also suggest to use '--base' option to specify the
> base tree in git format-patch, please see https://stackoverflow.com/a/37406982]
>
> url:    https://github.com/0day-ci/linux/commits/Yafang-Shao/protect-page-cache-from-freeing-inode/20191225-193636
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 46cf053efec6a3a5f343fead837777efe8252a46
> config: i386-tinyconfig (attached as .config)
> compiler: gcc-7 (Debian 7.5.0-3) 7.5.0
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=i386
>
> If you fix the issue, kindly add following tag
> Reported-by: kbuild test robot <lkp@intel.com>
>
> All errors (new ones prefixed by >>):
>
>    In file included from include/linux/swap.h:9:0,
>                     from include/linux/suspend.h:5,
>                     from arch/x86/kernel/asm-offsets.c:13:
> >> include/linux/memcontrol.h:872:23: error: conflicting types for 'memcg'
>             struct inode *memcg)
>                           ^~~~~
>    include/linux/memcontrol.h:871:63: note: previous definition of 'memcg' was here
>     static inline bool memcg_can_reclaim_inode(struct mem_cgroup *memcg,
>                                                                   ^~~~~
>    make[2]: *** [arch/x86/kernel/asm-offsets.s] Error 1
>    make[2]: Target '__build' not remade because of errors.
>    make[1]: *** [prepare0] Error 2
>    make[1]: Target 'prepare' not remade because of errors.
>    make: *** [sub-make] Error 2
>    9 real  4 user  3 sys  93.40% cpu    make prepare
>
> vim +/memcg +872 include/linux/memcontrol.h
>
>    870
>    871  static inline bool memcg_can_reclaim_inode(struct mem_cgroup *memcg,
>  > 872                                             struct inode *memcg)
>    873  {
>    874          return true;
>    875  }
>    876
>

Will fix this build error (when CONFIG_MEMCG_KMEM is not set) in next
version, thanks kbuild test robot.

Thanks
Yafang


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/5] mm, memcg: reduce size of struct mem_cgroup by using bit field
  2019-12-24  7:53 ` [PATCH v2 1/5] mm, memcg: reduce size of struct mem_cgroup by using bit field Yafang Shao
@ 2019-12-26 21:23   ` Roman Gushchin
  2019-12-27  1:03       ` Yafang Shao
  0 siblings, 1 reply; 36+ messages in thread
From: Roman Gushchin @ 2019-12-26 21:23 UTC (permalink / raw)
  To: Yafang Shao
  Cc: hannes, david, mhocko, vdavydov.dev, akpm, viro, linux-mm, linux-fsdevel

On Tue, Dec 24, 2019 at 02:53:22AM -0500, Yafang Shao wrote:
> There are some members in struct mem_group can be either 0(false) or
> 1(true), so we can define them using bit field to reduce size. With this
> patch, the size of struct mem_cgroup can be reduced by 64 bytes in theory,
> but as there're some MEMCG_PADDING()s, the real number may be different,
> which is relate with the cacheline size. Anyway, this patch could reduce
> the size of struct mem_cgroup more or less.

It seems it's not really related to the rest of the patchset, isn't it?

Can you, please, post it separately?

Also, I'd move the tcp-related stuff up, so that all oom-related fields
would be together.

Otherwise lgtm.

Thanks!

> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  include/linux/memcontrol.h | 21 ++++++++++++---------
>  1 file changed, 12 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index a7a0a1a5..612a457 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -229,20 +229,26 @@ struct mem_cgroup {
>  	/*
>  	 * Should the accounting and control be hierarchical, per subtree?
>  	 */
> -	bool use_hierarchy;
> +	unsigned int use_hierarchy : 1;
>  
>  	/*
>  	 * Should the OOM killer kill all belonging tasks, had it kill one?
>  	 */
> -	bool oom_group;
> +	unsigned int  oom_group : 1;
>  
>  	/* protected by memcg_oom_lock */
> -	bool		oom_lock;
> -	int		under_oom;
> +	unsigned int oom_lock : 1;
>  
> -	int	swappiness;
>  	/* OOM-Killer disable */
> -	int		oom_kill_disable;
> +	unsigned int oom_kill_disable : 1;
> +
> +	/* Legacy tcp memory accounting */
> +	unsigned int tcpmem_active : 1;
> +	unsigned int tcpmem_pressure : 1;
> +
> +	int under_oom;
> +
> +	int	swappiness;
>  
>  	/* memory.events and memory.events.local */
>  	struct cgroup_file events_file;
> @@ -297,9 +303,6 @@ struct mem_cgroup {
>  
>  	unsigned long		socket_pressure;
>  
> -	/* Legacy tcp memory accounting */
> -	bool			tcpmem_active;
> -	int			tcpmem_pressure;
>  
>  #ifdef CONFIG_MEMCG_KMEM
>          /* Index in the kmem_cache->memcg_params.memcg_caches array */
> -- 
> 1.8.3.1
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/5] mm, memcg: introduce MEMCG_PROT_SKIP for memcg zero usage case
  2019-12-24  7:53 ` [PATCH v2 2/5] mm, memcg: introduce MEMCG_PROT_SKIP for memcg zero usage case Yafang Shao
@ 2019-12-26 21:36   ` Roman Gushchin
  2019-12-27  1:09       ` Yafang Shao
  0 siblings, 1 reply; 36+ messages in thread
From: Roman Gushchin @ 2019-12-26 21:36 UTC (permalink / raw)
  To: Yafang Shao
  Cc: hannes, david, mhocko, vdavydov.dev, akpm, viro, linux-mm, linux-fsdevel

On Tue, Dec 24, 2019 at 02:53:23AM -0500, Yafang Shao wrote:
> If the usage of a memcg is zero, we don't need to do useless work to scan
> it. That is a minor optimization.

The optimization isn't really related to the main idea of the patchset,
so I'd prefer to treat it separately.

> 
> Cc: Roman Gushchin <guro@fb.com>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  include/linux/memcontrol.h | 1 +
>  mm/memcontrol.c            | 2 +-
>  mm/vmscan.c                | 6 ++++++
>  3 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 612a457..1a315c7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -54,6 +54,7 @@ enum mem_cgroup_protection {
>  	MEMCG_PROT_NONE,
>  	MEMCG_PROT_LOW,
>  	MEMCG_PROT_MIN,
> +	MEMCG_PROT_SKIP,	/* For zero usage case */
>  };
>  
>  struct mem_cgroup_reclaim_cookie {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c5b5f74..f35fcca 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6292,7 +6292,7 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
>  
>  	usage = page_counter_read(&memcg->memory);
>  	if (!usage)
> -		return MEMCG_PROT_NONE;
> +		return MEMCG_PROT_SKIP;

I'm concerned that it might lead to a regression with the scraping of
last pages from a memcg. Charge is batched using percpu stocks, so the
value of the page counter is approximate. Skipping the cgroup entirely
we're losing all chances to reclaim these few pages.

Idk how serious the problem could be in the real life, and maybe it's OK
to skip if the cgroup is online, but I'd triple check here.

Also, because this optimization isn't really related to protection,
why not check the page counter first, e.g.:

	memcg = mem_cgroup_iter(root, NULL, NULL);
 	do {
		unsigned long reclaimed;
		unsigned long scanned;

		if (!page_counter_read(&memcg->memory))
			continue;

		switch (mem_cgroup_protected(root, memcg)) {
		case MEMCG_PROT_MIN:
			/*
			 * Hard protection.
			 * If there is no reclaimable memory, OOM.
			 */
			continue;
		case MEMCG_PROT_LOW:

--

Thank you!

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 3/5] mm, memcg: reset memcg's memory.{min, low} for reclaiming itself
  2019-12-24  7:53 ` [PATCH v2 3/5] mm, memcg: reset memcg's memory.{min, low} for reclaiming itself Yafang Shao
@ 2019-12-26 21:45   ` Roman Gushchin
  2019-12-27  1:11       ` Yafang Shao
  0 siblings, 1 reply; 36+ messages in thread
From: Roman Gushchin @ 2019-12-26 21:45 UTC (permalink / raw)
  To: Yafang Shao
  Cc: hannes, david, mhocko, vdavydov.dev, akpm, viro, linux-mm,
	linux-fsdevel, Chris Down

On Tue, Dec 24, 2019 at 02:53:24AM -0500, Yafang Shao wrote:
> memory.{emin, elow} are set in mem_cgroup_protected(), and the values of
> them won't be changed until next recalculation in this function. After
> either or both of them are set, the next reclaimer to relcaim this memcg
> may be a different reclaimer, e.g. this memcg is also the root memcg of
> the new reclaimer, and then in mem_cgroup_protection() in get_scan_count()
> the old values of them will be used to calculate scan count, that is not
> proper. We should reset them to zero in this case.
> 
> Here's an example of this issue.
> 
>     root_mem_cgroup
>          /
>         A   memory.max=1024M memory.min=512M memory.current=800M
> 
> Once kswapd is waked up, it will try to scan all MEMCGs, including
> this A, and it will assign memory.emin of A with 512M.
> After that, A may reach its hard limit(memory.max), and then it will
> do memcg reclaim. Because A is the root of this reclaimer, so it will
> not calculate its memory.emin. So the memory.emin is the old value
> 512M, and then this old value will be used in
> mem_cgroup_protection() in get_scan_count() to get the scan count.
> That is not proper.

Good catch!

But it seems to be a bug introduced with the implementation of the proportional
reclaim. So I'd remove it from the patchset, add the "Fixes" tag and cc stable@.
Then it will have chances to be backported to stable trees.

Thank you!

> 
> Cc: Chris Down <chris@chrisdown.name>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  mm/memcontrol.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f35fcca..2e78931 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6287,8 +6287,17 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
>  
>  	if (!root)
>  		root = root_mem_cgroup;
> -	if (memcg == root)
> +	if (memcg == root) {
> +		/*
> +		 * Reset memory.(emin, elow) for reclaiming the memcg
> +		 * itself.
> +		 */
> +		if (memcg != root_mem_cgroup) {
> +			memcg->memory.emin = 0;
> +			memcg->memory.elow = 0;
> +		}
>  		return MEMCG_PROT_NONE;
> +	}
>  
>  	usage = page_counter_read(&memcg->memory);
>  	if (!usage)
> -- 
> 1.8.3.1
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/5] mm, memcg: reduce size of struct mem_cgroup by using bit field
  2019-12-26 21:23   ` Roman Gushchin
@ 2019-12-27  1:03       ` Yafang Shao
  0 siblings, 0 replies; 36+ messages in thread
From: Yafang Shao @ 2019-12-27  1:03 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: hannes, david, mhocko, vdavydov.dev, akpm, viro, linux-mm, linux-fsdevel

On Fri, Dec 27, 2019 at 5:23 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Tue, Dec 24, 2019 at 02:53:22AM -0500, Yafang Shao wrote:
> > There are some members in struct mem_group can be either 0(false) or
> > 1(true), so we can define them using bit field to reduce size. With this
> > patch, the size of struct mem_cgroup can be reduced by 64 bytes in theory,
> > but as there're some MEMCG_PADDING()s, the real number may be different,
> > which is relate with the cacheline size. Anyway, this patch could reduce
> > the size of struct mem_cgroup more or less.
>
> It seems it's not really related to the rest of the patchset, isn't it?
>

Right.

> Can you, please, post it separately?
>

Sure. Will post it separately.

> Also, I'd move the tcp-related stuff up, so that all oom-related fields
> would be together.

OK. Thanks for your comment.

> Otherwise lgtm.
>
> Thanks!
>
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  include/linux/memcontrol.h | 21 ++++++++++++---------
> >  1 file changed, 12 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index a7a0a1a5..612a457 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -229,20 +229,26 @@ struct mem_cgroup {
> >       /*
> >        * Should the accounting and control be hierarchical, per subtree?
> >        */
> > -     bool use_hierarchy;
> > +     unsigned int use_hierarchy : 1;
> >
> >       /*
> >        * Should the OOM killer kill all belonging tasks, had it kill one?
> >        */
> > -     bool oom_group;
> > +     unsigned int  oom_group : 1;
> >
> >       /* protected by memcg_oom_lock */
> > -     bool            oom_lock;
> > -     int             under_oom;
> > +     unsigned int oom_lock : 1;
> >
> > -     int     swappiness;
> >       /* OOM-Killer disable */
> > -     int             oom_kill_disable;
> > +     unsigned int oom_kill_disable : 1;
> > +
> > +     /* Legacy tcp memory accounting */
> > +     unsigned int tcpmem_active : 1;
> > +     unsigned int tcpmem_pressure : 1;
> > +
> > +     int under_oom;
> > +
> > +     int     swappiness;
> >
> >       /* memory.events and memory.events.local */
> >       struct cgroup_file events_file;
> > @@ -297,9 +303,6 @@ struct mem_cgroup {
> >
> >       unsigned long           socket_pressure;
> >
> > -     /* Legacy tcp memory accounting */
> > -     bool                    tcpmem_active;
> > -     int                     tcpmem_pressure;
> >
> >  #ifdef CONFIG_MEMCG_KMEM
> >          /* Index in the kmem_cache->memcg_params.memcg_caches array */
> > --
> > 1.8.3.1
> >
> >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/5] mm, memcg: reduce size of struct mem_cgroup by using bit field
@ 2019-12-27  1:03       ` Yafang Shao
  0 siblings, 0 replies; 36+ messages in thread
From: Yafang Shao @ 2019-12-27  1:03 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: hannes, david, mhocko, vdavydov.dev, akpm, viro, linux-mm, linux-fsdevel

On Fri, Dec 27, 2019 at 5:23 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Tue, Dec 24, 2019 at 02:53:22AM -0500, Yafang Shao wrote:
> > There are some members in struct mem_group can be either 0(false) or
> > 1(true), so we can define them using bit field to reduce size. With this
> > patch, the size of struct mem_cgroup can be reduced by 64 bytes in theory,
> > but as there're some MEMCG_PADDING()s, the real number may be different,
> > which is relate with the cacheline size. Anyway, this patch could reduce
> > the size of struct mem_cgroup more or less.
>
> It seems it's not really related to the rest of the patchset, isn't it?
>

Right.

> Can you, please, post it separately?
>

Sure. Will post it separately.

> Also, I'd move the tcp-related stuff up, so that all oom-related fields
> would be together.

OK. Thanks for your comment.

> Otherwise lgtm.
>
> Thanks!
>
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  include/linux/memcontrol.h | 21 ++++++++++++---------
> >  1 file changed, 12 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index a7a0a1a5..612a457 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -229,20 +229,26 @@ struct mem_cgroup {
> >       /*
> >        * Should the accounting and control be hierarchical, per subtree?
> >        */
> > -     bool use_hierarchy;
> > +     unsigned int use_hierarchy : 1;
> >
> >       /*
> >        * Should the OOM killer kill all belonging tasks, had it kill one?
> >        */
> > -     bool oom_group;
> > +     unsigned int  oom_group : 1;
> >
> >       /* protected by memcg_oom_lock */
> > -     bool            oom_lock;
> > -     int             under_oom;
> > +     unsigned int oom_lock : 1;
> >
> > -     int     swappiness;
> >       /* OOM-Killer disable */
> > -     int             oom_kill_disable;
> > +     unsigned int oom_kill_disable : 1;
> > +
> > +     /* Legacy tcp memory accounting */
> > +     unsigned int tcpmem_active : 1;
> > +     unsigned int tcpmem_pressure : 1;
> > +
> > +     int under_oom;
> > +
> > +     int     swappiness;
> >
> >       /* memory.events and memory.events.local */
> >       struct cgroup_file events_file;
> > @@ -297,9 +303,6 @@ struct mem_cgroup {
> >
> >       unsigned long           socket_pressure;
> >
> > -     /* Legacy tcp memory accounting */
> > -     bool                    tcpmem_active;
> > -     int                     tcpmem_pressure;
> >
> >  #ifdef CONFIG_MEMCG_KMEM
> >          /* Index in the kmem_cache->memcg_params.memcg_caches array */
> > --
> > 1.8.3.1
> >
> >


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/5] mm, memcg: introduce MEMCG_PROT_SKIP for memcg zero usage case
  2019-12-26 21:36   ` Roman Gushchin
@ 2019-12-27  1:09       ` Yafang Shao
  0 siblings, 0 replies; 36+ messages in thread
From: Yafang Shao @ 2019-12-27  1:09 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: hannes, david, mhocko, vdavydov.dev, akpm, viro, linux-mm, linux-fsdevel

On Fri, Dec 27, 2019 at 5:36 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Tue, Dec 24, 2019 at 02:53:23AM -0500, Yafang Shao wrote:
> > If the usage of a memcg is zero, we don't need to do useless work to scan
> > it. That is a minor optimization.
>
> The optimization isn't really related to the main idea of the patchset,
> so I'd prefer to treat it separately.
>

Sure.

> >
> > Cc: Roman Gushchin <guro@fb.com>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  include/linux/memcontrol.h | 1 +
> >  mm/memcontrol.c            | 2 +-
> >  mm/vmscan.c                | 6 ++++++
> >  3 files changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 612a457..1a315c7 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -54,6 +54,7 @@ enum mem_cgroup_protection {
> >       MEMCG_PROT_NONE,
> >       MEMCG_PROT_LOW,
> >       MEMCG_PROT_MIN,
> > +     MEMCG_PROT_SKIP,        /* For zero usage case */
> >  };
> >
> >  struct mem_cgroup_reclaim_cookie {
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c5b5f74..f35fcca 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6292,7 +6292,7 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
> >
> >       usage = page_counter_read(&memcg->memory);
> >       if (!usage)
> > -             return MEMCG_PROT_NONE;
> > +             return MEMCG_PROT_SKIP;
>
> I'm concerned that it might lead to a regression with the scraping of
> last pages from a memcg. Charge is batched using percpu stocks, so the
> value of the page counter is approximate. Skipping the cgroup entirely
> we're losing all chances to reclaim these few pages.
>

Agree with you. It may lose the chances to reclaim these last few pages.
I will think about it.

> Idk how serious the problem could be in the real life, and maybe it's OK
> to skip if the cgroup is online, but I'd triple check here.
>
> Also, because this optimization isn't really related to protection,
> why not check the page counter first, e.g.:
>
>         memcg = mem_cgroup_iter(root, NULL, NULL);
>         do {
>                 unsigned long reclaimed;
>                 unsigned long scanned;
>
>                 if (!page_counter_read(&memcg->memory))
>                         continue;
>

Seems better. Thanks for your suggestion.

>                 switch (mem_cgroup_protected(root, memcg)) {
>                 case MEMCG_PROT_MIN:
>                         /*
>                          * Hard protection.
>                          * If there is no reclaimable memory, OOM.
>                          */
>                         continue;
>                 case MEMCG_PROT_LOW:
>
> --
>
> Thank you!

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/5] mm, memcg: introduce MEMCG_PROT_SKIP for memcg zero usage case
@ 2019-12-27  1:09       ` Yafang Shao
  0 siblings, 0 replies; 36+ messages in thread
From: Yafang Shao @ 2019-12-27  1:09 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: hannes, david, mhocko, vdavydov.dev, akpm, viro, linux-mm, linux-fsdevel

On Fri, Dec 27, 2019 at 5:36 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Tue, Dec 24, 2019 at 02:53:23AM -0500, Yafang Shao wrote:
> > If the usage of a memcg is zero, we don't need to do useless work to scan
> > it. That is a minor optimization.
>
> The optimization isn't really related to the main idea of the patchset,
> so I'd prefer to treat it separately.
>

Sure.

> >
> > Cc: Roman Gushchin <guro@fb.com>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  include/linux/memcontrol.h | 1 +
> >  mm/memcontrol.c            | 2 +-
> >  mm/vmscan.c                | 6 ++++++
> >  3 files changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 612a457..1a315c7 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -54,6 +54,7 @@ enum mem_cgroup_protection {
> >       MEMCG_PROT_NONE,
> >       MEMCG_PROT_LOW,
> >       MEMCG_PROT_MIN,
> > +     MEMCG_PROT_SKIP,        /* For zero usage case */
> >  };
> >
> >  struct mem_cgroup_reclaim_cookie {
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c5b5f74..f35fcca 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6292,7 +6292,7 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
> >
> >       usage = page_counter_read(&memcg->memory);
> >       if (!usage)
> > -             return MEMCG_PROT_NONE;
> > +             return MEMCG_PROT_SKIP;
>
> I'm concerned that it might lead to a regression with the scraping of
> last pages from a memcg. Charge is batched using percpu stocks, so the
> value of the page counter is approximate. Skipping the cgroup entirely
> we're losing all chances to reclaim these few pages.
>

Agree with you. It may lose the chances to reclaim these last few pages.
I will think about it.

> Idk how serious the problem could be in the real life, and maybe it's OK
> to skip if the cgroup is online, but I'd triple check here.
>
> Also, because this optimization isn't really related to protection,
> why not check the page counter first, e.g.:
>
>         memcg = mem_cgroup_iter(root, NULL, NULL);
>         do {
>                 unsigned long reclaimed;
>                 unsigned long scanned;
>
>                 if (!page_counter_read(&memcg->memory))
>                         continue;
>

Seems better. Thanks for your suggestion.

>                 switch (mem_cgroup_protected(root, memcg)) {
>                 case MEMCG_PROT_MIN:
>                         /*
>                          * Hard protection.
>                          * If there is no reclaimable memory, OOM.
>                          */
>                         continue;
>                 case MEMCG_PROT_LOW:
>
> --
>
> Thank you!


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 3/5] mm, memcg: reset memcg's memory.{min, low} for reclaiming itself
  2019-12-26 21:45   ` Roman Gushchin
@ 2019-12-27  1:11       ` Yafang Shao
  0 siblings, 0 replies; 36+ messages in thread
From: Yafang Shao @ 2019-12-27  1:11 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: hannes, david, mhocko, vdavydov.dev, akpm, viro, linux-mm,
	linux-fsdevel, Chris Down

On Fri, Dec 27, 2019 at 5:46 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Tue, Dec 24, 2019 at 02:53:24AM -0500, Yafang Shao wrote:
> > memory.{emin, elow} are set in mem_cgroup_protected(), and the values of
> > them won't be changed until next recalculation in this function. After
> > either or both of them are set, the next reclaimer to relcaim this memcg
> > may be a different reclaimer, e.g. this memcg is also the root memcg of
> > the new reclaimer, and then in mem_cgroup_protection() in get_scan_count()
> > the old values of them will be used to calculate scan count, that is not
> > proper. We should reset them to zero in this case.
> >
> > Here's an example of this issue.
> >
> >     root_mem_cgroup
> >          /
> >         A   memory.max=1024M memory.min=512M memory.current=800M
> >
> > Once kswapd is waked up, it will try to scan all MEMCGs, including
> > this A, and it will assign memory.emin of A with 512M.
> > After that, A may reach its hard limit(memory.max), and then it will
> > do memcg reclaim. Because A is the root of this reclaimer, so it will
> > not calculate its memory.emin. So the memory.emin is the old value
> > 512M, and then this old value will be used in
> > mem_cgroup_protection() in get_scan_count() to get the scan count.
> > That is not proper.
>
> Good catch!
>
> But it seems to be a bug introduced with the implementation of the proportional
> reclaim. So I'd remove it from the patchset, add the "Fixes" tag and cc stable@.
> Then it will have chances to be backported to stable trees.
>

Sure, will do it.

Thanks!

>
> >
> > Cc: Chris Down <chris@chrisdown.name>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  mm/memcontrol.c | 11 ++++++++++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index f35fcca..2e78931 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6287,8 +6287,17 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
> >
> >       if (!root)
> >               root = root_mem_cgroup;
> > -     if (memcg == root)
> > +     if (memcg == root) {
> > +             /*
> > +              * Reset memory.(emin, elow) for reclaiming the memcg
> > +              * itself.
> > +              */
> > +             if (memcg != root_mem_cgroup) {
> > +                     memcg->memory.emin = 0;
> > +                     memcg->memory.elow = 0;
> > +             }
> >               return MEMCG_PROT_NONE;
> > +     }
> >
> >       usage = page_counter_read(&memcg->memory);
> >       if (!usage)
> > --
> > 1.8.3.1
> >
> >

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 3/5] mm, memcg: reset memcg's memory.{min, low} for reclaiming itself
@ 2019-12-27  1:11       ` Yafang Shao
  0 siblings, 0 replies; 36+ messages in thread
From: Yafang Shao @ 2019-12-27  1:11 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: hannes, david, mhocko, vdavydov.dev, akpm, viro, linux-mm,
	linux-fsdevel, Chris Down

On Fri, Dec 27, 2019 at 5:46 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Tue, Dec 24, 2019 at 02:53:24AM -0500, Yafang Shao wrote:
> > memory.{emin, elow} are set in mem_cgroup_protected(), and the values of
> > them won't be changed until next recalculation in this function. After
> > either or both of them are set, the next reclaimer to relcaim this memcg
> > may be a different reclaimer, e.g. this memcg is also the root memcg of
> > the new reclaimer, and then in mem_cgroup_protection() in get_scan_count()
> > the old values of them will be used to calculate scan count, that is not
> > proper. We should reset them to zero in this case.
> >
> > Here's an example of this issue.
> >
> >     root_mem_cgroup
> >          /
> >         A   memory.max=1024M memory.min=512M memory.current=800M
> >
> > Once kswapd is waked up, it will try to scan all MEMCGs, including
> > this A, and it will assign memory.emin of A with 512M.
> > After that, A may reach its hard limit(memory.max), and then it will
> > do memcg reclaim. Because A is the root of this reclaimer, so it will
> > not calculate its memory.emin. So the memory.emin is the old value
> > 512M, and then this old value will be used in
> > mem_cgroup_protection() in get_scan_count() to get the scan count.
> > That is not proper.
>
> Good catch!
>
> But it seems to be a bug introduced with the implementation of the proportional
> reclaim. So I'd remove it from the patchset, add the "Fixes" tag and cc stable@.
> Then it will have chances to be backported to stable trees.
>

Sure, will do it.

Thanks!

>
> >
> > Cc: Chris Down <chris@chrisdown.name>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  mm/memcontrol.c | 11 ++++++++++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index f35fcca..2e78931 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6287,8 +6287,17 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
> >
> >       if (!root)
> >               root = root_mem_cgroup;
> > -     if (memcg == root)
> > +     if (memcg == root) {
> > +             /*
> > +              * Reset memory.(emin, elow) for reclaiming the memcg
> > +              * itself.
> > +              */
> > +             if (memcg != root_mem_cgroup) {
> > +                     memcg->memory.emin = 0;
> > +                     memcg->memory.elow = 0;
> > +             }
> >               return MEMCG_PROT_NONE;
> > +     }
> >
> >       usage = page_counter_read(&memcg->memory);
> >       if (!usage)
> > --
> > 1.8.3.1
> >
> >


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 4/5] mm: make memcg visible to lru walker isolation function
  2019-12-24  7:53 ` [PATCH v2 4/5] mm: make memcg visible to lru walker isolation function Yafang Shao
@ 2020-01-04  3:35   ` Dave Chinner
  2020-01-04  7:26       ` Yafang Shao
  0 siblings, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2020-01-04  3:35 UTC (permalink / raw)
  To: Yafang Shao
  Cc: hannes, mhocko, vdavydov.dev, akpm, viro, linux-mm,
	linux-fsdevel, Dave Chinner

On Tue, Dec 24, 2019 at 02:53:25AM -0500, Yafang Shao wrote:
> The lru walker isolation function may use this memcg to do something, e.g.
> the inode isolatation function will use the memcg to do inode protection in
> followup patch. So make memcg visible to the lru walker isolation function.
> 
> Something should be emphasized in this patch is it replaces
> for_each_memcg_cache_index() with for_each_mem_cgroup() in
> list_lru_walk_node(). Because there's a gap between these two MACROs that
> for_each_mem_cgroup() depends on CONFIG_MEMCG while the other one depends
> on CONFIG_MEMCG_KMEM. But as list_lru_memcg_aware() returns false if
> CONFIG_MEMCG_KMEM is not configured, it is safe to this replacement.
> 
> Cc: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>

....

> @@ -299,17 +299,15 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
>  				 list_lru_walk_cb isolate, void *cb_arg,
>  				 unsigned long *nr_to_walk)
>  {
> +	struct mem_cgroup *memcg;
>  	long isolated = 0;
> -	int memcg_idx;
>  
> -	isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
> -				      nr_to_walk);
> -	if (*nr_to_walk > 0 && list_lru_memcg_aware(lru)) {
> -		for_each_memcg_cache_index(memcg_idx) {
> +	if (list_lru_memcg_aware(lru)) {
> +		for_each_mem_cgroup(memcg) {
>  			struct list_lru_node *nlru = &lru->node[nid];
>  
>  			spin_lock(&nlru->lock);
> -			isolated += __list_lru_walk_one(nlru, memcg_idx,
> +			isolated += __list_lru_walk_one(nlru, memcg,
>  							isolate, cb_arg,
>  							nr_to_walk);
>  			spin_unlock(&nlru->lock);
> @@ -317,7 +315,11 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
>  			if (*nr_to_walk <= 0)
>  				break;
>  		}
> +	} else {
> +		isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
> +					      nr_to_walk);
>  	}
> +

That's a change of behaviour. The old code always runs per-node
reclaim, then if the LRU is memcg aware it also runs the memcg
aware reclaim. The new code never runs global per-node reclaim
if the list is memcg aware, so shrinkers that are initialised
with the flags SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE seem
likely to have reclaim problems with mixed memcg/global memory
pressure scenarios.

e.g. if all the memory is in the per-node lists, and the memcg needs
to reclaim memory because of a global shortage, it is now unable to
reclaim global memory.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 5/5] memcg, inode: protect page cache from freeing inode
  2019-12-24  7:53 ` [PATCH v2 5/5] memcg, inode: protect page cache from freeing inode Yafang Shao
  2019-12-25 13:01     ` kbuild test robot
  2019-12-25 13:18     ` kbuild test robot
@ 2020-01-04  3:55   ` Dave Chinner
  2020-01-04  7:42       ` Yafang Shao
  2 siblings, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2020-01-04  3:55 UTC (permalink / raw)
  To: Yafang Shao
  Cc: hannes, mhocko, vdavydov.dev, akpm, viro, linux-mm,
	linux-fsdevel, Roman Gushchin, Chris Down, Dave Chinner

On Tue, Dec 24, 2019 at 02:53:26AM -0500, Yafang Shao wrote:
> On my server there're some running MEMCGs protected by memory.{min, low},
> but I found the usage of these MEMCGs abruptly became very small, which
> were far less than the protect limit. It confused me and finally I
> found that was because of inode stealing.
> Once an inode is freed, all its belonging page caches will be dropped as
> well, no matter how may page caches it has. So if we intend to protect the
> page caches in a memcg, we must protect their host (the inode) first.
> Otherwise the memcg protection can be easily bypassed with freeing inode,
> especially if there're big files in this memcg.
> 
> Supposes we have a memcg, and the stat of this memcg is,
> 	memory.current = 1024M
> 	memory.min = 512M
> And in this memcg there's a inode with 800M page caches.
> Once this memcg is scanned by kswapd or other regular reclaimers,
>     kswapd <<<< It can be either of the regular reclaimers.
>         shrink_node_memcgs
> 	    switch (mem_cgroup_protected()) <<<< Not protected
> 		case MEMCG_PROT_NONE:  <<<< Will scan this memcg
> 			beak;
>             shrink_lruvec() <<<< Reclaim the page caches
>             shrink_slab()   <<<< It may free this inode and drop all its
>                                  page caches(800M).
> So we must protect the inode first if we want to protect page caches.
> 
> The inherent mismatch between memcg and inode is a trouble. One inode can
> be shared by different MEMCGs, but it is a very rare case. If an inode is
> shared, its belonging page caches may be charged to different MEMCGs.
> Currently there's no perfect solution to fix this kind of issue, but the
> inode majority-writer ownership switching can help it more or less.

There's multiple changes in this patch set. Yes, there's some inode
cache futzing to deal with memcgs, but it also adds some weird
undocumented "in low reclaim" heuristic that does something magical
with "protection" that you don't describe or document at all. PLease
separate that out into a new patch and document it clearly (both in
the commit message and the code, please).

> diff --git a/fs/inode.c b/fs/inode.c
> index fef457a..4f4b2f3 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -54,6 +54,13 @@
>   *   inode_hash_lock
>   */
>  
> +struct inode_head {

not an "inode head" structure. They are inode isolation arguments.
i.e. struct inode_isolation_args {...};

> +	struct list_head *freeable;
> +#ifdef CONFIG_MEMCG_KMEM
> +	struct mem_cgroup *memcg;
> +#endif
> +};

These defines are awful, too, and completely unnecesarily because
the struct shrink_control unconditionally defines sc->memcg and
so we can just code it throughout without caring whether memcgs are
enabled or not.

> +
>  static unsigned int i_hash_mask __read_mostly;
>  static unsigned int i_hash_shift __read_mostly;
>  static struct hlist_head *inode_hashtable __read_mostly;
> @@ -724,8 +731,10 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
>  static enum lru_status inode_lru_isolate(struct list_head *item,
>  		struct list_lru_one *lru, spinlock_t *lru_lock, void *arg)
>  {
> -	struct list_head *freeable = arg;
> +	struct inode_head *ihead = (struct inode_head *)arg;

No need for a cast of a void *.

> +	struct list_head *freeable = ihead->freeable;
>  	struct inode	*inode = container_of(item, struct inode, i_lru);
> +	struct mem_cgroup *memcg = NULL;

	struct inode_isolation_args *iargs = arg;
	struct list_head *freeable = iargs->freeable;
	struct mem_cgroup *memcg = iargs->memcg;
	struct inode	*inode = container_of(item, struct inode, i_lru);

> @@ -734,6 +743,15 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
>  	if (!spin_trylock(&inode->i_lock))
>  		return LRU_SKIP;
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +	memcg = ihead->memcg;
> +#endif

This is no longer necessary.

> +	if (memcg && inode->i_data.nrpages &&
> +	    !(memcg_can_reclaim_inode(memcg, inode))) {
> +		spin_unlock(&inode->i_lock);
> +		return LRU_ROTATE;
> +	}

I'd argue that both the memcg and the inode->i_data.nrpages check
should be inside memcg_can_reclaim_inode(), that way this entire
chunk of code disappears when CONFIG_MEMCG_KMEM=N.

> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index f36ada9..d1d4175 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -247,6 +247,9 @@ struct mem_cgroup {
>  	unsigned int tcpmem_active : 1;
>  	unsigned int tcpmem_pressure : 1;
>  
> +	/* Soft protection will be ignored if it's true */
> +	unsigned int in_low_reclaim : 1;

This is the stuff that has nothing to do with the changes to the
inode reclaim shrinker...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 4/5] mm: make memcg visible to lru walker isolation function
  2020-01-04  3:35   ` Dave Chinner
@ 2020-01-04  7:26       ` Yafang Shao
  0 siblings, 0 replies; 36+ messages in thread
From: Yafang Shao @ 2020-01-04  7:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Al Viro, Linux MM, linux-fsdevel, Dave Chinner

On Sat, Jan 4, 2020 at 11:36 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Dec 24, 2019 at 02:53:25AM -0500, Yafang Shao wrote:
> > The lru walker isolation function may use this memcg to do something, e.g.
> > the inode isolatation function will use the memcg to do inode protection in
> > followup patch. So make memcg visible to the lru walker isolation function.
> >
> > Something should be emphasized in this patch is it replaces
> > for_each_memcg_cache_index() with for_each_mem_cgroup() in
> > list_lru_walk_node(). Because there's a gap between these two MACROs that
> > for_each_mem_cgroup() depends on CONFIG_MEMCG while the other one depends
> > on CONFIG_MEMCG_KMEM. But as list_lru_memcg_aware() returns false if
> > CONFIG_MEMCG_KMEM is not configured, it is safe to this replacement.
> >
> > Cc: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
>
> ....
>
> > @@ -299,17 +299,15 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
> >                                list_lru_walk_cb isolate, void *cb_arg,
> >                                unsigned long *nr_to_walk)
> >  {
> > +     struct mem_cgroup *memcg;
> >       long isolated = 0;
> > -     int memcg_idx;
> >
> > -     isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
> > -                                   nr_to_walk);
> > -     if (*nr_to_walk > 0 && list_lru_memcg_aware(lru)) {
> > -             for_each_memcg_cache_index(memcg_idx) {
> > +     if (list_lru_memcg_aware(lru)) {
> > +             for_each_mem_cgroup(memcg) {
> >                       struct list_lru_node *nlru = &lru->node[nid];
> >
> >                       spin_lock(&nlru->lock);
> > -                     isolated += __list_lru_walk_one(nlru, memcg_idx,
> > +                     isolated += __list_lru_walk_one(nlru, memcg,
> >                                                       isolate, cb_arg,
> >                                                       nr_to_walk);
> >                       spin_unlock(&nlru->lock);
> > @@ -317,7 +315,11 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
> >                       if (*nr_to_walk <= 0)
> >                               break;
> >               }
> > +     } else {
> > +             isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
> > +                                           nr_to_walk);
> >       }
> > +
>
> That's a change of behaviour. The old code always runs per-node
> reclaim, then if the LRU is memcg aware it also runs the memcg
> aware reclaim. The new code never runs global per-node reclaim
> if the list is memcg aware, so shrinkers that are initialised
> with the flags SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE seem
> likely to have reclaim problems with mixed memcg/global memory
> pressure scenarios.
>
> e.g. if all the memory is in the per-node lists, and the memcg needs
> to reclaim memory because of a global shortage, it is now unable to
> reclaim global memory.....
>

Hi Dave,

Thanks for your detailed explanation.
But I have different understanding.
The difference between for_each_mem_cgroup(memcg) and
for_each_memcg_cache_index(memcg_idx) is that the
for_each_mem_cgroup() includes the root_mem_cgroup while the
for_each_memcg_cache_index() excludes the root_mem_cgroup because the
memcg_idx of it is -1.
So it can reclaim global memory even if the list is memcg aware.
Is that right ?

Thanks
Yafang

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 4/5] mm: make memcg visible to lru walker isolation function
@ 2020-01-04  7:26       ` Yafang Shao
  0 siblings, 0 replies; 36+ messages in thread
From: Yafang Shao @ 2020-01-04  7:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Al Viro, Linux MM, linux-fsdevel, Dave Chinner

On Sat, Jan 4, 2020 at 11:36 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Dec 24, 2019 at 02:53:25AM -0500, Yafang Shao wrote:
> > The lru walker isolation function may use this memcg to do something, e.g.
> > the inode isolatation function will use the memcg to do inode protection in
> > followup patch. So make memcg visible to the lru walker isolation function.
> >
> > Something should be emphasized in this patch is it replaces
> > for_each_memcg_cache_index() with for_each_mem_cgroup() in
> > list_lru_walk_node(). Because there's a gap between these two MACROs that
> > for_each_mem_cgroup() depends on CONFIG_MEMCG while the other one depends
> > on CONFIG_MEMCG_KMEM. But as list_lru_memcg_aware() returns false if
> > CONFIG_MEMCG_KMEM is not configured, it is safe to this replacement.
> >
> > Cc: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
>
> ....
>
> > @@ -299,17 +299,15 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
> >                                list_lru_walk_cb isolate, void *cb_arg,
> >                                unsigned long *nr_to_walk)
> >  {
> > +     struct mem_cgroup *memcg;
> >       long isolated = 0;
> > -     int memcg_idx;
> >
> > -     isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
> > -                                   nr_to_walk);
> > -     if (*nr_to_walk > 0 && list_lru_memcg_aware(lru)) {
> > -             for_each_memcg_cache_index(memcg_idx) {
> > +     if (list_lru_memcg_aware(lru)) {
> > +             for_each_mem_cgroup(memcg) {
> >                       struct list_lru_node *nlru = &lru->node[nid];
> >
> >                       spin_lock(&nlru->lock);
> > -                     isolated += __list_lru_walk_one(nlru, memcg_idx,
> > +                     isolated += __list_lru_walk_one(nlru, memcg,
> >                                                       isolate, cb_arg,
> >                                                       nr_to_walk);
> >                       spin_unlock(&nlru->lock);
> > @@ -317,7 +315,11 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
> >                       if (*nr_to_walk <= 0)
> >                               break;
> >               }
> > +     } else {
> > +             isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
> > +                                           nr_to_walk);
> >       }
> > +
>
> That's a change of behaviour. The old code always runs per-node
> reclaim, then if the LRU is memcg aware it also runs the memcg
> aware reclaim. The new code never runs global per-node reclaim
> if the list is memcg aware, so shrinkers that are initialised
> with the flags SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE seem
> likely to have reclaim problems with mixed memcg/global memory
> pressure scenarios.
>
> e.g. if all the memory is in the per-node lists, and the memcg needs
> to reclaim memory because of a global shortage, it is now unable to
> reclaim global memory.....
>

Hi Dave,

Thanks for your detailed explanation.
But I have different understanding.
The difference between for_each_mem_cgroup(memcg) and
for_each_memcg_cache_index(memcg_idx) is that the
for_each_mem_cgroup() includes the root_mem_cgroup while the
for_each_memcg_cache_index() excludes the root_mem_cgroup because the
memcg_idx of it is -1.
So it can reclaim global memory even if the list is memcg aware.
Is that right ?

Thanks
Yafang


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 5/5] memcg, inode: protect page cache from freeing inode
  2020-01-04  3:55   ` Dave Chinner
@ 2020-01-04  7:42       ` Yafang Shao
  0 siblings, 0 replies; 36+ messages in thread
From: Yafang Shao @ 2020-01-04  7:42 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Al Viro, Linux MM, linux-fsdevel, Roman Gushchin, Chris Down,
	Dave Chinner

On Sat, Jan 4, 2020 at 11:55 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Dec 24, 2019 at 02:53:26AM -0500, Yafang Shao wrote:
> > On my server there're some running MEMCGs protected by memory.{min, low},
> > but I found the usage of these MEMCGs abruptly became very small, which
> > were far less than the protect limit. It confused me and finally I
> > found that was because of inode stealing.
> > Once an inode is freed, all its belonging page caches will be dropped as
> > well, no matter how may page caches it has. So if we intend to protect the
> > page caches in a memcg, we must protect their host (the inode) first.
> > Otherwise the memcg protection can be easily bypassed with freeing inode,
> > especially if there're big files in this memcg.
> >
> > Supposes we have a memcg, and the stat of this memcg is,
> >       memory.current = 1024M
> >       memory.min = 512M
> > And in this memcg there's a inode with 800M page caches.
> > Once this memcg is scanned by kswapd or other regular reclaimers,
> >     kswapd <<<< It can be either of the regular reclaimers.
> >         shrink_node_memcgs
> >           switch (mem_cgroup_protected()) <<<< Not protected
> >               case MEMCG_PROT_NONE:  <<<< Will scan this memcg
> >                       beak;
> >             shrink_lruvec() <<<< Reclaim the page caches
> >             shrink_slab()   <<<< It may free this inode and drop all its
> >                                  page caches(800M).
> > So we must protect the inode first if we want to protect page caches.
> >
> > The inherent mismatch between memcg and inode is a trouble. One inode can
> > be shared by different MEMCGs, but it is a very rare case. If an inode is
> > shared, its belonging page caches may be charged to different MEMCGs.
> > Currently there's no perfect solution to fix this kind of issue, but the
> > inode majority-writer ownership switching can help it more or less.
>
> There's multiple changes in this patch set. Yes, there's some inode
> cache futzing to deal with memcgs, but it also adds some weird
> undocumented "in low reclaim" heuristic that does something magical
> with "protection" that you don't describe or document at all. PLease
> separate that out into a new patch and document it clearly (both in
> the commit message and the code, please).
>

Sure, I will separate it and document it in next version.

> > diff --git a/fs/inode.c b/fs/inode.c
> > index fef457a..4f4b2f3 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -54,6 +54,13 @@
> >   *   inode_hash_lock
> >   */
> >
> > +struct inode_head {
>
> not an "inode head" structure. They are inode isolation arguments.
> i.e. struct inode_isolation_args {...};
>

Agree with you that inode_isolation_args is better.
While I have a different idea, what about using inode_isolation_control ?
scan_control : to scan all MEMCGs
        v
shrink_control: to shrink all slabs in one memcg
        v
inode_isolation_control: to shrink one slab (the inode)

And in the future we may introduce dentry_isolation_control and some others.
Anyway that's a minor issue.

> > +     struct list_head *freeable;
> > +#ifdef CONFIG_MEMCG_KMEM
> > +     struct mem_cgroup *memcg;
> > +#endif
> > +};
>
> These defines are awful, too, and completely unnecesarily because
> the struct shrink_control unconditionally defines sc->memcg and
> so we can just code it throughout without caring whether memcgs are
> enabled or not.
>

Sure, will change it in next version.

> > +
> >  static unsigned int i_hash_mask __read_mostly;
> >  static unsigned int i_hash_shift __read_mostly;
> >  static struct hlist_head *inode_hashtable __read_mostly;
> > @@ -724,8 +731,10 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
> >  static enum lru_status inode_lru_isolate(struct list_head *item,
> >               struct list_lru_one *lru, spinlock_t *lru_lock, void *arg)
> >  {
> > -     struct list_head *freeable = arg;
> > +     struct inode_head *ihead = (struct inode_head *)arg;
>
> No need for a cast of a void *.
>

OK.

> > +     struct list_head *freeable = ihead->freeable;
> >       struct inode    *inode = container_of(item, struct inode, i_lru);
> > +     struct mem_cgroup *memcg = NULL;
>
>         struct inode_isolation_args *iargs = arg;
>         struct list_head *freeable = iargs->freeable;
>         struct mem_cgroup *memcg = iargs->memcg;
>         struct inode    *inode = container_of(item, struct inode, i_lru);
>

Thanks. That looks better.

> > @@ -734,6 +743,15 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
> >       if (!spin_trylock(&inode->i_lock))
> >               return LRU_SKIP;
> >
> > +#ifdef CONFIG_MEMCG_KMEM
> > +     memcg = ihead->memcg;
> > +#endif
>
> This is no longer necessary.
>

Thanks.

> > +     if (memcg && inode->i_data.nrpages &&
> > +         !(memcg_can_reclaim_inode(memcg, inode))) {
> > +             spin_unlock(&inode->i_lock);
> > +             return LRU_ROTATE;
> > +     }
>
> I'd argue that both the memcg and the inode->i_data.nrpages check
> should be inside memcg_can_reclaim_inode(), that way this entire
> chunk of code disappears when CONFIG_MEMCG_KMEM=N.
>

I will think about it and make the code better when CONFIG_MEMCG_KMEM=N.

> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index f36ada9..d1d4175 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -247,6 +247,9 @@ struct mem_cgroup {
> >       unsigned int tcpmem_active : 1;
> >       unsigned int tcpmem_pressure : 1;
> >
> > +     /* Soft protection will be ignored if it's true */
> > +     unsigned int in_low_reclaim : 1;
>
> This is the stuff that has nothing to do with the changes to the
> inode reclaim shrinker...
>

In next version I will drop this in_low_reclaim and using the
memcg_low_reclaim passed from struct scan_control.


Thanks
Yafang

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 5/5] memcg, inode: protect page cache from freeing inode
@ 2020-01-04  7:42       ` Yafang Shao
  0 siblings, 0 replies; 36+ messages in thread
From: Yafang Shao @ 2020-01-04  7:42 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Al Viro, Linux MM, linux-fsdevel, Roman Gushchin, Chris Down,
	Dave Chinner

On Sat, Jan 4, 2020 at 11:55 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Dec 24, 2019 at 02:53:26AM -0500, Yafang Shao wrote:
> > On my server there're some running MEMCGs protected by memory.{min, low},
> > but I found the usage of these MEMCGs abruptly became very small, which
> > were far less than the protect limit. It confused me and finally I
> > found that was because of inode stealing.
> > Once an inode is freed, all its belonging page caches will be dropped as
> > well, no matter how may page caches it has. So if we intend to protect the
> > page caches in a memcg, we must protect their host (the inode) first.
> > Otherwise the memcg protection can be easily bypassed with freeing inode,
> > especially if there're big files in this memcg.
> >
> > Supposes we have a memcg, and the stat of this memcg is,
> >       memory.current = 1024M
> >       memory.min = 512M
> > And in this memcg there's a inode with 800M page caches.
> > Once this memcg is scanned by kswapd or other regular reclaimers,
> >     kswapd <<<< It can be either of the regular reclaimers.
> >         shrink_node_memcgs
> >           switch (mem_cgroup_protected()) <<<< Not protected
> >               case MEMCG_PROT_NONE:  <<<< Will scan this memcg
> >                       beak;
> >             shrink_lruvec() <<<< Reclaim the page caches
> >             shrink_slab()   <<<< It may free this inode and drop all its
> >                                  page caches(800M).
> > So we must protect the inode first if we want to protect page caches.
> >
> > The inherent mismatch between memcg and inode is a trouble. One inode can
> > be shared by different MEMCGs, but it is a very rare case. If an inode is
> > shared, its belonging page caches may be charged to different MEMCGs.
> > Currently there's no perfect solution to fix this kind of issue, but the
> > inode majority-writer ownership switching can help it more or less.
>
> There's multiple changes in this patch set. Yes, there's some inode
> cache futzing to deal with memcgs, but it also adds some weird
> undocumented "in low reclaim" heuristic that does something magical
> with "protection" that you don't describe or document at all. PLease
> separate that out into a new patch and document it clearly (both in
> the commit message and the code, please).
>

Sure, I will separate it and document it in next version.

> > diff --git a/fs/inode.c b/fs/inode.c
> > index fef457a..4f4b2f3 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -54,6 +54,13 @@
> >   *   inode_hash_lock
> >   */
> >
> > +struct inode_head {
>
> not an "inode head" structure. They are inode isolation arguments.
> i.e. struct inode_isolation_args {...};
>

Agree with you that inode_isolation_args is better.
While I have a different idea, what about using inode_isolation_control ?
scan_control : to scan all MEMCGs
        v
shrink_control: to shrink all slabs in one memcg
        v
inode_isolation_control: to shrink one slab (the inode)

And in the future we may introduce dentry_isolation_control and some others.
Anyway that's a minor issue.

> > +     struct list_head *freeable;
> > +#ifdef CONFIG_MEMCG_KMEM
> > +     struct mem_cgroup *memcg;
> > +#endif
> > +};
>
> These defines are awful, too, and completely unnecesarily because
> the struct shrink_control unconditionally defines sc->memcg and
> so we can just code it throughout without caring whether memcgs are
> enabled or not.
>

Sure, will change it in next version.

> > +
> >  static unsigned int i_hash_mask __read_mostly;
> >  static unsigned int i_hash_shift __read_mostly;
> >  static struct hlist_head *inode_hashtable __read_mostly;
> > @@ -724,8 +731,10 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
> >  static enum lru_status inode_lru_isolate(struct list_head *item,
> >               struct list_lru_one *lru, spinlock_t *lru_lock, void *arg)
> >  {
> > -     struct list_head *freeable = arg;
> > +     struct inode_head *ihead = (struct inode_head *)arg;
>
> No need for a cast of a void *.
>

OK.

> > +     struct list_head *freeable = ihead->freeable;
> >       struct inode    *inode = container_of(item, struct inode, i_lru);
> > +     struct mem_cgroup *memcg = NULL;
>
>         struct inode_isolation_args *iargs = arg;
>         struct list_head *freeable = iargs->freeable;
>         struct mem_cgroup *memcg = iargs->memcg;
>         struct inode    *inode = container_of(item, struct inode, i_lru);
>

Thanks. That looks better.

> > @@ -734,6 +743,15 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
> >       if (!spin_trylock(&inode->i_lock))
> >               return LRU_SKIP;
> >
> > +#ifdef CONFIG_MEMCG_KMEM
> > +     memcg = ihead->memcg;
> > +#endif
>
> This is no longer necessary.
>

Thanks.

> > +     if (memcg && inode->i_data.nrpages &&
> > +         !(memcg_can_reclaim_inode(memcg, inode))) {
> > +             spin_unlock(&inode->i_lock);
> > +             return LRU_ROTATE;
> > +     }
>
> I'd argue that both the memcg and the inode->i_data.nrpages check
> should be inside memcg_can_reclaim_inode(), that way this entire
> chunk of code disappears when CONFIG_MEMCG_KMEM=N.
>

I will think about it and make the code better when CONFIG_MEMCG_KMEM=N.

> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index f36ada9..d1d4175 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -247,6 +247,9 @@ struct mem_cgroup {
> >       unsigned int tcpmem_active : 1;
> >       unsigned int tcpmem_pressure : 1;
> >
> > +     /* Soft protection will be ignored if it's true */
> > +     unsigned int in_low_reclaim : 1;
>
> This is the stuff that has nothing to do with the changes to the
> inode reclaim shrinker...
>

In next version I will drop this in_low_reclaim and using the
memcg_low_reclaim passed from struct scan_control.


Thanks
Yafang


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 4/5] mm: make memcg visible to lru walker isolation function
  2020-01-04  7:26       ` Yafang Shao
  (?)
@ 2020-01-04 21:23       ` Dave Chinner
  2020-01-05  1:43           ` Yafang Shao
  -1 siblings, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2020-01-04 21:23 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Al Viro, Linux MM, linux-fsdevel, Dave Chinner

On Sat, Jan 04, 2020 at 03:26:13PM +0800, Yafang Shao wrote:
> On Sat, Jan 4, 2020 at 11:36 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Tue, Dec 24, 2019 at 02:53:25AM -0500, Yafang Shao wrote:
> > > The lru walker isolation function may use this memcg to do something, e.g.
> > > the inode isolatation function will use the memcg to do inode protection in
> > > followup patch. So make memcg visible to the lru walker isolation function.
> > >
> > > Something should be emphasized in this patch is it replaces
> > > for_each_memcg_cache_index() with for_each_mem_cgroup() in
> > > list_lru_walk_node(). Because there's a gap between these two MACROs that
> > > for_each_mem_cgroup() depends on CONFIG_MEMCG while the other one depends
> > > on CONFIG_MEMCG_KMEM. But as list_lru_memcg_aware() returns false if
> > > CONFIG_MEMCG_KMEM is not configured, it is safe to this replacement.
> > >
> > > Cc: Dave Chinner <dchinner@redhat.com>
> > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> >
> > ....
> >
> > > @@ -299,17 +299,15 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
> > >                                list_lru_walk_cb isolate, void *cb_arg,
> > >                                unsigned long *nr_to_walk)
> > >  {
> > > +     struct mem_cgroup *memcg;
> > >       long isolated = 0;
> > > -     int memcg_idx;
> > >
> > > -     isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
> > > -                                   nr_to_walk);
> > > -     if (*nr_to_walk > 0 && list_lru_memcg_aware(lru)) {
> > > -             for_each_memcg_cache_index(memcg_idx) {
> > > +     if (list_lru_memcg_aware(lru)) {
> > > +             for_each_mem_cgroup(memcg) {
> > >                       struct list_lru_node *nlru = &lru->node[nid];
> > >
> > >                       spin_lock(&nlru->lock);
> > > -                     isolated += __list_lru_walk_one(nlru, memcg_idx,
> > > +                     isolated += __list_lru_walk_one(nlru, memcg,
> > >                                                       isolate, cb_arg,
> > >                                                       nr_to_walk);
> > >                       spin_unlock(&nlru->lock);
> > > @@ -317,7 +315,11 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
> > >                       if (*nr_to_walk <= 0)
> > >                               break;
> > >               }
> > > +     } else {
> > > +             isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
> > > +                                           nr_to_walk);
> > >       }
> > > +
> >
> > That's a change of behaviour. The old code always runs per-node
> > reclaim, then if the LRU is memcg aware it also runs the memcg
> > aware reclaim. The new code never runs global per-node reclaim
> > if the list is memcg aware, so shrinkers that are initialised
> > with the flags SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE seem
> > likely to have reclaim problems with mixed memcg/global memory
> > pressure scenarios.
> >
> > e.g. if all the memory is in the per-node lists, and the memcg needs
> > to reclaim memory because of a global shortage, it is now unable to
> > reclaim global memory.....
> >
> 
> Hi Dave,
> 
> Thanks for your detailed explanation.
> But I have different understanding.
> The difference between for_each_mem_cgroup(memcg) and
> for_each_memcg_cache_index(memcg_idx) is that the
> for_each_mem_cgroup() includes the root_mem_cgroup while the
> for_each_memcg_cache_index() excludes the root_mem_cgroup because the
> memcg_idx of it is -1.

Except that the "root" memcg that for_each_mem_cgroup() is not the
"global root" memcg - it is whatever memcg that is passed down in
the shrink_control, whereever that sits in the cgroup tree heirarchy.
do_shrink_slab() only ever passes down the global root memcg to the
shrinkers when the global root memcg is passed to shrink_slab(), and
that does not iterate the memcg heirarchy - it just wants to
reclaim from global caches an non-memcg aware shrinkers.

IOWs, there are multiple changes in behaviour here - memcg specific
reclaim won't do global reclaim, and global reclaim will now iterate
all memcgs instead of just the global root memcg.

> So it can reclaim global memory even if the list is memcg aware.
> Is that right ?

If the memcg passed to this fucntion is the root memcg, then yes,
it will behave as you suggest. But for the majority of memcg-context
reclaim, the memcg is not the root memcg and so they will not do
global reclaim anymore...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 4/5] mm: make memcg visible to lru walker isolation function
  2020-01-04 21:23       ` Dave Chinner
@ 2020-01-05  1:43           ` Yafang Shao
  0 siblings, 0 replies; 36+ messages in thread
From: Yafang Shao @ 2020-01-05  1:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Al Viro, Linux MM, linux-fsdevel, Dave Chinner

On Sun, Jan 5, 2020 at 5:23 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sat, Jan 04, 2020 at 03:26:13PM +0800, Yafang Shao wrote:
> > On Sat, Jan 4, 2020 at 11:36 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Tue, Dec 24, 2019 at 02:53:25AM -0500, Yafang Shao wrote:
> > > > The lru walker isolation function may use this memcg to do something, e.g.
> > > > the inode isolatation function will use the memcg to do inode protection in
> > > > followup patch. So make memcg visible to the lru walker isolation function.
> > > >
> > > > Something should be emphasized in this patch is it replaces
> > > > for_each_memcg_cache_index() with for_each_mem_cgroup() in
> > > > list_lru_walk_node(). Because there's a gap between these two MACROs that
> > > > for_each_mem_cgroup() depends on CONFIG_MEMCG while the other one depends
> > > > on CONFIG_MEMCG_KMEM. But as list_lru_memcg_aware() returns false if
> > > > CONFIG_MEMCG_KMEM is not configured, it is safe to this replacement.
> > > >
> > > > Cc: Dave Chinner <dchinner@redhat.com>
> > > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > >
> > > ....
> > >
> > > > @@ -299,17 +299,15 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
> > > >                                list_lru_walk_cb isolate, void *cb_arg,
> > > >                                unsigned long *nr_to_walk)
> > > >  {
> > > > +     struct mem_cgroup *memcg;
> > > >       long isolated = 0;
> > > > -     int memcg_idx;
> > > >
> > > > -     isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
> > > > -                                   nr_to_walk);
> > > > -     if (*nr_to_walk > 0 && list_lru_memcg_aware(lru)) {
> > > > -             for_each_memcg_cache_index(memcg_idx) {
> > > > +     if (list_lru_memcg_aware(lru)) {
> > > > +             for_each_mem_cgroup(memcg) {
> > > >                       struct list_lru_node *nlru = &lru->node[nid];
> > > >
> > > >                       spin_lock(&nlru->lock);
> > > > -                     isolated += __list_lru_walk_one(nlru, memcg_idx,
> > > > +                     isolated += __list_lru_walk_one(nlru, memcg,
> > > >                                                       isolate, cb_arg,
> > > >                                                       nr_to_walk);
> > > >                       spin_unlock(&nlru->lock);
> > > > @@ -317,7 +315,11 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
> > > >                       if (*nr_to_walk <= 0)
> > > >                               break;
> > > >               }
> > > > +     } else {
> > > > +             isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
> > > > +                                           nr_to_walk);
> > > >       }
> > > > +
> > >
> > > That's a change of behaviour. The old code always runs per-node
> > > reclaim, then if the LRU is memcg aware it also runs the memcg
> > > aware reclaim. The new code never runs global per-node reclaim
> > > if the list is memcg aware, so shrinkers that are initialised
> > > with the flags SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE seem
> > > likely to have reclaim problems with mixed memcg/global memory
> > > pressure scenarios.
> > >
> > > e.g. if all the memory is in the per-node lists, and the memcg needs
> > > to reclaim memory because of a global shortage, it is now unable to
> > > reclaim global memory.....
> > >
> >
> > Hi Dave,
> >
> > Thanks for your detailed explanation.
> > But I have different understanding.
> > The difference between for_each_mem_cgroup(memcg) and
> > for_each_memcg_cache_index(memcg_idx) is that the
> > for_each_mem_cgroup() includes the root_mem_cgroup while the
> > for_each_memcg_cache_index() excludes the root_mem_cgroup because the
> > memcg_idx of it is -1.
>
> Except that the "root" memcg that for_each_mem_cgroup() is not the
> "global root" memcg - it is whatever memcg that is passed down in
> the shrink_control, whereever that sits in the cgroup tree heirarchy.
> do_shrink_slab() only ever passes down the global root memcg to the
> shrinkers when the global root memcg is passed to shrink_slab(), and
> that does not iterate the memcg heirarchy - it just wants to
> reclaim from global caches an non-memcg aware shrinkers.
>
> IOWs, there are multiple changes in behaviour here - memcg specific
> reclaim won't do global reclaim, and global reclaim will now iterate
> all memcgs instead of just the global root memcg.
>
> > So it can reclaim global memory even if the list is memcg aware.
> > Is that right ?
>
> If the memcg passed to this fucntion is the root memcg, then yes,
> it will behave as you suggest. But for the majority of memcg-context
> reclaim, the memcg is not the root memcg and so they will not do
> global reclaim anymore...
>

Thanks for you reply.
But I have to clairfy that this change is in list_lru_walk_node(), and
the memcg is not passed to this funtion from shrink_control.
In order to make it more clear, I paste the function here.

- The new function
unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
                                 list_lru_walk_cb isolate, void *cb_arg,
                                 unsigned long *nr_to_walk)
{
        struct mem_cgroup *memcg;   <<<< A local variable
        long isolated = 0;

        if (list_lru_memcg_aware(lru)) {
                for_each_mem_cgroup(memcg) {  <<<<  scan all MEMCGs,
including root_mem_cgroup
                        struct list_lru_node *nlru = &lru->node[nid];

                        spin_lock(&nlru->lock);
                        isolated += __list_lru_walk_one(nlru, memcg,
                                                        isolate, cb_arg,
                                                        nr_to_walk);
                        spin_unlock(&nlru->lock);

                        if (*nr_to_walk <= 0)
                                break;
                }
        } else {
<<<<  scan global memory only (root_mem_cgroup)
                isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
                                              nr_to_walk);
        }

        return isolated;
}

- While the original function is,

unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
                                 list_lru_walk_cb isolate, void *cb_arg,
                                 unsigned long *nr_to_walk)
{
        long isolated = 0;
        int memcg_idx;

        isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
                                      nr_to_walk);
            <<<<  scan global memory only (root_mem_cgroup)
        if (*nr_to_walk > 0 && list_lru_memcg_aware(lru)) {
                for_each_memcg_cache_index(memcg_idx) {   <<<< scan
all MEMCGs excludes root_mem_cgroup
                        struct list_lru_node *nlru = &lru->node[nid];

                        spin_lock(&nlru->lock);
                        isolated += __list_lru_walk_one(nlru, memcg_idx,
                                                        isolate, cb_arg,
                                                        nr_to_walk);
                        spin_unlock(&nlru->lock);

                        if (*nr_to_walk <= 0)
                                break;
                }
        }

        return isolated;
}

Is that right ?

Thanks
Yafang

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 4/5] mm: make memcg visible to lru walker isolation function
@ 2020-01-05  1:43           ` Yafang Shao
  0 siblings, 0 replies; 36+ messages in thread
From: Yafang Shao @ 2020-01-05  1:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Al Viro, Linux MM, linux-fsdevel, Dave Chinner

On Sun, Jan 5, 2020 at 5:23 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sat, Jan 04, 2020 at 03:26:13PM +0800, Yafang Shao wrote:
> > On Sat, Jan 4, 2020 at 11:36 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Tue, Dec 24, 2019 at 02:53:25AM -0500, Yafang Shao wrote:
> > > > The lru walker isolation function may use this memcg to do something, e.g.
> > > > the inode isolatation function will use the memcg to do inode protection in
> > > > followup patch. So make memcg visible to the lru walker isolation function.
> > > >
> > > > Something should be emphasized in this patch is it replaces
> > > > for_each_memcg_cache_index() with for_each_mem_cgroup() in
> > > > list_lru_walk_node(). Because there's a gap between these two MACROs that
> > > > for_each_mem_cgroup() depends on CONFIG_MEMCG while the other one depends
> > > > on CONFIG_MEMCG_KMEM. But as list_lru_memcg_aware() returns false if
> > > > CONFIG_MEMCG_KMEM is not configured, it is safe to this replacement.
> > > >
> > > > Cc: Dave Chinner <dchinner@redhat.com>
> > > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > >
> > > ....
> > >
> > > > @@ -299,17 +299,15 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
> > > >                                list_lru_walk_cb isolate, void *cb_arg,
> > > >                                unsigned long *nr_to_walk)
> > > >  {
> > > > +     struct mem_cgroup *memcg;
> > > >       long isolated = 0;
> > > > -     int memcg_idx;
> > > >
> > > > -     isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
> > > > -                                   nr_to_walk);
> > > > -     if (*nr_to_walk > 0 && list_lru_memcg_aware(lru)) {
> > > > -             for_each_memcg_cache_index(memcg_idx) {
> > > > +     if (list_lru_memcg_aware(lru)) {
> > > > +             for_each_mem_cgroup(memcg) {
> > > >                       struct list_lru_node *nlru = &lru->node[nid];
> > > >
> > > >                       spin_lock(&nlru->lock);
> > > > -                     isolated += __list_lru_walk_one(nlru, memcg_idx,
> > > > +                     isolated += __list_lru_walk_one(nlru, memcg,
> > > >                                                       isolate, cb_arg,
> > > >                                                       nr_to_walk);
> > > >                       spin_unlock(&nlru->lock);
> > > > @@ -317,7 +315,11 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
> > > >                       if (*nr_to_walk <= 0)
> > > >                               break;
> > > >               }
> > > > +     } else {
> > > > +             isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
> > > > +                                           nr_to_walk);
> > > >       }
> > > > +
> > >
> > > That's a change of behaviour. The old code always runs per-node
> > > reclaim, then if the LRU is memcg aware it also runs the memcg
> > > aware reclaim. The new code never runs global per-node reclaim
> > > if the list is memcg aware, so shrinkers that are initialised
> > > with the flags SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE seem
> > > likely to have reclaim problems with mixed memcg/global memory
> > > pressure scenarios.
> > >
> > > e.g. if all the memory is in the per-node lists, and the memcg needs
> > > to reclaim memory because of a global shortage, it is now unable to
> > > reclaim global memory.....
> > >
> >
> > Hi Dave,
> >
> > Thanks for your detailed explanation.
> > But I have different understanding.
> > The difference between for_each_mem_cgroup(memcg) and
> > for_each_memcg_cache_index(memcg_idx) is that the
> > for_each_mem_cgroup() includes the root_mem_cgroup while the
> > for_each_memcg_cache_index() excludes the root_mem_cgroup because the
> > memcg_idx of it is -1.
>
> Except that the "root" memcg that for_each_mem_cgroup() is not the
> "global root" memcg - it is whatever memcg that is passed down in
> the shrink_control, whereever that sits in the cgroup tree heirarchy.
> do_shrink_slab() only ever passes down the global root memcg to the
> shrinkers when the global root memcg is passed to shrink_slab(), and
> that does not iterate the memcg heirarchy - it just wants to
> reclaim from global caches an non-memcg aware shrinkers.
>
> IOWs, there are multiple changes in behaviour here - memcg specific
> reclaim won't do global reclaim, and global reclaim will now iterate
> all memcgs instead of just the global root memcg.
>
> > So it can reclaim global memory even if the list is memcg aware.
> > Is that right ?
>
> If the memcg passed to this fucntion is the root memcg, then yes,
> it will behave as you suggest. But for the majority of memcg-context
> reclaim, the memcg is not the root memcg and so they will not do
> global reclaim anymore...
>

Thanks for you reply.
But I have to clairfy that this change is in list_lru_walk_node(), and
the memcg is not passed to this funtion from shrink_control.
In order to make it more clear, I paste the function here.

- The new function
unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
                                 list_lru_walk_cb isolate, void *cb_arg,
                                 unsigned long *nr_to_walk)
{
        struct mem_cgroup *memcg;   <<<< A local variable
        long isolated = 0;

        if (list_lru_memcg_aware(lru)) {
                for_each_mem_cgroup(memcg) {  <<<<  scan all MEMCGs,
including root_mem_cgroup
                        struct list_lru_node *nlru = &lru->node[nid];

                        spin_lock(&nlru->lock);
                        isolated += __list_lru_walk_one(nlru, memcg,
                                                        isolate, cb_arg,
                                                        nr_to_walk);
                        spin_unlock(&nlru->lock);

                        if (*nr_to_walk <= 0)
                                break;
                }
        } else {
<<<<  scan global memory only (root_mem_cgroup)
                isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
                                              nr_to_walk);
        }

        return isolated;
}

- While the original function is,

unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
                                 list_lru_walk_cb isolate, void *cb_arg,
                                 unsigned long *nr_to_walk)
{
        long isolated = 0;
        int memcg_idx;

        isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
                                      nr_to_walk);
            <<<<  scan global memory only (root_mem_cgroup)
        if (*nr_to_walk > 0 && list_lru_memcg_aware(lru)) {
                for_each_memcg_cache_index(memcg_idx) {   <<<< scan
all MEMCGs excludes root_mem_cgroup
                        struct list_lru_node *nlru = &lru->node[nid];

                        spin_lock(&nlru->lock);
                        isolated += __list_lru_walk_one(nlru, memcg_idx,
                                                        isolate, cb_arg,
                                                        nr_to_walk);
                        spin_unlock(&nlru->lock);

                        if (*nr_to_walk <= 0)
                                break;
                }
        }

        return isolated;
}

Is that right ?

Thanks
Yafang


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 4/5] mm: make memcg visible to lru walker isolation function
  2020-01-05  1:43           ` Yafang Shao
  (?)
@ 2020-01-06  0:17           ` Dave Chinner
  2020-01-06 14:41               ` Yafang Shao
  -1 siblings, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2020-01-06  0:17 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Al Viro, Linux MM, linux-fsdevel, Dave Chinner

On Sun, Jan 05, 2020 at 09:43:11AM +0800, Yafang Shao wrote:
> On Sun, Jan 5, 2020 at 5:23 AM Dave Chinner <david@fromorbit.com> wrote:
> > On Sat, Jan 04, 2020 at 03:26:13PM +0800, Yafang Shao wrote:
> > > On Sat, Jan 4, 2020 at 11:36 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > On Tue, Dec 24, 2019 at 02:53:25AM -0500, Yafang Shao wrote:
> > > > > The lru walker isolation function may use this memcg to do something, e.g.
> > > > > the inode isolatation function will use the memcg to do inode protection in
> > > > > followup patch. So make memcg visible to the lru walker isolation function.
> > > > >
> > > > > Something should be emphasized in this patch is it replaces
> > > > > for_each_memcg_cache_index() with for_each_mem_cgroup() in
> > > > > list_lru_walk_node(). Because there's a gap between these two MACROs that
> > > > > for_each_mem_cgroup() depends on CONFIG_MEMCG while the other one depends
> > > > > on CONFIG_MEMCG_KMEM. But as list_lru_memcg_aware() returns false if
> > > > > CONFIG_MEMCG_KMEM is not configured, it is safe to this replacement.
> > > > >
> > > > > Cc: Dave Chinner <dchinner@redhat.com>
> > > > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > >
> > > > ....
> > > >
> > > > > @@ -299,17 +299,15 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
> > > > >                                list_lru_walk_cb isolate, void *cb_arg,
> > > > >                                unsigned long *nr_to_walk)
> > > > >  {
> > > > > +     struct mem_cgroup *memcg;
> > > > >       long isolated = 0;
> > > > > -     int memcg_idx;
> > > > >
> > > > > -     isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
> > > > > -                                   nr_to_walk);
> > > > > -     if (*nr_to_walk > 0 && list_lru_memcg_aware(lru)) {
> > > > > -             for_each_memcg_cache_index(memcg_idx) {
> > > > > +     if (list_lru_memcg_aware(lru)) {
> > > > > +             for_each_mem_cgroup(memcg) {
> > > > >                       struct list_lru_node *nlru = &lru->node[nid];
> > > > >
> > > > >                       spin_lock(&nlru->lock);
> > > > > -                     isolated += __list_lru_walk_one(nlru, memcg_idx,
> > > > > +                     isolated += __list_lru_walk_one(nlru, memcg,
> > > > >                                                       isolate, cb_arg,
> > > > >                                                       nr_to_walk);
> > > > >                       spin_unlock(&nlru->lock);
> > > > > @@ -317,7 +315,11 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
> > > > >                       if (*nr_to_walk <= 0)
> > > > >                               break;
> > > > >               }
> > > > > +     } else {
> > > > > +             isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
> > > > > +                                           nr_to_walk);
> > > > >       }
> > > > > +
> > > >
> > > > That's a change of behaviour. The old code always runs per-node
> > > > reclaim, then if the LRU is memcg aware it also runs the memcg
> > > > aware reclaim. The new code never runs global per-node reclaim
> > > > if the list is memcg aware, so shrinkers that are initialised
> > > > with the flags SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE seem
> > > > likely to have reclaim problems with mixed memcg/global memory
> > > > pressure scenarios.
> > > >
> > > > e.g. if all the memory is in the per-node lists, and the memcg needs
> > > > to reclaim memory because of a global shortage, it is now unable to
> > > > reclaim global memory.....
> > > >
> > >
> > > Hi Dave,
> > >
> > > Thanks for your detailed explanation.
> > > But I have different understanding.
> > > The difference between for_each_mem_cgroup(memcg) and
> > > for_each_memcg_cache_index(memcg_idx) is that the
> > > for_each_mem_cgroup() includes the root_mem_cgroup while the
> > > for_each_memcg_cache_index() excludes the root_mem_cgroup because the
> > > memcg_idx of it is -1.
> >
> > Except that the "root" memcg that for_each_mem_cgroup() is not the
> > "global root" memcg - it is whatever memcg that is passed down in
> > the shrink_control, whereever that sits in the cgroup tree heirarchy.
> > do_shrink_slab() only ever passes down the global root memcg to the
> > shrinkers when the global root memcg is passed to shrink_slab(), and
> > that does not iterate the memcg heirarchy - it just wants to
> > reclaim from global caches an non-memcg aware shrinkers.
> >
> > IOWs, there are multiple changes in behaviour here - memcg specific
> > reclaim won't do global reclaim, and global reclaim will now iterate
> > all memcgs instead of just the global root memcg.
> >
> > > So it can reclaim global memory even if the list is memcg aware.
> > > Is that right ?
> >
> > If the memcg passed to this fucntion is the root memcg, then yes,
> > it will behave as you suggest. But for the majority of memcg-context
> > reclaim, the memcg is not the root memcg and so they will not do
> > global reclaim anymore...
> >
> 
> Thanks for you reply.
> But I have to clairfy that this change is in list_lru_walk_node(), and
> the memcg is not passed to this funtion from shrink_control.
> In order to make it more clear, I paste the function here.
> 
> - The new function
> unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
>                                  list_lru_walk_cb isolate, void *cb_arg,
>                                  unsigned long *nr_to_walk)
> {
>         struct mem_cgroup *memcg;   <<<< A local variable
>         long isolated = 0;
> 
>         if (list_lru_memcg_aware(lru)) {
>                 for_each_mem_cgroup(memcg) {  <<<<  scan all MEMCGs,
> including root_mem_cgroup

Oh, wait, this is list_lru_walk_node(), which is not even called by
the shrinker code - this is the "iterate every object" code
path. So this isn't even part of the code path this patchset needs
the memcg for.

<does some code spelunking>

Gawd, what a mess this memcg stuff is from when you look from the
outside. After digging, I find that the root memcg has a kmemcg_id =
-1. There is special case code everywhere that checks the id for >=
0, open codes memcg == root_mem_cgroup checks, calls
mem_cgroup_is_root(), etc to avoid doing things like using the root
memcg kmemcg_id as an array index. An in the case of the list_lru,
this means it doesn't use the memcg lru indexes at all (i.e.
lru->node[nid].memcg_lrus[memcg_idx], but quietly returns the global
list at lru->node[nid].lru when kmemcg_id < 0.

So, after a couple of hours of wading through the code, I finally
remember all the details of the existing code and understand how
this new code works - it relies entirely on the special casing of
the root memcg that makes it behave like memcgs aren't configured at
all. i.e. we select the global root lists when the root memcg is
passed around rather than using a memcg specific LRU list.

The old code used for_each_memcg_cache_index(), which never
scanned the root memcg:

#define for_each_memcg_cache_index(_idx)        \
        for ((_idx) = 0; (_idx) < memcg_nr_cache_ids; (_idx)++)

because it starts at idx = 0, and the root memcg is -1. Hence the
old code always needed to scan the root memcg (the global lists)
manually itself, which made the code really nice and clear because
it was the same for both non-memcg and memcg aware lists. i.e. the
global scan was not hidden away inside the memcg scanning.

IOWs, I find the existing code is much easier to see the difference
between global and per-memcg iteration. IF we are keeping this
special "root memcg context is special" architecture (more on that
below), I'd prefer that the code keeps that distinction, because
having to understand how the root memcg is special cased just to
work out how this code iterates the global caches is a waste of time
and brain power. not every one really wants to know about memcg
internals....

We can clean up the code a lot by getting rid of the unnecessary
indenting by doing this:

	/* iterate the global lru first */
	isolated = list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
					nr_to_walk);
	if (!list_lru_memcg_aware(lru))
		return isolated;

	nlru = &lru->node[nid];
	for_each_mem_cgroup(memcg) {
		/* already scanned the root memcg above */
		if (is_root_memcg(memcg))
			continue;
		if (*nr_to_walk <= 0)
			break;
		spin_lock(&nlru->lock);
		isolated += __list_lru_walk_one(nlru, memcg,
						isolate, cb_arg,
						nr_to_walk);
		spin_unlock(&nlru->lock);
	}
	return isolated;


And so to architecture... This all makes me wonder why we still
special case the memcg LRU lists here. Ever since we went to
per-node memcg lru lists (~2015, iirc), there's been this special
hidden hack for the root memcg to use the global list. and one that
I have to read lots of code to remind myself it exists every time I
have to did into this code again.

I mean, if the list is not memcg aware, then it only needs a single
list per node - the root memcg list. That could be easily stored in
the memcg_lrus array for the node rather than a separate global
list, and then the node iteration code all starts to look like this:

	nlru = &lru->node[nid];
	for_each_mem_cgroup(memcg) {
		spin_lock(&nlru->lock);
		isolated += __list_lru_walk_one(nlru, memcg,
						isolate, cb_arg,
						nr_to_walk);
		spin_unlock(&nlru->lock);
		if (*nr_to_walk <= 0)
			break;

		/* non-memcg aware lists only have a root memcg list */
		if (!list_lru_memcg_aware(lru))
			break;
	}

Hence for the walks in the !list_lru_memcg_aware(lru) case, the
list_lru_from_memcg() call in __list_lru_walk_one() always returns
just the root list. This makes everything use the same iteration
interface, and when you configure out memcg then we simply code the
the iterator to run once and list_lru_from_memcg() always returns
the one list...

i.e. the whole reason for having the separate global and per-memcg
LRU lists went away back in 2015 when it was decided that we don't
care about the (potentially massive) memory usage of per-node memcg
LRU lists. However, the special casing of the root memcg was left in
place, but what your patch says to me is that we should really be
trying to make this special casing go away.

Moving everything in the list-lru to the memcg_lrus also means that
the global shrinker lists get the "objects in the lru to scan for
reclaim" shrinker bitmaps that optimise the high level shrinker
code. It means we could get rid of the separate global vs memcg
shrinker paths in vmscan.c - we just pass the root memcg into
shrink_slab_memcg() when we want to do global shrinker reclaim...

IMO, if memcgs are fundamental to the operation of systems these
days such that there the root memcg is treated no differently to all
other memcgs, then let's architect the whole stack that way, not
just make the code hard to read by having little bits of code hide
the special casing of the root memcg layers deep where it's raelly
hard to find....

And, FWIW, I noticed that the list_lru memcg code assumes we only
ever put objects from memcg associated slab pages in the list_lru.
It calls memcg_from_slab_page() which makes no attempt to verify the
page is actually a slab page. That's a landmine just waiting to get
boom - list-lru can store any type of object the user wants, not
just slab pages. e.g. the binder code stores pages in the list-lru,
not slab objects, and so the only reason it doesn't go boom is that
the lru-list is not configured to be memcg aware....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 4/5] mm: make memcg visible to lru walker isolation function
  2020-01-06  0:17           ` Dave Chinner
@ 2020-01-06 14:41               ` Yafang Shao
  0 siblings, 0 replies; 36+ messages in thread
From: Yafang Shao @ 2020-01-06 14:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Al Viro, Linux MM, linux-fsdevel, Dave Chinner

On Mon, Jan 6, 2020 at 8:17 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sun, Jan 05, 2020 at 09:43:11AM +0800, Yafang Shao wrote:
> > On Sun, Jan 5, 2020 at 5:23 AM Dave Chinner <david@fromorbit.com> wrote:
> > > On Sat, Jan 04, 2020 at 03:26:13PM +0800, Yafang Shao wrote:
> > > > On Sat, Jan 4, 2020 at 11:36 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > > On Tue, Dec 24, 2019 at 02:53:25AM -0500, Yafang Shao wrote:
> > > > > > The lru walker isolation function may use this memcg to do something, e.g.
> > > > > > the inode isolatation function will use the memcg to do inode protection in
> > > > > > followup patch. So make memcg visible to the lru walker isolation function.
> > > > > >
> > > > > > Something should be emphasized in this patch is it replaces
> > > > > > for_each_memcg_cache_index() with for_each_mem_cgroup() in
> > > > > > list_lru_walk_node(). Because there's a gap between these two MACROs that
> > > > > > for_each_mem_cgroup() depends on CONFIG_MEMCG while the other one depends
> > > > > > on CONFIG_MEMCG_KMEM. But as list_lru_memcg_aware() returns false if
> > > > > > CONFIG_MEMCG_KMEM is not configured, it is safe to this replacement.
> > > > > >
> > > > > > Cc: Dave Chinner <dchinner@redhat.com>
> > > > > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > > >
> > > > > ....
> > > > >
> > > > > > @@ -299,17 +299,15 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
> > > > > >                                list_lru_walk_cb isolate, void *cb_arg,
> > > > > >                                unsigned long *nr_to_walk)
> > > > > >  {
> > > > > > +     struct mem_cgroup *memcg;
> > > > > >       long isolated = 0;
> > > > > > -     int memcg_idx;
> > > > > >
> > > > > > -     isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
> > > > > > -                                   nr_to_walk);
> > > > > > -     if (*nr_to_walk > 0 && list_lru_memcg_aware(lru)) {
> > > > > > -             for_each_memcg_cache_index(memcg_idx) {
> > > > > > +     if (list_lru_memcg_aware(lru)) {
> > > > > > +             for_each_mem_cgroup(memcg) {
> > > > > >                       struct list_lru_node *nlru = &lru->node[nid];
> > > > > >
> > > > > >                       spin_lock(&nlru->lock);
> > > > > > -                     isolated += __list_lru_walk_one(nlru, memcg_idx,
> > > > > > +                     isolated += __list_lru_walk_one(nlru, memcg,
> > > > > >                                                       isolate, cb_arg,
> > > > > >                                                       nr_to_walk);
> > > > > >                       spin_unlock(&nlru->lock);
> > > > > > @@ -317,7 +315,11 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
> > > > > >                       if (*nr_to_walk <= 0)
> > > > > >                               break;
> > > > > >               }
> > > > > > +     } else {
> > > > > > +             isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
> > > > > > +                                           nr_to_walk);
> > > > > >       }
> > > > > > +
> > > > >
> > > > > That's a change of behaviour. The old code always runs per-node
> > > > > reclaim, then if the LRU is memcg aware it also runs the memcg
> > > > > aware reclaim. The new code never runs global per-node reclaim
> > > > > if the list is memcg aware, so shrinkers that are initialised
> > > > > with the flags SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE seem
> > > > > likely to have reclaim problems with mixed memcg/global memory
> > > > > pressure scenarios.
> > > > >
> > > > > e.g. if all the memory is in the per-node lists, and the memcg needs
> > > > > to reclaim memory because of a global shortage, it is now unable to
> > > > > reclaim global memory.....
> > > > >
> > > >
> > > > Hi Dave,
> > > >
> > > > Thanks for your detailed explanation.
> > > > But I have different understanding.
> > > > The difference between for_each_mem_cgroup(memcg) and
> > > > for_each_memcg_cache_index(memcg_idx) is that the
> > > > for_each_mem_cgroup() includes the root_mem_cgroup while the
> > > > for_each_memcg_cache_index() excludes the root_mem_cgroup because the
> > > > memcg_idx of it is -1.
> > >
> > > Except that the "root" memcg that for_each_mem_cgroup() is not the
> > > "global root" memcg - it is whatever memcg that is passed down in
> > > the shrink_control, whereever that sits in the cgroup tree heirarchy.
> > > do_shrink_slab() only ever passes down the global root memcg to the
> > > shrinkers when the global root memcg is passed to shrink_slab(), and
> > > that does not iterate the memcg heirarchy - it just wants to
> > > reclaim from global caches an non-memcg aware shrinkers.
> > >
> > > IOWs, there are multiple changes in behaviour here - memcg specific
> > > reclaim won't do global reclaim, and global reclaim will now iterate
> > > all memcgs instead of just the global root memcg.
> > >
> > > > So it can reclaim global memory even if the list is memcg aware.
> > > > Is that right ?
> > >
> > > If the memcg passed to this fucntion is the root memcg, then yes,
> > > it will behave as you suggest. But for the majority of memcg-context
> > > reclaim, the memcg is not the root memcg and so they will not do
> > > global reclaim anymore...
> > >
> >
> > Thanks for you reply.
> > But I have to clairfy that this change is in list_lru_walk_node(), and
> > the memcg is not passed to this funtion from shrink_control.
> > In order to make it more clear, I paste the function here.
> >
> > - The new function
> > unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
> >                                  list_lru_walk_cb isolate, void *cb_arg,
> >                                  unsigned long *nr_to_walk)
> > {
> >         struct mem_cgroup *memcg;   <<<< A local variable
> >         long isolated = 0;
> >
> >         if (list_lru_memcg_aware(lru)) {
> >                 for_each_mem_cgroup(memcg) {  <<<<  scan all MEMCGs,
> > including root_mem_cgroup
>
> Oh, wait, this is list_lru_walk_node(), which is not even called by
> the shrinker code - this is the "iterate every object" code
> path.

Right.

> So this isn't even part of the code path this patchset needs
> the memcg for.
>
> <does some code spelunking>
>
> Gawd, what a mess this memcg stuff is from when you look from the
> outside. After digging, I find that the root memcg has a kmemcg_id =
> -1. There is special case code everywhere that checks the id for >=
> 0, open codes memcg == root_mem_cgroup checks, calls
> mem_cgroup_is_root(), etc to avoid doing things like using the root
> memcg kmemcg_id as an array index. An in the case of the list_lru,
> this means it doesn't use the memcg lru indexes at all (i.e.
> lru->node[nid].memcg_lrus[memcg_idx], but quietly returns the global
> list at lru->node[nid].lru when kmemcg_id < 0.
>
> So, after a couple of hours of wading through the code, I finally
> remember all the details of the existing code and understand how
> this new code works - it relies entirely on the special casing of
> the root memcg that makes it behave like memcgs aren't configured at
> all. i.e. we select the global root lists when the root memcg is
> passed around rather than using a memcg specific LRU list.
>
> The old code used for_each_memcg_cache_index(), which never
> scanned the root memcg:
>
> #define for_each_memcg_cache_index(_idx)        \
>         for ((_idx) = 0; (_idx) < memcg_nr_cache_ids; (_idx)++)
>
> because it starts at idx = 0, and the root memcg is -1. Hence the
> old code always needed to scan the root memcg (the global lists)
> manually itself, which made the code really nice and clear because
> it was the same for both non-memcg and memcg aware lists. i.e. the
> global scan was not hidden away inside the memcg scanning.
>
> IOWs, I find the existing code is much easier to see the difference
> between global and per-memcg iteration. IF we are keeping this
> special "root memcg context is special" architecture (more on that
> below), I'd prefer that the code keeps that distinction, because
> having to understand how the root memcg is special cased just to
> work out how this code iterates the global caches is a waste of time
> and brain power. not every one really wants to know about memcg
> internals....
>
> We can clean up the code a lot by getting rid of the unnecessary
> indenting by doing this:
>
>         /* iterate the global lru first */
>         isolated = list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
>                                         nr_to_walk);
>         if (!list_lru_memcg_aware(lru))
>                 return isolated;
>
>         nlru = &lru->node[nid];
>         for_each_mem_cgroup(memcg) {
>                 /* already scanned the root memcg above */
>                 if (is_root_memcg(memcg))
>                         continue;
>                 if (*nr_to_walk <= 0)
>                         break;
>                 spin_lock(&nlru->lock);
>                 isolated += __list_lru_walk_one(nlru, memcg,
>                                                 isolate, cb_arg,
>                                                 nr_to_walk);
>                 spin_unlock(&nlru->lock);
>         }
>         return isolated;
>

That's eaiser to understand.
I wil change it like this in next version.

>
> And so to architecture... This all makes me wonder why we still
> special case the memcg LRU lists here.

Can't agree more.
The first time I read this code, I wondered why not assign an
non-negtive number to kmemcg_id of the root_mem_cgroup and then use
memcg_lrus as well.


> Ever since we went to
> per-node memcg lru lists (~2015, iirc), there's been this special
> hidden hack for the root memcg to use the global list. and one that
> I have to read lots of code to remind myself it exists every time I
> have to did into this code again.
>
> I mean, if the list is not memcg aware, then it only needs a single
> list per node - the root memcg list. That could be easily stored in
> the memcg_lrus array for the node rather than a separate global
> list, and then the node iteration code all starts to look like this:
>
>         nlru = &lru->node[nid];
>         for_each_mem_cgroup(memcg) {
>                 spin_lock(&nlru->lock);
>                 isolated += __list_lru_walk_one(nlru, memcg,
>                                                 isolate, cb_arg,
>                                                 nr_to_walk);
>                 spin_unlock(&nlru->lock);
>                 if (*nr_to_walk <= 0)
>                         break;
>
>                 /* non-memcg aware lists only have a root memcg list */
>                 if (!list_lru_memcg_aware(lru))
>                         break;
>         }
>
> Hence for the walks in the !list_lru_memcg_aware(lru) case, the
> list_lru_from_memcg() call in __list_lru_walk_one() always returns
> just the root list. This makes everything use the same iteration
> interface, and when you configure out memcg then we simply code the
> the iterator to run once and list_lru_from_memcg() always returns
> the one list...
>

Agree with you that it is a better architecture, and I also want to
change it like this.
That would be more clear and easier for maintiance.
But it requires lots of code changes, should we address it in another
separate patchset ?


> i.e. the whole reason for having the separate global and per-memcg
> LRU lists went away back in 2015 when it was decided that we don't
> care about the (potentially massive) memory usage of per-node memcg
> LRU lists. However, the special casing of the root memcg was left in
> place, but what your patch says to me is that we should really be
> trying to make this special casing go away.
>
> Moving everything in the list-lru to the memcg_lrus also means that
> the global shrinker lists get the "objects in the lru to scan for
> reclaim" shrinker bitmaps that optimise the high level shrinker
> code. It means we could get rid of the separate global vs memcg
> shrinker paths in vmscan.c - we just pass the root memcg into
> shrink_slab_memcg() when we want to do global shrinker reclaim...
>
> IMO, if memcgs are fundamental to the operation of systems these
> days such that there the root memcg is treated no differently to all
> other memcgs, then let's architect the whole stack that way, not
> just make the code hard to read by having little bits of code hide
> the special casing of the root memcg layers deep where it's raelly
> hard to find....
>

Agree. I will try to improve it and submit a new patchset for this
architect improvement.

> And, FWIW, I noticed that the list_lru memcg code assumes we only
> ever put objects from memcg associated slab pages in the list_lru.
> It calls memcg_from_slab_page() which makes no attempt to verify the
> page is actually a slab page. That's a landmine just waiting to get
> boom - list-lru can store any type of object the user wants, not
> just slab pages. e.g. the binder code stores pages in the list-lru,
> not slab objects, and so the only reason it doesn't go boom is that
> the lru-list is not configured to be memcg aware....
>

Another new issue.
I will try to dignose what hiddened in this landmine is, and after I
understand it clearly I will submit a new patchset.

Thanks
Yafang

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 4/5] mm: make memcg visible to lru walker isolation function
@ 2020-01-06 14:41               ` Yafang Shao
  0 siblings, 0 replies; 36+ messages in thread
From: Yafang Shao @ 2020-01-06 14:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Al Viro, Linux MM, linux-fsdevel, Dave Chinner

On Mon, Jan 6, 2020 at 8:17 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sun, Jan 05, 2020 at 09:43:11AM +0800, Yafang Shao wrote:
> > On Sun, Jan 5, 2020 at 5:23 AM Dave Chinner <david@fromorbit.com> wrote:
> > > On Sat, Jan 04, 2020 at 03:26:13PM +0800, Yafang Shao wrote:
> > > > On Sat, Jan 4, 2020 at 11:36 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > > On Tue, Dec 24, 2019 at 02:53:25AM -0500, Yafang Shao wrote:
> > > > > > The lru walker isolation function may use this memcg to do something, e.g.
> > > > > > the inode isolatation function will use the memcg to do inode protection in
> > > > > > followup patch. So make memcg visible to the lru walker isolation function.
> > > > > >
> > > > > > Something should be emphasized in this patch is it replaces
> > > > > > for_each_memcg_cache_index() with for_each_mem_cgroup() in
> > > > > > list_lru_walk_node(). Because there's a gap between these two MACROs that
> > > > > > for_each_mem_cgroup() depends on CONFIG_MEMCG while the other one depends
> > > > > > on CONFIG_MEMCG_KMEM. But as list_lru_memcg_aware() returns false if
> > > > > > CONFIG_MEMCG_KMEM is not configured, it is safe to this replacement.
> > > > > >
> > > > > > Cc: Dave Chinner <dchinner@redhat.com>
> > > > > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > > >
> > > > > ....
> > > > >
> > > > > > @@ -299,17 +299,15 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
> > > > > >                                list_lru_walk_cb isolate, void *cb_arg,
> > > > > >                                unsigned long *nr_to_walk)
> > > > > >  {
> > > > > > +     struct mem_cgroup *memcg;
> > > > > >       long isolated = 0;
> > > > > > -     int memcg_idx;
> > > > > >
> > > > > > -     isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
> > > > > > -                                   nr_to_walk);
> > > > > > -     if (*nr_to_walk > 0 && list_lru_memcg_aware(lru)) {
> > > > > > -             for_each_memcg_cache_index(memcg_idx) {
> > > > > > +     if (list_lru_memcg_aware(lru)) {
> > > > > > +             for_each_mem_cgroup(memcg) {
> > > > > >                       struct list_lru_node *nlru = &lru->node[nid];
> > > > > >
> > > > > >                       spin_lock(&nlru->lock);
> > > > > > -                     isolated += __list_lru_walk_one(nlru, memcg_idx,
> > > > > > +                     isolated += __list_lru_walk_one(nlru, memcg,
> > > > > >                                                       isolate, cb_arg,
> > > > > >                                                       nr_to_walk);
> > > > > >                       spin_unlock(&nlru->lock);
> > > > > > @@ -317,7 +315,11 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
> > > > > >                       if (*nr_to_walk <= 0)
> > > > > >                               break;
> > > > > >               }
> > > > > > +     } else {
> > > > > > +             isolated += list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
> > > > > > +                                           nr_to_walk);
> > > > > >       }
> > > > > > +
> > > > >
> > > > > That's a change of behaviour. The old code always runs per-node
> > > > > reclaim, then if the LRU is memcg aware it also runs the memcg
> > > > > aware reclaim. The new code never runs global per-node reclaim
> > > > > if the list is memcg aware, so shrinkers that are initialised
> > > > > with the flags SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE seem
> > > > > likely to have reclaim problems with mixed memcg/global memory
> > > > > pressure scenarios.
> > > > >
> > > > > e.g. if all the memory is in the per-node lists, and the memcg needs
> > > > > to reclaim memory because of a global shortage, it is now unable to
> > > > > reclaim global memory.....
> > > > >
> > > >
> > > > Hi Dave,
> > > >
> > > > Thanks for your detailed explanation.
> > > > But I have different understanding.
> > > > The difference between for_each_mem_cgroup(memcg) and
> > > > for_each_memcg_cache_index(memcg_idx) is that the
> > > > for_each_mem_cgroup() includes the root_mem_cgroup while the
> > > > for_each_memcg_cache_index() excludes the root_mem_cgroup because the
> > > > memcg_idx of it is -1.
> > >
> > > Except that the "root" memcg that for_each_mem_cgroup() is not the
> > > "global root" memcg - it is whatever memcg that is passed down in
> > > the shrink_control, whereever that sits in the cgroup tree heirarchy.
> > > do_shrink_slab() only ever passes down the global root memcg to the
> > > shrinkers when the global root memcg is passed to shrink_slab(), and
> > > that does not iterate the memcg heirarchy - it just wants to
> > > reclaim from global caches an non-memcg aware shrinkers.
> > >
> > > IOWs, there are multiple changes in behaviour here - memcg specific
> > > reclaim won't do global reclaim, and global reclaim will now iterate
> > > all memcgs instead of just the global root memcg.
> > >
> > > > So it can reclaim global memory even if the list is memcg aware.
> > > > Is that right ?
> > >
> > > If the memcg passed to this fucntion is the root memcg, then yes,
> > > it will behave as you suggest. But for the majority of memcg-context
> > > reclaim, the memcg is not the root memcg and so they will not do
> > > global reclaim anymore...
> > >
> >
> > Thanks for you reply.
> > But I have to clairfy that this change is in list_lru_walk_node(), and
> > the memcg is not passed to this funtion from shrink_control.
> > In order to make it more clear, I paste the function here.
> >
> > - The new function
> > unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
> >                                  list_lru_walk_cb isolate, void *cb_arg,
> >                                  unsigned long *nr_to_walk)
> > {
> >         struct mem_cgroup *memcg;   <<<< A local variable
> >         long isolated = 0;
> >
> >         if (list_lru_memcg_aware(lru)) {
> >                 for_each_mem_cgroup(memcg) {  <<<<  scan all MEMCGs,
> > including root_mem_cgroup
>
> Oh, wait, this is list_lru_walk_node(), which is not even called by
> the shrinker code - this is the "iterate every object" code
> path.

Right.

> So this isn't even part of the code path this patchset needs
> the memcg for.
>
> <does some code spelunking>
>
> Gawd, what a mess this memcg stuff is from when you look from the
> outside. After digging, I find that the root memcg has a kmemcg_id =
> -1. There is special case code everywhere that checks the id for >=
> 0, open codes memcg == root_mem_cgroup checks, calls
> mem_cgroup_is_root(), etc to avoid doing things like using the root
> memcg kmemcg_id as an array index. An in the case of the list_lru,
> this means it doesn't use the memcg lru indexes at all (i.e.
> lru->node[nid].memcg_lrus[memcg_idx], but quietly returns the global
> list at lru->node[nid].lru when kmemcg_id < 0.
>
> So, after a couple of hours of wading through the code, I finally
> remember all the details of the existing code and understand how
> this new code works - it relies entirely on the special casing of
> the root memcg that makes it behave like memcgs aren't configured at
> all. i.e. we select the global root lists when the root memcg is
> passed around rather than using a memcg specific LRU list.
>
> The old code used for_each_memcg_cache_index(), which never
> scanned the root memcg:
>
> #define for_each_memcg_cache_index(_idx)        \
>         for ((_idx) = 0; (_idx) < memcg_nr_cache_ids; (_idx)++)
>
> because it starts at idx = 0, and the root memcg is -1. Hence the
> old code always needed to scan the root memcg (the global lists)
> manually itself, which made the code really nice and clear because
> it was the same for both non-memcg and memcg aware lists. i.e. the
> global scan was not hidden away inside the memcg scanning.
>
> IOWs, I find the existing code is much easier to see the difference
> between global and per-memcg iteration. IF we are keeping this
> special "root memcg context is special" architecture (more on that
> below), I'd prefer that the code keeps that distinction, because
> having to understand how the root memcg is special cased just to
> work out how this code iterates the global caches is a waste of time
> and brain power. not every one really wants to know about memcg
> internals....
>
> We can clean up the code a lot by getting rid of the unnecessary
> indenting by doing this:
>
>         /* iterate the global lru first */
>         isolated = list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
>                                         nr_to_walk);
>         if (!list_lru_memcg_aware(lru))
>                 return isolated;
>
>         nlru = &lru->node[nid];
>         for_each_mem_cgroup(memcg) {
>                 /* already scanned the root memcg above */
>                 if (is_root_memcg(memcg))
>                         continue;
>                 if (*nr_to_walk <= 0)
>                         break;
>                 spin_lock(&nlru->lock);
>                 isolated += __list_lru_walk_one(nlru, memcg,
>                                                 isolate, cb_arg,
>                                                 nr_to_walk);
>                 spin_unlock(&nlru->lock);
>         }
>         return isolated;
>

That's eaiser to understand.
I wil change it like this in next version.

>
> And so to architecture... This all makes me wonder why we still
> special case the memcg LRU lists here.

Can't agree more.
The first time I read this code, I wondered why not assign an
non-negtive number to kmemcg_id of the root_mem_cgroup and then use
memcg_lrus as well.


> Ever since we went to
> per-node memcg lru lists (~2015, iirc), there's been this special
> hidden hack for the root memcg to use the global list. and one that
> I have to read lots of code to remind myself it exists every time I
> have to did into this code again.
>
> I mean, if the list is not memcg aware, then it only needs a single
> list per node - the root memcg list. That could be easily stored in
> the memcg_lrus array for the node rather than a separate global
> list, and then the node iteration code all starts to look like this:
>
>         nlru = &lru->node[nid];
>         for_each_mem_cgroup(memcg) {
>                 spin_lock(&nlru->lock);
>                 isolated += __list_lru_walk_one(nlru, memcg,
>                                                 isolate, cb_arg,
>                                                 nr_to_walk);
>                 spin_unlock(&nlru->lock);
>                 if (*nr_to_walk <= 0)
>                         break;
>
>                 /* non-memcg aware lists only have a root memcg list */
>                 if (!list_lru_memcg_aware(lru))
>                         break;
>         }
>
> Hence for the walks in the !list_lru_memcg_aware(lru) case, the
> list_lru_from_memcg() call in __list_lru_walk_one() always returns
> just the root list. This makes everything use the same iteration
> interface, and when you configure out memcg then we simply code the
> the iterator to run once and list_lru_from_memcg() always returns
> the one list...
>

Agree with you that it is a better architecture, and I also want to
change it like this.
That would be more clear and easier for maintiance.
But it requires lots of code changes, should we address it in another
separate patchset ?


> i.e. the whole reason for having the separate global and per-memcg
> LRU lists went away back in 2015 when it was decided that we don't
> care about the (potentially massive) memory usage of per-node memcg
> LRU lists. However, the special casing of the root memcg was left in
> place, but what your patch says to me is that we should really be
> trying to make this special casing go away.
>
> Moving everything in the list-lru to the memcg_lrus also means that
> the global shrinker lists get the "objects in the lru to scan for
> reclaim" shrinker bitmaps that optimise the high level shrinker
> code. It means we could get rid of the separate global vs memcg
> shrinker paths in vmscan.c - we just pass the root memcg into
> shrink_slab_memcg() when we want to do global shrinker reclaim...
>
> IMO, if memcgs are fundamental to the operation of systems these
> days such that there the root memcg is treated no differently to all
> other memcgs, then let's architect the whole stack that way, not
> just make the code hard to read by having little bits of code hide
> the special casing of the root memcg layers deep where it's raelly
> hard to find....
>

Agree. I will try to improve it and submit a new patchset for this
architect improvement.

> And, FWIW, I noticed that the list_lru memcg code assumes we only
> ever put objects from memcg associated slab pages in the list_lru.
> It calls memcg_from_slab_page() which makes no attempt to verify the
> page is actually a slab page. That's a landmine just waiting to get
> boom - list-lru can store any type of object the user wants, not
> just slab pages. e.g. the binder code stores pages in the list-lru,
> not slab objects, and so the only reason it doesn't go boom is that
> the lru-list is not configured to be memcg aware....
>

Another new issue.
I will try to dignose what hiddened in this landmine is, and after I
understand it clearly I will submit a new patchset.

Thanks
Yafang


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 4/5] mm: make memcg visible to lru walker isolation function
  2020-01-06 14:41               ` Yafang Shao
  (?)
@ 2020-01-06 21:31               ` Dave Chinner
  2020-01-07 13:22                   ` Yafang Shao
  -1 siblings, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2020-01-06 21:31 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Al Viro, Linux MM, linux-fsdevel, Dave Chinner

On Mon, Jan 06, 2020 at 10:41:22PM +0800, Yafang Shao wrote:
> On Mon, Jan 6, 2020 at 8:17 AM Dave Chinner <david@fromorbit.com> wrote:
> > We can clean up the code a lot by getting rid of the unnecessary
> > indenting by doing this:
> >
> >         /* iterate the global lru first */
> >         isolated = list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
> >                                         nr_to_walk);
> >         if (!list_lru_memcg_aware(lru))
> >                 return isolated;
> >
> >         nlru = &lru->node[nid];
> >         for_each_mem_cgroup(memcg) {
> >                 /* already scanned the root memcg above */
> >                 if (is_root_memcg(memcg))
> >                         continue;
> >                 if (*nr_to_walk <= 0)
> >                         break;
> >                 spin_lock(&nlru->lock);
> >                 isolated += __list_lru_walk_one(nlru, memcg,
> >                                                 isolate, cb_arg,
> >                                                 nr_to_walk);
> >                 spin_unlock(&nlru->lock);
> >         }
> >         return isolated;
> >
> 
> That's eaiser to understand.
> I wil change it like this in next version.

Thanks!

> >
> > And so to architecture... This all makes me wonder why we still
> > special case the memcg LRU lists here.
> 
> Can't agree more.
> The first time I read this code, I wondered why not assign an
> non-negtive number to kmemcg_id of the root_mem_cgroup and then use
> memcg_lrus as well.

Yeah, I've been wondering why the ID is -1 instead of 0 when we
have a global variable that stores the root memcg that we can
compare pointers directly against via is_root_memcg(). all it seems
to do is make things more complex by having to special case the root
memcg....

From that perspective, I do like your change to use the memcg
iterator functions rather than a for loop over the range of indexes,
but I'd much prefer to see this method used consistently everywhere
rather than the way we've duplicated lots of code by tacking memcgs
onto the side of the non-memcg code paths.

> > Ever since we went to
> > per-node memcg lru lists (~2015, iirc), there's been this special
> > hidden hack for the root memcg to use the global list. and one that
> > I have to read lots of code to remind myself it exists every time I
> > have to did into this code again.
> >
> > I mean, if the list is not memcg aware, then it only needs a single
> > list per node - the root memcg list. That could be easily stored in
> > the memcg_lrus array for the node rather than a separate global
> > list, and then the node iteration code all starts to look like this:
> >
> >         nlru = &lru->node[nid];
> >         for_each_mem_cgroup(memcg) {
> >                 spin_lock(&nlru->lock);
> >                 isolated += __list_lru_walk_one(nlru, memcg,
> >                                                 isolate, cb_arg,
> >                                                 nr_to_walk);
> >                 spin_unlock(&nlru->lock);
> >                 if (*nr_to_walk <= 0)
> >                         break;
> >
> >                 /* non-memcg aware lists only have a root memcg list */
> >                 if (!list_lru_memcg_aware(lru))
> >                         break;
> >         }
> >
> > Hence for the walks in the !list_lru_memcg_aware(lru) case, the
> > list_lru_from_memcg() call in __list_lru_walk_one() always returns
> > just the root list. This makes everything use the same iteration
> > interface, and when you configure out memcg then we simply code the
> > the iterator to run once and list_lru_from_memcg() always returns
> > the one list...
> >
> 
> Agree with you that it is a better architecture, and I also want to
> change it like this.
> That would be more clear and easier for maintiance.
> But it requires lots of code changes, should we address it in another
> separate patchset ?

Yes. I think this is a separate piece of work as it spans much more
than just the list-lru infrastructure.

> > And, FWIW, I noticed that the list_lru memcg code assumes we only
> > ever put objects from memcg associated slab pages in the list_lru.
> > It calls memcg_from_slab_page() which makes no attempt to verify the
> > page is actually a slab page. That's a landmine just waiting to get
> > boom - list-lru can store any type of object the user wants, not
> > just slab pages. e.g. the binder code stores pages in the list-lru,
> > not slab objects, and so the only reason it doesn't go boom is that
> > the lru-list is not configured to be memcg aware....
> >
> 
> Another new issue.
> I will try to dignose what hiddened in this landmine is, and after I
> understand it clearly I will submit a new patchset.

The problem is memcg_from_slab_page() uses page->slab_cache directly
to retreive the owner memcg without first checking the
PageSlab(page) flag. If it's not a slab page, we need to get the
memcg from page->memcg, not page->slab_cache->memcg_params.memcg.

see page_cgroup_ino() for how to safely get the owner memcg from a
random page of unknown type...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 4/5] mm: make memcg visible to lru walker isolation function
  2020-01-06 21:31               ` Dave Chinner
@ 2020-01-07 13:22                   ` Yafang Shao
  0 siblings, 0 replies; 36+ messages in thread
From: Yafang Shao @ 2020-01-07 13:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Al Viro, Linux MM, linux-fsdevel, Dave Chinner

On Tue, Jan 7, 2020 at 5:31 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Jan 06, 2020 at 10:41:22PM +0800, Yafang Shao wrote:
> > On Mon, Jan 6, 2020 at 8:17 AM Dave Chinner <david@fromorbit.com> wrote:
> > > We can clean up the code a lot by getting rid of the unnecessary
> > > indenting by doing this:
> > >
> > >         /* iterate the global lru first */
> > >         isolated = list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
> > >                                         nr_to_walk);
> > >         if (!list_lru_memcg_aware(lru))
> > >                 return isolated;
> > >
> > >         nlru = &lru->node[nid];
> > >         for_each_mem_cgroup(memcg) {
> > >                 /* already scanned the root memcg above */
> > >                 if (is_root_memcg(memcg))
> > >                         continue;
> > >                 if (*nr_to_walk <= 0)
> > >                         break;
> > >                 spin_lock(&nlru->lock);
> > >                 isolated += __list_lru_walk_one(nlru, memcg,
> > >                                                 isolate, cb_arg,
> > >                                                 nr_to_walk);
> > >                 spin_unlock(&nlru->lock);
> > >         }
> > >         return isolated;
> > >
> >
> > That's eaiser to understand.
> > I wil change it like this in next version.
>
> Thanks!
>
> > >
> > > And so to architecture... This all makes me wonder why we still
> > > special case the memcg LRU lists here.
> >
> > Can't agree more.
> > The first time I read this code, I wondered why not assign an
> > non-negtive number to kmemcg_id of the root_mem_cgroup and then use
> > memcg_lrus as well.
>
> Yeah, I've been wondering why the ID is -1 instead of 0 when we
> have a global variable that stores the root memcg that we can
> compare pointers directly against via is_root_memcg(). all it seems
> to do is make things more complex by having to special case the root
> memcg....
>
> From that perspective, I do like your change to use the memcg
> iterator functions rather than a for loop over the range of indexes,
> but I'd much prefer to see this method used consistently everywhere
> rather than the way we've duplicated lots of code by tacking memcgs
> onto the side of the non-memcg code paths.
>

Agreed, that would be better.
I will think about it.

> > > Ever since we went to
> > > per-node memcg lru lists (~2015, iirc), there's been this special
> > > hidden hack for the root memcg to use the global list. and one that
> > > I have to read lots of code to remind myself it exists every time I
> > > have to did into this code again.
> > >
> > > I mean, if the list is not memcg aware, then it only needs a single
> > > list per node - the root memcg list. That could be easily stored in
> > > the memcg_lrus array for the node rather than a separate global
> > > list, and then the node iteration code all starts to look like this:
> > >
> > >         nlru = &lru->node[nid];
> > >         for_each_mem_cgroup(memcg) {
> > >                 spin_lock(&nlru->lock);
> > >                 isolated += __list_lru_walk_one(nlru, memcg,
> > >                                                 isolate, cb_arg,
> > >                                                 nr_to_walk);
> > >                 spin_unlock(&nlru->lock);
> > >                 if (*nr_to_walk <= 0)
> > >                         break;
> > >
> > >                 /* non-memcg aware lists only have a root memcg list */
> > >                 if (!list_lru_memcg_aware(lru))
> > >                         break;
> > >         }
> > >
> > > Hence for the walks in the !list_lru_memcg_aware(lru) case, the
> > > list_lru_from_memcg() call in __list_lru_walk_one() always returns
> > > just the root list. This makes everything use the same iteration
> > > interface, and when you configure out memcg then we simply code the
> > > the iterator to run once and list_lru_from_memcg() always returns
> > > the one list...
> > >
> >
> > Agree with you that it is a better architecture, and I also want to
> > change it like this.
> > That would be more clear and easier for maintiance.
> > But it requires lots of code changes, should we address it in another
> > separate patchset ?
>
> Yes. I think this is a separate piece of work as it spans much more
> than just the list-lru infrastructure.
>

Got it.

> > > And, FWIW, I noticed that the list_lru memcg code assumes we only
> > > ever put objects from memcg associated slab pages in the list_lru.
> > > It calls memcg_from_slab_page() which makes no attempt to verify the
> > > page is actually a slab page. That's a landmine just waiting to get
> > > boom - list-lru can store any type of object the user wants, not
> > > just slab pages. e.g. the binder code stores pages in the list-lru,
> > > not slab objects, and so the only reason it doesn't go boom is that
> > > the lru-list is not configured to be memcg aware....
> > >
> >
> > Another new issue.
> > I will try to dignose what hiddened in this landmine is, and after I
> > understand it clearly I will submit a new patchset.
>
> The problem is memcg_from_slab_page() uses page->slab_cache directly
> to retreive the owner memcg without first checking the
> PageSlab(page) flag. If it's not a slab page, we need to get the
> memcg from page->memcg, not page->slab_cache->memcg_params.memcg.
>
> see page_cgroup_ino() for how to safely get the owner memcg from a
> random page of unknown type...
>

Understood, many thanks for your explanation.


Thanks
Yafang

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 4/5] mm: make memcg visible to lru walker isolation function
@ 2020-01-07 13:22                   ` Yafang Shao
  0 siblings, 0 replies; 36+ messages in thread
From: Yafang Shao @ 2020-01-07 13:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Al Viro, Linux MM, linux-fsdevel, Dave Chinner

On Tue, Jan 7, 2020 at 5:31 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Jan 06, 2020 at 10:41:22PM +0800, Yafang Shao wrote:
> > On Mon, Jan 6, 2020 at 8:17 AM Dave Chinner <david@fromorbit.com> wrote:
> > > We can clean up the code a lot by getting rid of the unnecessary
> > > indenting by doing this:
> > >
> > >         /* iterate the global lru first */
> > >         isolated = list_lru_walk_one(lru, nid, NULL, isolate, cb_arg,
> > >                                         nr_to_walk);
> > >         if (!list_lru_memcg_aware(lru))
> > >                 return isolated;
> > >
> > >         nlru = &lru->node[nid];
> > >         for_each_mem_cgroup(memcg) {
> > >                 /* already scanned the root memcg above */
> > >                 if (is_root_memcg(memcg))
> > >                         continue;
> > >                 if (*nr_to_walk <= 0)
> > >                         break;
> > >                 spin_lock(&nlru->lock);
> > >                 isolated += __list_lru_walk_one(nlru, memcg,
> > >                                                 isolate, cb_arg,
> > >                                                 nr_to_walk);
> > >                 spin_unlock(&nlru->lock);
> > >         }
> > >         return isolated;
> > >
> >
> > That's eaiser to understand.
> > I wil change it like this in next version.
>
> Thanks!
>
> > >
> > > And so to architecture... This all makes me wonder why we still
> > > special case the memcg LRU lists here.
> >
> > Can't agree more.
> > The first time I read this code, I wondered why not assign an
> > non-negtive number to kmemcg_id of the root_mem_cgroup and then use
> > memcg_lrus as well.
>
> Yeah, I've been wondering why the ID is -1 instead of 0 when we
> have a global variable that stores the root memcg that we can
> compare pointers directly against via is_root_memcg(). all it seems
> to do is make things more complex by having to special case the root
> memcg....
>
> From that perspective, I do like your change to use the memcg
> iterator functions rather than a for loop over the range of indexes,
> but I'd much prefer to see this method used consistently everywhere
> rather than the way we've duplicated lots of code by tacking memcgs
> onto the side of the non-memcg code paths.
>

Agreed, that would be better.
I will think about it.

> > > Ever since we went to
> > > per-node memcg lru lists (~2015, iirc), there's been this special
> > > hidden hack for the root memcg to use the global list. and one that
> > > I have to read lots of code to remind myself it exists every time I
> > > have to did into this code again.
> > >
> > > I mean, if the list is not memcg aware, then it only needs a single
> > > list per node - the root memcg list. That could be easily stored in
> > > the memcg_lrus array for the node rather than a separate global
> > > list, and then the node iteration code all starts to look like this:
> > >
> > >         nlru = &lru->node[nid];
> > >         for_each_mem_cgroup(memcg) {
> > >                 spin_lock(&nlru->lock);
> > >                 isolated += __list_lru_walk_one(nlru, memcg,
> > >                                                 isolate, cb_arg,
> > >                                                 nr_to_walk);
> > >                 spin_unlock(&nlru->lock);
> > >                 if (*nr_to_walk <= 0)
> > >                         break;
> > >
> > >                 /* non-memcg aware lists only have a root memcg list */
> > >                 if (!list_lru_memcg_aware(lru))
> > >                         break;
> > >         }
> > >
> > > Hence for the walks in the !list_lru_memcg_aware(lru) case, the
> > > list_lru_from_memcg() call in __list_lru_walk_one() always returns
> > > just the root list. This makes everything use the same iteration
> > > interface, and when you configure out memcg then we simply code the
> > > the iterator to run once and list_lru_from_memcg() always returns
> > > the one list...
> > >
> >
> > Agree with you that it is a better architecture, and I also want to
> > change it like this.
> > That would be more clear and easier for maintiance.
> > But it requires lots of code changes, should we address it in another
> > separate patchset ?
>
> Yes. I think this is a separate piece of work as it spans much more
> than just the list-lru infrastructure.
>

Got it.

> > > And, FWIW, I noticed that the list_lru memcg code assumes we only
> > > ever put objects from memcg associated slab pages in the list_lru.
> > > It calls memcg_from_slab_page() which makes no attempt to verify the
> > > page is actually a slab page. That's a landmine just waiting to get
> > > boom - list-lru can store any type of object the user wants, not
> > > just slab pages. e.g. the binder code stores pages in the list-lru,
> > > not slab objects, and so the only reason it doesn't go boom is that
> > > the lru-list is not configured to be memcg aware....
> > >
> >
> > Another new issue.
> > I will try to dignose what hiddened in this landmine is, and after I
> > understand it clearly I will submit a new patchset.
>
> The problem is memcg_from_slab_page() uses page->slab_cache directly
> to retreive the owner memcg without first checking the
> PageSlab(page) flag. If it's not a slab page, we need to get the
> memcg from page->memcg, not page->slab_cache->memcg_params.memcg.
>
> see page_cgroup_ino() for how to safely get the owner memcg from a
> random page of unknown type...
>

Understood, many thanks for your explanation.


Thanks
Yafang


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2020-01-07 13:22 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-24  7:53 [PATCH v2 0/5] protect page cache from freeing inode Yafang Shao
2019-12-24  7:53 ` [PATCH v2 1/5] mm, memcg: reduce size of struct mem_cgroup by using bit field Yafang Shao
2019-12-26 21:23   ` Roman Gushchin
2019-12-27  1:03     ` Yafang Shao
2019-12-27  1:03       ` Yafang Shao
2019-12-24  7:53 ` [PATCH v2 2/5] mm, memcg: introduce MEMCG_PROT_SKIP for memcg zero usage case Yafang Shao
2019-12-26 21:36   ` Roman Gushchin
2019-12-27  1:09     ` Yafang Shao
2019-12-27  1:09       ` Yafang Shao
2019-12-24  7:53 ` [PATCH v2 3/5] mm, memcg: reset memcg's memory.{min, low} for reclaiming itself Yafang Shao
2019-12-26 21:45   ` Roman Gushchin
2019-12-27  1:11     ` Yafang Shao
2019-12-27  1:11       ` Yafang Shao
2019-12-24  7:53 ` [PATCH v2 4/5] mm: make memcg visible to lru walker isolation function Yafang Shao
2020-01-04  3:35   ` Dave Chinner
2020-01-04  7:26     ` Yafang Shao
2020-01-04  7:26       ` Yafang Shao
2020-01-04 21:23       ` Dave Chinner
2020-01-05  1:43         ` Yafang Shao
2020-01-05  1:43           ` Yafang Shao
2020-01-06  0:17           ` Dave Chinner
2020-01-06 14:41             ` Yafang Shao
2020-01-06 14:41               ` Yafang Shao
2020-01-06 21:31               ` Dave Chinner
2020-01-07 13:22                 ` Yafang Shao
2020-01-07 13:22                   ` Yafang Shao
2019-12-24  7:53 ` [PATCH v2 5/5] memcg, inode: protect page cache from freeing inode Yafang Shao
2019-12-25 13:01   ` kbuild test robot
2019-12-25 13:01     ` kbuild test robot
2019-12-25 13:18   ` kbuild test robot
2019-12-25 13:18     ` kbuild test robot
2019-12-26  5:09     ` Yafang Shao
2019-12-26  5:09       ` Yafang Shao
2020-01-04  3:55   ` Dave Chinner
2020-01-04  7:42     ` Yafang Shao
2020-01-04  7:42       ` Yafang Shao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.