[PATCH 0/4] memcg, inode: protect page cache from freeing inode

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] memcg, inode: protect page cache from freeing inode
@ 2019-12-17 11:29 Yafang Shao
  2019-12-17 11:29 ` [PATCH 1/4] mm, memcg: reduce size of struct mem_cgroup by using bit field Yafang Shao
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Yafang Shao @ 2019-12-17 11:29 UTC (permalink / raw)
  To: hannes, mhocko, vdavydov.dev, akpm, viro
  Cc: linux-mm, linux-fsdevel, Yafang Shao

On my server there're some running MEMCGs protected by memory.{min, low},
but I found the usage of these MEMCGs abruptly became very small, which
were far less than the protect limit. It confused me and finally I
found that was because of inode stealing.
Once an inode is freed, all its belonging page caches will be dropped as
well, no matter how may page caches it has. So if we intend to protect the
page caches in a memcg, we must protect their host (the inode) first.
Otherwise the memcg protection can be easily bypassed with freeing inode,
especially if there're big files in this memcg.
The inherent mismatch between memcg and inode is a trouble. One inode can
be shared by different MEMCGs, but it is a very rare case. If an inode is
shared, its belonging page caches may be charged to different MEMCGs.
Currently there's no perfect solution to fix this kind of issue, but the
inode majority-writer ownership switching can help it more or less.

This patchset contains four patches, in which patches 1-3 are minor
optimization and also the preparation of patch 4, and patch 4 is the real
issue I want to fix.

Yafang Shao (4):
  mm, memcg: reduce size of struct mem_cgroup by using bit field
  mm, memcg: introduce MEMCG_PROT_SKIP for memcg zero usage case
  mm, memcg: reset memcg's memory.{min, low} for reclaiming itself
  memcg, inode: protect page cache from freeing inode

 fs/inode.c                 |  9 +++++++
 include/linux/memcontrol.h | 37 ++++++++++++++++++++++-------
 mm/memcontrol.c            | 59 ++++++++++++++++++++++++++++++++++++++++++++--
 mm/vmscan.c                | 10 ++++++++
 4 files changed, 104 insertions(+), 11 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/4] mm, memcg: reduce size of struct mem_cgroup by using bit field
  2019-12-17 11:29 [PATCH 0/4] memcg, inode: protect page cache from freeing inode Yafang Shao
@ 2019-12-17 11:29 ` Yafang Shao
  2019-12-17 11:29 ` [PATCH 2/4] mm, memcg: introduce MEMCG_PROT_SKIP for memcg zero usage case Yafang Shao
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: Yafang Shao @ 2019-12-17 11:29 UTC (permalink / raw)
  To: hannes, mhocko, vdavydov.dev, akpm, viro
  Cc: linux-mm, linux-fsdevel, Yafang Shao, Aaron Lu

There are some members in struct mem_group can be either 0(false) or
1(true), so we can define them using bit field to reduce size. With this
patch, the size of struct mem_cgroup can be reduced by 64 bytes in theory,
but as there're some MEMCG_PADDING()s, the real number may be different,
which is relate with the cacheline size. Anyway, this patch could reduce
the size of struct mem_cgroup more or less.

Cc: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/memcontrol.h | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a7a0a1a5..612a457 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -229,20 +229,26 @@ struct mem_cgroup {
 	/*
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
-	bool use_hierarchy;
+	unsigned int use_hierarchy : 1;
 
 	/*
 	 * Should the OOM killer kill all belonging tasks, had it kill one?
 	 */
-	bool oom_group;
+	unsigned int  oom_group : 1;
 
 	/* protected by memcg_oom_lock */
-	bool		oom_lock;
-	int		under_oom;
+	unsigned int oom_lock : 1;
 
-	int	swappiness;
 	/* OOM-Killer disable */
-	int		oom_kill_disable;
+	unsigned int oom_kill_disable : 1;
+
+	/* Legacy tcp memory accounting */
+	unsigned int tcpmem_active : 1;
+	unsigned int tcpmem_pressure : 1;
+
+	int under_oom;
+
+	int	swappiness;
 
 	/* memory.events and memory.events.local */
 	struct cgroup_file events_file;
@@ -297,9 +303,6 @@ struct mem_cgroup {
 
 	unsigned long		socket_pressure;
 
-	/* Legacy tcp memory accounting */
-	bool			tcpmem_active;
-	int			tcpmem_pressure;
 
 #ifdef CONFIG_MEMCG_KMEM
         /* Index in the kmem_cache->memcg_params.memcg_caches array */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/4] mm, memcg: introduce MEMCG_PROT_SKIP for memcg zero usage case
  2019-12-17 11:29 [PATCH 0/4] memcg, inode: protect page cache from freeing inode Yafang Shao
  2019-12-17 11:29 ` [PATCH 1/4] mm, memcg: reduce size of struct mem_cgroup by using bit field Yafang Shao
@ 2019-12-17 11:29 ` Yafang Shao
  2019-12-17 11:29 ` [PATCH 3/4] mm, memcg: reset memcg's memory.{min, low} for reclaiming itself Yafang Shao
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: Yafang Shao @ 2019-12-17 11:29 UTC (permalink / raw)
  To: hannes, mhocko, vdavydov.dev, akpm, viro
  Cc: linux-mm, linux-fsdevel, Yafang Shao, Roman Gushchin

If the usage of a memcg is zero, we don't need to do useless work to scan
it. That is a minor optimization.

Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/memcontrol.h | 1 +
 mm/memcontrol.c            | 2 +-
 mm/vmscan.c                | 6 ++++++
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 612a457..1a315c7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -54,6 +54,7 @@ enum mem_cgroup_protection {
 	MEMCG_PROT_NONE,
 	MEMCG_PROT_LOW,
 	MEMCG_PROT_MIN,
+	MEMCG_PROT_SKIP,	/* For zero usage case */
 };
 
 struct mem_cgroup_reclaim_cookie {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c5b5f74..f35fcca 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6292,7 +6292,7 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 
 	usage = page_counter_read(&memcg->memory);
 	if (!usage)
-		return MEMCG_PROT_NONE;
+		return MEMCG_PROT_SKIP;
 
 	emin = memcg->memory.min;
 	elow = memcg->memory.low;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5a6445e..3c4c2da 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2677,6 +2677,12 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 			 * thresholds (see get_scan_count).
 			 */
 			break;
+		case MEMCG_PROT_SKIP:
+			/*
+			 * Skip scanning this memcg if the usage of it is
+			 * zero.
+			 */
+			continue;
 		}
 
 		reclaimed = sc->nr_reclaimed;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/4] mm, memcg: reset memcg's memory.{min, low} for reclaiming itself
  2019-12-17 11:29 [PATCH 0/4] memcg, inode: protect page cache from freeing inode Yafang Shao
  2019-12-17 11:29 ` [PATCH 1/4] mm, memcg: reduce size of struct mem_cgroup by using bit field Yafang Shao
  2019-12-17 11:29 ` [PATCH 2/4] mm, memcg: introduce MEMCG_PROT_SKIP for memcg zero usage case Yafang Shao
@ 2019-12-17 11:29 ` Yafang Shao
  2019-12-17 14:20   ` Chris Down
  2019-12-17 11:29 ` [PATCH 4/4] memcg, inode: protect page cache from freeing inode Yafang Shao
  2019-12-17 11:56 ` [PATCH 0/4] " Michal Hocko
  4 siblings, 1 reply; 23+ messages in thread
From: Yafang Shao @ 2019-12-17 11:29 UTC (permalink / raw)
  To: hannes, mhocko, vdavydov.dev, akpm, viro
  Cc: linux-mm, linux-fsdevel, Yafang Shao, Chris Down

memory.{emin, elow} are set in mem_cgroup_protected(), and the values of
them won't be changed until next recalculation in this function. After
either or both of them are set, the next reclaimer to relcaim this memcg
may be a different reclaimer, e.g. this memcg is also the root memcg of
the new reclaimer, and then in mem_cgroup_protection() in get_scan_count()
the old values of them will be used to calculate scan count, that is not
proper. We should reset them to zero in this case.

Cc: Chris Down <chris@chrisdown.name>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 mm/memcontrol.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f35fcca..234370c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6287,8 +6287,17 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 
 	if (!root)
 		root = root_mem_cgroup;
-	if (memcg == root)
+	if (memcg == root) {
+		/*
+		 * Reset memory.(emin, elow) for reclaiming the memcg
+		 * itself.
+		 */
+		if (memcg != root_mem_cgroup) {
+			memcg->memory.emin = 0;
+			memcg->memory.emin = 0;
+		}
 		return MEMCG_PROT_NONE;
+	}
 
 	usage = page_counter_read(&memcg->memory);
 	if (!usage)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] mm, memcg: reset memcg's memory.{min, low} for reclaiming itself
  2019-12-17 11:29 ` [PATCH 3/4] mm, memcg: reset memcg's memory.{min, low} for reclaiming itself Yafang Shao
@ 2019-12-17 14:20   ` Chris Down
  2019-12-18  1:13     ` Yafang Shao
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Down @ 2019-12-17 14:20 UTC (permalink / raw)
  To: Yafang Shao
  Cc: hannes, mhocko, vdavydov.dev, akpm, viro, linux-mm, linux-fsdevel

Hi Yafang,

Yafang Shao writes:
>memory.{emin, elow} are set in mem_cgroup_protected(), and the values of
>them won't be changed until next recalculation in this function. After
>either or both of them are set, the next reclaimer to relcaim this memcg
>may be a different reclaimer, e.g. this memcg is also the root memcg of
>the new reclaimer, and then in mem_cgroup_protection() in get_scan_count()
>the old values of them will be used to calculate scan count, that is not
>proper. We should reset them to zero in this case.

If the memcg in question is passed as "root" to mem_cgroup_protected with a 
child as the new "memcg" argument, then I still don't see what is wrong. 
mem_cgroup_protected must be called top-down from the root of the hierarchy in 
order to work already, which we already do in shrink_node_memcgs. This will 
already update the tree's cached effective protections properly, as far as I 
can see.

As such I'm not sure I understand what you mean in the changelog or in the 
patch. emin/elow as a mechanism is already intended to be racy/best-effort, 
since by the time we get to doing work it's always possible that reclaim 
eligibility state changed, and callers have to consider that.

Could you please explain further the situation you're trying to guard against? 
Thanks.

>Cc: Chris Down <chris@chrisdown.name>
>Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
>---
> mm/memcontrol.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
>diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>index f35fcca..234370c 100644
>--- a/mm/memcontrol.c
>+++ b/mm/memcontrol.c
>@@ -6287,8 +6287,17 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
>
> 	if (!root)
> 		root = root_mem_cgroup;
>-	if (memcg == root)
>+	if (memcg == root) {
>+		/*
>+		 * Reset memory.(emin, elow) for reclaiming the memcg
>+		 * itself.
>+		 */
>+		if (memcg != root_mem_cgroup) {
>+			memcg->memory.emin = 0;
>+			memcg->memory.emin = 0;
>+		}
> 		return MEMCG_PROT_NONE;
>+	}
>
> 	usage = page_counter_read(&memcg->memory);
> 	if (!usage)
>-- 
>1.8.3.1
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] mm, memcg: reset memcg's memory.{min, low} for reclaiming itself
  2019-12-17 14:20   ` Chris Down
@ 2019-12-18  1:13     ` Yafang Shao
  0 siblings, 0 replies; 23+ messages in thread
From: Yafang Shao @ 2019-12-18  1:13 UTC (permalink / raw)
  To: Chris Down
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Al Viro, Linux MM, linux-fsdevel

On Tue, Dec 17, 2019 at 10:20 PM Chris Down <chris@chrisdown.name> wrote:
>
> Hi Yafang,
>
> Yafang Shao writes:
> >memory.{emin, elow} are set in mem_cgroup_protected(), and the values of
> >them won't be changed until next recalculation in this function. After
> >either or both of them are set, the next reclaimer to relcaim this memcg
> >may be a different reclaimer, e.g. this memcg is also the root memcg of
> >the new reclaimer, and then in mem_cgroup_protection() in get_scan_count()
> >the old values of them will be used to calculate scan count, that is not
> >proper. We should reset them to zero in this case.
>
> If the memcg in question is passed as "root" to mem_cgroup_protected with a
> child as the new "memcg" argument, then I still don't see what is wrong.
> mem_cgroup_protected must be called top-down from the root of the hierarchy in
> order to work already, which we already do in shrink_node_memcgs. This will
> already update the tree's cached effective protections properly, as far as I
> can see.
>

Right.

> As such I'm not sure I understand what you mean in the changelog or in the
> patch. emin/elow as a mechanism is already intended to be racy/best-effort,
> since by the time we get to doing work it's always possible that reclaim
> eligibility state changed, and callers have to consider that.
>
> Could you please explain further the situation you're trying to guard against?
> Thanks.
>

Considering bellow case,

         root_mem_cgroup
           /
        A   memory.max=1024M memory.min=512M memory.current=800M

Once kswapd is waked up, it will try to scan all MEMCGs, including
this A, and it will assign memory.emin of A to 512M.
After that, A may reach its hard limit(memory.max), and then it will
do memcg reclaim. Because A is the root of this reclaimer, so it will
not calculate its memory.emin. So the memory.emin is the old vaule
512M, and then this old value will be used to in
mem_cgroup_protection() in get_scan_count() to get the scan count.
That is not proper.

Right ?


Thanks
Yafang

> >Cc: Chris Down <chris@chrisdown.name>
> >Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> >---
> > mm/memcontrol.c | 11 ++++++++++-
> > 1 file changed, 10 insertions(+), 1 deletion(-)
> >
> >diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >index f35fcca..234370c 100644
> >--- a/mm/memcontrol.c
> >+++ b/mm/memcontrol.c
> >@@ -6287,8 +6287,17 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
> >
> >       if (!root)
> >               root = root_mem_cgroup;
> >-      if (memcg == root)
> >+      if (memcg == root) {
> >+              /*
> >+               * Reset memory.(emin, elow) for reclaiming the memcg
> >+               * itself.
> >+               */
> >+              if (memcg != root_mem_cgroup) {
> >+                      memcg->memory.emin = 0;
> >+                      memcg->memory.emin = 0;
> >+              }
> >               return MEMCG_PROT_NONE;
> >+      }
> >
> >       usage = page_counter_read(&memcg->memory);
> >       if (!usage)
> >--
> >1.8.3.1
> >

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 4/4] memcg, inode: protect page cache from freeing inode
  2019-12-17 11:29 [PATCH 0/4] memcg, inode: protect page cache from freeing inode Yafang Shao
                   ` (2 preceding siblings ...)
  2019-12-17 11:29 ` [PATCH 3/4] mm, memcg: reset memcg's memory.{min, low} for reclaiming itself Yafang Shao
@ 2019-12-17 11:29 ` Yafang Shao
  2019-12-18  2:21   ` Dave Chinner
  2019-12-18 17:53   ` Roman Gushchin
  2019-12-17 11:56 ` [PATCH 0/4] " Michal Hocko
  4 siblings, 2 replies; 23+ messages in thread
From: Yafang Shao @ 2019-12-17 11:29 UTC (permalink / raw)
  To: hannes, mhocko, vdavydov.dev, akpm, viro
  Cc: linux-mm, linux-fsdevel, Yafang Shao, Roman Gushchin, Chris Down,
	Dave Chinner

On my server there're some running MEMCGs protected by memory.{min, low},
but I found the usage of these MEMCGs abruptly became very small, which
were far less than the protect limit. It confused me and finally I
found that was because of inode stealing.
Once an inode is freed, all its belonging page caches will be dropped as
well, no matter how may page caches it has. So if we intend to protect the
page caches in a memcg, we must protect their host (the inode) first.
Otherwise the memcg protection can be easily bypassed with freeing inode,
especially if there're big files in this memcg.
The inherent mismatch between memcg and inode is a trouble. One inode can
be shared by different MEMCGs, but it is a very rare case. If an inode is
shared, its belonging page caches may be charged to different MEMCGs.
Currently there's no perfect solution to fix this kind of issue, but the
inode majority-writer ownership switching can help it more or less.

Cc: Roman Gushchin <guro@fb.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 fs/inode.c                 |  9 +++++++++
 include/linux/memcontrol.h | 15 +++++++++++++++
 mm/memcontrol.c            | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                |  4 ++++
 4 files changed, 74 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index fef457a..b022447 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -734,6 +734,15 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
 	if (!spin_trylock(&inode->i_lock))
 		return LRU_SKIP;
 
+
+	/* Page protection only works in reclaimer */
+	if (inode->i_data.nrpages && current->reclaim_state) {
+		if (mem_cgroup_inode_protected(inode)) {
+			spin_unlock(&inode->i_lock);
+			return LRU_ROTATE;
+		}
+	}
+
 	/*
 	 * Referenced or dirty inodes are still in use. Give them another pass
 	 * through the LRU as we canot reclaim them now.
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1a315c7..21338f0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -247,6 +247,9 @@ struct mem_cgroup {
 	unsigned int tcpmem_active : 1;
 	unsigned int tcpmem_pressure : 1;
 
+	/* Soft protection will be ignored if it's true */
+	unsigned int in_low_reclaim : 1;
+
 	int under_oom;
 
 	int	swappiness;
@@ -363,6 +366,7 @@ static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
 
 enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 						struct mem_cgroup *memcg);
+unsigned long mem_cgroup_inode_protected(struct inode *inode);
 
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 			  gfp_t gfp_mask, struct mem_cgroup **memcgp,
@@ -850,6 +854,11 @@ static inline enum mem_cgroup_protection mem_cgroup_protected(
 	return MEMCG_PROT_NONE;
 }
 
+static inline unsigned long mem_cgroup_inode_protected(struct inode *inode)
+{
+	return 0;
+}
+
 static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask,
 					struct mem_cgroup **memcgp,
@@ -926,6 +935,12 @@ static inline struct mem_cgroup *get_mem_cgroup_from_page(struct page *page)
 	return NULL;
 }
 
+static inline struct mem_cgroup *
+mem_cgroup_from_css(struct cgroup_subsys_state *css)
+{
+	return NULL;
+}
+
 static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 234370c..efb53f3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6355,6 +6355,52 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 }
 
 /**
+ * Once an inode is freed, all its belonging page caches will be dropped as
+ * well, even if there're lots of page caches. So if we intend to protect
+ * page caches in a memcg, we must protect their host first. Otherwise the
+ * memory usage can be dropped abruptly if there're big files in this
+ * memcg. IOW the memcy protection can be easily bypassed with freeing
+ * inode. We should prevent it.
+ * The inherent mismatch between memcg and inode is a trouble. One inode
+ * can be shared by different MEMCGs, but it is a very rare case. If
+ * an inode is shared, its belonging page caches may be charged to
+ * different MEMCGs. Currently there's no perfect solution to fix this
+ * kind of issue, but the inode majority-writer ownership switching can
+ * help it more or less.
+ */
+unsigned long mem_cgroup_inode_protected(struct inode *inode)
+{
+	unsigned long cgroup_size;
+	unsigned long protect = 0;
+	struct bdi_writeback *wb;
+	struct mem_cgroup *memcg;
+
+	wb = inode_to_wb(inode);
+	if (!wb)
+		goto out;
+
+	memcg = mem_cgroup_from_css(wb->memcg_css);
+	if (!memcg || memcg == root_mem_cgroup)
+		goto out;
+
+	protect = mem_cgroup_protection(memcg, memcg->in_low_reclaim);
+	if (!protect)
+		goto out;
+
+	cgroup_size = mem_cgroup_size(memcg);
+	/*
+	 * Don't need to protect this inode, if the usage is still above
+	 * the limit after reclaiming this inode and its belonging page
+	 * caches.
+	 */
+	if (inode->i_data.nrpages + protect < cgroup_size)
+		protect = 0;
+
+out:
+	return protect;
+}
+
+/**
  * mem_cgroup_try_charge - try charging a page
  * @page: page to charge
  * @mm: mm context of the victim
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3c4c2da..1cc7fc2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2666,6 +2666,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 				sc->memcg_low_skipped = 1;
 				continue;
 			}
+			memcg->in_low_reclaim = 1;
 			memcg_memory_event(memcg, MEMCG_LOW);
 			break;
 		case MEMCG_PROT_NONE:
@@ -2693,6 +2694,9 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 		shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
 			    sc->priority);
 
+		if (memcg->in_low_reclaim)
+			memcg->in_low_reclaim = 0;
+
 		/* Record the group's reclaim efficiency */
 		vmpressure(sc->gfp_mask, memcg, false,
 			   sc->nr_scanned - scanned,
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] memcg, inode: protect page cache from freeing inode
  2019-12-17 11:29 ` [PATCH 4/4] memcg, inode: protect page cache from freeing inode Yafang Shao
@ 2019-12-18  2:21   ` Dave Chinner
  2019-12-18  2:33     ` Yafang Shao
  2019-12-18 17:53   ` Roman Gushchin
  1 sibling, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2019-12-18  2:21 UTC (permalink / raw)
  To: Yafang Shao
  Cc: hannes, mhocko, vdavydov.dev, akpm, viro, linux-mm,
	linux-fsdevel, Roman Gushchin, Chris Down, Dave Chinner

On Tue, Dec 17, 2019 at 06:29:19AM -0500, Yafang Shao wrote:
> On my server there're some running MEMCGs protected by memory.{min, low},
> but I found the usage of these MEMCGs abruptly became very small, which
> were far less than the protect limit. It confused me and finally I
> found that was because of inode stealing.
> Once an inode is freed, all its belonging page caches will be dropped as
> well, no matter how may page caches it has. So if we intend to protect the
> page caches in a memcg, we must protect their host (the inode) first.
> Otherwise the memcg protection can be easily bypassed with freeing inode,
> especially if there're big files in this memcg.
> The inherent mismatch between memcg and inode is a trouble. One inode can
> be shared by different MEMCGs, but it is a very rare case. If an inode is
> shared, its belonging page caches may be charged to different MEMCGs.
> Currently there's no perfect solution to fix this kind of issue, but the
> inode majority-writer ownership switching can help it more or less.
> 
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Chris Down <chris@chrisdown.name>
> Cc: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  fs/inode.c                 |  9 +++++++++
>  include/linux/memcontrol.h | 15 +++++++++++++++
>  mm/memcontrol.c            | 46 ++++++++++++++++++++++++++++++++++++++++++++++
>  mm/vmscan.c                |  4 ++++
>  4 files changed, 74 insertions(+)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index fef457a..b022447 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -734,6 +734,15 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
>  	if (!spin_trylock(&inode->i_lock))
>  		return LRU_SKIP;
>  
> +
> +	/* Page protection only works in reclaimer */
> +	if (inode->i_data.nrpages && current->reclaim_state) {
> +		if (mem_cgroup_inode_protected(inode)) {
> +			spin_unlock(&inode->i_lock);
> +			return LRU_ROTATE;

Urk, so after having plumbed the memcg all the way down to the
list_lru walk code so that we only walk inodes in that memcg, we now
have to do a lookup from the inode back to the owner memcg to
determine if we should reclaim it? IOWs, I think the layering here
is all wrong - if memcg info is needed in the shrinker, it should
come from the shrink_control->memcg pointer, not be looked up from
the object being isolated...

i.e. this code should read something like this:

	if (memcg && inode->i_data.nrpages &&
	    (!memcg_can_reclaim_inode(memcg, inode)) {
		spin_unlock(&inode->i_lock);
		return LRU_ROTATE;
	}

This code does not need comments because it is obvious what it does,
and it provides a generic hook into inode reclaim for the memcg code
to decide whether the shrinker should reclaim the inode or not.

This is how the memcg code should interact with other shrinkers, too
(e.g. the dentry cache isolation function), so you need to look at
how to make the memcg visible to the lru walker isolation
functions....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] memcg, inode: protect page cache from freeing inode
  2019-12-18  2:21   ` Dave Chinner
@ 2019-12-18  2:33     ` Yafang Shao
  0 siblings, 0 replies; 23+ messages in thread
From: Yafang Shao @ 2019-12-18  2:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Johannes Weiner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Al Viro, Linux MM, linux-fsdevel, Roman Gushchin, Chris Down,
	Dave Chinner

On Wed, Dec 18, 2019 at 10:21 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Dec 17, 2019 at 06:29:19AM -0500, Yafang Shao wrote:
> > On my server there're some running MEMCGs protected by memory.{min, low},
> > but I found the usage of these MEMCGs abruptly became very small, which
> > were far less than the protect limit. It confused me and finally I
> > found that was because of inode stealing.
> > Once an inode is freed, all its belonging page caches will be dropped as
> > well, no matter how may page caches it has. So if we intend to protect the
> > page caches in a memcg, we must protect their host (the inode) first.
> > Otherwise the memcg protection can be easily bypassed with freeing inode,
> > especially if there're big files in this memcg.
> > The inherent mismatch between memcg and inode is a trouble. One inode can
> > be shared by different MEMCGs, but it is a very rare case. If an inode is
> > shared, its belonging page caches may be charged to different MEMCGs.
> > Currently there's no perfect solution to fix this kind of issue, but the
> > inode majority-writer ownership switching can help it more or less.
> >
> > Cc: Roman Gushchin <guro@fb.com>
> > Cc: Chris Down <chris@chrisdown.name>
> > Cc: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  fs/inode.c                 |  9 +++++++++
> >  include/linux/memcontrol.h | 15 +++++++++++++++
> >  mm/memcontrol.c            | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> >  mm/vmscan.c                |  4 ++++
> >  4 files changed, 74 insertions(+)
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index fef457a..b022447 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -734,6 +734,15 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
> >       if (!spin_trylock(&inode->i_lock))
> >               return LRU_SKIP;
> >
> > +
> > +     /* Page protection only works in reclaimer */
> > +     if (inode->i_data.nrpages && current->reclaim_state) {
> > +             if (mem_cgroup_inode_protected(inode)) {
> > +                     spin_unlock(&inode->i_lock);
> > +                     return LRU_ROTATE;
>
> Urk, so after having plumbed the memcg all the way down to the
> list_lru walk code so that we only walk inodes in that memcg, we now
> have to do a lookup from the inode back to the owner memcg to
> determine if we should reclaim it? IOWs, I think the layering here
> is all wrong - if memcg info is needed in the shrinker, it should
> come from the shrink_control->memcg pointer, not be looked up from
> the object being isolated...
>

Agree with you that the layering here is not good.
I had tried to use shrink_control->memcg pointer as an argument or
something else,  but I found that will change lots of code.
I don't want to change too much code, so I implement it this way,
although it looks a litte strange.

> i.e. this code should read something like this:
>
>         if (memcg && inode->i_data.nrpages &&
>             (!memcg_can_reclaim_inode(memcg, inode)) {
>                 spin_unlock(&inode->i_lock);
>                 return LRU_ROTATE;
>         }
>
> This code does not need comments because it is obvious what it does,
> and it provides a generic hook into inode reclaim for the memcg code
> to decide whether the shrinker should reclaim the inode or not.
>
> This is how the memcg code should interact with other shrinkers, too
> (e.g. the dentry cache isolation function), so you need to look at
> how to make the memcg visible to the lru walker isolation
> functions....
>

Thanks for your suggestion.
I will rethink it torwards this way.

Thanks
Yafang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] memcg, inode: protect page cache from freeing inode
  2019-12-17 11:29 ` [PATCH 4/4] memcg, inode: protect page cache from freeing inode Yafang Shao
  2019-12-18  2:21   ` Dave Chinner
@ 2019-12-18 17:53   ` Roman Gushchin
  2019-12-19  1:45     ` Yafang Shao
  1 sibling, 1 reply; 23+ messages in thread
From: Roman Gushchin @ 2019-12-18 17:53 UTC (permalink / raw)
  To: Yafang Shao
  Cc: hannes, mhocko, vdavydov.dev, akpm, viro, linux-mm,
	linux-fsdevel, Chris Down, Dave Chinner

On Tue, Dec 17, 2019 at 06:29:19AM -0500, Yafang Shao wrote:
> On my server there're some running MEMCGs protected by memory.{min, low},
> but I found the usage of these MEMCGs abruptly became very small, which
> were far less than the protect limit. It confused me and finally I
> found that was because of inode stealing.
> Once an inode is freed, all its belonging page caches will be dropped as
> well, no matter how may page caches it has. So if we intend to protect the
> page caches in a memcg, we must protect their host (the inode) first.
> Otherwise the memcg protection can be easily bypassed with freeing inode,
> especially if there're big files in this memcg.
> The inherent mismatch between memcg and inode is a trouble. One inode can
> be shared by different MEMCGs, but it is a very rare case. If an inode is
> shared, its belonging page caches may be charged to different MEMCGs.
> Currently there's no perfect solution to fix this kind of issue, but the
> inode majority-writer ownership switching can help it more or less.
> 
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Chris Down <chris@chrisdown.name>
> Cc: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  fs/inode.c                 |  9 +++++++++
>  include/linux/memcontrol.h | 15 +++++++++++++++
>  mm/memcontrol.c            | 46 ++++++++++++++++++++++++++++++++++++++++++++++
>  mm/vmscan.c                |  4 ++++
>  4 files changed, 74 insertions(+)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index fef457a..b022447 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -734,6 +734,15 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
>  	if (!spin_trylock(&inode->i_lock))
>  		return LRU_SKIP;
>  
> +
> +	/* Page protection only works in reclaimer */
> +	if (inode->i_data.nrpages && current->reclaim_state) {
> +		if (mem_cgroup_inode_protected(inode)) {
> +			spin_unlock(&inode->i_lock);
> +			return LRU_ROTATE;
> +		}
> +	}

Not directly related to this approach, but I wonder, if we should scale down
the size of shrinker lists depending on the memory protection (like we do with
LRU lists)? It won't fix the problem with huge inodes being reclaimed at once
without a need, but will help scale the memory pressure for protected cgroups.

Thanks!


> +
>  	/*
>  	 * Referenced or dirty inodes are still in use. Give them another pass
>  	 * through the LRU as we canot reclaim them now.
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1a315c7..21338f0 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -247,6 +247,9 @@ struct mem_cgroup {
>  	unsigned int tcpmem_active : 1;
>  	unsigned int tcpmem_pressure : 1;
>  
> +	/* Soft protection will be ignored if it's true */
> +	unsigned int in_low_reclaim : 1;
> +
>  	int under_oom;
>  
>  	int	swappiness;
> @@ -363,6 +366,7 @@ static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
>  
>  enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
>  						struct mem_cgroup *memcg);
> +unsigned long mem_cgroup_inode_protected(struct inode *inode);
>  
>  int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
>  			  gfp_t gfp_mask, struct mem_cgroup **memcgp,
> @@ -850,6 +854,11 @@ static inline enum mem_cgroup_protection mem_cgroup_protected(
>  	return MEMCG_PROT_NONE;
>  }
>  
> +static inline unsigned long mem_cgroup_inode_protected(struct inode *inode)
> +{
> +	return 0;
> +}
> +
>  static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
>  					gfp_t gfp_mask,
>  					struct mem_cgroup **memcgp,
> @@ -926,6 +935,12 @@ static inline struct mem_cgroup *get_mem_cgroup_from_page(struct page *page)
>  	return NULL;
>  }
>  
> +static inline struct mem_cgroup *
> +mem_cgroup_from_css(struct cgroup_subsys_state *css)
> +{
> +	return NULL;
> +}
> +
>  static inline void mem_cgroup_put(struct mem_cgroup *memcg)
>  {
>  }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 234370c..efb53f3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6355,6 +6355,52 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
>  }
>  
>  /**
> + * Once an inode is freed, all its belonging page caches will be dropped as
> + * well, even if there're lots of page caches. So if we intend to protect
> + * page caches in a memcg, we must protect their host first. Otherwise the
> + * memory usage can be dropped abruptly if there're big files in this
> + * memcg. IOW the memcy protection can be easily bypassed with freeing
> + * inode. We should prevent it.
> + * The inherent mismatch between memcg and inode is a trouble. One inode
> + * can be shared by different MEMCGs, but it is a very rare case. If
> + * an inode is shared, its belonging page caches may be charged to
> + * different MEMCGs. Currently there's no perfect solution to fix this
> + * kind of issue, but the inode majority-writer ownership switching can
> + * help it more or less.
> + */
> +unsigned long mem_cgroup_inode_protected(struct inode *inode)
> +{
> +	unsigned long cgroup_size;
> +	unsigned long protect = 0;
> +	struct bdi_writeback *wb;
> +	struct mem_cgroup *memcg;
> +
> +	wb = inode_to_wb(inode);
> +	if (!wb)
> +		goto out;
> +
> +	memcg = mem_cgroup_from_css(wb->memcg_css);
> +	if (!memcg || memcg == root_mem_cgroup)
> +		goto out;
> +
> +	protect = mem_cgroup_protection(memcg, memcg->in_low_reclaim);
> +	if (!protect)
> +		goto out;
> +
> +	cgroup_size = mem_cgroup_size(memcg);
> +	/*
> +	 * Don't need to protect this inode, if the usage is still above
> +	 * the limit after reclaiming this inode and its belonging page
> +	 * caches.
> +	 */
> +	if (inode->i_data.nrpages + protect < cgroup_size)
> +		protect = 0;
> +
> +out:
> +	return protect;
> +}
> +
> +/**
>   * mem_cgroup_try_charge - try charging a page
>   * @page: page to charge
>   * @mm: mm context of the victim
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3c4c2da..1cc7fc2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2666,6 +2666,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>  				sc->memcg_low_skipped = 1;
>  				continue;
>  			}
> +			memcg->in_low_reclaim = 1;
>  			memcg_memory_event(memcg, MEMCG_LOW);
>  			break;
>  		case MEMCG_PROT_NONE:
> @@ -2693,6 +2694,9 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>  		shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
>  			    sc->priority);
>  
> +		if (memcg->in_low_reclaim)
> +			memcg->in_low_reclaim = 0;
> +
>  		/* Record the group's reclaim efficiency */
>  		vmpressure(sc->gfp_mask, memcg, false,
>  			   sc->nr_scanned - scanned,
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] memcg, inode: protect page cache from freeing inode
  2019-12-18 17:53   ` Roman Gushchin
@ 2019-12-19  1:45     ` Yafang Shao
  0 siblings, 0 replies; 23+ messages in thread
From: Yafang Shao @ 2019-12-19  1:45 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: hannes, mhocko, vdavydov.dev, akpm, viro, linux-mm,
	linux-fsdevel, Chris Down, Dave Chinner

On Thu, Dec 19, 2019 at 1:53 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Tue, Dec 17, 2019 at 06:29:19AM -0500, Yafang Shao wrote:
> > On my server there're some running MEMCGs protected by memory.{min, low},
> > but I found the usage of these MEMCGs abruptly became very small, which
> > were far less than the protect limit. It confused me and finally I
> > found that was because of inode stealing.
> > Once an inode is freed, all its belonging page caches will be dropped as
> > well, no matter how may page caches it has. So if we intend to protect the
> > page caches in a memcg, we must protect their host (the inode) first.
> > Otherwise the memcg protection can be easily bypassed with freeing inode,
> > especially if there're big files in this memcg.
> > The inherent mismatch between memcg and inode is a trouble. One inode can
> > be shared by different MEMCGs, but it is a very rare case. If an inode is
> > shared, its belonging page caches may be charged to different MEMCGs.
> > Currently there's no perfect solution to fix this kind of issue, but the
> > inode majority-writer ownership switching can help it more or less.
> >
> > Cc: Roman Gushchin <guro@fb.com>
> > Cc: Chris Down <chris@chrisdown.name>
> > Cc: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  fs/inode.c                 |  9 +++++++++
> >  include/linux/memcontrol.h | 15 +++++++++++++++
> >  mm/memcontrol.c            | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> >  mm/vmscan.c                |  4 ++++
> >  4 files changed, 74 insertions(+)
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index fef457a..b022447 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -734,6 +734,15 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
> >       if (!spin_trylock(&inode->i_lock))
> >               return LRU_SKIP;
> >
> > +
> > +     /* Page protection only works in reclaimer */
> > +     if (inode->i_data.nrpages && current->reclaim_state) {
> > +             if (mem_cgroup_inode_protected(inode)) {
> > +                     spin_unlock(&inode->i_lock);
> > +                     return LRU_ROTATE;
> > +             }
> > +     }
>
> Not directly related to this approach, but I wonder, if we should scale down
> the size of shrinker lists depending on the memory protection (like we do with
> LRU lists)? It won't fix the problem with huge inodes being reclaimed at once
> without a need, but will help scale the memory pressure for protected cgroups.
>

Same with what we are doing in get_scan_count() to calculate how many
pages we should scan ?
I guess we should.

>
>
> > +
> >       /*
> >        * Referenced or dirty inodes are still in use. Give them another pass
> >        * through the LRU as we canot reclaim them now.
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 1a315c7..21338f0 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -247,6 +247,9 @@ struct mem_cgroup {
> >       unsigned int tcpmem_active : 1;
> >       unsigned int tcpmem_pressure : 1;
> >
> > +     /* Soft protection will be ignored if it's true */
> > +     unsigned int in_low_reclaim : 1;
> > +
> >       int under_oom;
> >
> >       int     swappiness;
> > @@ -363,6 +366,7 @@ static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
> >
> >  enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
> >                                               struct mem_cgroup *memcg);
> > +unsigned long mem_cgroup_inode_protected(struct inode *inode);
> >
> >  int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
> >                         gfp_t gfp_mask, struct mem_cgroup **memcgp,
> > @@ -850,6 +854,11 @@ static inline enum mem_cgroup_protection mem_cgroup_protected(
> >       return MEMCG_PROT_NONE;
> >  }
> >
> > +static inline unsigned long mem_cgroup_inode_protected(struct inode *inode)
> > +{
> > +     return 0;
> > +}
> > +
> >  static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
> >                                       gfp_t gfp_mask,
> >                                       struct mem_cgroup **memcgp,
> > @@ -926,6 +935,12 @@ static inline struct mem_cgroup *get_mem_cgroup_from_page(struct page *page)
> >       return NULL;
> >  }
> >
> > +static inline struct mem_cgroup *
> > +mem_cgroup_from_css(struct cgroup_subsys_state *css)
> > +{
> > +     return NULL;
> > +}
> > +
> >  static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> >  {
> >  }
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 234370c..efb53f3 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6355,6 +6355,52 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
> >  }
> >
> >  /**
> > + * Once an inode is freed, all its belonging page caches will be dropped as
> > + * well, even if there're lots of page caches. So if we intend to protect
> > + * page caches in a memcg, we must protect their host first. Otherwise the
> > + * memory usage can be dropped abruptly if there're big files in this
> > + * memcg. IOW the memcy protection can be easily bypassed with freeing
> > + * inode. We should prevent it.
> > + * The inherent mismatch between memcg and inode is a trouble. One inode
> > + * can be shared by different MEMCGs, but it is a very rare case. If
> > + * an inode is shared, its belonging page caches may be charged to
> > + * different MEMCGs. Currently there's no perfect solution to fix this
> > + * kind of issue, but the inode majority-writer ownership switching can
> > + * help it more or less.
> > + */
> > +unsigned long mem_cgroup_inode_protected(struct inode *inode)
> > +{
> > +     unsigned long cgroup_size;
> > +     unsigned long protect = 0;
> > +     struct bdi_writeback *wb;
> > +     struct mem_cgroup *memcg;
> > +
> > +     wb = inode_to_wb(inode);
> > +     if (!wb)
> > +             goto out;
> > +
> > +     memcg = mem_cgroup_from_css(wb->memcg_css);
> > +     if (!memcg || memcg == root_mem_cgroup)
> > +             goto out;
> > +
> > +     protect = mem_cgroup_protection(memcg, memcg->in_low_reclaim);
> > +     if (!protect)
> > +             goto out;
> > +
> > +     cgroup_size = mem_cgroup_size(memcg);
> > +     /*
> > +      * Don't need to protect this inode, if the usage is still above
> > +      * the limit after reclaiming this inode and its belonging page
> > +      * caches.
> > +      */
> > +     if (inode->i_data.nrpages + protect < cgroup_size)
> > +             protect = 0;
> > +
> > +out:
> > +     return protect;
> > +}
> > +
> > +/**
> >   * mem_cgroup_try_charge - try charging a page
> >   * @page: page to charge
> >   * @mm: mm context of the victim
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 3c4c2da..1cc7fc2 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2666,6 +2666,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> >                               sc->memcg_low_skipped = 1;
> >                               continue;
> >                       }
> > +                     memcg->in_low_reclaim = 1;
> >                       memcg_memory_event(memcg, MEMCG_LOW);
> >                       break;
> >               case MEMCG_PROT_NONE:
> > @@ -2693,6 +2694,9 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> >               shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
> >                           sc->priority);
> >
> > +             if (memcg->in_low_reclaim)
> > +                     memcg->in_low_reclaim = 0;
> > +
> >               /* Record the group's reclaim efficiency */
> >               vmpressure(sc->gfp_mask, memcg, false,
> >                          sc->nr_scanned - scanned,
> > --
> > 1.8.3.1
> >

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] memcg, inode: protect page cache from freeing inode
  2019-12-17 11:29 [PATCH 0/4] memcg, inode: protect page cache from freeing inode Yafang Shao
                   ` (3 preceding siblings ...)
  2019-12-17 11:29 ` [PATCH 4/4] memcg, inode: protect page cache from freeing inode Yafang Shao
@ 2019-12-17 11:56 ` Michal Hocko
  2019-12-17 12:19   ` Yafang Shao
  4 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2019-12-17 11:56 UTC (permalink / raw)
  To: Yafang Shao; +Cc: hannes, vdavydov.dev, akpm, viro, linux-mm, linux-fsdevel

On Tue 17-12-19 06:29:15, Yafang Shao wrote:
> On my server there're some running MEMCGs protected by memory.{min, low},
> but I found the usage of these MEMCGs abruptly became very small, which
> were far less than the protect limit. It confused me and finally I
> found that was because of inode stealing.

What do you mean by this exactly. Are those inodes reclaimed by the
regular memory reclaim or by other means? Because shrink_node does
exclude shrinking slab for protected memcgs.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] memcg, inode: protect page cache from freeing inode
  2019-12-17 11:56 ` [PATCH 0/4] " Michal Hocko
@ 2019-12-17 12:19   ` Yafang Shao
  2019-12-17 16:54     ` Johannes Weiner
  0 siblings, 1 reply; 23+ messages in thread
From: Yafang Shao @ 2019-12-17 12:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Al Viro,
	Linux MM, linux-fsdevel

On Tue, Dec 17, 2019 at 7:56 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 17-12-19 06:29:15, Yafang Shao wrote:
> > On my server there're some running MEMCGs protected by memory.{min, low},
> > but I found the usage of these MEMCGs abruptly became very small, which
> > were far less than the protect limit. It confused me and finally I
> > found that was because of inode stealing.
>
> What do you mean by this exactly. Are those inodes reclaimed by the
> regular memory reclaim or by other means? Because shrink_node does
> exclude shrinking slab for protected memcgs.

By the regular memory reclaim, kswapd, direct reclaimer or memcg reclaimer.
IOW, the current->reclaim_state it set.

Take an example for you.

kswapd
    balance_pgdat
        shrink_node_memcgs
            switch (mem_cgroup_protected)  <<<< memory.current= 1024M
memory.min = 512M a file has 800M page caches
                case MEMCG_PROT_NONE:  <<<< hard limit is not reached.
                      beak;
            shrink_lruvec
            shrink_slab <<< it may free the inode and the free all its
page caches (800M)


Hope it could clarify.

Thanks
Yafang

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] memcg, inode: protect page cache from freeing inode
  2019-12-17 12:19   ` Yafang Shao
@ 2019-12-17 16:54     ` Johannes Weiner
  2019-12-18  1:17       ` Yafang Shao
                         ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Johannes Weiner @ 2019-12-17 16:54 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Dave Chinner,
	Al Viro, Linux MM, linux-fsdevel

CCing Dave

On Tue, Dec 17, 2019 at 08:19:08PM +0800, Yafang Shao wrote:
> On Tue, Dec 17, 2019 at 7:56 PM Michal Hocko <mhocko@kernel.org> wrote:
> > What do you mean by this exactly. Are those inodes reclaimed by the
> > regular memory reclaim or by other means? Because shrink_node does
> > exclude shrinking slab for protected memcgs.
> 
> By the regular memory reclaim, kswapd, direct reclaimer or memcg reclaimer.
> IOW, the current->reclaim_state it set.
> 
> Take an example for you.
> 
> kswapd
>     balance_pgdat
>         shrink_node_memcgs
>             switch (mem_cgroup_protected)  <<<< memory.current= 1024M
> memory.min = 512M a file has 800M page caches
>                 case MEMCG_PROT_NONE:  <<<< hard limit is not reached.
>                       beak;
>             shrink_lruvec
>             shrink_slab <<< it may free the inode and the free all its
> page caches (800M)

This problem exists independent of cgroup protection.

The inode shrinker may take down an inode that's still holding a ton
of (potentially active) page cache pages when the inode hasn't been
referenced recently.

IMO we shouldn't be dropping data that the VM still considers hot
compared to other data, just because the inode object hasn't been used
as recently as other inode objects (e.g. drowned in a stream of
one-off inode accesses).

I've carried the below patch in my private tree for testing cache
aging decisions that the shrinker interfered with. (It would be nicer
if page cache pages could pin the inode of course, but reclaim cannot
easily participate in the inode refcounting scheme.)

Thoughts?

diff --git a/fs/inode.c b/fs/inode.c
index fef457a42882..bfcaaaf6314f 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -753,7 +753,13 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
 		return LRU_ROTATE;
 	}
 
-	if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+	/* Leave the pages to page reclaim */
+	if (inode->i_data.nrpages) {
+		spin_unlock(&inode->i_lock);
+		return LRU_ROTATE;
+	}
+
+	if (inode_has_buffers(inode)) {
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(lru_lock);

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] memcg, inode: protect page cache from freeing inode
  2019-12-17 16:54     ` Johannes Weiner
@ 2019-12-18  1:17       ` Yafang Shao
  2019-12-18  1:37       ` Andrew Morton
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 23+ messages in thread
From: Yafang Shao @ 2019-12-18  1:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Vladimir Davydov, Andrew Morton, Dave Chinner,
	Al Viro, Linux MM, linux-fsdevel

On Wed, Dec 18, 2019 at 12:54 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> CCing Dave
>
> On Tue, Dec 17, 2019 at 08:19:08PM +0800, Yafang Shao wrote:
> > On Tue, Dec 17, 2019 at 7:56 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > What do you mean by this exactly. Are those inodes reclaimed by the
> > > regular memory reclaim or by other means? Because shrink_node does
> > > exclude shrinking slab for protected memcgs.
> >
> > By the regular memory reclaim, kswapd, direct reclaimer or memcg reclaimer.
> > IOW, the current->reclaim_state it set.
> >
> > Take an example for you.
> >
> > kswapd
> >     balance_pgdat
> >         shrink_node_memcgs
> >             switch (mem_cgroup_protected)  <<<< memory.current= 1024M
> > memory.min = 512M a file has 800M page caches
> >                 case MEMCG_PROT_NONE:  <<<< hard limit is not reached.
> >                       beak;
> >             shrink_lruvec
> >             shrink_slab <<< it may free the inode and the free all its
> > page caches (800M)
>
> This problem exists independent of cgroup protection.
>
> The inode shrinker may take down an inode that's still holding a ton
> of (potentially active) page cache pages when the inode hasn't been
> referenced recently.
>
> IMO we shouldn't be dropping data that the VM still considers hot
> compared to other data, just because the inode object hasn't been used
> as recently as other inode objects (e.g. drowned in a stream of
> one-off inode accesses).
>
> I've carried the below patch in my private tree for testing cache
> aging decisions that the shrinker interfered with. (It would be nicer
> if page cache pages could pin the inode of course, but reclaim cannot
> easily participate in the inode refcounting scheme.)
>
> Thoughts?
>

I have already though about this solution.
But I found there is a similar revert by Dave - see 69056ee6a8a3
("Revert "mm: don't reclaim inodes with many attached pages"").
That's why I CCed Dave in patch-4.
So I only fix it for memcg protection because that will not impact too much.

> diff --git a/fs/inode.c b/fs/inode.c
> index fef457a42882..bfcaaaf6314f 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -753,7 +753,13 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
>                 return LRU_ROTATE;
>         }
>
> -       if (inode_has_buffers(inode) || inode->i_data.nrpages) {
> +       /* Leave the pages to page reclaim */
> +       if (inode->i_data.nrpages) {
> +               spin_unlock(&inode->i_lock);
> +               return LRU_ROTATE;
> +       }
> +
> +       if (inode_has_buffers(inode)) {
>                 __iget(inode);
>                 spin_unlock(&inode->i_lock);
>                 spin_unlock(lru_lock);

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] memcg, inode: protect page cache from freeing inode
  2019-12-17 16:54     ` Johannes Weiner
  2019-12-18  1:17       ` Yafang Shao
@ 2019-12-18  1:37       ` Andrew Morton
  2019-12-18  1:51       ` Dave Chinner
  2019-12-18 17:27       ` Roman Gushchin
  3 siblings, 0 replies; 23+ messages in thread
From: Andrew Morton @ 2019-12-18  1:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Yafang Shao, Michal Hocko, Vladimir Davydov, Dave Chinner,
	Al Viro, Linux MM, linux-fsdevel

On Tue, 17 Dec 2019 11:54:22 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:

> I've carried the below patch in my private tree for testing cache
> aging decisions that the shrinker interfered with. (It would be nicer
> if page cache pages could pin the inode of course, but reclaim cannot
> easily participate in the inode refcounting scheme.)
> 
> ...
> 
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -753,7 +753,13 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
>  		return LRU_ROTATE;
>  	}
>  
> -	if (inode_has_buffers(inode) || inode->i_data.nrpages) {
> +	/* Leave the pages to page reclaim */
> +	if (inode->i_data.nrpages) {
> +		spin_unlock(&inode->i_lock);
> +		return LRU_ROTATE;
> +	}

I guess that code should have been commented...

This code was originally added because on large highmem machines we
were seeing lowmem full of inodes which had one or more highmem pages
attached to them.  Highmem was not under memory pressure so those
pagecache pages remained unreclaimed "for ever", thus pinning their
lowmem inode.  The net result was exhaustion of lowmem.

I guess a #ifdef CONFIG_HIGHMEM would help, to preserve the old
behavior in that case.  Although given the paucity of testing on large
highmem machines, the risk of divergent behavior over time is a
concern.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] memcg, inode: protect page cache from freeing inode
  2019-12-17 16:54     ` Johannes Weiner
  2019-12-18  1:17       ` Yafang Shao
  2019-12-18  1:37       ` Andrew Morton
@ 2019-12-18  1:51       ` Dave Chinner
  2019-12-18  4:37         ` Johannes Weiner
  2019-12-18 17:27       ` Roman Gushchin
  3 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2019-12-18  1:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Yafang Shao, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Al Viro, Linux MM, linux-fsdevel

On Tue, Dec 17, 2019 at 11:54:22AM -0500, Johannes Weiner wrote:
> CCing Dave
> 
> On Tue, Dec 17, 2019 at 08:19:08PM +0800, Yafang Shao wrote:
> > On Tue, Dec 17, 2019 at 7:56 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > What do you mean by this exactly. Are those inodes reclaimed by the
> > > regular memory reclaim or by other means? Because shrink_node does
> > > exclude shrinking slab for protected memcgs.
> > 
> > By the regular memory reclaim, kswapd, direct reclaimer or memcg reclaimer.
> > IOW, the current->reclaim_state it set.
> > 
> > Take an example for you.
> > 
> > kswapd
> >     balance_pgdat
> >         shrink_node_memcgs
> >             switch (mem_cgroup_protected)  <<<< memory.current= 1024M
> > memory.min = 512M a file has 800M page caches
> >                 case MEMCG_PROT_NONE:  <<<< hard limit is not reached.
> >                       beak;
> >             shrink_lruvec
> >             shrink_slab <<< it may free the inode and the free all its
> > page caches (800M)

<looks at patch>

Oh, great, yet another special heuristic reclaim hack for some
whacky memcg reclaim corner case.

> This problem exists independent of cgroup protection.
> 
> The inode shrinker may take down an inode that's still holding a ton
> of (potentially active) page cache pages when the inode hasn't been
> referenced recently.

Ok, please explain to me how are those pages getting repeated
referenced and kept active without referencing the inode in some
way?

e.g. active mmap pins a struct file which pins the inode.
e.g. open fd pins a struct file which pins the inode.
e.g. open/read/write/close keeps a dentry active in cache which pins
the inode when not actively referenced by the open fd.

AFAIA, all of the cases where -file pages- are being actively
referenced require also actively referencing the inode in some way.
So why is the inode being reclaimed as an unreferenced inode at the
end of the LRU if these are actively referenced file pages?

> IMO we shouldn't be dropping data that the VM still considers hot
> compared to other data, just because the inode object hasn't been used
> as recently as other inode objects (e.g. drowned in a stream of
> one-off inode accesses).

It should not be drowned by one-off inode accesses because if
the file data is being actively referenced then there should be
frequent active references to the inode that contains the data and
that should be keeping it away from the tail of the inode LRU.

If the inode is not being frequently referenced, then it
isn't really part of the current working set of inodes, is it?

> I've carried the below patch in my private tree for testing cache
> aging decisions that the shrinker interfered with. (It would be nicer
> if page cache pages could pin the inode of course, but reclaim cannot
> easily participate in the inode refcounting scheme.)
> 
> Thoughts?
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index fef457a42882..bfcaaaf6314f 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -753,7 +753,13 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
>  		return LRU_ROTATE;
>  	}
>  
> -	if (inode_has_buffers(inode) || inode->i_data.nrpages) {
> +	/* Leave the pages to page reclaim */
> +	if (inode->i_data.nrpages) {
> +		spin_unlock(&inode->i_lock);
> +		return LRU_ROTATE;
> +	}

<sigh>

Remember this?

commit 69056ee6a8a3d576ed31e38b3b14c70d6c74edcc
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Feb 12 15:35:51 2019 -0800

    Revert "mm: don't reclaim inodes with many attached pages"
    
    This reverts commit a76cf1a474d7d ("mm: don't reclaim inodes with many
    attached pages").
    
    This change causes serious changes to page cache and inode cache
    behaviour and balance, resulting in major performance regressions when
    combining worklaods such as large file copies and kernel compiles.
    
      https://bugzilla.kernel.org/show_bug.cgi?id=202441
    
    This change is a hack to work around the problems introduced by changing
    how agressive shrinkers are on small caches in commit 172b06c32b94 ("mm:
    slowly shrink slabs with a relatively small number of objects").  It
    creates more problems than it solves, wasn't adequately reviewed or
    tested, so it needs to be reverted.
    
    Link: http://lkml.kernel.org/r/20190130041707.27750-2-david@fromorbit.com
    Fixes: a76cf1a474d7d ("mm: don't reclaim inodes with many attached pages")
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Cc: Wolfgang Walter <linux@stwm.de>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Spock <dairinin@gmail.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/fs/inode.c b/fs/inode.c
index 0cd47fe0dbe5..73432e64f874 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -730,11 +730,8 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
                return LRU_REMOVED;
        }
 
-       /*
-        * Recently referenced inodes and inodes with many attached pages
-        * get one more pass.
-        */
-       if (inode->i_state & I_REFERENCED || inode->i_data.nrpages > 1) {
+       /* recently referenced inodes get one more pass */
+       if (inode->i_state & I_REFERENCED) {
                inode->i_state &= ~I_REFERENCED;
                spin_unlock(&inode->i_lock);
                return LRU_ROTATE;


-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] memcg, inode: protect page cache from freeing inode
  2019-12-18  1:51       ` Dave Chinner
@ 2019-12-18  4:37         ` Johannes Weiner
  2019-12-18 10:16           ` Dave Chinner
  0 siblings, 1 reply; 23+ messages in thread
From: Johannes Weiner @ 2019-12-18  4:37 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Yafang Shao, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Al Viro, Linux MM, linux-fsdevel

On Wed, Dec 18, 2019 at 12:51:24PM +1100, Dave Chinner wrote:
> On Tue, Dec 17, 2019 at 11:54:22AM -0500, Johannes Weiner wrote:
> > CCing Dave
> > 
> > On Tue, Dec 17, 2019 at 08:19:08PM +0800, Yafang Shao wrote:
> > > On Tue, Dec 17, 2019 at 7:56 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > What do you mean by this exactly. Are those inodes reclaimed by the
> > > > regular memory reclaim or by other means? Because shrink_node does
> > > > exclude shrinking slab for protected memcgs.
> > > 
> > > By the regular memory reclaim, kswapd, direct reclaimer or memcg reclaimer.
> > > IOW, the current->reclaim_state it set.
> > > 
> > > Take an example for you.
> > > 
> > > kswapd
> > >     balance_pgdat
> > >         shrink_node_memcgs
> > >             switch (mem_cgroup_protected)  <<<< memory.current= 1024M
> > > memory.min = 512M a file has 800M page caches
> > >                 case MEMCG_PROT_NONE:  <<<< hard limit is not reached.
> > >                       beak;
> > >             shrink_lruvec
> > >             shrink_slab <<< it may free the inode and the free all its
> > > page caches (800M)
> 
> <looks at patch>
> 
> Oh, great, yet another special heuristic reclaim hack for some
> whacky memcg reclaim corner case.
> 
> > This problem exists independent of cgroup protection.
> > 
> > The inode shrinker may take down an inode that's still holding a ton
> > of (potentially active) page cache pages when the inode hasn't been
> > referenced recently.
> 
> Ok, please explain to me how are those pages getting repeated
> referenced and kept active without referencing the inode in some
> way?
> 
> e.g. active mmap pins a struct file which pins the inode.
> e.g. open fd pins a struct file which pins the inode.
> e.g. open/read/write/close keeps a dentry active in cache which pins
> the inode when not actively referenced by the open fd.
> 
> AFAIA, all of the cases where -file pages- are being actively
> referenced require also actively referencing the inode in some way.
> So why is the inode being reclaimed as an unreferenced inode at the
> end of the LRU if these are actively referenced file pages?
> 
> > IMO we shouldn't be dropping data that the VM still considers hot
> > compared to other data, just because the inode object hasn't been used
> > as recently as other inode objects (e.g. drowned in a stream of
> > one-off inode accesses).
> 
> It should not be drowned by one-off inode accesses because if
> the file data is being actively referenced then there should be
> frequent active references to the inode that contains the data and
> that should be keeping it away from the tail of the inode LRU.
> 
> If the inode is not being frequently referenced, then it
> isn't really part of the current working set of inodes, is it?

The inode doesn't have to be currently open for its data to be used
frequently and recently.

Executables that run periodically come to mind.

An sqlite file database that is periodically opened and queried, then
closed again.

A git repository.

I don't want a find or an updatedb, which doesn't produce active
pages, and could be funneled through the cache with otherwise no side
effects, kick out all my linux tree git objects via the inode shrinker
just because I haven't run a git command in a few minutes.

> > I've carried the below patch in my private tree for testing cache
> > aging decisions that the shrinker interfered with. (It would be nicer
> > if page cache pages could pin the inode of course, but reclaim cannot
> > easily participate in the inode refcounting scheme.)
> > 
> > Thoughts?
> > 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index fef457a42882..bfcaaaf6314f 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -753,7 +753,13 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
> >  		return LRU_ROTATE;
> >  	}
> >  
> > -	if (inode_has_buffers(inode) || inode->i_data.nrpages) {
> > +	/* Leave the pages to page reclaim */
> > +	if (inode->i_data.nrpages) {
> > +		spin_unlock(&inode->i_lock);
> > +		return LRU_ROTATE;
> > +	}
> 
> <sigh>
> 
> Remember this?
> 
> commit 69056ee6a8a3d576ed31e38b3b14c70d6c74edcc
> Author: Dave Chinner <dchinner@redhat.com>
> Date:   Tue Feb 12 15:35:51 2019 -0800
> 
>     Revert "mm: don't reclaim inodes with many attached pages"
>     
>     This reverts commit a76cf1a474d7d ("mm: don't reclaim inodes with many
>     attached pages").
>     
>     This change causes serious changes to page cache and inode cache
>     behaviour and balance, resulting in major performance regressions when
>     combining worklaods such as large file copies and kernel compiles.
>     
>       https://bugzilla.kernel.org/show_bug.cgi?id=202441

I don't remember this, but reading this bugzilla thread is immensely
frustrating.

We've been carrying this patch here in our tree for over half a decade
now to work around this exact stalling in the xfs shrinker:

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index d53a316162d6..45b3a4d07813 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1344,7 +1344,7 @@ xfs_reclaim_inodes_nr(
        xfs_reclaim_work_queue(mp);
        xfs_ail_push_all(mp->m_ail);

-       return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
+       return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
 }

Because if we don't, our warmstorage machines lock up within minutes,
long before Roman's patch.

The fact that xfs stalls on individual inodes while there might be a
ton of clean cache on the LRUs is an xfs problem, not a VM problem.

The right thing to do to avoid stalls in the inode shrinker is to skip
over the dirty inodes and yield back to LRU reclaim; not circumvent
page aging and drop clean inodes on the floor when those may or may
not hold gigabytes of cache data that the inode shrinker knows
*absolutely nothing* about.

This entire approach is backwards.

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] memcg, inode: protect page cache from freeing inode
  2019-12-18  4:37         ` Johannes Weiner
@ 2019-12-18 10:16           ` Dave Chinner
  2019-12-18 21:38             ` Johannes Weiner
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2019-12-18 10:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Yafang Shao, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Al Viro, Linux MM, linux-fsdevel

On Tue, Dec 17, 2019 at 11:37:27PM -0500, Johannes Weiner wrote:
> On Wed, Dec 18, 2019 at 12:51:24PM +1100, Dave Chinner wrote:
> > On Tue, Dec 17, 2019 at 11:54:22AM -0500, Johannes Weiner wrote:
> > > This problem exists independent of cgroup protection.
> > > 
> > > The inode shrinker may take down an inode that's still holding a ton
> > > of (potentially active) page cache pages when the inode hasn't been
> > > referenced recently.
> > 
> > Ok, please explain to me how are those pages getting repeated
> > referenced and kept active without referencing the inode in some
> > way?
> > 
> > e.g. active mmap pins a struct file which pins the inode.
> > e.g. open fd pins a struct file which pins the inode.
> > e.g. open/read/write/close keeps a dentry active in cache which pins
> > the inode when not actively referenced by the open fd.
> > 
> > AFAIA, all of the cases where -file pages- are being actively
> > referenced require also actively referencing the inode in some way.
> > So why is the inode being reclaimed as an unreferenced inode at the
> > end of the LRU if these are actively referenced file pages?
> > 
> > > IMO we shouldn't be dropping data that the VM still considers hot
> > > compared to other data, just because the inode object hasn't been used
> > > as recently as other inode objects (e.g. drowned in a stream of
> > > one-off inode accesses).
> > 
> > It should not be drowned by one-off inode accesses because if
> > the file data is being actively referenced then there should be
> > frequent active references to the inode that contains the data and
> > that should be keeping it away from the tail of the inode LRU.
> > 
> > If the inode is not being frequently referenced, then it
> > isn't really part of the current working set of inodes, is it?
> 
> The inode doesn't have to be currently open for its data to be used
> frequently and recently.

No, it doesn't have to be "open", but it has to be referenced if
pages are being added to or accessed from it's mapping tree.

e.g. you can do open/mmap/close, and the vma backing the mmap region
holds a reference to the inode via vma->vm_file until munmap is
called and the vma is torn down.

So:

> Executables that run periodically come to mind.

this requires mmap, hence an active inode reference, and so when the
vma is torn down, the inode is moved to the head of the inode cache
LRU. IF we keep running that same executable, the inode will be
repeatedly relocated to the head of the LRU every time the process
running the executable exits.

> An sqlite file database that is periodically opened and queried, then
> closed again.

dentry pins inode on open, struct file pins inpde until close,
dentry reference pins inode until shrinker reclaims dentry. Inode
goes on head of LRU when dentry is reclaimed. Repeated cycles will
hit either the dentry cache or if that's been reclaimed the inode
cache will get hit.

> A git repository.

same as sqlite case, just with many files.

IOWs, all of these data references take an active reference to the
inode and reset it's position in the inode cache LRU when the last
reference is dropped. If it's a dentry, it may not get dropped until
memory presure relaims the dentry. Hence inode cache LRU order does
not reflect the file data page LRU order in any way.

But my question still stands: how do you get page LRU references
without inode references? And if you can't, why should having cached
pages on the oldest unused, unreferenced inode in the LRU prevent
it's reclaim?

> I don't want a find or an updatedb, which doesn't produce active
> pages, and could be funneled through the cache with otherwise no side
> effects, kick out all my linux tree git objects via the inode shrinker
> just because I haven't run a git command in a few minutes.

That has nothing to do with this patch. updatedb and any file
traversal that touches data are going to be treated identically to
you precious working set because they all have nr_pages > 0.

IOWs, this patch does nothing to avoid the problem of single use
inodes streaming through the inode cache causing the reclaim of all
inodes. It just changes the reclaim behaviour and how quickly single
use inodes can be reclaimed. i.e. we now can't reclaim single use
inodes when they reach the end of the LRU, we have to wait for page
cache reclaim to free it's pages before the inode can be reclaimed.

Further, because inode LRU order is going to be different to page LRU
order, there's going to be a lot more useless scanning trying to
find inodes that can be reclaimed. Hence this changes cache balance,
reduces reclaim efficiency, increases reclaim priority windup as
less gets freed per scan, and this all ends up causing performance
and behavioural regressions in unexpected places.

i.e. this makes the page cache pin the inode in memory and that's a
major change in bheaviour. that's what caused all the performance
regressions with workloads that traverse a large single-use file set
such as a kernel compile - most files and their data are accessed
just once, and when they get to the end of the inode LRU we really
want to reclaim them immediately as they'll never get accessed
again.

To put it simply, if your goal is to avoid single use inodes from
trashing a long term working set of cached inodes, then this
patch does not provide the reliable or predictable object
management algorithm you are looking for.

If you want to address use-once cache trashing, how about working
towards a *smarter LRU algorithm* for the list_lru infrastructure?
Don't hack naive heuristics that "work for me" into the code, go
back to the algorithm and select something that is provent to
be resilient against use-once object storms.

i.e. The requirement is we retain quasi-LRU behaviour, but
allow use-once objects to cycle through the LRU without disturbing
frequently/recently referenced/active objects.  The
per-object reference bit we currently use isn't resilient against
large-scale use-once object cycling, so we have to improve on that.

Experience tells me we've solved this problem before, and it's right
in your area or expertise, too. We could modify the list-lru to use
a different LRU algorithm that is resilient against the sort of
flooding you are worried about. We could simply use a double clock
list like the page LRU uses - we promote frequently referenced
inodes to the active list when instead of setting a reference bit
when a reference is dropped and the indoe is on the inactive list.
And a small part of each shrinker scan count can be used to demote
the tail of the active list to keep it slowly cycling. This way
single use inodes will only ever pass through the inactive list
without perturbing the active list, and we've solved the problem of
single use inode streams trashing the working cache for all use
cases, not just one special case....

> > <sigh>
> > 
> > Remember this?
> > 
> > commit 69056ee6a8a3d576ed31e38b3b14c70d6c74edcc
> > Author: Dave Chinner <dchinner@redhat.com>
> > Date:   Tue Feb 12 15:35:51 2019 -0800
> > 
> >     Revert "mm: don't reclaim inodes with many attached pages"
> >     
> >     This reverts commit a76cf1a474d7d ("mm: don't reclaim inodes with many
> >     attached pages").
> >     
> >     This change causes serious changes to page cache and inode cache
> >     behaviour and balance, resulting in major performance regressions when
> >     combining worklaods such as large file copies and kernel compiles.
> >     
> >       https://bugzilla.kernel.org/show_bug.cgi?id=202441
> 
> I don't remember this, but reading this bugzilla thread is immensely
> frustrating.

So you're shooting the messenger as well, eh?

We went through this whole "blame XFS" circus sideshow when I found
the commits that caused the regression. It went on right up until
people using ext4 started reporting similar problems.

Yes, XFS users were the first to notice the issue, but that does
not make it an XFS problem!

> We've been carrying this patch here in our tree for over half a decade
> now to work around this exact stalling in the xfs shrinker:
>
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index d53a316162d6..45b3a4d07813 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -1344,7 +1344,7 @@ xfs_reclaim_inodes_nr(
>         xfs_reclaim_work_queue(mp);
>         xfs_ail_push_all(mp->m_ail);
> 
> -       return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
> +       return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
>  }
> 
> Because if we don't, our warmstorage machines lock up within minutes,
> long before Roman's patch.

Oh, go cry me a river. Poor little FB, has to carry an out-of-tree
hack that "works for them" because they don't care enough about
fixing it to help upstream address the underlying memory reclaim 
problems that SYNC_WAIT flag avoids.

Indeed, we (XFS devs) have repeatedly provided evidence that this
patch makes it relatively trivial for users to DOS systems via
OOM-killer rampages. It does not survive my trivial "fill memory
with inodes" test without the oom-killer killing the machine, and
any workload that empties the page cache before the inode cache is
prone to oom kill because nothing throttles reclaim anymore and
there are no pages left to reclaim or swap.

It is manifestly worse than what we have now, and that means it is
not a candidate for merging. We've told FB engineers this
*repeatedly*, and yet this horrible, broken, nasty, expedient hack
gets raised every time "shrinker" and "XFS" are mentioned in the
same neighbourhood.  Just stop it, please.

> The fact that xfs stalls on individual inodes while there might be a
> ton of clean cache on the LRUs is an xfs problem, not a VM problem.

No, at it's core it is a VM problem, because if we don't stall on
inode reclaim in XFS then memory reclaim does far worse things to
your machine than incur an occasional long tail latency.

You're free to use some other filesystem if you can't wait for
upstream XFS developers to fix it properly or you can't be bothered
to review the patches that actually attempt to fix the problem
properly...

> The right thing to do to avoid stalls in the inode shrinker is to skip
> over the dirty inodes and yield back to LRU reclaim; not circumvent
> page aging and drop clean inodes on the floor when those may or may
> not hold gigabytes of cache data that the inode shrinker knows
> *absolutely nothing* about.

*cough* [*]

https://lore.kernel.org/linux-mm/20191031234618.15403-1-david@fromorbit.com/

This implements exactly what you suggest - shrinkers that can
communicate the need for backoffs to the core infrastructure and
work deferral to kswapd rather than doing it themselves. And it uses
that capability to implement non-blocking inode reclaim for XFS.

So, how about doing something useful like reviewing the code that
tries to solve the problem you are whining about in the way you
say you want it solved?

I'd appreciate feedback on the shrinker algorithm factoring changes,
the scan algorithm changes,how I'm deferring work to kswapd, how I'm
triggering backoffs in the main vmscan loops differently for direct
reclaim vs kswapd, etc.

I'd also appreciate it if mm developers started working on fixing
the borken IO-based congestion back-off infrastructure (broken by
blk-mq) that makes it just about impossible to make core reclaim
backoffs work reliably or scale sufficiently to prevent
excessive/unbalanced reclaim occurring.

We also need better page vs shrinker reclaim balancing mechanisms to
allow the reclaim code throttle harder when there's a major page vs
slab cache imbalance. Right now it ends up swap-storming trying do
page reclaim when there's no page cache left and millions of clean
inodes to reclaim. (That's something that blocking on dirty inode
writeback avoided).

We also need mechanisms for detecting and preventing premature
priority windup (oom kill vector) that occurs when lots of direct
reclaim is run under GFP_NOFS context and shrinkers cannot make
reclaim progress and there is no page cache left to reclaim and
kswapd is blocked on swap IO...

There was also a bunch of broken swap vs block layer throttle
interactions as well that I've mentioned in the cover letters of
the initial patch sets that haven't been addressed, either...

All this requires core memory reclaim expertise, and that's
somethign I don't have. I can make the XFS inode cache shrinker
behave correctly, but I don't have the knowledge, expertise or
patience to fix the broken, horrible, spagetti-heuristic vmscan.c
code. So if you want this problem fixed, there's some work for you
to do....

-Dave.

[*] I've been saying this for *years*, ever since I first started
working on the shrinker scalability problem (~v3.0 timeframe, long
before FB ever tripped over it). But back then, nobody on the mm
side beleived that shrinkers needed to be as tightly integrated into
the memory reclaim scan loops as pages. There was a major disconnect
and lack of understanding that shrinkers, like page reclaim, need to
deal with throttling/back-off, IO, working set management, dirty
objects, numa scalabilty, etc.  Part of the problem was attitude -
"Oh, it's an XFS shrinker problem, ext4 doesn't need this, so we
don't need to care about that in core code...".

I'm glad that, after all these years, mm developers are finally
starting to realise that shrinker reclaim requirements are no
different to page reclaim requirements. Maybe it's time for me to
suggest, once again, that page LRU reclaim should just be another
set of shrinker instances and that all memory reclaim should be run
by a self-balancing shrinker instance scan loop, not just caches for
subsystems outside mm/.... 
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] memcg, inode: protect page cache from freeing inode
  2019-12-18 10:16           ` Dave Chinner
@ 2019-12-18 21:38             ` Johannes Weiner
  2019-12-19  2:04               ` Yafang Shao
  2020-01-10  2:08               ` Dave Chinner
  0 siblings, 2 replies; 23+ messages in thread
From: Johannes Weiner @ 2019-12-18 21:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Yafang Shao, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Al Viro, Linux MM, linux-fsdevel

On Wed, Dec 18, 2019 at 09:16:26PM +1100, Dave Chinner wrote:
> On Tue, Dec 17, 2019 at 11:37:27PM -0500, Johannes Weiner wrote:
> > On Wed, Dec 18, 2019 at 12:51:24PM +1100, Dave Chinner wrote:
> > > On Tue, Dec 17, 2019 at 11:54:22AM -0500, Johannes Weiner wrote:
> > > > This problem exists independent of cgroup protection.
> > > > 
> > > > The inode shrinker may take down an inode that's still holding a ton
> > > > of (potentially active) page cache pages when the inode hasn't been
> > > > referenced recently.
> > > 
> > > Ok, please explain to me how are those pages getting repeated
> > > referenced and kept active without referencing the inode in some
> > > way?
> > > 
> > > e.g. active mmap pins a struct file which pins the inode.
> > > e.g. open fd pins a struct file which pins the inode.
> > > e.g. open/read/write/close keeps a dentry active in cache which pins
> > > the inode when not actively referenced by the open fd.
> > > 
> > > AFAIA, all of the cases where -file pages- are being actively
> > > referenced require also actively referencing the inode in some way.
> > > So why is the inode being reclaimed as an unreferenced inode at the
> > > end of the LRU if these are actively referenced file pages?
> > > 
> > > > IMO we shouldn't be dropping data that the VM still considers hot
> > > > compared to other data, just because the inode object hasn't been used
> > > > as recently as other inode objects (e.g. drowned in a stream of
> > > > one-off inode accesses).
> > > 
> > > It should not be drowned by one-off inode accesses because if
> > > the file data is being actively referenced then there should be
> > > frequent active references to the inode that contains the data and
> > > that should be keeping it away from the tail of the inode LRU.
> > > 
> > > If the inode is not being frequently referenced, then it
> > > isn't really part of the current working set of inodes, is it?
> > 
> > The inode doesn't have to be currently open for its data to be used
> > frequently and recently.
> 
> No, it doesn't have to be "open", but it has to be referenced if
> pages are being added to or accessed from it's mapping tree.
> 
> e.g. you can do open/mmap/close, and the vma backing the mmap region
> holds a reference to the inode via vma->vm_file until munmap is
> called and the vma is torn down.
> 
> So:
> 
> > Executables that run periodically come to mind.
> 
> this requires mmap, hence an active inode reference, and so when the
> vma is torn down, the inode is moved to the head of the inode cache
> LRU. IF we keep running that same executable, the inode will be
> repeatedly relocated to the head of the LRU every time the process
> running the executable exits.
> 
> > An sqlite file database that is periodically opened and queried, then
> > closed again.
> 
> dentry pins inode on open, struct file pins inpde until close,
> dentry reference pins inode until shrinker reclaims dentry. Inode
> goes on head of LRU when dentry is reclaimed. Repeated cycles will
> hit either the dentry cache or if that's been reclaimed the inode
> cache will get hit.
> 
> > A git repository.
> 
> same as sqlite case, just with many files.
> 
> IOWs, all of these data references take an active reference to the
> inode and reset it's position in the inode cache LRU when the last
> reference is dropped. If it's a dentry, it may not get dropped until
> memory presure relaims the dentry. Hence inode cache LRU order does
> not reflect the file data page LRU order in any way.
> 
> But my question still stands: how do you get page LRU references
> without inode references? And if you can't, why should having cached
> pages on the oldest unused, unreferenced inode in the LRU prevent
> it's reclaim?

One of us is missing something really obvious here.

Let's say I'm routinely working with a git tree and the objects are
cached by active pages. I'm using a modified mincore() that reports
page active state, so the output here is active/present/filesize:

[hannes@computer linux]$ ~/src/mincore .git/objects/pack/*
17/17/17 .git/objects/pack/pack-1993efac574359d041b010c84d04eb0f05275bfd.idx
97/97/1168 .git/objects/pack/pack-1993efac574359d041b010c84d04eb0f05275bfd.pack
21/21/21 .git/objects/pack/pack-1d4bf264156bee8558b290123af0755292452520.idx
69/69/1487 .git/objects/pack/pack-1d4bf264156bee8558b290123af0755292452520.pack
223/223/243 .git/objects/pack/pack-1f7fde0cd5444aca2bad22d9f1f782f7b5fc5b7c.idx
261/261/25012 .git/objects/pack/pack-1f7fde0cd5444aca2bad22d9f1f782f7b5fc5b7c.pack
48/48/66 .git/objects/pack/pack-2d05108aa7d7542c3faff7b456bfa4c33aa49ddb.idx
0/0/8306 .git/objects/pack/pack-2d05108aa7d7542c3faff7b456bfa4c33aa49ddb.pack
40/40/40 .git/objects/pack/pack-4430a9ced8123449669b25879f7d4cd3f23c2df7.idx
16/16/5020 .git/objects/pack/pack-4430a9ced8123449669b25879f7d4cd3f23c2df7.pack
28/28/29 .git/objects/pack/pack-4d783e29b97258d679490f899be09d0a7fc73cf4.idx
4/4/3755 .git/objects/pack/pack-4d783e29b97258d679490f899be09d0a7fc73cf4.pack
46/46/46 .git/objects/pack/pack-5d66c70e90371495b5f1a35770e3c092347a2362.idx
166/166/2689 .git/objects/pack/pack-5d66c70e90371495b5f1a35770e3c092347a2362.pack
12/12/12 .git/objects/pack/pack-5e2d63c26589c42286cd7f15d3b076f1a0a2e895.idx
42/42/1083 .git/objects/pack/pack-5e2d63c26589c42286cd7f15d3b076f1a0a2e895.pack
38/38/38 .git/objects/pack/pack-6f7a49bdbcfd2ea4b64d57458a4f04df518a55eb.idx
129/129/2652 .git/objects/pack/pack-6f7a49bdbcfd2ea4b64d57458a4f04df518a55eb.pack
8/8/8 .git/objects/pack/pack-7053184528af47c7edacccbdbc2de25e627ea8e3.idx
4/4/743 .git/objects/pack/pack-7053184528af47c7edacccbdbc2de25e627ea8e3.pack
62/62/63 .git/objects/pack/pack-7463fe2f036d011a79a31bacb9da58455982ee4b.idx
96/96/7023 .git/objects/pack/pack-7463fe2f036d011a79a31bacb9da58455982ee4b.pack
129/129/130 .git/objects/pack/pack-7644e9848940f15642b4efebb8e4ccdcb9b2024e.idx
333/333/5060 .git/objects/pack/pack-7644e9848940f15642b4efebb8e4ccdcb9b2024e.pack
6487/6487/7557 .git/objects/pack/pack-8347268f4d6fa0f763c7d1690dcee8f933be253f.idx
12260/12260/285090 .git/objects/pack/pack-8347268f4d6fa0f763c7d1690dcee8f933be253f.pack
603/603/683 .git/objects/pack/pack-c51831234bf615a2b47a49c31f10ae480fa482dd.idx
1450/1450/23000 .git/objects/pack/pack-c51831234bf615a2b47a49c31f10ae480fa482dd.pack
657/657/757 .git/objects/pack/pack-d793ea6b319c4d19eb281f5ca2e368c24e10d91a.idx
1658/1658/21055 .git/objects/pack/pack-d793ea6b319c4d19eb281f5ca2e368c24e10d91a.pack
46037/46037/53690 .git/objects/pack/pack-ee31400e588e715113b665d7313d570553133d71.idx
105772/105772/367275 .git/objects/pack/pack-ee31400e588e715113b665d7313d570553133d71.pack

Now something like updatedb, a find or comparable goes off and in a
short amount of time creates a ton of one-off dentries, inodes, and
file cache:

$ find /usr -type f -exec grep -q dave {} \;

LRU reclaim recognizes that the file cache produced by this operation
is not used repeatedly and lets an infinite amount of it pass through
the inactive list without disturbing my git tree workingset.

The inode/dentry reclaim doesn't do the same thing. It looks at the
references and delays the inevitable for a few more items coming
through the LRU, but eventually it lets a bunch of objects that are
only used once push out data that has been used over and over right
before this burst of metadata objects came along.

The VM goes through a ridiculous effort to implement scan resistance:
we split the LRUs into inactive/active lists, we track non-resident
cache information to tell stable states from transitions and carefully
balance the lists agains each other. All in an effort to protect
established workingsets that have proven to benefit from caching from
bursts of one-off entries that do not.

Thousands of lines of complexity, years of labor, to make this work.

And then the inode shrinker just goes and drops it all on the floor:

[hannes@computer linux]$ ~/src/mincore .git/objects/pack/*
0/0/17 .git/objects/pack/pack-1993efac574359d041b010c84d04eb0f05275bfd.idx
0/0/1168 .git/objects/pack/pack-1993efac574359d041b010c84d04eb0f05275bfd.pack
0/0/21 .git/objects/pack/pack-1d4bf264156bee8558b290123af0755292452520.idx
0/0/1487 .git/objects/pack/pack-1d4bf264156bee8558b290123af0755292452520.pack
0/0/243 .git/objects/pack/pack-1f7fde0cd5444aca2bad22d9f1f782f7b5fc5b7c.idx
0/0/25012 .git/objects/pack/pack-1f7fde0cd5444aca2bad22d9f1f782f7b5fc5b7c.pack
0/0/66 .git/objects/pack/pack-2d05108aa7d7542c3faff7b456bfa4c33aa49ddb.idx
0/0/8306 .git/objects/pack/pack-2d05108aa7d7542c3faff7b456bfa4c33aa49ddb.pack
0/0/40 .git/objects/pack/pack-4430a9ced8123449669b25879f7d4cd3f23c2df7.idx
0/0/5020 .git/objects/pack/pack-4430a9ced8123449669b25879f7d4cd3f23c2df7.pack
0/0/29 .git/objects/pack/pack-4d783e29b97258d679490f899be09d0a7fc73cf4.idx
0/0/3755 .git/objects/pack/pack-4d783e29b97258d679490f899be09d0a7fc73cf4.pack
0/0/46 .git/objects/pack/pack-5d66c70e90371495b5f1a35770e3c092347a2362.idx
0/0/2689 .git/objects/pack/pack-5d66c70e90371495b5f1a35770e3c092347a2362.pack
0/0/12 .git/objects/pack/pack-5e2d63c26589c42286cd7f15d3b076f1a0a2e895.idx
0/0/1083 .git/objects/pack/pack-5e2d63c26589c42286cd7f15d3b076f1a0a2e895.pack
0/0/38 .git/objects/pack/pack-6f7a49bdbcfd2ea4b64d57458a4f04df518a55eb.idx
0/0/2652 .git/objects/pack/pack-6f7a49bdbcfd2ea4b64d57458a4f04df518a55eb.pack
0/0/8 .git/objects/pack/pack-7053184528af47c7edacccbdbc2de25e627ea8e3.idx
0/0/743 .git/objects/pack/pack-7053184528af47c7edacccbdbc2de25e627ea8e3.pack
0/0/63 .git/objects/pack/pack-7463fe2f036d011a79a31bacb9da58455982ee4b.idx
0/0/7023 .git/objects/pack/pack-7463fe2f036d011a79a31bacb9da58455982ee4b.pack
0/0/130 .git/objects/pack/pack-7644e9848940f15642b4efebb8e4ccdcb9b2024e.idx
0/0/5060 .git/objects/pack/pack-7644e9848940f15642b4efebb8e4ccdcb9b2024e.pack
0/0/7557 .git/objects/pack/pack-8347268f4d6fa0f763c7d1690dcee8f933be253f.idx
0/0/285090 .git/objects/pack/pack-8347268f4d6fa0f763c7d1690dcee8f933be253f.pack
0/0/683 .git/objects/pack/pack-c51831234bf615a2b47a49c31f10ae480fa482dd.idx
0/0/23000 .git/objects/pack/pack-c51831234bf615a2b47a49c31f10ae480fa482dd.pack
0/0/757 .git/objects/pack/pack-d793ea6b319c4d19eb281f5ca2e368c24e10d91a.idx
0/0/21055 .git/objects/pack/pack-d793ea6b319c4d19eb281f5ca2e368c24e10d91a.pack
0/0/53690 .git/objects/pack/pack-ee31400e588e715113b665d7313d570553133d71.idx
0/0/367275 .git/objects/pack/pack-ee31400e588e715113b665d7313d570553133d71.pack

This isn't a theoretical issue. The reason people keep coming up with
the same patch is because they hit exactly this problem in real life.

> > I don't want a find or an updatedb, which doesn't produce active
> > pages, and could be funneled through the cache with otherwise no side
> > effects, kick out all my linux tree git objects via the inode shrinker
> > just because I haven't run a git command in a few minutes.
> 
> That has nothing to do with this patch. updatedb and any file
> traversal that touches data are going to be treated identically to
> you precious working set because they all have nr_pages > 0.
>
> IOWs, this patch does nothing to avoid the problem of single use
> inodes streaming through the inode cache causing the reclaim of all
> inodes. It just changes the reclaim behaviour and how quickly single
> use inodes can be reclaimed. i.e. we now can't reclaim single use
> inodes when they reach the end of the LRU, we have to wait for page
> cache reclaim to free it's pages before the inode can be reclaimed.

Of course it does. LRU reclaim will clean out the single-use pages,
after which those inodes will have !nr_pages and can be reclaimed.

> Further, because inode LRU order is going to be different to page LRU
> order, there's going to be a lot more useless scanning trying to
> find inodes that can be reclaimed. Hence this changes cache balance,
> reduces reclaim efficiency, increases reclaim priority windup as
> less gets freed per scan, and this all ends up causing performance
> and behavioural regressions in unexpected places.

It would be better to keep the inodes off the LRU entirely as long as
they are not considered for reclaim. That would save some CPU churn.

> i.e. this makes the page cache pin the inode in memory and that's a
> major change in bheaviour. that's what caused all the performance
> regressions with workloads that traverse a large single-use file set
> such as a kernel compile - most files and their data are accessed
> just once, and when they get to the end of the inode LRU we really
> want to reclaim them immediately as they'll never get accessed
> again.
> 
> To put it simply, if your goal is to avoid single use inodes from
> trashing a long term working set of cached inodes, then this
> patch does not provide the reliable or predictable object
> management algorithm you are looking for.
>
> If you want to address use-once cache trashing, how about working
> towards a *smarter LRU algorithm* for the list_lru infrastructure?
> Don't hack naive heuristics that "work for me" into the code, go
> back to the algorithm and select something that is provent to
> be resilient against use-once object storms.
> 
> i.e. The requirement is we retain quasi-LRU behaviour, but
> allow use-once objects to cycle through the LRU without disturbing
> frequently/recently referenced/active objects.  The
> per-object reference bit we currently use isn't resilient against
> large-scale use-once object cycling, so we have to improve on that.
> 
> Experience tells me we've solved this problem before, and it's right
> in your area or expertise, too. We could modify the list-lru to use
> a different LRU algorithm that is resilient against the sort of
> flooding you are worried about. We could simply use a double clock
> list like the page LRU uses - we promote frequently referenced
> inodes to the active list when instead of setting a reference bit
> when a reference is dropped and the indoe is on the inactive list.
> And a small part of each shrinker scan count can be used to demote
> the tail of the active list to keep it slowly cycling. This way
> single use inodes will only ever pass through the inactive list
> without perturbing the active list, and we've solved the problem of
> single use inode streams trashing the working cache for all use
> cases, not just one special case....

I'm not opposed to any of this work, but I don't see how it would be a
prerequisite to fixing the aging inversion we're talking about here -
throwing out "unused" containers without looking at what's inside.

On the contrary, the inode scanner would already make better decisions
by simply not discarding all the usage information painstakingly
gathered by the VM.

We can talk about the implementation, of course. Repeatedly skipping
over inodes rather than physically taking them off the list can be a
scalability problem; pushing the shrinker into dirty inodes can be a
problem for certain filesystems. I didn't submit a patch for
upstreaming, I sent a diff hunk to propose an aging hierarchy.

If you agree with my concern about aging decisions here, but think
it's the best we can do given our constraints, we can talk about this
too - but we should at least document the hack currently in place.

If you disagree that the reclaim layering here is fundamentally
problematic, I'm not sure I need to move on with this discussion.

> > > commit 69056ee6a8a3d576ed31e38b3b14c70d6c74edcc
> > > Author: Dave Chinner <dchinner@redhat.com>
> > > Date:   Tue Feb 12 15:35:51 2019 -0800
> > > 
> > >     Revert "mm: don't reclaim inodes with many attached pages"
> > >     
> > >     This reverts commit a76cf1a474d7d ("mm: don't reclaim inodes with many
> > >     attached pages").
> > >     
> > >     This change causes serious changes to page cache and inode cache
> > >     behaviour and balance, resulting in major performance regressions when
> > >     combining worklaods such as large file copies and kernel compiles.
> > >     
> > >       https://bugzilla.kernel.org/show_bug.cgi?id=202441
> > 
> > I don't remember this, but reading this bugzilla thread is immensely
> > frustrating.
> 
> So you're shooting the messenger as well, eh?
> 
> We went through this whole "blame XFS" circus sideshow when I found
> the commits that caused the regression. It went on right up until
> people using ext4 started reporting similar problems.
> 
> Yes, XFS users were the first to notice the issue, but that does
> not make it an XFS problem!

I cannot find details on the other filesystems in the bug report or
the changelog. Where was the time going? Was it the CPU churn of
skipping over the inodes?

> > We've been carrying this patch here in our tree for over half a decade
> > now to work around this exact stalling in the xfs shrinker:
> >
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index d53a316162d6..45b3a4d07813 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -1344,7 +1344,7 @@ xfs_reclaim_inodes_nr(
> >         xfs_reclaim_work_queue(mp);
> >         xfs_ail_push_all(mp->m_ail);
> > 
> > -       return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
> > +       return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
> >  }
> > 
> > Because if we don't, our warmstorage machines lock up within minutes,
> > long before Roman's patch.
> 
> Oh, go cry me a river. Poor little FB, has to carry an out-of-tree
> hack that "works for them" because they don't care enough about
> fixing it to help upstream address the underlying memory reclaim 
> problems that SYNC_WAIT flag avoids.
> 
> Indeed, we (XFS devs) have repeatedly provided evidence that this
> patch makes it relatively trivial for users to DOS systems via
> OOM-killer rampages. It does not survive my trivial "fill memory
> with inodes" test without the oom-killer killing the machine, and
> any workload that empties the page cache before the inode cache is
> prone to oom kill because nothing throttles reclaim anymore and
> there are no pages left to reclaim or swap.
>
> It is manifestly worse than what we have now, and that means it is
> not a candidate for merging. We've told FB engineers this
> *repeatedly*, and yet this horrible, broken, nasty, expedient hack
> gets raised every time "shrinker" and "XFS" are mentioned in the
> same neighbourhood.  Just stop it, please.

You don't need to be privileged to cause OOM kills in a myriad of ways
if you tried to.

You don't need to run a malicious workload to have the xfs shrinker
stall out reclaimers in the presence of gigabytes of clean, easy to
reclaim cache.

We fundamentally disagree on what the horrible, broken, nasty,
expedient hack is.

> > The fact that xfs stalls on individual inodes while there might be a
> > ton of clean cache on the LRUs is an xfs problem, not a VM problem.
> 
> No, at it's core it is a VM problem, because if we don't stall on
> inode reclaim in XFS then memory reclaim does far worse things to
> your machine than incur an occasional long tail latency.
> 
> You're free to use some other filesystem if you can't wait for
> upstream XFS developers to fix it properly or you can't be bothered
> to review the patches that actually attempt to fix the problem
> properly...

I'm not worried about xfs. I'm worried about these design decisions
bleeding into other parts of reclaim.

> > The right thing to do to avoid stalls in the inode shrinker is to skip
> > over the dirty inodes and yield back to LRU reclaim; not circumvent
> > page aging and drop clean inodes on the floor when those may or may
> > not hold gigabytes of cache data that the inode shrinker knows
> > *absolutely nothing* about.
> 
> *cough* [*]
> 
> https://lore.kernel.org/linux-mm/20191031234618.15403-1-david@fromorbit.com/
> 
> This implements exactly what you suggest - shrinkers that can
> communicate the need for backoffs to the core infrastructure and
> work deferral to kswapd rather than doing it themselves. And it uses
> that capability to implement non-blocking inode reclaim for XFS.

Does that series end in the shrinkers leaving page reclaim to the page
LRU order?

I'm asking facetiously. Don't get me wrong, I'm interested in what
your patchset promises to implement. However, I'm extremely reluctant
to dive into a series of 28 patches if this is how the discussions go.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] memcg, inode: protect page cache from freeing inode
  2019-12-18 21:38             ` Johannes Weiner
@ 2019-12-19  2:04               ` Yafang Shao
  2020-01-10  2:08               ` Dave Chinner
  1 sibling, 0 replies; 23+ messages in thread
From: Yafang Shao @ 2019-12-19  2:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Chinner, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Al Viro, Linux MM, linux-fsdevel

On Thu, Dec 19, 2019 at 5:38 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Dec 18, 2019 at 09:16:26PM +1100, Dave Chinner wrote:
> > On Tue, Dec 17, 2019 at 11:37:27PM -0500, Johannes Weiner wrote:
> > > On Wed, Dec 18, 2019 at 12:51:24PM +1100, Dave Chinner wrote:
> > > > On Tue, Dec 17, 2019 at 11:54:22AM -0500, Johannes Weiner wrote:
> > > > > This problem exists independent of cgroup protection.
> > > > >
> > > > > The inode shrinker may take down an inode that's still holding a ton
> > > > > of (potentially active) page cache pages when the inode hasn't been
> > > > > referenced recently.
> > > >
> > > > Ok, please explain to me how are those pages getting repeated
> > > > referenced and kept active without referencing the inode in some
> > > > way?
> > > >
> > > > e.g. active mmap pins a struct file which pins the inode.
> > > > e.g. open fd pins a struct file which pins the inode.
> > > > e.g. open/read/write/close keeps a dentry active in cache which pins
> > > > the inode when not actively referenced by the open fd.
> > > >
> > > > AFAIA, all of the cases where -file pages- are being actively
> > > > referenced require also actively referencing the inode in some way.
> > > > So why is the inode being reclaimed as an unreferenced inode at the
> > > > end of the LRU if these are actively referenced file pages?
> > > >
> > > > > IMO we shouldn't be dropping data that the VM still considers hot
> > > > > compared to other data, just because the inode object hasn't been used
> > > > > as recently as other inode objects (e.g. drowned in a stream of
> > > > > one-off inode accesses).
> > > >
> > > > It should not be drowned by one-off inode accesses because if
> > > > the file data is being actively referenced then there should be
> > > > frequent active references to the inode that contains the data and
> > > > that should be keeping it away from the tail of the inode LRU.
> > > >
> > > > If the inode is not being frequently referenced, then it
> > > > isn't really part of the current working set of inodes, is it?
> > >
> > > The inode doesn't have to be currently open for its data to be used
> > > frequently and recently.
> >
> > No, it doesn't have to be "open", but it has to be referenced if
> > pages are being added to or accessed from it's mapping tree.
> >
> > e.g. you can do open/mmap/close, and the vma backing the mmap region
> > holds a reference to the inode via vma->vm_file until munmap is
> > called and the vma is torn down.
> >
> > So:
> >
> > > Executables that run periodically come to mind.
> >
> > this requires mmap, hence an active inode reference, and so when the
> > vma is torn down, the inode is moved to the head of the inode cache
> > LRU. IF we keep running that same executable, the inode will be
> > repeatedly relocated to the head of the LRU every time the process
> > running the executable exits.
> >
> > > An sqlite file database that is periodically opened and queried, then
> > > closed again.
> >
> > dentry pins inode on open, struct file pins inpde until close,
> > dentry reference pins inode until shrinker reclaims dentry. Inode
> > goes on head of LRU when dentry is reclaimed. Repeated cycles will
> > hit either the dentry cache or if that's been reclaimed the inode
> > cache will get hit.
> >
> > > A git repository.
> >
> > same as sqlite case, just with many files.
> >
> > IOWs, all of these data references take an active reference to the
> > inode and reset it's position in the inode cache LRU when the last
> > reference is dropped. If it's a dentry, it may not get dropped until
> > memory presure relaims the dentry. Hence inode cache LRU order does
> > not reflect the file data page LRU order in any way.
> >
> > But my question still stands: how do you get page LRU references
> > without inode references? And if you can't, why should having cached
> > pages on the oldest unused, unreferenced inode in the LRU prevent
> > it's reclaim?
>
> One of us is missing something really obvious here.
>
> Let's say I'm routinely working with a git tree and the objects are
> cached by active pages. I'm using a modified mincore() that reports
> page active state, so the output here is active/present/filesize:
>
> [hannes@computer linux]$ ~/src/mincore .git/objects/pack/*
> 17/17/17 .git/objects/pack/pack-1993efac574359d041b010c84d04eb0f05275bfd.idx
> 97/97/1168 .git/objects/pack/pack-1993efac574359d041b010c84d04eb0f05275bfd.pack
> 21/21/21 .git/objects/pack/pack-1d4bf264156bee8558b290123af0755292452520.idx
> 69/69/1487 .git/objects/pack/pack-1d4bf264156bee8558b290123af0755292452520.pack
> 223/223/243 .git/objects/pack/pack-1f7fde0cd5444aca2bad22d9f1f782f7b5fc5b7c.idx
> 261/261/25012 .git/objects/pack/pack-1f7fde0cd5444aca2bad22d9f1f782f7b5fc5b7c.pack
> 48/48/66 .git/objects/pack/pack-2d05108aa7d7542c3faff7b456bfa4c33aa49ddb.idx
> 0/0/8306 .git/objects/pack/pack-2d05108aa7d7542c3faff7b456bfa4c33aa49ddb.pack
> 40/40/40 .git/objects/pack/pack-4430a9ced8123449669b25879f7d4cd3f23c2df7.idx
> 16/16/5020 .git/objects/pack/pack-4430a9ced8123449669b25879f7d4cd3f23c2df7.pack
> 28/28/29 .git/objects/pack/pack-4d783e29b97258d679490f899be09d0a7fc73cf4.idx
> 4/4/3755 .git/objects/pack/pack-4d783e29b97258d679490f899be09d0a7fc73cf4.pack
> 46/46/46 .git/objects/pack/pack-5d66c70e90371495b5f1a35770e3c092347a2362.idx
> 166/166/2689 .git/objects/pack/pack-5d66c70e90371495b5f1a35770e3c092347a2362.pack
> 12/12/12 .git/objects/pack/pack-5e2d63c26589c42286cd7f15d3b076f1a0a2e895.idx
> 42/42/1083 .git/objects/pack/pack-5e2d63c26589c42286cd7f15d3b076f1a0a2e895.pack
> 38/38/38 .git/objects/pack/pack-6f7a49bdbcfd2ea4b64d57458a4f04df518a55eb.idx
> 129/129/2652 .git/objects/pack/pack-6f7a49bdbcfd2ea4b64d57458a4f04df518a55eb.pack
> 8/8/8 .git/objects/pack/pack-7053184528af47c7edacccbdbc2de25e627ea8e3.idx
> 4/4/743 .git/objects/pack/pack-7053184528af47c7edacccbdbc2de25e627ea8e3.pack
> 62/62/63 .git/objects/pack/pack-7463fe2f036d011a79a31bacb9da58455982ee4b.idx
> 96/96/7023 .git/objects/pack/pack-7463fe2f036d011a79a31bacb9da58455982ee4b.pack
> 129/129/130 .git/objects/pack/pack-7644e9848940f15642b4efebb8e4ccdcb9b2024e.idx
> 333/333/5060 .git/objects/pack/pack-7644e9848940f15642b4efebb8e4ccdcb9b2024e.pack
> 6487/6487/7557 .git/objects/pack/pack-8347268f4d6fa0f763c7d1690dcee8f933be253f.idx
> 12260/12260/285090 .git/objects/pack/pack-8347268f4d6fa0f763c7d1690dcee8f933be253f.pack
> 603/603/683 .git/objects/pack/pack-c51831234bf615a2b47a49c31f10ae480fa482dd.idx
> 1450/1450/23000 .git/objects/pack/pack-c51831234bf615a2b47a49c31f10ae480fa482dd.pack
> 657/657/757 .git/objects/pack/pack-d793ea6b319c4d19eb281f5ca2e368c24e10d91a.idx
> 1658/1658/21055 .git/objects/pack/pack-d793ea6b319c4d19eb281f5ca2e368c24e10d91a.pack
> 46037/46037/53690 .git/objects/pack/pack-ee31400e588e715113b665d7313d570553133d71.idx
> 105772/105772/367275 .git/objects/pack/pack-ee31400e588e715113b665d7313d570553133d71.pack
>
> Now something like updatedb, a find or comparable goes off and in a
> short amount of time creates a ton of one-off dentries, inodes, and
> file cache:
>
> $ find /usr -type f -exec grep -q dave {} \;
>
> LRU reclaim recognizes that the file cache produced by this operation
> is not used repeatedly and lets an infinite amount of it pass through
> the inactive list without disturbing my git tree workingset.
>
> The inode/dentry reclaim doesn't do the same thing. It looks at the
> references and delays the inevitable for a few more items coming
> through the LRU, but eventually it lets a bunch of objects that are
> only used once push out data that has been used over and over right
> before this burst of metadata objects came along.
>
> The VM goes through a ridiculous effort to implement scan resistance:
> we split the LRUs into inactive/active lists, we track non-resident
> cache information to tell stable states from transitions and carefully
> balance the lists agains each other. All in an effort to protect
> established workingsets that have proven to benefit from caching from
> bursts of one-off entries that do not.
>
> Thousands of lines of complexity, years of labor, to make this work.
>
> And then the inode shrinker just goes and drops it all on the floor:
>
> [hannes@computer linux]$ ~/src/mincore .git/objects/pack/*
> 0/0/17 .git/objects/pack/pack-1993efac574359d041b010c84d04eb0f05275bfd.idx
> 0/0/1168 .git/objects/pack/pack-1993efac574359d041b010c84d04eb0f05275bfd.pack
> 0/0/21 .git/objects/pack/pack-1d4bf264156bee8558b290123af0755292452520.idx
> 0/0/1487 .git/objects/pack/pack-1d4bf264156bee8558b290123af0755292452520.pack
> 0/0/243 .git/objects/pack/pack-1f7fde0cd5444aca2bad22d9f1f782f7b5fc5b7c.idx
> 0/0/25012 .git/objects/pack/pack-1f7fde0cd5444aca2bad22d9f1f782f7b5fc5b7c.pack
> 0/0/66 .git/objects/pack/pack-2d05108aa7d7542c3faff7b456bfa4c33aa49ddb.idx
> 0/0/8306 .git/objects/pack/pack-2d05108aa7d7542c3faff7b456bfa4c33aa49ddb.pack
> 0/0/40 .git/objects/pack/pack-4430a9ced8123449669b25879f7d4cd3f23c2df7.idx
> 0/0/5020 .git/objects/pack/pack-4430a9ced8123449669b25879f7d4cd3f23c2df7.pack
> 0/0/29 .git/objects/pack/pack-4d783e29b97258d679490f899be09d0a7fc73cf4.idx
> 0/0/3755 .git/objects/pack/pack-4d783e29b97258d679490f899be09d0a7fc73cf4.pack
> 0/0/46 .git/objects/pack/pack-5d66c70e90371495b5f1a35770e3c092347a2362.idx
> 0/0/2689 .git/objects/pack/pack-5d66c70e90371495b5f1a35770e3c092347a2362.pack
> 0/0/12 .git/objects/pack/pack-5e2d63c26589c42286cd7f15d3b076f1a0a2e895.idx
> 0/0/1083 .git/objects/pack/pack-5e2d63c26589c42286cd7f15d3b076f1a0a2e895.pack
> 0/0/38 .git/objects/pack/pack-6f7a49bdbcfd2ea4b64d57458a4f04df518a55eb.idx
> 0/0/2652 .git/objects/pack/pack-6f7a49bdbcfd2ea4b64d57458a4f04df518a55eb.pack
> 0/0/8 .git/objects/pack/pack-7053184528af47c7edacccbdbc2de25e627ea8e3.idx
> 0/0/743 .git/objects/pack/pack-7053184528af47c7edacccbdbc2de25e627ea8e3.pack
> 0/0/63 .git/objects/pack/pack-7463fe2f036d011a79a31bacb9da58455982ee4b.idx
> 0/0/7023 .git/objects/pack/pack-7463fe2f036d011a79a31bacb9da58455982ee4b.pack
> 0/0/130 .git/objects/pack/pack-7644e9848940f15642b4efebb8e4ccdcb9b2024e.idx
> 0/0/5060 .git/objects/pack/pack-7644e9848940f15642b4efebb8e4ccdcb9b2024e.pack
> 0/0/7557 .git/objects/pack/pack-8347268f4d6fa0f763c7d1690dcee8f933be253f.idx
> 0/0/285090 .git/objects/pack/pack-8347268f4d6fa0f763c7d1690dcee8f933be253f.pack
> 0/0/683 .git/objects/pack/pack-c51831234bf615a2b47a49c31f10ae480fa482dd.idx
> 0/0/23000 .git/objects/pack/pack-c51831234bf615a2b47a49c31f10ae480fa482dd.pack
> 0/0/757 .git/objects/pack/pack-d793ea6b319c4d19eb281f5ca2e368c24e10d91a.idx
> 0/0/21055 .git/objects/pack/pack-d793ea6b319c4d19eb281f5ca2e368c24e10d91a.pack
> 0/0/53690 .git/objects/pack/pack-ee31400e588e715113b665d7313d570553133d71.idx
> 0/0/367275 .git/objects/pack/pack-ee31400e588e715113b665d7313d570553133d71.pack
>
> This isn't a theoretical issue. The reason people keep coming up with
> the same patch is because they hit exactly this problem in real life.
>

BTW, we can protect these page caches with memory.min after the issues
found by me is fixed
(and I'm working on it :-) ).

> > > I don't want a find or an updatedb, which doesn't produce active
> > > pages, and could be funneled through the cache with otherwise no side
> > > effects, kick out all my linux tree git objects via the inode shrinker
> > > just because I haven't run a git command in a few minutes.
> >
> > That has nothing to do with this patch. updatedb and any file
> > traversal that touches data are going to be treated identically to
> > you precious working set because they all have nr_pages > 0.
> >
> > IOWs, this patch does nothing to avoid the problem of single use
> > inodes streaming through the inode cache causing the reclaim of all
> > inodes. It just changes the reclaim behaviour and how quickly single
> > use inodes can be reclaimed. i.e. we now can't reclaim single use
> > inodes when they reach the end of the LRU, we have to wait for page
> > cache reclaim to free it's pages before the inode can be reclaimed.
>
> Of course it does. LRU reclaim will clean out the single-use pages,
> after which those inodes will have !nr_pages and can be reclaimed.
>
> > Further, because inode LRU order is going to be different to page LRU
> > order, there's going to be a lot more useless scanning trying to
> > find inodes that can be reclaimed. Hence this changes cache balance,
> > reduces reclaim efficiency, increases reclaim priority windup as
> > less gets freed per scan, and this all ends up causing performance
> > and behavioural regressions in unexpected places.
>
> It would be better to keep the inodes off the LRU entirely as long as
> they are not considered for reclaim. That would save some CPU churn.
>
> > i.e. this makes the page cache pin the inode in memory and that's a
> > major change in bheaviour. that's what caused all the performance
> > regressions with workloads that traverse a large single-use file set
> > such as a kernel compile - most files and their data are accessed
> > just once, and when they get to the end of the inode LRU we really
> > want to reclaim them immediately as they'll never get accessed
> > again.
> >
> > To put it simply, if your goal is to avoid single use inodes from
> > trashing a long term working set of cached inodes, then this
> > patch does not provide the reliable or predictable object
> > management algorithm you are looking for.
> >
> > If you want to address use-once cache trashing, how about working
> > towards a *smarter LRU algorithm* for the list_lru infrastructure?
> > Don't hack naive heuristics that "work for me" into the code, go
> > back to the algorithm and select something that is provent to
> > be resilient against use-once object storms.
> >
> > i.e. The requirement is we retain quasi-LRU behaviour, but
> > allow use-once objects to cycle through the LRU without disturbing
> > frequently/recently referenced/active objects.  The
> > per-object reference bit we currently use isn't resilient against
> > large-scale use-once object cycling, so we have to improve on that.
> >
> > Experience tells me we've solved this problem before, and it's right
> > in your area or expertise, too. We could modify the list-lru to use
> > a different LRU algorithm that is resilient against the sort of
> > flooding you are worried about. We could simply use a double clock
> > list like the page LRU uses - we promote frequently referenced
> > inodes to the active list when instead of setting a reference bit
> > when a reference is dropped and the indoe is on the inactive list.
> > And a small part of each shrinker scan count can be used to demote
> > the tail of the active list to keep it slowly cycling. This way
> > single use inodes will only ever pass through the inactive list
> > without perturbing the active list, and we've solved the problem of
> > single use inode streams trashing the working cache for all use
> > cases, not just one special case....
>
> I'm not opposed to any of this work, but I don't see how it would be a
> prerequisite to fixing the aging inversion we're talking about here -
> throwing out "unused" containers without looking at what's inside.
>
> On the contrary, the inode scanner would already make better decisions
> by simply not discarding all the usage information painstakingly
> gathered by the VM.
>
> We can talk about the implementation, of course. Repeatedly skipping
> over inodes rather than physically taking them off the list can be a
> scalability problem; pushing the shrinker into dirty inodes can be a
> problem for certain filesystems. I didn't submit a patch for
> upstreaming, I sent a diff hunk to propose an aging hierarchy.
>
> If you agree with my concern about aging decisions here, but think
> it's the best we can do given our constraints, we can talk about this
> too - but we should at least document the hack currently in place.
>
> If you disagree that the reclaim layering here is fundamentally
> problematic, I'm not sure I need to move on with this discussion.
>
> > > > commit 69056ee6a8a3d576ed31e38b3b14c70d6c74edcc
> > > > Author: Dave Chinner <dchinner@redhat.com>
> > > > Date:   Tue Feb 12 15:35:51 2019 -0800
> > > >
> > > >     Revert "mm: don't reclaim inodes with many attached pages"
> > > >
> > > >     This reverts commit a76cf1a474d7d ("mm: don't reclaim inodes with many
> > > >     attached pages").
> > > >
> > > >     This change causes serious changes to page cache and inode cache
> > > >     behaviour and balance, resulting in major performance regressions when
> > > >     combining worklaods such as large file copies and kernel compiles.
> > > >
> > > >       https://bugzilla.kernel.org/show_bug.cgi?id=202441
> > >
> > > I don't remember this, but reading this bugzilla thread is immensely
> > > frustrating.
> >
> > So you're shooting the messenger as well, eh?
> >
> > We went through this whole "blame XFS" circus sideshow when I found
> > the commits that caused the regression. It went on right up until
> > people using ext4 started reporting similar problems.
> >
> > Yes, XFS users were the first to notice the issue, but that does
> > not make it an XFS problem!
>
> I cannot find details on the other filesystems in the bug report or
> the changelog. Where was the time going? Was it the CPU churn of
> skipping over the inodes?
>
> > > We've been carrying this patch here in our tree for over half a decade
> > > now to work around this exact stalling in the xfs shrinker:
> > >
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index d53a316162d6..45b3a4d07813 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -1344,7 +1344,7 @@ xfs_reclaim_inodes_nr(
> > >         xfs_reclaim_work_queue(mp);
> > >         xfs_ail_push_all(mp->m_ail);
> > >
> > > -       return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
> > > +       return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK, &nr_to_scan);
> > >  }
> > >
> > > Because if we don't, our warmstorage machines lock up within minutes,
> > > long before Roman's patch.
> >
> > Oh, go cry me a river. Poor little FB, has to carry an out-of-tree
> > hack that "works for them" because they don't care enough about
> > fixing it to help upstream address the underlying memory reclaim
> > problems that SYNC_WAIT flag avoids.
> >
> > Indeed, we (XFS devs) have repeatedly provided evidence that this
> > patch makes it relatively trivial for users to DOS systems via
> > OOM-killer rampages. It does not survive my trivial "fill memory
> > with inodes" test without the oom-killer killing the machine, and
> > any workload that empties the page cache before the inode cache is
> > prone to oom kill because nothing throttles reclaim anymore and
> > there are no pages left to reclaim or swap.
> >
> > It is manifestly worse than what we have now, and that means it is
> > not a candidate for merging. We've told FB engineers this
> > *repeatedly*, and yet this horrible, broken, nasty, expedient hack
> > gets raised every time "shrinker" and "XFS" are mentioned in the
> > same neighbourhood.  Just stop it, please.
>
> You don't need to be privileged to cause OOM kills in a myriad of ways
> if you tried to.
>
> You don't need to run a malicious workload to have the xfs shrinker
> stall out reclaimers in the presence of gigabytes of clean, easy to
> reclaim cache.
>
> We fundamentally disagree on what the horrible, broken, nasty,
> expedient hack is.
>
> > > The fact that xfs stalls on individual inodes while there might be a
> > > ton of clean cache on the LRUs is an xfs problem, not a VM problem.
> >
> > No, at it's core it is a VM problem, because if we don't stall on
> > inode reclaim in XFS then memory reclaim does far worse things to
> > your machine than incur an occasional long tail latency.
> >
> > You're free to use some other filesystem if you can't wait for
> > upstream XFS developers to fix it properly or you can't be bothered
> > to review the patches that actually attempt to fix the problem
> > properly...
>
> I'm not worried about xfs. I'm worried about these design decisions
> bleeding into other parts of reclaim.
>
> > > The right thing to do to avoid stalls in the inode shrinker is to skip
> > > over the dirty inodes and yield back to LRU reclaim; not circumvent
> > > page aging and drop clean inodes on the floor when those may or may
> > > not hold gigabytes of cache data that the inode shrinker knows
> > > *absolutely nothing* about.
> >
> > *cough* [*]
> >
> > https://lore.kernel.org/linux-mm/20191031234618.15403-1-david@fromorbit.com/
> >
> > This implements exactly what you suggest - shrinkers that can
> > communicate the need for backoffs to the core infrastructure and
> > work deferral to kswapd rather than doing it themselves. And it uses
> > that capability to implement non-blocking inode reclaim for XFS.
>
> Does that series end in the shrinkers leaving page reclaim to the page
> LRU order?
>
> I'm asking facetiously. Don't get me wrong, I'm interested in what
> your patchset promises to implement. However, I'm extremely reluctant
> to dive into a series of 28 patches if this is how the discussions go.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] memcg, inode: protect page cache from freeing inode
  2019-12-18 21:38             ` Johannes Weiner
  2019-12-19  2:04               ` Yafang Shao
@ 2020-01-10  2:08               ` Dave Chinner
  1 sibling, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2020-01-10  2:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Yafang Shao, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Al Viro, Linux MM, linux-fsdevel

On Wed, Dec 18, 2019 at 04:38:32PM -0500, Johannes Weiner wrote:
> On Wed, Dec 18, 2019 at 09:16:26PM +1100, Dave Chinner wrote:
> > On Tue, Dec 17, 2019 at 11:37:27PM -0500, Johannes Weiner wrote:
> > > On Wed, Dec 18, 2019 at 12:51:24PM +1100, Dave Chinner wrote:
> > > > On Tue, Dec 17, 2019 at 11:54:22AM -0500, Johannes Weiner wrote:
> > > > > This problem exists independent of cgroup protection.
> > > > > 
> > > > > The inode shrinker may take down an inode that's still holding a ton
> > > > > of (potentially active) page cache pages when the inode hasn't been
> > > > > referenced recently.
> > > > 
> > > > Ok, please explain to me how are those pages getting repeated
> > > > referenced and kept active without referencing the inode in some
> > > > way?
> > > > 
> > > > e.g. active mmap pins a struct file which pins the inode.
> > > > e.g. open fd pins a struct file which pins the inode.
> > > > e.g. open/read/write/close keeps a dentry active in cache which pins
> > > > the inode when not actively referenced by the open fd.
> > > > 
> > > > AFAIA, all of the cases where -file pages- are being actively
> > > > referenced require also actively referencing the inode in some way.
> > > > So why is the inode being reclaimed as an unreferenced inode at the
> > > > end of the LRU if these are actively referenced file pages?
> > > > 
> > > > > IMO we shouldn't be dropping data that the VM still considers hot
> > > > > compared to other data, just because the inode object hasn't been used
> > > > > as recently as other inode objects (e.g. drowned in a stream of
> > > > > one-off inode accesses).
> > > > 
> > > > It should not be drowned by one-off inode accesses because if
> > > > the file data is being actively referenced then there should be
> > > > frequent active references to the inode that contains the data and
> > > > that should be keeping it away from the tail of the inode LRU.
> > > > 
> > > > If the inode is not being frequently referenced, then it
> > > > isn't really part of the current working set of inodes, is it?
> > > 
> > > The inode doesn't have to be currently open for its data to be used
> > > frequently and recently.
> > 
> > No, it doesn't have to be "open", but it has to be referenced if
> > pages are being added to or accessed from it's mapping tree.
> > 
> > e.g. you can do open/mmap/close, and the vma backing the mmap region
> > holds a reference to the inode via vma->vm_file until munmap is
> > called and the vma is torn down.
> > 
> > So:
> > 
> > > Executables that run periodically come to mind.
> > 
> > this requires mmap, hence an active inode reference, and so when the
> > vma is torn down, the inode is moved to the head of the inode cache
> > LRU. IF we keep running that same executable, the inode will be
> > repeatedly relocated to the head of the LRU every time the process
> > running the executable exits.
> > 
> > > An sqlite file database that is periodically opened and queried, then
> > > closed again.
> > 
> > dentry pins inode on open, struct file pins inpde until close,
> > dentry reference pins inode until shrinker reclaims dentry. Inode
> > goes on head of LRU when dentry is reclaimed. Repeated cycles will
> > hit either the dentry cache or if that's been reclaimed the inode
> > cache will get hit.
> > 
> > > A git repository.
> > 
> > same as sqlite case, just with many files.
> > 
> > IOWs, all of these data references take an active reference to the
> > inode and reset it's position in the inode cache LRU when the last
> > reference is dropped. If it's a dentry, it may not get dropped until
> > memory presure relaims the dentry. Hence inode cache LRU order does
> > not reflect the file data page LRU order in any way.
> > 
> > But my question still stands: how do you get page LRU references
> > without inode references? And if you can't, why should having cached
> > pages on the oldest unused, unreferenced inode in the LRU prevent
> > it's reclaim?
> 
> One of us is missing something really obvious here.
> 
> Let's say I'm routinely working with a git tree and the objects are
> cached by active pages. I'm using a modified mincore() that reports
> page active state, so the output here is active/present/filesize:
> 
> [hannes@computer linux]$ ~/src/mincore .git/objects/pack/*
> 17/17/17 .git/objects/pack/pack-1993efac574359d041b010c84d04eb0f05275bfd.idx
> 97/97/1168 .git/objects/pack/pack-1993efac574359d041b010c84d04eb0f05275bfd.pack

....

> Now something like updatedb, a find or comparable goes off and in a
> short amount of time creates a ton of one-off dentries, inodes, and
> file cache:
> 
> $ find /usr -type f -exec grep -q dave {} \;
> 
> LRU reclaim recognizes that the file cache produced by this operation
> is not used repeatedly and lets an infinite amount of it pass through
> the inactive list without disturbing my git tree workingset.

Yes, but even streaming single use pages through the cache slowly
turns over the active list. i.e. If you aren't referencing the git
tree working set, and that find takes long enough, it will still
turn over the active list and reclaim the git tree workingset.
"working set" only references cache that is being actively used -
you cannot expect unreferenced cached objects to be retained forever
under ongoing memory demand....

> The inode/dentry reclaim doesn't do the same thing. It looks at the
> references and delays the inevitable for a few more items coming
> through the LRU, but eventually it lets a bunch of objects that are
> only used once push out data that has been used over and over right
> before this burst of metadata objects came along.

Sure. The point I'm making is that this is the right behaviour for
at least as many workloads as it is the wrong behaviour, especially
on machines where that "single use" workload needs a lot of memory.
e.g. kernel compiles on machines with less memory than the build
requires greatly benefits from accelerated page cache pruning via
inode cache eviction.

> This isn't a theoretical issue. The reason people keep coming up with
> the same patch is because they hit exactly this problem in real life.

And they keep finding out that it causes performance regressions.

I'm not saying the current inode cache code is perfect - what I'm
saying is that just removing this heuristic causes unacceptible
performance regressions in common workloads, and so we *need a
different solution*.

> > Further, because inode LRU order is going to be different to page LRU
> > order, there's going to be a lot more useless scanning trying to
> > find inodes that can be reclaimed. Hence this changes cache balance,
> > reduces reclaim efficiency, increases reclaim priority windup as
> > less gets freed per scan, and this all ends up causing performance
> > and behavioural regressions in unexpected places.
> 
> It would be better to keep the inodes off the LRU entirely as long as
> they are not considered for reclaim. That would save some CPU churn.

I don't think there's any sane way to have the page cache pages on
an unreferenced inode keep it off the LRU. The whole point of the
LRU is that it tracks unreferenced inodes....

> > Experience tells me we've solved this problem before, and it's right
> > in your area or expertise, too. We could modify the list-lru to use
> > a different LRU algorithm that is resilient against the sort of
> > flooding you are worried about. We could simply use a double clock
> > list like the page LRU uses - we promote frequently referenced
> > inodes to the active list when instead of setting a reference bit
> > when a reference is dropped and the indoe is on the inactive list.
> > And a small part of each shrinker scan count can be used to demote
> > the tail of the active list to keep it slowly cycling. This way
> > single use inodes will only ever pass through the inactive list
> > without perturbing the active list, and we've solved the problem of
> > single use inode streams trashing the working cache for all use
> > cases, not just one special case....
> 
> I'm not opposed to any of this work, but I don't see how it would be a
> prerequisite to fixing the aging inversion we're talking about here -
> throwing out "unused" containers without looking at what's inside.
>
> On the contrary, the inode scanner would already make better decisions
> by simply not discarding all the usage information painstakingly
> gathered by the VM.

You appear to be starting from the position that "the VM is always
right", but the regressions that have been reported by changing this
code indicate otherwise. i.e. we know that page reclaim is not
perfect, and that despite the sophistication of the the workingset
retention code it is not onmipotent and so page reclaim still makes
wrong decisions. Similarly, inode reclaim can make wrong decisions.

From that perspective, simply replacing code that makes the wrong
decision in one common circumstance with code that makes a different
wrong decision in a different common cicrumstance is not an
improvement.  The solution needs to improve the situation without
regressions in common workloads, and if possible address the
fundamental issue that causes the problem.

> We can talk about the implementation, of course. Repeatedly skipping
> over inodes rather than physically taking them off the list can be a
> scalability problem; pushing the shrinker into dirty inodes can be a
> problem for certain filesystems. I didn't submit a patch for
> upstreaming, I sent a diff hunk to propose an aging hierarchy.

Sure, But history tells us that the proposed "aging hierarchy"
doesn't work for everyone.  It's not a viable solution, so can we
please move on from this specific idea?

AFAICT, all your problematic workloads are single use inodes causing
inode cache working set eviction. Making the inode cache LRU
resistant to single use cache eviction will solve these problems. It
will keep the existing "expediate page cache reclaim because the
current working set is larger than memory" behaviour of the inode
shrinker, but it will prevent this specific shrinker behaviour from
being invoked under workloads that stream single use inodes through
the cache.

Yes, it is more work, but architecturally it is the right way to
solve this problem because it will also likely improve the inode
cache working set retention (and hence performance) for a host of
other workloads, too.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/4] memcg, inode: protect page cache from freeing inode
  2019-12-17 16:54     ` Johannes Weiner
                         ` (2 preceding siblings ...)
  2019-12-18  1:51       ` Dave Chinner
@ 2019-12-18 17:27       ` Roman Gushchin
  3 siblings, 0 replies; 23+ messages in thread
From: Roman Gushchin @ 2019-12-18 17:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Yafang Shao, Michal Hocko, Vladimir Davydov, Andrew Morton,
	Dave Chinner, Al Viro, Linux MM, linux-fsdevel

On Tue, Dec 17, 2019 at 11:54:22AM -0500, Johannes Weiner wrote:
> CCing Dave
> 
> On Tue, Dec 17, 2019 at 08:19:08PM +0800, Yafang Shao wrote:
> > On Tue, Dec 17, 2019 at 7:56 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > What do you mean by this exactly. Are those inodes reclaimed by the
> > > regular memory reclaim or by other means? Because shrink_node does
> > > exclude shrinking slab for protected memcgs.
> > 
> > By the regular memory reclaim, kswapd, direct reclaimer or memcg reclaimer.
> > IOW, the current->reclaim_state it set.
> > 
> > Take an example for you.
> > 
> > kswapd
> >     balance_pgdat
> >         shrink_node_memcgs
> >             switch (mem_cgroup_protected)  <<<< memory.current= 1024M
> > memory.min = 512M a file has 800M page caches
> >                 case MEMCG_PROT_NONE:  <<<< hard limit is not reached.
> >                       beak;
> >             shrink_lruvec
> >             shrink_slab <<< it may free the inode and the free all its
> > page caches (800M)
> 
> This problem exists independent of cgroup protection.
> 
> The inode shrinker may take down an inode that's still holding a ton
> of (potentially active) page cache pages when the inode hasn't been
> referenced recently.
> 
> IMO we shouldn't be dropping data that the VM still considers hot
> compared to other data, just because the inode object hasn't been used
> as recently as other inode objects (e.g. drowned in a stream of
> one-off inode accesses).
> 
> I've carried the below patch in my private tree for testing cache
> aging decisions that the shrinker interfered with. (It would be nicer
> if page cache pages could pin the inode of course, but reclaim cannot
> easily participate in the inode refcounting scheme.)
> 
> Thoughts?
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index fef457a42882..bfcaaaf6314f 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -753,7 +753,13 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
>  		return LRU_ROTATE;
>  	}
>  
> -	if (inode_has_buffers(inode) || inode->i_data.nrpages) {
> +	/* Leave the pages to page reclaim */
> +	if (inode->i_data.nrpages) {
> +		spin_unlock(&inode->i_lock);
> +		return LRU_ROTATE;
> +	}
> +
> +	if (inode_has_buffers(inode)) {

JFYI: there was a very similar commit a76cf1a474d7 ("mm: don't reclaim inodes
with many attached pages"), which has been reverted because it created some
serious xfs regressions.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2020-01-10  2:08 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-17 11:29 [PATCH 0/4] memcg, inode: protect page cache from freeing inode Yafang Shao
2019-12-17 11:29 ` [PATCH 1/4] mm, memcg: reduce size of struct mem_cgroup by using bit field Yafang Shao
2019-12-17 11:29 ` [PATCH 2/4] mm, memcg: introduce MEMCG_PROT_SKIP for memcg zero usage case Yafang Shao
2019-12-17 11:29 ` [PATCH 3/4] mm, memcg: reset memcg's memory.{min, low} for reclaiming itself Yafang Shao
2019-12-17 14:20   ` Chris Down
2019-12-18  1:13     ` Yafang Shao
2019-12-17 11:29 ` [PATCH 4/4] memcg, inode: protect page cache from freeing inode Yafang Shao
2019-12-18  2:21   ` Dave Chinner
2019-12-18  2:33     ` Yafang Shao
2019-12-18 17:53   ` Roman Gushchin
2019-12-19  1:45     ` Yafang Shao
2019-12-17 11:56 ` [PATCH 0/4] " Michal Hocko
2019-12-17 12:19   ` Yafang Shao
2019-12-17 16:54     ` Johannes Weiner
2019-12-18  1:17       ` Yafang Shao
2019-12-18  1:37       ` Andrew Morton
2019-12-18  1:51       ` Dave Chinner
2019-12-18  4:37         ` Johannes Weiner
2019-12-18 10:16           ` Dave Chinner
2019-12-18 21:38             ` Johannes Weiner
2019-12-19  2:04               ` Yafang Shao
2020-01-10  2:08               ` Dave Chinner
2019-12-18 17:27       ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).