linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/3] Ignore non-LRU-based reclaim in memcg reclaim
@ 2023-04-13 10:40 Yosry Ahmed
  2023-04-13 10:40 ` [PATCH v6 1/3] mm: vmscan: ignore " Yosry Ahmed
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Yosry Ahmed @ 2023-04-13 10:40 UTC (permalink / raw)
  To: Andrew Morton, Alexander Viro, Darrick J. Wong,
	Christoph Lameter, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Roman Gushchin, Hyeonggon Yoo, Matthew Wilcox (Oracle),
	Miaohe Lin, David Hildenbrand, Johannes Weiner, Peter Xu,
	NeilBrown, Shakeel Butt, Michal Hocko, Yu Zhao, Dave Chinner,
	Tim Chen
  Cc: linux-fsdevel, linux-kernel, linux-xfs, linux-mm, Yosry Ahmed

Upon running some proactive reclaim tests using memory.reclaim, we
noticed some tests flaking where writing to memory.reclaim would be
successful even though we did not reclaim the requested amount fully
Looking further into it, I discovered that *sometimes* we overestimate
the number of reclaimed pages in memcg reclaim.

Reclaimed pages through other means than LRU-based reclaim are tracked
through reclaim_state in struct scan_control, which is stashed in
current task_struct. These pages are added to the number of reclaimed
pages through LRUs. For memcg reclaim, these pages generally cannot be
linked to the memcg under reclaim and can cause an overestimated count
of reclaimed pages. This short series tries to address that.

Patch 1 ignores pages reclaimed outside of LRU reclaim in memcg reclaim.
The pages are uncharged anyway, so even if we end up under-reporting
reclaimed pages we will still succeed in making progress during
charging.

Patches 2-3 are just refactoring. Patch 2 moves set_reclaim_state()
helper next to flush_reclaim_state(). Patch 3 adds a helper that wraps
updating current->reclaim_state, and renames
reclaim_state->reclaimed_slab to reclaim_state->reclaimed.

v5 -> v6:
- Re-arranged the patches:
  - Pulled flush_reclaim_state() helper with the clarifyng comment to
    the first patch so that the patch is clear on its own (David
    Hildenbrand).
  - Separated moving set_reclaim_state() to a separate patch so that we
    can easily drop it if deemed unnecessary (Questioned by Peter Xu).
- Added a fixes tag (David Hildenbrand).
- Reworded comment in flush_reclaim_state() (David Hildenbrand and Tim
  Chen).
- Dropped reclaim_state argument to flush_reclaim_state() and use
  current->reclaim_state directly instead (Peter Xu).

v5: https://lore.kernel.org/linux-mm/20230405185427.1246289-1-yosryahmed@google.com/

Yosry Ahmed (3):
  mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim
  mm: vmscan: move set_task_reclaim_state() near flush_reclaim_state()
  mm: vmscan: refactor updating current->reclaim_state

 fs/inode.c           |  3 +-
 fs/xfs/xfs_buf.c     |  3 +-
 include/linux/swap.h | 17 ++++++++++-
 mm/slab.c            |  3 +-
 mm/slob.c            |  6 ++--
 mm/slub.c            |  5 ++-
 mm/vmscan.c          | 72 ++++++++++++++++++++++++++++++++------------
 7 files changed, 76 insertions(+), 33 deletions(-)

-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v6 1/3] mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim
  2023-04-13 10:40 [PATCH v6 0/3] Ignore non-LRU-based reclaim in memcg reclaim Yosry Ahmed
@ 2023-04-13 10:40 ` Yosry Ahmed
  2023-04-13 10:45   ` Yosry Ahmed
                     ` (3 more replies)
  2023-04-13 10:40 ` [PATCH v6 2/3] mm: vmscan: move set_task_reclaim_state() near flush_reclaim_state() Yosry Ahmed
  2023-04-13 10:40 ` [PATCH v6 3/3] mm: vmscan: refactor updating current->reclaim_state Yosry Ahmed
  2 siblings, 4 replies; 20+ messages in thread
From: Yosry Ahmed @ 2023-04-13 10:40 UTC (permalink / raw)
  To: Andrew Morton, Alexander Viro, Darrick J. Wong,
	Christoph Lameter, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Roman Gushchin, Hyeonggon Yoo, Matthew Wilcox (Oracle),
	Miaohe Lin, David Hildenbrand, Johannes Weiner, Peter Xu,
	NeilBrown, Shakeel Butt, Michal Hocko, Yu Zhao, Dave Chinner,
	Tim Chen
  Cc: linux-fsdevel, linux-kernel, linux-xfs, linux-mm, Yosry Ahmed

We keep track of different types of reclaimed pages through
reclaim_state->reclaimed_slab, and we add them to the reported number
of reclaimed pages.  For non-memcg reclaim, this makes sense. For memcg
reclaim, we have no clue if those pages are charged to the memcg under
reclaim.

Slab pages are shared by different memcgs, so a freed slab page may have
only been partially charged to the memcg under reclaim.  The same goes for
clean file pages from pruned inodes (on highmem systems) or xfs buffer
pages, there is no simple way to currently link them to the memcg under
reclaim.

Stop reporting those freed pages as reclaimed pages during memcg reclaim.
This should make the return value of writing to memory.reclaim, and may
help reduce unnecessary reclaim retries during memcg charging.  Writing to
memory.reclaim on the root memcg is considered as cgroup_reclaim(), but
for this case we want to include any freed pages, so use the
global_reclaim() check instead of !cgroup_reclaim().

Generally, this should make the return value of
try_to_free_mem_cgroup_pages() more accurate. In some limited cases (e.g.
freed a slab page that was mostly charged to the memcg under reclaim),
the return value of try_to_free_mem_cgroup_pages() can be underestimated,
but this should be fine. The freed pages will be uncharged anyway, and we
can charge the memcg the next time around as we usually do memcg reclaim
in a retry loop.

Fixes: f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects
instead of pages")

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 mm/vmscan.c | 49 ++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 42 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c1c5e8b24b8..be657832be48 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -511,6 +511,46 @@ static bool writeback_throttling_sane(struct scan_control *sc)
 }
 #endif
 
+/*
+ * flush_reclaim_state(): add pages reclaimed outside of LRU-based reclaim to
+ * scan_control->nr_reclaimed.
+ */
+static void flush_reclaim_state(struct scan_control *sc)
+{
+	/*
+	 * Currently, reclaim_state->reclaimed includes three types of pages
+	 * freed outside of vmscan:
+	 * (1) Slab pages.
+	 * (2) Clean file pages from pruned inodes (on highmem systems).
+	 * (3) XFS freed buffer pages.
+	 *
+	 * For all of these cases, we cannot universally link the pages to a
+	 * single memcg. For example, a memcg-aware shrinker can free one object
+	 * charged to the target memcg, causing an entire page to be freed.
+	 * If we count the entire page as reclaimed from the memcg, we end up
+	 * overestimating the reclaimed amount (potentially under-reclaiming).
+	 *
+	 * Only count such pages for global reclaim to prevent under-reclaiming
+	 * from the target memcg; preventing unnecessary retries during memcg
+	 * charging and false positives from proactive reclaim.
+	 *
+	 * For uncommon cases where the freed pages were actually mostly
+	 * charged to the target memcg, we end up underestimating the reclaimed
+	 * amount. This should be fine. The freed pages will be uncharged
+	 * anyway, even if they are not counted here properly, and we will be
+	 * able to make forward progress in charging (which is usually in a
+	 * retry loop).
+	 *
+	 * We can go one step further, and report the uncharged objcg pages in
+	 * memcg reclaim, to make reporting more accurate and reduce
+	 * underestimation, but it's probably not worth the complexity for now.
+	 */
+	if (current->reclaim_state && global_reclaim(sc)) {
+		sc->nr_reclaimed += current->reclaim_state->reclaimed;
+		current->reclaim_state->reclaimed = 0;
+	}
+}
+
 static long xchg_nr_deferred(struct shrinker *shrinker,
 			     struct shrink_control *sc)
 {
@@ -5346,8 +5386,7 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 		vmpressure(sc->gfp_mask, memcg, false, sc->nr_scanned - scanned,
 			   sc->nr_reclaimed - reclaimed);
 
-	sc->nr_reclaimed += current->reclaim_state->reclaimed_slab;
-	current->reclaim_state->reclaimed_slab = 0;
+	flush_reclaim_state(sc);
 
 	return success ? MEMCG_LRU_YOUNG : 0;
 }
@@ -6450,7 +6489,6 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 
 static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 {
-	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long nr_reclaimed, nr_scanned;
 	struct lruvec *target_lruvec;
 	bool reclaimable = false;
@@ -6472,10 +6510,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 
 	shrink_node_memcgs(pgdat, sc);
 
-	if (reclaim_state) {
-		sc->nr_reclaimed += reclaim_state->reclaimed_slab;
-		reclaim_state->reclaimed_slab = 0;
-	}
+	flush_reclaim_state(sc);
 
 	/* Record the subtree's reclaim efficiency */
 	if (!sc->proactive)
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v6 2/3] mm: vmscan: move set_task_reclaim_state() near flush_reclaim_state()
  2023-04-13 10:40 [PATCH v6 0/3] Ignore non-LRU-based reclaim in memcg reclaim Yosry Ahmed
  2023-04-13 10:40 ` [PATCH v6 1/3] mm: vmscan: ignore " Yosry Ahmed
@ 2023-04-13 10:40 ` Yosry Ahmed
  2023-04-13 11:19   ` David Hildenbrand
  2023-04-14  8:16   ` Michal Hocko
  2023-04-13 10:40 ` [PATCH v6 3/3] mm: vmscan: refactor updating current->reclaim_state Yosry Ahmed
  2 siblings, 2 replies; 20+ messages in thread
From: Yosry Ahmed @ 2023-04-13 10:40 UTC (permalink / raw)
  To: Andrew Morton, Alexander Viro, Darrick J. Wong,
	Christoph Lameter, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Roman Gushchin, Hyeonggon Yoo, Matthew Wilcox (Oracle),
	Miaohe Lin, David Hildenbrand, Johannes Weiner, Peter Xu,
	NeilBrown, Shakeel Butt, Michal Hocko, Yu Zhao, Dave Chinner,
	Tim Chen
  Cc: linux-fsdevel, linux-kernel, linux-xfs, linux-mm, Yosry Ahmed

Move set_task_reclaim_state() near flush_reclaim_state() so that all
helpers manipulating reclaim_state are in close proximity.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 mm/vmscan.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index be657832be48..cb7d5a17c2b2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -188,18 +188,6 @@ struct scan_control {
  */
 int vm_swappiness = 60;
 
-static void set_task_reclaim_state(struct task_struct *task,
-				   struct reclaim_state *rs)
-{
-	/* Check for an overwrite */
-	WARN_ON_ONCE(rs && task->reclaim_state);
-
-	/* Check for the nulling of an already-nulled member */
-	WARN_ON_ONCE(!rs && !task->reclaim_state);
-
-	task->reclaim_state = rs;
-}
-
 LIST_HEAD(shrinker_list);
 DECLARE_RWSEM(shrinker_rwsem);
 
@@ -511,6 +499,18 @@ static bool writeback_throttling_sane(struct scan_control *sc)
 }
 #endif
 
+static void set_task_reclaim_state(struct task_struct *task,
+				   struct reclaim_state *rs)
+{
+	/* Check for an overwrite */
+	WARN_ON_ONCE(rs && task->reclaim_state);
+
+	/* Check for the nulling of an already-nulled member */
+	WARN_ON_ONCE(!rs && !task->reclaim_state);
+
+	task->reclaim_state = rs;
+}
+
 /*
  * flush_reclaim_state(): add pages reclaimed outside of LRU-based reclaim to
  * scan_control->nr_reclaimed.
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v6 3/3] mm: vmscan: refactor updating current->reclaim_state
  2023-04-13 10:40 [PATCH v6 0/3] Ignore non-LRU-based reclaim in memcg reclaim Yosry Ahmed
  2023-04-13 10:40 ` [PATCH v6 1/3] mm: vmscan: ignore " Yosry Ahmed
  2023-04-13 10:40 ` [PATCH v6 2/3] mm: vmscan: move set_task_reclaim_state() near flush_reclaim_state() Yosry Ahmed
@ 2023-04-13 10:40 ` Yosry Ahmed
  2023-04-13 11:20   ` David Hildenbrand
  2023-04-14  8:18   ` Michal Hocko
  2 siblings, 2 replies; 20+ messages in thread
From: Yosry Ahmed @ 2023-04-13 10:40 UTC (permalink / raw)
  To: Andrew Morton, Alexander Viro, Darrick J. Wong,
	Christoph Lameter, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Roman Gushchin, Hyeonggon Yoo, Matthew Wilcox (Oracle),
	Miaohe Lin, David Hildenbrand, Johannes Weiner, Peter Xu,
	NeilBrown, Shakeel Butt, Michal Hocko, Yu Zhao, Dave Chinner,
	Tim Chen
  Cc: linux-fsdevel, linux-kernel, linux-xfs, linux-mm, Yosry Ahmed

During reclaim, we keep track of pages reclaimed from other means than
LRU-based reclaim through scan_control->reclaim_state->reclaimed_slab,
which we stash a pointer to in current task_struct.

However, we keep track of more than just reclaimed slab pages through
this. We also use it for clean file pages dropped through pruned inodes,
and xfs buffer pages freed. Rename reclaimed_slab to reclaimed, and add
a helper function that wraps updating it through current, so that future
changes to this logic are contained within include/linux/swap.h.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 fs/inode.c           |  3 +--
 fs/xfs/xfs_buf.c     |  3 +--
 include/linux/swap.h | 17 ++++++++++++++++-
 mm/slab.c            |  3 +--
 mm/slob.c            |  6 ++----
 mm/slub.c            |  5 ++---
 6 files changed, 23 insertions(+), 14 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 4558dc2f1355..e60fcc41faf1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -864,8 +864,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
 				__count_vm_events(KSWAPD_INODESTEAL, reap);
 			else
 				__count_vm_events(PGINODESTEAL, reap);
-			if (current->reclaim_state)
-				current->reclaim_state->reclaimed_slab += reap;
+			mm_account_reclaimed_pages(reap);
 		}
 		iput(inode);
 		spin_lock(lru_lock);
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 54c774af6e1c..15d1e5a7c2d3 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -286,8 +286,7 @@ xfs_buf_free_pages(
 		if (bp->b_pages[i])
 			__free_page(bp->b_pages[i]);
 	}
-	if (current->reclaim_state)
-		current->reclaim_state->reclaimed_slab += bp->b_page_count;
+	mm_account_reclaimed_pages(bp->b_page_count);
 
 	if (bp->b_pages != bp->b_page_array)
 		kmem_free(bp->b_pages);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 209a425739a9..e131ac155fb9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -153,13 +153,28 @@ union swap_header {
  * memory reclaim
  */
 struct reclaim_state {
-	unsigned long reclaimed_slab;
+	/* pages reclaimed outside of LRU-based reclaim */
+	unsigned long reclaimed;
 #ifdef CONFIG_LRU_GEN
 	/* per-thread mm walk data */
 	struct lru_gen_mm_walk *mm_walk;
 #endif
 };
 
+/*
+ * mm_account_reclaimed_pages(): account reclaimed pages outside of LRU-based
+ * reclaim
+ * @pages: number of pages reclaimed
+ *
+ * If the current process is undergoing a reclaim operation, increment the
+ * number of reclaimed pages by @pages.
+ */
+static inline void mm_account_reclaimed_pages(unsigned long pages)
+{
+	if (current->reclaim_state)
+		current->reclaim_state->reclaimed += pages;
+}
+
 #ifdef __KERNEL__
 
 struct address_space;
diff --git a/mm/slab.c b/mm/slab.c
index dabc2a671fc6..64bf1de817b2 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1392,8 +1392,7 @@ static void kmem_freepages(struct kmem_cache *cachep, struct slab *slab)
 	smp_wmb();
 	__folio_clear_slab(folio);
 
-	if (current->reclaim_state)
-		current->reclaim_state->reclaimed_slab += 1 << order;
+	mm_account_reclaimed_pages(1 << order);
 	unaccount_slab(slab, order, cachep);
 	__free_pages(&folio->page, order);
 }
diff --git a/mm/slob.c b/mm/slob.c
index fe567fcfa3a3..79cc8680c973 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -61,7 +61,7 @@
 #include <linux/slab.h>
 
 #include <linux/mm.h>
-#include <linux/swap.h> /* struct reclaim_state */
+#include <linux/swap.h> /* mm_account_reclaimed_pages() */
 #include <linux/cache.h>
 #include <linux/init.h>
 #include <linux/export.h>
@@ -211,9 +211,7 @@ static void slob_free_pages(void *b, int order)
 {
 	struct page *sp = virt_to_page(b);
 
-	if (current->reclaim_state)
-		current->reclaim_state->reclaimed_slab += 1 << order;
-
+	mm_account_reclaimed_pages(1 << order);
 	mod_node_page_state(page_pgdat(sp), NR_SLAB_UNRECLAIMABLE_B,
 			    -(PAGE_SIZE << order));
 	__free_pages(sp, order);
diff --git a/mm/slub.c b/mm/slub.c
index 39327e98fce3..7aa30eef8235 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -11,7 +11,7 @@
  */
 
 #include <linux/mm.h>
-#include <linux/swap.h> /* struct reclaim_state */
+#include <linux/swap.h> /* mm_account_reclaimed_pages() */
 #include <linux/module.h>
 #include <linux/bit_spinlock.h>
 #include <linux/interrupt.h>
@@ -2063,8 +2063,7 @@ static void __free_slab(struct kmem_cache *s, struct slab *slab)
 	/* Make the mapping reset visible before clearing the flag */
 	smp_wmb();
 	__folio_clear_slab(folio);
-	if (current->reclaim_state)
-		current->reclaim_state->reclaimed_slab += pages;
+	mm_account_reclaimed_pages(pages);
 	unaccount_slab(slab, order, s);
 	__free_pages(&folio->page, order);
 }
-- 
2.40.0.577.gac1e443424-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 1/3] mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim
  2023-04-13 10:40 ` [PATCH v6 1/3] mm: vmscan: ignore " Yosry Ahmed
@ 2023-04-13 10:45   ` Yosry Ahmed
  2023-04-13 11:16   ` David Hildenbrand
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 20+ messages in thread
From: Yosry Ahmed @ 2023-04-13 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Darrick J. Wong, Christoph Lameter,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Roman Gushchin,
	Hyeonggon Yoo, Matthew Wilcox (Oracle),
	Miaohe Lin, David Hildenbrand, Johannes Weiner, Peter Xu,
	NeilBrown, Shakeel Butt, Michal Hocko, Yu Zhao, Dave Chinner,
	Tim Chen, linux-fsdevel, linux-kernel, linux-xfs, linux-mm

On Thu, Apr 13, 2023 at 3:40 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> We keep track of different types of reclaimed pages through
> reclaim_state->reclaimed_slab, and we add them to the reported number
> of reclaimed pages.  For non-memcg reclaim, this makes sense. For memcg
> reclaim, we have no clue if those pages are charged to the memcg under
> reclaim.
>
> Slab pages are shared by different memcgs, so a freed slab page may have
> only been partially charged to the memcg under reclaim.  The same goes for
> clean file pages from pruned inodes (on highmem systems) or xfs buffer
> pages, there is no simple way to currently link them to the memcg under
> reclaim.
>
> Stop reporting those freed pages as reclaimed pages during memcg reclaim.
> This should make the return value of writing to memory.reclaim, and may
> help reduce unnecessary reclaim retries during memcg charging.  Writing to
> memory.reclaim on the root memcg is considered as cgroup_reclaim(), but
> for this case we want to include any freed pages, so use the
> global_reclaim() check instead of !cgroup_reclaim().
>
> Generally, this should make the return value of
> try_to_free_mem_cgroup_pages() more accurate. In some limited cases (e.g.
> freed a slab page that was mostly charged to the memcg under reclaim),
> the return value of try_to_free_mem_cgroup_pages() can be underestimated,
> but this should be fine. The freed pages will be uncharged anyway, and we
> can charge the memcg the next time around as we usually do memcg reclaim
> in a retry loop.
>
> Fixes: f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects
> instead of pages")


Andrew, I removed the CC: stable as you were sceptical about the need
for a backport, but left the Fixes tag so that it's easy to identify
where to backport it if you and/or stable maintainers decide
otherwise.

>
>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>  mm/vmscan.c | 49 ++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 42 insertions(+), 7 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9c1c5e8b24b8..be657832be48 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -511,6 +511,46 @@ static bool writeback_throttling_sane(struct scan_control *sc)
>  }
>  #endif
>
> +/*
> + * flush_reclaim_state(): add pages reclaimed outside of LRU-based reclaim to
> + * scan_control->nr_reclaimed.
> + */
> +static void flush_reclaim_state(struct scan_control *sc)
> +{
> +       /*
> +        * Currently, reclaim_state->reclaimed includes three types of pages
> +        * freed outside of vmscan:
> +        * (1) Slab pages.
> +        * (2) Clean file pages from pruned inodes (on highmem systems).
> +        * (3) XFS freed buffer pages.
> +        *
> +        * For all of these cases, we cannot universally link the pages to a
> +        * single memcg. For example, a memcg-aware shrinker can free one object
> +        * charged to the target memcg, causing an entire page to be freed.
> +        * If we count the entire page as reclaimed from the memcg, we end up
> +        * overestimating the reclaimed amount (potentially under-reclaiming).
> +        *
> +        * Only count such pages for global reclaim to prevent under-reclaiming
> +        * from the target memcg; preventing unnecessary retries during memcg
> +        * charging and false positives from proactive reclaim.
> +        *
> +        * For uncommon cases where the freed pages were actually mostly
> +        * charged to the target memcg, we end up underestimating the reclaimed
> +        * amount. This should be fine. The freed pages will be uncharged
> +        * anyway, even if they are not counted here properly, and we will be
> +        * able to make forward progress in charging (which is usually in a
> +        * retry loop).
> +        *
> +        * We can go one step further, and report the uncharged objcg pages in
> +        * memcg reclaim, to make reporting more accurate and reduce
> +        * underestimation, but it's probably not worth the complexity for now.
> +        */
> +       if (current->reclaim_state && global_reclaim(sc)) {
> +               sc->nr_reclaimed += current->reclaim_state->reclaimed;
> +               current->reclaim_state->reclaimed = 0;
> +       }
> +}
> +
>  static long xchg_nr_deferred(struct shrinker *shrinker,
>                              struct shrink_control *sc)
>  {
> @@ -5346,8 +5386,7 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>                 vmpressure(sc->gfp_mask, memcg, false, sc->nr_scanned - scanned,
>                            sc->nr_reclaimed - reclaimed);
>
> -       sc->nr_reclaimed += current->reclaim_state->reclaimed_slab;
> -       current->reclaim_state->reclaimed_slab = 0;
> +       flush_reclaim_state(sc);
>
>         return success ? MEMCG_LRU_YOUNG : 0;
>  }
> @@ -6450,7 +6489,6 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>
>  static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  {
> -       struct reclaim_state *reclaim_state = current->reclaim_state;
>         unsigned long nr_reclaimed, nr_scanned;
>         struct lruvec *target_lruvec;
>         bool reclaimable = false;
> @@ -6472,10 +6510,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>
>         shrink_node_memcgs(pgdat, sc);
>
> -       if (reclaim_state) {
> -               sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> -               reclaim_state->reclaimed_slab = 0;
> -       }
> +       flush_reclaim_state(sc);
>
>         /* Record the subtree's reclaim efficiency */
>         if (!sc->proactive)
> --
> 2.40.0.577.gac1e443424-goog
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 1/3] mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim
  2023-04-13 10:40 ` [PATCH v6 1/3] mm: vmscan: ignore " Yosry Ahmed
  2023-04-13 10:45   ` Yosry Ahmed
@ 2023-04-13 11:16   ` David Hildenbrand
  2023-04-13 11:25     ` Yosry Ahmed
  2023-04-14  8:15   ` Michal Hocko
  2023-05-01 10:12   ` Yosry Ahmed
  3 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2023-04-13 11:16 UTC (permalink / raw)
  To: Yosry Ahmed, Andrew Morton, Alexander Viro, Darrick J. Wong,
	Christoph Lameter, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Roman Gushchin, Hyeonggon Yoo, Matthew Wilcox (Oracle),
	Miaohe Lin, Johannes Weiner, Peter Xu, NeilBrown, Shakeel Butt,
	Michal Hocko, Yu Zhao, Dave Chinner, Tim Chen
  Cc: linux-fsdevel, linux-kernel, linux-xfs, linux-mm

On 13.04.23 12:40, Yosry Ahmed wrote:
> We keep track of different types of reclaimed pages through
> reclaim_state->reclaimed_slab, and we add them to the reported number
> of reclaimed pages.  For non-memcg reclaim, this makes sense. For memcg
> reclaim, we have no clue if those pages are charged to the memcg under
> reclaim.
> 
> Slab pages are shared by different memcgs, so a freed slab page may have
> only been partially charged to the memcg under reclaim.  The same goes for
> clean file pages from pruned inodes (on highmem systems) or xfs buffer
> pages, there is no simple way to currently link them to the memcg under
> reclaim.
> 
> Stop reporting those freed pages as reclaimed pages during memcg reclaim.
> This should make the return value of writing to memory.reclaim, and may
> help reduce unnecessary reclaim retries during memcg charging.  Writing to
> memory.reclaim on the root memcg is considered as cgroup_reclaim(), but
> for this case we want to include any freed pages, so use the
> global_reclaim() check instead of !cgroup_reclaim().
> 
> Generally, this should make the return value of
> try_to_free_mem_cgroup_pages() more accurate. In some limited cases (e.g.
> freed a slab page that was mostly charged to the memcg under reclaim),
> the return value of try_to_free_mem_cgroup_pages() can be underestimated,
> but this should be fine. The freed pages will be uncharged anyway, and we
> can charge the memcg the next time around as we usually do memcg reclaim
> in a retry loop.
> 
> Fixes: f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects
> instead of pages")
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---

LGTM, hopefully the underestimation won't result in a real issue.

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 2/3] mm: vmscan: move set_task_reclaim_state() near flush_reclaim_state()
  2023-04-13 10:40 ` [PATCH v6 2/3] mm: vmscan: move set_task_reclaim_state() near flush_reclaim_state() Yosry Ahmed
@ 2023-04-13 11:19   ` David Hildenbrand
  2023-04-13 11:26     ` Yosry Ahmed
  2023-04-14  8:16   ` Michal Hocko
  1 sibling, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2023-04-13 11:19 UTC (permalink / raw)
  To: Yosry Ahmed, Andrew Morton, Alexander Viro, Darrick J. Wong,
	Christoph Lameter, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Roman Gushchin, Hyeonggon Yoo, Matthew Wilcox (Oracle),
	Miaohe Lin, Johannes Weiner, Peter Xu, NeilBrown, Shakeel Butt,
	Michal Hocko, Yu Zhao, Dave Chinner, Tim Chen
  Cc: linux-fsdevel, linux-kernel, linux-xfs, linux-mm

On 13.04.23 12:40, Yosry Ahmed wrote:
> Move set_task_reclaim_state() near flush_reclaim_state() so that all
> helpers manipulating reclaim_state are in close proximity.
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---

Hm, it's rather a simple helper to set the reclaim_state for a task, not 
to modify it.

No strong opinion, but I'd just leave it as it.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 3/3] mm: vmscan: refactor updating current->reclaim_state
  2023-04-13 10:40 ` [PATCH v6 3/3] mm: vmscan: refactor updating current->reclaim_state Yosry Ahmed
@ 2023-04-13 11:20   ` David Hildenbrand
  2023-04-13 11:29     ` Yosry Ahmed
  2023-04-14  8:18   ` Michal Hocko
  1 sibling, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2023-04-13 11:20 UTC (permalink / raw)
  To: Yosry Ahmed, Andrew Morton, Alexander Viro, Darrick J. Wong,
	Christoph Lameter, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Roman Gushchin, Hyeonggon Yoo, Matthew Wilcox (Oracle),
	Miaohe Lin, Johannes Weiner, Peter Xu, NeilBrown, Shakeel Butt,
	Michal Hocko, Yu Zhao, Dave Chinner, Tim Chen
  Cc: linux-fsdevel, linux-kernel, linux-xfs, linux-mm

On 13.04.23 12:40, Yosry Ahmed wrote:
> During reclaim, we keep track of pages reclaimed from other means than
> LRU-based reclaim through scan_control->reclaim_state->reclaimed_slab,
> which we stash a pointer to in current task_struct.
> 
> However, we keep track of more than just reclaimed slab pages through
> this. We also use it for clean file pages dropped through pruned inodes,
> and xfs buffer pages freed. Rename reclaimed_slab to reclaimed, and add

Would "reclaimed_non_lru" be more expressive? Then,

mm_account_reclaimed_pages() -> mm_account_non_lru_reclaimed_pages()


Apart from that LGTM.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 1/3] mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim
  2023-04-13 11:16   ` David Hildenbrand
@ 2023-04-13 11:25     ` Yosry Ahmed
  0 siblings, 0 replies; 20+ messages in thread
From: Yosry Ahmed @ 2023-04-13 11:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Alexander Viro, Darrick J. Wong,
	Christoph Lameter, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Roman Gushchin, Hyeonggon Yoo, Matthew Wilcox (Oracle),
	Miaohe Lin, Johannes Weiner, Peter Xu, NeilBrown, Shakeel Butt,
	Michal Hocko, Yu Zhao, Dave Chinner, Tim Chen, linux-fsdevel,
	linux-kernel, linux-xfs, linux-mm

On Thu, Apr 13, 2023 at 4:16 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 13.04.23 12:40, Yosry Ahmed wrote:
> > We keep track of different types of reclaimed pages through
> > reclaim_state->reclaimed_slab, and we add them to the reported number
> > of reclaimed pages.  For non-memcg reclaim, this makes sense. For memcg
> > reclaim, we have no clue if those pages are charged to the memcg under
> > reclaim.
> >
> > Slab pages are shared by different memcgs, so a freed slab page may have
> > only been partially charged to the memcg under reclaim.  The same goes for
> > clean file pages from pruned inodes (on highmem systems) or xfs buffer
> > pages, there is no simple way to currently link them to the memcg under
> > reclaim.
> >
> > Stop reporting those freed pages as reclaimed pages during memcg reclaim.
> > This should make the return value of writing to memory.reclaim, and may
> > help reduce unnecessary reclaim retries during memcg charging.  Writing to
> > memory.reclaim on the root memcg is considered as cgroup_reclaim(), but
> > for this case we want to include any freed pages, so use the
> > global_reclaim() check instead of !cgroup_reclaim().
> >
> > Generally, this should make the return value of
> > try_to_free_mem_cgroup_pages() more accurate. In some limited cases (e.g.
> > freed a slab page that was mostly charged to the memcg under reclaim),
> > the return value of try_to_free_mem_cgroup_pages() can be underestimated,
> > but this should be fine. The freed pages will be uncharged anyway, and we
> > can charge the memcg the next time around as we usually do memcg reclaim
> > in a retry loop.
> >
> > Fixes: f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects
> > instead of pages")
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > ---
>
> LGTM, hopefully the underestimation won't result in a real issue.
>
> Acked-by: David Hildenbrand <david@redhat.com>

Thanks!

>
> --
> Thanks,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 2/3] mm: vmscan: move set_task_reclaim_state() near flush_reclaim_state()
  2023-04-13 11:19   ` David Hildenbrand
@ 2023-04-13 11:26     ` Yosry Ahmed
  0 siblings, 0 replies; 20+ messages in thread
From: Yosry Ahmed @ 2023-04-13 11:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Alexander Viro, Darrick J. Wong,
	Christoph Lameter, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Roman Gushchin, Hyeonggon Yoo, Matthew Wilcox (Oracle),
	Miaohe Lin, Johannes Weiner, Peter Xu, NeilBrown, Shakeel Butt,
	Michal Hocko, Yu Zhao, Dave Chinner, Tim Chen, linux-fsdevel,
	linux-kernel, linux-xfs, linux-mm

On Thu, Apr 13, 2023 at 4:19 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 13.04.23 12:40, Yosry Ahmed wrote:
> > Move set_task_reclaim_state() near flush_reclaim_state() so that all
> > helpers manipulating reclaim_state are in close proximity.
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > ---
>
> Hm, it's rather a simple helper to set the reclaim_state for a task, not
> to modify it.
>
> No strong opinion, but I'd just leave it as it.

It's just personal taste to have helpers acting on the same data
structure next to one another. I don't feel strongly about it either,
I left it as a separate patch so that we can simply drop it. Peter
also thought the same, so maybe I should just drop it.

>
> --
> Thanks,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 3/3] mm: vmscan: refactor updating current->reclaim_state
  2023-04-13 11:20   ` David Hildenbrand
@ 2023-04-13 11:29     ` Yosry Ahmed
  2023-04-13 11:31       ` David Hildenbrand
  2023-04-13 21:00       ` Dave Chinner
  0 siblings, 2 replies; 20+ messages in thread
From: Yosry Ahmed @ 2023-04-13 11:29 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Alexander Viro, Darrick J. Wong,
	Christoph Lameter, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Roman Gushchin, Hyeonggon Yoo, Matthew Wilcox (Oracle),
	Miaohe Lin, Johannes Weiner, Peter Xu, NeilBrown, Shakeel Butt,
	Michal Hocko, Yu Zhao, Dave Chinner, Tim Chen, linux-fsdevel,
	linux-kernel, linux-xfs, linux-mm

On Thu, Apr 13, 2023 at 4:21 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 13.04.23 12:40, Yosry Ahmed wrote:
> > During reclaim, we keep track of pages reclaimed from other means than
> > LRU-based reclaim through scan_control->reclaim_state->reclaimed_slab,
> > which we stash a pointer to in current task_struct.
> >
> > However, we keep track of more than just reclaimed slab pages through
> > this. We also use it for clean file pages dropped through pruned inodes,
> > and xfs buffer pages freed. Rename reclaimed_slab to reclaimed, and add
>
> Would "reclaimed_non_lru" be more expressive? Then,
>
> mm_account_reclaimed_pages() -> mm_account_non_lru_reclaimed_pages()
>
>
> Apart from that LGTM.

Thanks!

I suck at naming things. If you think "reclaimed_non_lru" is better,
then we can do that. FWIW mm_account_reclaimed_pages() was taken from
a suggestion from Dave Chinner. My initial version had a terrible
name: report_freed_pages(), so I am happy with whatever you see fit.

Should I re-spin for this or can we change it in place?

>
> --
> Thanks,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 3/3] mm: vmscan: refactor updating current->reclaim_state
  2023-04-13 11:29     ` Yosry Ahmed
@ 2023-04-13 11:31       ` David Hildenbrand
  2023-04-13 21:00       ` Dave Chinner
  1 sibling, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2023-04-13 11:31 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Alexander Viro, Darrick J. Wong,
	Christoph Lameter, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Roman Gushchin, Hyeonggon Yoo, Matthew Wilcox (Oracle),
	Miaohe Lin, Johannes Weiner, Peter Xu, NeilBrown, Shakeel Butt,
	Michal Hocko, Yu Zhao, Dave Chinner, Tim Chen, linux-fsdevel,
	linux-kernel, linux-xfs, linux-mm

On 13.04.23 13:29, Yosry Ahmed wrote:
> On Thu, Apr 13, 2023 at 4:21 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 13.04.23 12:40, Yosry Ahmed wrote:
>>> During reclaim, we keep track of pages reclaimed from other means than
>>> LRU-based reclaim through scan_control->reclaim_state->reclaimed_slab,
>>> which we stash a pointer to in current task_struct.
>>>
>>> However, we keep track of more than just reclaimed slab pages through
>>> this. We also use it for clean file pages dropped through pruned inodes,
>>> and xfs buffer pages freed. Rename reclaimed_slab to reclaimed, and add
>>
>> Would "reclaimed_non_lru" be more expressive? Then,
>>
>> mm_account_reclaimed_pages() -> mm_account_non_lru_reclaimed_pages()
>>
>>
>> Apart from that LGTM.
> 
> Thanks!
> 
> I suck at naming things. If you think "reclaimed_non_lru" is better,
> then we can do that. FWIW mm_account_reclaimed_pages() was taken from
> a suggestion from Dave Chinner. My initial version had a terrible
> name: report_freed_pages(), so I am happy with whatever you see fit.
> 
> Should I re-spin for this or can we change it in place?

Respin would be good, but maybe wait a bit more on other comments. I'm 
bad at naming things as well :)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 3/3] mm: vmscan: refactor updating current->reclaim_state
  2023-04-13 11:29     ` Yosry Ahmed
  2023-04-13 11:31       ` David Hildenbrand
@ 2023-04-13 21:00       ` Dave Chinner
  2023-04-13 21:38         ` Yosry Ahmed
  1 sibling, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2023-04-13 21:00 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: David Hildenbrand, Andrew Morton, Alexander Viro,
	Darrick J. Wong, Christoph Lameter, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, Hyeonggon Yoo,
	Matthew Wilcox (Oracle),
	Miaohe Lin, Johannes Weiner, Peter Xu, NeilBrown, Shakeel Butt,
	Michal Hocko, Yu Zhao, Tim Chen, linux-fsdevel, linux-kernel,
	linux-xfs, linux-mm

On Thu, Apr 13, 2023 at 04:29:43AM -0700, Yosry Ahmed wrote:
> On Thu, Apr 13, 2023 at 4:21 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 13.04.23 12:40, Yosry Ahmed wrote:
> > > During reclaim, we keep track of pages reclaimed from other means than
> > > LRU-based reclaim through scan_control->reclaim_state->reclaimed_slab,
> > > which we stash a pointer to in current task_struct.
> > >
> > > However, we keep track of more than just reclaimed slab pages through
> > > this. We also use it for clean file pages dropped through pruned inodes,
> > > and xfs buffer pages freed. Rename reclaimed_slab to reclaimed, and add
> >
> > Would "reclaimed_non_lru" be more expressive? Then,
> >
> > mm_account_reclaimed_pages() -> mm_account_non_lru_reclaimed_pages()
> >
> >
> > Apart from that LGTM.
> 
> Thanks!
> 
> I suck at naming things. If you think "reclaimed_non_lru" is better,
> then we can do that. FWIW mm_account_reclaimed_pages() was taken from
> a suggestion from Dave Chinner. My initial version had a terrible
> name: report_freed_pages(), so I am happy with whatever you see fit.
> 
> Should I re-spin for this or can we change it in place?

I don't care for the noise all the bikeshed painting has generated
for a simple change like this.  If it's a fix for a bug, and the
naming is good enough, just merge it already, ok?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 3/3] mm: vmscan: refactor updating current->reclaim_state
  2023-04-13 21:00       ` Dave Chinner
@ 2023-04-13 21:38         ` Yosry Ahmed
  2023-04-14 21:47           ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: Yosry Ahmed @ 2023-04-13 21:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: David Hildenbrand, Andrew Morton, Alexander Viro,
	Darrick J. Wong, Christoph Lameter, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Roman Gushchin, Hyeonggon Yoo,
	Matthew Wilcox (Oracle),
	Miaohe Lin, Johannes Weiner, Peter Xu, NeilBrown, Shakeel Butt,
	Michal Hocko, Yu Zhao, Tim Chen, linux-fsdevel, linux-kernel,
	linux-xfs, linux-mm

On Thu, Apr 13, 2023 at 2:01 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Apr 13, 2023 at 04:29:43AM -0700, Yosry Ahmed wrote:
> > On Thu, Apr 13, 2023 at 4:21 AM David Hildenbrand <david@redhat.com> wrote:
> > >
> > > On 13.04.23 12:40, Yosry Ahmed wrote:
> > > > During reclaim, we keep track of pages reclaimed from other means than
> > > > LRU-based reclaim through scan_control->reclaim_state->reclaimed_slab,
> > > > which we stash a pointer to in current task_struct.
> > > >
> > > > However, we keep track of more than just reclaimed slab pages through
> > > > this. We also use it for clean file pages dropped through pruned inodes,
> > > > and xfs buffer pages freed. Rename reclaimed_slab to reclaimed, and add
> > >
> > > Would "reclaimed_non_lru" be more expressive? Then,
> > >
> > > mm_account_reclaimed_pages() -> mm_account_non_lru_reclaimed_pages()
> > >
> > >
> > > Apart from that LGTM.
> >
> > Thanks!
> >
> > I suck at naming things. If you think "reclaimed_non_lru" is better,
> > then we can do that. FWIW mm_account_reclaimed_pages() was taken from
> > a suggestion from Dave Chinner. My initial version had a terrible
> > name: report_freed_pages(), so I am happy with whatever you see fit.
> >
> > Should I re-spin for this or can we change it in place?
>
> I don't care for the noise all the bikeshed painting has generated
> for a simple change like this.  If it's a fix for a bug, and the
> naming is good enough, just merge it already, ok?

Sorry for all the noise. I think this version is in good enough shape.

Andrew, could you please replace v4 with this v6 without patch 2 as
multiple people pointed out that it is unneeded? Sorry for the hassle.

>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 1/3] mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim
  2023-04-13 10:40 ` [PATCH v6 1/3] mm: vmscan: ignore " Yosry Ahmed
  2023-04-13 10:45   ` Yosry Ahmed
  2023-04-13 11:16   ` David Hildenbrand
@ 2023-04-14  8:15   ` Michal Hocko
  2023-05-01 10:12   ` Yosry Ahmed
  3 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2023-04-14  8:15 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Alexander Viro, Darrick J. Wong,
	Christoph Lameter, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Roman Gushchin, Hyeonggon Yoo, Matthew Wilcox (Oracle),
	Miaohe Lin, David Hildenbrand, Johannes Weiner, Peter Xu,
	NeilBrown, Shakeel Butt, Yu Zhao, Dave Chinner, Tim Chen,
	linux-fsdevel, linux-kernel, linux-xfs, linux-mm

On Thu 13-04-23 10:40:32, Yosry Ahmed wrote:
> We keep track of different types of reclaimed pages through
> reclaim_state->reclaimed_slab, and we add them to the reported number
> of reclaimed pages.  For non-memcg reclaim, this makes sense. For memcg
> reclaim, we have no clue if those pages are charged to the memcg under
> reclaim.
> 
> Slab pages are shared by different memcgs, so a freed slab page may have
> only been partially charged to the memcg under reclaim.  The same goes for
> clean file pages from pruned inodes (on highmem systems) or xfs buffer
> pages, there is no simple way to currently link them to the memcg under
> reclaim.
> 
> Stop reporting those freed pages as reclaimed pages during memcg reclaim.
> This should make the return value of writing to memory.reclaim, and may
> help reduce unnecessary reclaim retries during memcg charging.  Writing to
> memory.reclaim on the root memcg is considered as cgroup_reclaim(), but
> for this case we want to include any freed pages, so use the
> global_reclaim() check instead of !cgroup_reclaim().
> 
> Generally, this should make the return value of
> try_to_free_mem_cgroup_pages() more accurate. In some limited cases (e.g.
> freed a slab page that was mostly charged to the memcg under reclaim),
> the return value of try_to_free_mem_cgroup_pages() can be underestimated,
> but this should be fine. The freed pages will be uncharged anyway, and we
> can charge the memcg the next time around as we usually do memcg reclaim
> in a retry loop.
> 
> Fixes: f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects
> instead of pages")
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

Acked-by: Michal Hocko <mhocko@suse.com>
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 2/3] mm: vmscan: move set_task_reclaim_state() near flush_reclaim_state()
  2023-04-13 10:40 ` [PATCH v6 2/3] mm: vmscan: move set_task_reclaim_state() near flush_reclaim_state() Yosry Ahmed
  2023-04-13 11:19   ` David Hildenbrand
@ 2023-04-14  8:16   ` Michal Hocko
  1 sibling, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2023-04-14  8:16 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Alexander Viro, Darrick J. Wong,
	Christoph Lameter, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Roman Gushchin, Hyeonggon Yoo, Matthew Wilcox (Oracle),
	Miaohe Lin, David Hildenbrand, Johannes Weiner, Peter Xu,
	NeilBrown, Shakeel Butt, Yu Zhao, Dave Chinner, Tim Chen,
	linux-fsdevel, linux-kernel, linux-xfs, linux-mm

On Thu 13-04-23 10:40:33, Yosry Ahmed wrote:
> Move set_task_reclaim_state() near flush_reclaim_state() so that all
> helpers manipulating reclaim_state are in close proximity.
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

Acked-by: Michal Hocko <mhocko@suse.com>
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 3/3] mm: vmscan: refactor updating current->reclaim_state
  2023-04-13 10:40 ` [PATCH v6 3/3] mm: vmscan: refactor updating current->reclaim_state Yosry Ahmed
  2023-04-13 11:20   ` David Hildenbrand
@ 2023-04-14  8:18   ` Michal Hocko
  1 sibling, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2023-04-14  8:18 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Alexander Viro, Darrick J. Wong,
	Christoph Lameter, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Roman Gushchin, Hyeonggon Yoo, Matthew Wilcox (Oracle),
	Miaohe Lin, David Hildenbrand, Johannes Weiner, Peter Xu,
	NeilBrown, Shakeel Butt, Yu Zhao, Dave Chinner, Tim Chen,
	linux-fsdevel, linux-kernel, linux-xfs, linux-mm

On Thu 13-04-23 10:40:34, Yosry Ahmed wrote:
> During reclaim, we keep track of pages reclaimed from other means than
> LRU-based reclaim through scan_control->reclaim_state->reclaimed_slab,
> which we stash a pointer to in current task_struct.
> 
> However, we keep track of more than just reclaimed slab pages through
> this. We also use it for clean file pages dropped through pruned inodes,
> and xfs buffer pages freed. Rename reclaimed_slab to reclaimed, and add
> a helper function that wraps updating it through current, so that future
> changes to this logic are contained within include/linux/swap.h.
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

Acked-by: Michal Hocko <mhocko@suse.com>
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 3/3] mm: vmscan: refactor updating current->reclaim_state
  2023-04-13 21:38         ` Yosry Ahmed
@ 2023-04-14 21:47           ` Andrew Morton
  2023-04-14 23:11             ` Yosry Ahmed
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2023-04-14 21:47 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Dave Chinner, David Hildenbrand, Alexander Viro, Darrick J. Wong,
	Christoph Lameter, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Roman Gushchin, Hyeonggon Yoo, Matthew Wilcox (Oracle),
	Miaohe Lin, Johannes Weiner, Peter Xu, NeilBrown, Shakeel Butt,
	Michal Hocko, Yu Zhao, Tim Chen, linux-fsdevel, linux-kernel,
	linux-xfs, linux-mm

On Thu, 13 Apr 2023 14:38:03 -0700 Yosry Ahmed <yosryahmed@google.com> wrote:

> > > I suck at naming things. If you think "reclaimed_non_lru" is better,
> > > then we can do that. FWIW mm_account_reclaimed_pages() was taken from
> > > a suggestion from Dave Chinner. My initial version had a terrible
> > > name: report_freed_pages(), so I am happy with whatever you see fit.
> > >
> > > Should I re-spin for this or can we change it in place?
> >
> > I don't care for the noise all the bikeshed painting has generated
> > for a simple change like this.  If it's a fix for a bug, and the
> > naming is good enough, just merge it already, ok?
> 
> Sorry for all the noise. I think this version is in good enough shape.
> 
> Andrew, could you please replace v4 with this v6 without patch 2 as
> multiple people pointed out that it is unneeded? Sorry for the hassle.

I like patch 2!

mm.git presently has the v6 series.  All of it ;)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 3/3] mm: vmscan: refactor updating current->reclaim_state
  2023-04-14 21:47           ` Andrew Morton
@ 2023-04-14 23:11             ` Yosry Ahmed
  0 siblings, 0 replies; 20+ messages in thread
From: Yosry Ahmed @ 2023-04-14 23:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, David Hildenbrand, Alexander Viro, Darrick J. Wong,
	Christoph Lameter, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Roman Gushchin, Hyeonggon Yoo, Matthew Wilcox (Oracle),
	Miaohe Lin, Johannes Weiner, Peter Xu, NeilBrown, Shakeel Butt,
	Michal Hocko, Yu Zhao, Tim Chen, linux-fsdevel, linux-kernel,
	linux-xfs, linux-mm

On Fri, Apr 14, 2023 at 2:47 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 13 Apr 2023 14:38:03 -0700 Yosry Ahmed <yosryahmed@google.com> wrote:
>
> > > > I suck at naming things. If you think "reclaimed_non_lru" is better,
> > > > then we can do that. FWIW mm_account_reclaimed_pages() was taken from
> > > > a suggestion from Dave Chinner. My initial version had a terrible
> > > > name: report_freed_pages(), so I am happy with whatever you see fit.
> > > >
> > > > Should I re-spin for this or can we change it in place?
> > >
> > > I don't care for the noise all the bikeshed painting has generated
> > > for a simple change like this.  If it's a fix for a bug, and the
> > > naming is good enough, just merge it already, ok?
> >
> > Sorry for all the noise. I think this version is in good enough shape.
> >
> > Andrew, could you please replace v4 with this v6 without patch 2 as
> > multiple people pointed out that it is unneeded? Sorry for the hassle.
>
> I like patch 2!
>
> mm.git presently has the v6 series.  All of it ;)

Thanks Andrew :)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v6 1/3] mm: vmscan: ignore non-LRU-based reclaim in memcg reclaim
  2023-04-13 10:40 ` [PATCH v6 1/3] mm: vmscan: ignore " Yosry Ahmed
                     ` (2 preceding siblings ...)
  2023-04-14  8:15   ` Michal Hocko
@ 2023-05-01 10:12   ` Yosry Ahmed
  3 siblings, 0 replies; 20+ messages in thread
From: Yosry Ahmed @ 2023-05-01 10:12 UTC (permalink / raw)
  To: Andrew Morton, Alexander Viro, Darrick J. Wong,
	Christoph Lameter, David Rientjes, Joonsoo Kim, Vlastimil Babka,
	Roman Gushchin, Hyeonggon Yoo, Matthew Wilcox (Oracle),
	Miaohe Lin, David Hildenbrand, Johannes Weiner, Peter Xu,
	NeilBrown, Shakeel Butt, Michal Hocko, Yu Zhao, Dave Chinner,
	Tim Chen
  Cc: linux-fsdevel, linux-kernel, linux-xfs, linux-mm, Sergey Senozhatsky

On Thu, Apr 13, 2023 at 3:40 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> We keep track of different types of reclaimed pages through
> reclaim_state->reclaimed_slab, and we add them to the reported number
> of reclaimed pages.  For non-memcg reclaim, this makes sense. For memcg
> reclaim, we have no clue if those pages are charged to the memcg under
> reclaim.
>
> Slab pages are shared by different memcgs, so a freed slab page may have
> only been partially charged to the memcg under reclaim.  The same goes for
> clean file pages from pruned inodes (on highmem systems) or xfs buffer
> pages, there is no simple way to currently link them to the memcg under
> reclaim.
>
> Stop reporting those freed pages as reclaimed pages during memcg reclaim.
> This should make the return value of writing to memory.reclaim, and may
> help reduce unnecessary reclaim retries during memcg charging.  Writing to
> memory.reclaim on the root memcg is considered as cgroup_reclaim(), but
> for this case we want to include any freed pages, so use the
> global_reclaim() check instead of !cgroup_reclaim().
>
> Generally, this should make the return value of
> try_to_free_mem_cgroup_pages() more accurate. In some limited cases (e.g.
> freed a slab page that was mostly charged to the memcg under reclaim),
> the return value of try_to_free_mem_cgroup_pages() can be underestimated,
> but this should be fine. The freed pages will be uncharged anyway, and we
> can charge the memcg the next time around as we usually do memcg reclaim
> in a retry loop.
>
> Fixes: f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects
> instead of pages")
>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>  mm/vmscan.c | 49 ++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 42 insertions(+), 7 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9c1c5e8b24b8..be657832be48 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -511,6 +511,46 @@ static bool writeback_throttling_sane(struct scan_control *sc)
>  }
>  #endif
>
> +/*
> + * flush_reclaim_state(): add pages reclaimed outside of LRU-based reclaim to
> + * scan_control->nr_reclaimed.
> + */
> +static void flush_reclaim_state(struct scan_control *sc)
> +{
> +       /*
> +        * Currently, reclaim_state->reclaimed includes three types of pages
> +        * freed outside of vmscan:
> +        * (1) Slab pages.
> +        * (2) Clean file pages from pruned inodes (on highmem systems).
> +        * (3) XFS freed buffer pages.
> +        *
> +        * For all of these cases, we cannot universally link the pages to a
> +        * single memcg. For example, a memcg-aware shrinker can free one object
> +        * charged to the target memcg, causing an entire page to be freed.
> +        * If we count the entire page as reclaimed from the memcg, we end up
> +        * overestimating the reclaimed amount (potentially under-reclaiming).
> +        *
> +        * Only count such pages for global reclaim to prevent under-reclaiming
> +        * from the target memcg; preventing unnecessary retries during memcg
> +        * charging and false positives from proactive reclaim.
> +        *
> +        * For uncommon cases where the freed pages were actually mostly
> +        * charged to the target memcg, we end up underestimating the reclaimed
> +        * amount. This should be fine. The freed pages will be uncharged
> +        * anyway, even if they are not counted here properly, and we will be
> +        * able to make forward progress in charging (which is usually in a
> +        * retry loop).
> +        *
> +        * We can go one step further, and report the uncharged objcg pages in
> +        * memcg reclaim, to make reporting more accurate and reduce
> +        * underestimation, but it's probably not worth the complexity for now.
> +        */
> +       if (current->reclaim_state && global_reclaim(sc)) {
> +               sc->nr_reclaimed += current->reclaim_state->reclaimed;
> +               current->reclaim_state->reclaimed = 0;

Ugh.. this breaks the build. This should have been
current->reclaim_state->reclaimed_slab. It doesn't get renamed from
"reclaimed_slab" to "reclaim" until the next patch. When I moved
flush_reclaim_state() from patch 2 to patch 1 I forgot to augment it.
My bad.

The break is fixed by the very next patch, and the patches have
already landed in Linus's tree, so there isn't much that can be done
at this point. Sorry about that. Just wondering, why wouldn't this
breakage be caught by any of the build bots?

> +       }
> +}
> +
>  static long xchg_nr_deferred(struct shrinker *shrinker,
>                              struct shrink_control *sc)
>  {
> @@ -5346,8 +5386,7 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>                 vmpressure(sc->gfp_mask, memcg, false, sc->nr_scanned - scanned,
>                            sc->nr_reclaimed - reclaimed);
>
> -       sc->nr_reclaimed += current->reclaim_state->reclaimed_slab;
> -       current->reclaim_state->reclaimed_slab = 0;
> +       flush_reclaim_state(sc);
>
>         return success ? MEMCG_LRU_YOUNG : 0;
>  }
> @@ -6450,7 +6489,6 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>
>  static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  {
> -       struct reclaim_state *reclaim_state = current->reclaim_state;
>         unsigned long nr_reclaimed, nr_scanned;
>         struct lruvec *target_lruvec;
>         bool reclaimable = false;
> @@ -6472,10 +6510,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>
>         shrink_node_memcgs(pgdat, sc);
>
> -       if (reclaim_state) {
> -               sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> -               reclaim_state->reclaimed_slab = 0;
> -       }
> +       flush_reclaim_state(sc);
>
>         /* Record the subtree's reclaim efficiency */
>         if (!sc->proactive)
> --
> 2.40.0.577.gac1e443424-goog
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-05-01 10:13 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-13 10:40 [PATCH v6 0/3] Ignore non-LRU-based reclaim in memcg reclaim Yosry Ahmed
2023-04-13 10:40 ` [PATCH v6 1/3] mm: vmscan: ignore " Yosry Ahmed
2023-04-13 10:45   ` Yosry Ahmed
2023-04-13 11:16   ` David Hildenbrand
2023-04-13 11:25     ` Yosry Ahmed
2023-04-14  8:15   ` Michal Hocko
2023-05-01 10:12   ` Yosry Ahmed
2023-04-13 10:40 ` [PATCH v6 2/3] mm: vmscan: move set_task_reclaim_state() near flush_reclaim_state() Yosry Ahmed
2023-04-13 11:19   ` David Hildenbrand
2023-04-13 11:26     ` Yosry Ahmed
2023-04-14  8:16   ` Michal Hocko
2023-04-13 10:40 ` [PATCH v6 3/3] mm: vmscan: refactor updating current->reclaim_state Yosry Ahmed
2023-04-13 11:20   ` David Hildenbrand
2023-04-13 11:29     ` Yosry Ahmed
2023-04-13 11:31       ` David Hildenbrand
2023-04-13 21:00       ` Dave Chinner
2023-04-13 21:38         ` Yosry Ahmed
2023-04-14 21:47           ` Andrew Morton
2023-04-14 23:11             ` Yosry Ahmed
2023-04-14  8:18   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).