[PATCH rfc 0/4] mm: filemap: try to batch lruvec stat updating

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH rfc 0/4] mm: filemap: try to batch lruvec stat updating
@ 2024-04-29  7:24 Kefeng Wang
  2024-04-29  7:24 ` [PATCH rfc 1/4] mm: memory: add prepare_range_pte_entry() Kefeng Wang
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Kefeng Wang @ 2024-04-29  7:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox (Oracle), linux-mm, linux-fsdevel, Kefeng Wang

It is similar to mm counter updating, try to batch lruvec stat updating,
which could save most of time when all folios in same memcg/padat,
lat_pagefault shows 3~4% improvement.

Kefeng Wang (4):
  mm: memory: add prepare_range_pte_entry()
  mm: filemap: add filemap_set_pte_range()
  mm: filemap: move __lruvec_stat_mod_folio() out of
    filemap_set_pte_range()
  mm: filemap: try to batch lruvec stat updating

 include/linux/mm.h   |  2 ++
 include/linux/rmap.h |  2 ++
 mm/filemap.c         | 75 ++++++++++++++++++++++++++++++++++++--------
 mm/memory.c          | 33 ++++++++++++-------
 mm/rmap.c            | 16 ++++++++++
 5 files changed, 104 insertions(+), 24 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH rfc 1/4] mm: memory: add prepare_range_pte_entry()
  2024-04-29  7:24 [PATCH rfc 0/4] mm: filemap: try to batch lruvec stat updating Kefeng Wang
@ 2024-04-29  7:24 ` Kefeng Wang
  2024-04-29  7:24 ` [PATCH rfc 2/4] mm: filemap: add filemap_set_pte_range() Kefeng Wang
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Kefeng Wang @ 2024-04-29  7:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox (Oracle), linux-mm, linux-fsdevel, Kefeng Wang

This is prepare for a separate filemap_set_pte_range(), add a
prepare_range_pte_entry(), no functional changes.

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 include/linux/mm.h |  2 ++
 mm/memory.c        | 33 ++++++++++++++++++++++-----------
 2 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9849dfda44d4..bcbeb8a4cd43 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1372,6 +1372,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 }
 
 vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
+pte_t prepare_range_pte_entry(struct vm_fault *vmf, bool write, struct folio *folio,
+		struct page *page, unsigned int nr, unsigned long addr);
 void set_pte_range(struct vm_fault *vmf, struct folio *folio,
 		struct page *page, unsigned int nr, unsigned long addr);
 
diff --git a/mm/memory.c b/mm/memory.c
index 6647685fd3c4..ccbeb58fa136 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4652,19 +4652,11 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 }
 #endif
 
-/**
- * set_pte_range - Set a range of PTEs to point to pages in a folio.
- * @vmf: Fault decription.
- * @folio: The folio that contains @page.
- * @page: The first page to create a PTE for.
- * @nr: The number of PTEs to create.
- * @addr: The first address to create a PTE for.
- */
-void set_pte_range(struct vm_fault *vmf, struct folio *folio,
-		struct page *page, unsigned int nr, unsigned long addr)
+pte_t prepare_range_pte_entry(struct vm_fault *vmf, bool write,
+			      struct folio *folio, struct page *page,
+			      unsigned int nr, unsigned long addr)
 {
 	struct vm_area_struct *vma = vmf->vma;
-	bool write = vmf->flags & FAULT_FLAG_WRITE;
 	bool prefault = in_range(vmf->address, addr, nr * PAGE_SIZE);
 	pte_t entry;
 
@@ -4680,6 +4672,25 @@ void set_pte_range(struct vm_fault *vmf, struct folio *folio,
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 	if (unlikely(vmf_orig_pte_uffd_wp(vmf)))
 		entry = pte_mkuffd_wp(entry);
+
+	return entry;
+}
+
+/**
+ * set_pte_range - Set a range of PTEs to point to pages in a folio.
+ * @vmf: Fault description.
+ * @folio: The folio that contains @page.
+ * @page: The first page to create a PTE for.
+ * @nr: The number of PTEs to create.
+ * @addr: The first address to create a PTE for.
+ */
+void set_pte_range(struct vm_fault *vmf, struct folio *folio,
+		struct page *page, unsigned int nr, unsigned long addr)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	bool write = vmf->flags & FAULT_FLAG_WRITE;
+	pte_t entry = prepare_range_pte_entry(vmf, write, folio, page, nr, addr);
+
 	/* copy-on-write page */
 	if (write && !(vma->vm_flags & VM_SHARED)) {
 		VM_BUG_ON_FOLIO(nr != 1, folio);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH rfc 2/4] mm: filemap: add filemap_set_pte_range()
  2024-04-29  7:24 [PATCH rfc 0/4] mm: filemap: try to batch lruvec stat updating Kefeng Wang
  2024-04-29  7:24 ` [PATCH rfc 1/4] mm: memory: add prepare_range_pte_entry() Kefeng Wang
@ 2024-04-29  7:24 ` Kefeng Wang
  2024-04-29  7:24 ` [PATCH rfc 3/4] mm: filemap: move __lruvec_stat_mod_folio() out of filemap_set_pte_range() Kefeng Wang
  2024-04-29  7:24 ` [PATCH rfc 4/4] mm: filemap: try to batch lruvec stat updating Kefeng Wang
  3 siblings, 0 replies; 14+ messages in thread
From: Kefeng Wang @ 2024-04-29  7:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox (Oracle), linux-mm, linux-fsdevel, Kefeng Wang

Adding filemap_set_pte_range() independent of set_pte_range() to unify
the rss and folio reference update for small folio and large folio, which
also is prepare for the upcoming lruvec stat batch updating.

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 mm/filemap.c | 31 ++++++++++++++++++++++---------
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index ec273b00ce5f..7019692daddd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3499,6 +3499,25 @@ static struct folio *next_uptodate_folio(struct xa_state *xas,
 	return NULL;
 }
 
+static void filemap_set_pte_range(struct vm_fault *vmf, struct folio *folio,
+			struct page *page, unsigned int nr, unsigned long addr,
+			unsigned long *rss)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	pte_t entry;
+
+	entry = prepare_range_pte_entry(vmf, false, folio, page, nr, addr);
+
+	folio_add_file_rmap_ptes(folio, page, nr, vma);
+	set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr);
+
+	/* no need to invalidate: a not-present page won't be cached */
+	update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr);
+
+	*rss += nr;
+	folio_ref_add(folio, nr);
+}
+
 /*
  * Map page range [start_page, start_page + nr_pages) of folio.
  * start_page is gotten from start by folio_page(folio, start)
@@ -3539,9 +3558,7 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 		continue;
 skip:
 		if (count) {
-			set_pte_range(vmf, folio, page, count, addr);
-			*rss += count;
-			folio_ref_add(folio, count);
+			filemap_set_pte_range(vmf, folio, page, count, addr, rss);
 			if (in_range(vmf->address, addr, count * PAGE_SIZE))
 				ret = VM_FAULT_NOPAGE;
 		}
@@ -3554,9 +3571,7 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 	} while (--nr_pages > 0);
 
 	if (count) {
-		set_pte_range(vmf, folio, page, count, addr);
-		*rss += count;
-		folio_ref_add(folio, count);
+		filemap_set_pte_range(vmf, folio, page, count, addr, rss);
 		if (in_range(vmf->address, addr, count * PAGE_SIZE))
 			ret = VM_FAULT_NOPAGE;
 	}
@@ -3591,9 +3606,7 @@ static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf,
 	if (vmf->address == addr)
 		ret = VM_FAULT_NOPAGE;
 
-	set_pte_range(vmf, folio, page, 1, addr);
-	(*rss)++;
-	folio_ref_inc(folio);
+	filemap_set_pte_range(vmf, folio, page, 1, addr, rss);
 
 	return ret;
 }
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH rfc 3/4] mm: filemap: move __lruvec_stat_mod_folio() out of filemap_set_pte_range()
  2024-04-29  7:24 [PATCH rfc 0/4] mm: filemap: try to batch lruvec stat updating Kefeng Wang
  2024-04-29  7:24 ` [PATCH rfc 1/4] mm: memory: add prepare_range_pte_entry() Kefeng Wang
  2024-04-29  7:24 ` [PATCH rfc 2/4] mm: filemap: add filemap_set_pte_range() Kefeng Wang
@ 2024-04-29  7:24 ` Kefeng Wang
  2024-05-07 11:11   ` David Hildenbrand
  2024-04-29  7:24 ` [PATCH rfc 4/4] mm: filemap: try to batch lruvec stat updating Kefeng Wang
  3 siblings, 1 reply; 14+ messages in thread
From: Kefeng Wang @ 2024-04-29  7:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox (Oracle), linux-mm, linux-fsdevel, Kefeng Wang

Adding __folio_add_file_rmap_ptes() which don't update lruvec stat, it
is used in filemap_set_pte_range(), with it, lruvec stat updating is
moved into the caller, no functional changes.

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 include/linux/rmap.h |  2 ++
 mm/filemap.c         | 27 ++++++++++++++++++---------
 mm/rmap.c            | 16 ++++++++++++++++
 3 files changed, 36 insertions(+), 9 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 7229b9baf20d..43014ddd06f9 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -242,6 +242,8 @@ void folio_add_anon_rmap_pmd(struct folio *, struct page *,
 		struct vm_area_struct *, unsigned long address, rmap_t flags);
 void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
 		unsigned long address);
+int __folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
+		struct vm_area_struct *);
 void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
 		struct vm_area_struct *);
 #define folio_add_file_rmap_pte(folio, page, vma) \
diff --git a/mm/filemap.c b/mm/filemap.c
index 7019692daddd..3966b6616d02 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3501,14 +3501,15 @@ static struct folio *next_uptodate_folio(struct xa_state *xas,
 
 static void filemap_set_pte_range(struct vm_fault *vmf, struct folio *folio,
 			struct page *page, unsigned int nr, unsigned long addr,
-			unsigned long *rss)
+			unsigned long *rss, int *nr_mapped)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	pte_t entry;
 
 	entry = prepare_range_pte_entry(vmf, false, folio, page, nr, addr);
 
-	folio_add_file_rmap_ptes(folio, page, nr, vma);
+	*nr_mapped += __folio_add_file_rmap_ptes(folio, page, nr, vma);
+
 	set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr);
 
 	/* no need to invalidate: a not-present page won't be cached */
@@ -3525,7 +3526,8 @@ static void filemap_set_pte_range(struct vm_fault *vmf, struct folio *folio,
 static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 			struct folio *folio, unsigned long start,
 			unsigned long addr, unsigned int nr_pages,
-			unsigned long *rss, unsigned int *mmap_miss)
+			unsigned long *rss, int *nr_mapped,
+			unsigned int *mmap_miss)
 {
 	vm_fault_t ret = 0;
 	struct page *page = folio_page(folio, start);
@@ -3558,7 +3560,8 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 		continue;
 skip:
 		if (count) {
-			filemap_set_pte_range(vmf, folio, page, count, addr, rss);
+			filemap_set_pte_range(vmf, folio, page, count, addr,
+					      rss, nr_mapped);
 			if (in_range(vmf->address, addr, count * PAGE_SIZE))
 				ret = VM_FAULT_NOPAGE;
 		}
@@ -3571,7 +3574,8 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 	} while (--nr_pages > 0);
 
 	if (count) {
-		filemap_set_pte_range(vmf, folio, page, count, addr, rss);
+		filemap_set_pte_range(vmf, folio, page, count, addr, rss,
+				      nr_mapped);
 		if (in_range(vmf->address, addr, count * PAGE_SIZE))
 			ret = VM_FAULT_NOPAGE;
 	}
@@ -3583,7 +3587,7 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 
 static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf,
 		struct folio *folio, unsigned long addr,
-		unsigned long *rss, unsigned int *mmap_miss)
+		unsigned long *rss, int *nr_mapped, unsigned int *mmap_miss)
 {
 	vm_fault_t ret = 0;
 	struct page *page = &folio->page;
@@ -3606,7 +3610,7 @@ static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf,
 	if (vmf->address == addr)
 		ret = VM_FAULT_NOPAGE;
 
-	filemap_set_pte_range(vmf, folio, page, 1, addr, rss);
+	filemap_set_pte_range(vmf, folio, page, 1, addr, rss, nr_mapped);
 
 	return ret;
 }
@@ -3646,6 +3650,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	folio_type = mm_counter_file(folio);
 	do {
 		unsigned long end;
+		int nr_mapped = 0;
 
 		addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
 		vmf->pte += xas.xa_index - last_pgoff;
@@ -3655,11 +3660,15 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 
 		if (!folio_test_large(folio))
 			ret |= filemap_map_order0_folio(vmf,
-					folio, addr, &rss, &mmap_miss);
+					folio, addr, &rss, &nr_mapped,
+					&mmap_miss);
 		else
 			ret |= filemap_map_folio_range(vmf, folio,
 					xas.xa_index - folio->index, addr,
-					nr_pages, &rss, &mmap_miss);
+					nr_pages, &rss, &nr_mapped,
+					&mmap_miss);
+
+		__lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr_mapped);
 
 		folio_unlock(folio);
 		folio_put(folio);
diff --git a/mm/rmap.c b/mm/rmap.c
index 2608c40dffad..55face4024f2 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1452,6 +1452,22 @@ static __always_inline void __folio_add_file_rmap(struct folio *folio,
 		mlock_vma_folio(folio, vma);
 }
 
+int __folio_add_file_rmap_ptes(struct folio *folio, struct page *page,
+		int nr_pages, struct vm_area_struct *vma)
+{
+	int nr, nr_pmdmapped = 0;
+
+	VM_WARN_ON_FOLIO(folio_test_anon(folio), folio);
+
+	nr = __folio_add_rmap(folio, page, nr_pages, RMAP_LEVEL_PTE,
+			      &nr_pmdmapped);
+
+	/* See comments in folio_add_anon_rmap_*() */
+	if (!folio_test_large(folio))
+		mlock_vma_folio(folio, vma);
+
+	return nr;
+}
 /**
  * folio_add_file_rmap_ptes - add PTE mappings to a page range of a folio
  * @folio:	The folio to add the mappings to
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH rfc 4/4] mm: filemap: try to batch lruvec stat updating
  2024-04-29  7:24 [PATCH rfc 0/4] mm: filemap: try to batch lruvec stat updating Kefeng Wang
                   ` (2 preceding siblings ...)
  2024-04-29  7:24 ` [PATCH rfc 3/4] mm: filemap: move __lruvec_stat_mod_folio() out of filemap_set_pte_range() Kefeng Wang
@ 2024-04-29  7:24 ` Kefeng Wang
  2024-05-07  9:06   ` Kefeng Wang
  3 siblings, 1 reply; 14+ messages in thread
From: Kefeng Wang @ 2024-04-29  7:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox (Oracle), linux-mm, linux-fsdevel, Kefeng Wang

The filemap_map_pages() tries to map few pages(eg, 16 pages), but the
lruvec stat updating is called on each mapping, since the updating is
time-consuming, especially with memcg, so try to batch it when the memcg
and pgdat are same during the mapping, if luckily, we could save most of
time of lruvec stat updating, the lat_pagefault shows 3~4% improvement.

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 mm/filemap.c | 33 ++++++++++++++++++++++++++++++---
 1 file changed, 30 insertions(+), 3 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 3966b6616d02..b27281707098 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3615,6 +3615,20 @@ static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf,
 	return ret;
 }
 
+static void filemap_lruvec_stat_update(struct mem_cgroup *memcg,
+				       pg_data_t *pgdat, int nr)
+{
+	struct lruvec *lruvec;
+
+	if (!memcg) {
+		__mod_node_page_state(pgdat, NR_FILE_MAPPED, nr);
+		return;
+	}
+
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	__mod_lruvec_state(lruvec, NR_FILE_MAPPED, nr);
+}
+
 vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 			     pgoff_t start_pgoff, pgoff_t end_pgoff)
 {
@@ -3628,6 +3642,9 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	vm_fault_t ret = 0;
 	unsigned long rss = 0;
 	unsigned int nr_pages = 0, mmap_miss = 0, mmap_miss_saved, folio_type;
+	struct mem_cgroup *memcg, *memcg_cur;
+	pg_data_t *pgdat, *pgdat_cur;
+	int nr_mapped = 0;
 
 	rcu_read_lock();
 	folio = next_uptodate_folio(&xas, mapping, end_pgoff);
@@ -3648,9 +3665,20 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	}
 
 	folio_type = mm_counter_file(folio);
+	memcg = folio_memcg(folio);
+	pgdat = folio_pgdat(folio);
 	do {
 		unsigned long end;
-		int nr_mapped = 0;
+
+		memcg_cur = folio_memcg(folio);
+		pgdat_cur = folio_pgdat(folio);
+
+		if (unlikely(memcg != memcg_cur || pgdat != pgdat_cur)) {
+			filemap_lruvec_stat_update(memcg, pgdat, nr_mapped);
+			nr_mapped = 0;
+			memcg = memcg_cur;
+			pgdat = pgdat_cur;
+		}
 
 		addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
 		vmf->pte += xas.xa_index - last_pgoff;
@@ -3668,11 +3696,10 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 					nr_pages, &rss, &nr_mapped,
 					&mmap_miss);
 
-		__lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr_mapped);
-
 		folio_unlock(folio);
 		folio_put(folio);
 	} while ((folio = next_uptodate_folio(&xas, mapping, end_pgoff)) != NULL);
+	filemap_lruvec_stat_update(memcg, pgdat, nr_mapped);
 	add_mm_counter(vma->vm_mm, folio_type, rss);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 out:
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH rfc 4/4] mm: filemap: try to batch lruvec stat updating
  2024-04-29  7:24 ` [PATCH rfc 4/4] mm: filemap: try to batch lruvec stat updating Kefeng Wang
@ 2024-05-07  9:06   ` Kefeng Wang
  2024-05-09 14:01     ` Johannes Weiner
  0 siblings, 1 reply; 14+ messages in thread
From: Kefeng Wang @ 2024-05-07  9:06 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt
  Cc: Matthew Wilcox (Oracle), linux-mm, linux-fsdevel, Muchun Song

+ memcg maintainers and David too, please check all patches from link

https://lore.kernel.org/linux-mm/20240429072417.2146732-1-wangkefeng.wang@huawei.com/

Thanks

On 2024/4/29 15:24, Kefeng Wang wrote:
> The filemap_map_pages() tries to map few pages(eg, 16 pages), but the
> lruvec stat updating is called on each mapping, since the updating is
> time-consuming, especially with memcg, so try to batch it when the memcg
> and pgdat are same during the mapping, if luckily, we could save most of
> time of lruvec stat updating, the lat_pagefault shows 3~4% improvement.
> 
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
>   mm/filemap.c | 33 ++++++++++++++++++++++++++++++---
>   1 file changed, 30 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 3966b6616d02..b27281707098 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3615,6 +3615,20 @@ static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf,
>   	return ret;
>   }
>   
> +static void filemap_lruvec_stat_update(struct mem_cgroup *memcg,
> +				       pg_data_t *pgdat, int nr)
> +{
> +	struct lruvec *lruvec;
> +
> +	if (!memcg) {
> +		__mod_node_page_state(pgdat, NR_FILE_MAPPED, nr);
> +		return;
> +	}
> +
> +	lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +	__mod_lruvec_state(lruvec, NR_FILE_MAPPED, nr);
> +}
> +
>   vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>   			     pgoff_t start_pgoff, pgoff_t end_pgoff)
>   {
> @@ -3628,6 +3642,9 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>   	vm_fault_t ret = 0;
>   	unsigned long rss = 0;
>   	unsigned int nr_pages = 0, mmap_miss = 0, mmap_miss_saved, folio_type;
> +	struct mem_cgroup *memcg, *memcg_cur;
> +	pg_data_t *pgdat, *pgdat_cur;
> +	int nr_mapped = 0;
>   
>   	rcu_read_lock();
>   	folio = next_uptodate_folio(&xas, mapping, end_pgoff);
> @@ -3648,9 +3665,20 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>   	}
>   
>   	folio_type = mm_counter_file(folio);
> +	memcg = folio_memcg(folio);
> +	pgdat = folio_pgdat(folio);
>   	do {
>   		unsigned long end;
> -		int nr_mapped = 0;
> +
> +		memcg_cur = folio_memcg(folio);
> +		pgdat_cur = folio_pgdat(folio);
> +
> +		if (unlikely(memcg != memcg_cur || pgdat != pgdat_cur)) {
> +			filemap_lruvec_stat_update(memcg, pgdat, nr_mapped);
> +			nr_mapped = 0;
> +			memcg = memcg_cur;
> +			pgdat = pgdat_cur;
> +		}
>   
>   		addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
>   		vmf->pte += xas.xa_index - last_pgoff;
> @@ -3668,11 +3696,10 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>   					nr_pages, &rss, &nr_mapped,
>   					&mmap_miss);
>   
> -		__lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr_mapped);
> -
>   		folio_unlock(folio);
>   		folio_put(folio);
>   	} while ((folio = next_uptodate_folio(&xas, mapping, end_pgoff)) != NULL);
> +	filemap_lruvec_stat_update(memcg, pgdat, nr_mapped);
>   	add_mm_counter(vma->vm_mm, folio_type, rss);
>   	pte_unmap_unlock(vmf->pte, vmf->ptl);
>   out:

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH rfc 3/4] mm: filemap: move __lruvec_stat_mod_folio() out of filemap_set_pte_range()
  2024-04-29  7:24 ` [PATCH rfc 3/4] mm: filemap: move __lruvec_stat_mod_folio() out of filemap_set_pte_range() Kefeng Wang
@ 2024-05-07 11:11   ` David Hildenbrand
  2024-05-07 13:12     ` Kefeng Wang
  0 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand @ 2024-05-07 11:11 UTC (permalink / raw)
  To: Kefeng Wang, Andrew Morton
  Cc: Matthew Wilcox (Oracle), linux-mm, linux-fsdevel

On 29.04.24 09:24, Kefeng Wang wrote:
> Adding __folio_add_file_rmap_ptes() which don't update lruvec stat, it
> is used in filemap_set_pte_range(), with it, lruvec stat updating is
> moved into the caller, no functional changes.
> 
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
>   include/linux/rmap.h |  2 ++
>   mm/filemap.c         | 27 ++++++++++++++++++---------
>   mm/rmap.c            | 16 ++++++++++++++++
>   3 files changed, 36 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 7229b9baf20d..43014ddd06f9 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -242,6 +242,8 @@ void folio_add_anon_rmap_pmd(struct folio *, struct page *,
>   		struct vm_area_struct *, unsigned long address, rmap_t flags);
>   void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>   		unsigned long address);
> +int __folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
> +		struct vm_area_struct *);
>   void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
>   		struct vm_area_struct *);
>   #define folio_add_file_rmap_pte(folio, page, vma) \
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7019692daddd..3966b6616d02 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3501,14 +3501,15 @@ static struct folio *next_uptodate_folio(struct xa_state *xas,
>   
>   static void filemap_set_pte_range(struct vm_fault *vmf, struct folio *folio,
>   			struct page *page, unsigned int nr, unsigned long addr,
> -			unsigned long *rss)
> +			unsigned long *rss, int *nr_mapped)
>   {
>   	struct vm_area_struct *vma = vmf->vma;
>   	pte_t entry;
>   
>   	entry = prepare_range_pte_entry(vmf, false, folio, page, nr, addr);
>   
> -	folio_add_file_rmap_ptes(folio, page, nr, vma);
> +	*nr_mapped += __folio_add_file_rmap_ptes(folio, page, nr, vma);
> +
>   	set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr);
>   
>   	/* no need to invalidate: a not-present page won't be cached */
> @@ -3525,7 +3526,8 @@ static void filemap_set_pte_range(struct vm_fault *vmf, struct folio *folio,
>   static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>   			struct folio *folio, unsigned long start,
>   			unsigned long addr, unsigned int nr_pages,
> -			unsigned long *rss, unsigned int *mmap_miss)
> +			unsigned long *rss, int *nr_mapped,
> +			unsigned int *mmap_miss)
>   {
>   	vm_fault_t ret = 0;
>   	struct page *page = folio_page(folio, start);
> @@ -3558,7 +3560,8 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>   		continue;
>   skip:
>   		if (count) {
> -			filemap_set_pte_range(vmf, folio, page, count, addr, rss);
> +			filemap_set_pte_range(vmf, folio, page, count, addr,
> +					      rss, nr_mapped);
>   			if (in_range(vmf->address, addr, count * PAGE_SIZE))
>   				ret = VM_FAULT_NOPAGE;
>   		}
> @@ -3571,7 +3574,8 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>   	} while (--nr_pages > 0);
>   
>   	if (count) {
> -		filemap_set_pte_range(vmf, folio, page, count, addr, rss);
> +		filemap_set_pte_range(vmf, folio, page, count, addr, rss,
> +				      nr_mapped);
>   		if (in_range(vmf->address, addr, count * PAGE_SIZE))
>   			ret = VM_FAULT_NOPAGE;
>   	}
> @@ -3583,7 +3587,7 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>   
>   static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf,
>   		struct folio *folio, unsigned long addr,
> -		unsigned long *rss, unsigned int *mmap_miss)
> +		unsigned long *rss, int *nr_mapped, unsigned int *mmap_miss)
>   {
>   	vm_fault_t ret = 0;
>   	struct page *page = &folio->page;
> @@ -3606,7 +3610,7 @@ static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf,
>   	if (vmf->address == addr)
>   		ret = VM_FAULT_NOPAGE;
>   
> -	filemap_set_pte_range(vmf, folio, page, 1, addr, rss);
> +	filemap_set_pte_range(vmf, folio, page, 1, addr, rss, nr_mapped);
>   
>   	return ret;
>   }
> @@ -3646,6 +3650,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>   	folio_type = mm_counter_file(folio);
>   	do {
>   		unsigned long end;
> +		int nr_mapped = 0;
>   
>   		addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
>   		vmf->pte += xas.xa_index - last_pgoff;
> @@ -3655,11 +3660,15 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>   
>   		if (!folio_test_large(folio))
>   			ret |= filemap_map_order0_folio(vmf,
> -					folio, addr, &rss, &mmap_miss);
> +					folio, addr, &rss, &nr_mapped,
> +					&mmap_miss);
>   		else
>   			ret |= filemap_map_folio_range(vmf, folio,
>   					xas.xa_index - folio->index, addr,
> -					nr_pages, &rss, &mmap_miss);
> +					nr_pages, &rss, &nr_mapped,
> +					&mmap_miss);
> +
> +		__lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr_mapped);
>   
>   		folio_unlock(folio);
>   		folio_put(folio);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2608c40dffad..55face4024f2 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1452,6 +1452,22 @@ static __always_inline void __folio_add_file_rmap(struct folio *folio,
>   		mlock_vma_folio(folio, vma);
>   }
>   
> +int __folio_add_file_rmap_ptes(struct folio *folio, struct page *page,
> +		int nr_pages, struct vm_area_struct *vma)
> +{
> +	int nr, nr_pmdmapped = 0;
> +
> +	VM_WARN_ON_FOLIO(folio_test_anon(folio), folio);
> +
> +	nr = __folio_add_rmap(folio, page, nr_pages, RMAP_LEVEL_PTE,
> +			      &nr_pmdmapped);
> +
> +	/* See comments in folio_add_anon_rmap_*() */
> +	if (!folio_test_large(folio))
> +		mlock_vma_folio(folio, vma);
> +
> +	return nr;
> +}

I'm not really a fan :/ It does make the code more complicated, and it 
will be harder to extend if we decide to ever account differently (e.g., 
NR_SHMEM_MAPPED, additional tracking for mTHP etc).

With large folios we'll be naturally batching already here, and I do 
wonder, if this is really worth for performance, or if we could find 
another way of batching (let the caller activate batching and drain 
afterwards) without exposing these details to the caller.

Note that there is another cleanup happening [1].

[1] 
https://lore.kernel.org/all/20240506192924.271999-1-yosryahmed@google.com/T/#u

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH rfc 3/4] mm: filemap: move __lruvec_stat_mod_folio() out of filemap_set_pte_range()
  2024-05-07 11:11   ` David Hildenbrand
@ 2024-05-07 13:12     ` Kefeng Wang
  2024-05-08  9:33       ` David Hildenbrand
  0 siblings, 1 reply; 14+ messages in thread
From: Kefeng Wang @ 2024-05-07 13:12 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton
  Cc: Matthew Wilcox (Oracle), linux-mm, linux-fsdevel



On 2024/5/7 19:11, David Hildenbrand wrote:
> On 29.04.24 09:24, Kefeng Wang wrote:
>> Adding __folio_add_file_rmap_ptes() which don't update lruvec stat, it
>> is used in filemap_set_pte_range(), with it, lruvec stat updating is
>> moved into the caller, no functional changes.
>>
>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>> ---
>>   include/linux/rmap.h |  2 ++
>>   mm/filemap.c         | 27 ++++++++++++++++++---------
>>   mm/rmap.c            | 16 ++++++++++++++++
>>   3 files changed, 36 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index 7229b9baf20d..43014ddd06f9 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -242,6 +242,8 @@ void folio_add_anon_rmap_pmd(struct folio *, 
>> struct page *,
>>           struct vm_area_struct *, unsigned long address, rmap_t flags);
>>   void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>>           unsigned long address);
>> +int __folio_add_file_rmap_ptes(struct folio *, struct page *, int 
>> nr_pages,
>> +        struct vm_area_struct *);
>>   void folio_add_file_rmap_ptes(struct folio *, struct page *, int 
>> nr_pages,
>>           struct vm_area_struct *);
>>   #define folio_add_file_rmap_pte(folio, page, vma) \
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 7019692daddd..3966b6616d02 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -3501,14 +3501,15 @@ static struct folio 
>> *next_uptodate_folio(struct xa_state *xas,
>>   static void filemap_set_pte_range(struct vm_fault *vmf, struct folio 
>> *folio,
>>               struct page *page, unsigned int nr, unsigned long addr,
>> -            unsigned long *rss)
>> +            unsigned long *rss, int *nr_mapped)
>>   {
>>       struct vm_area_struct *vma = vmf->vma;
>>       pte_t entry;
>>       entry = prepare_range_pte_entry(vmf, false, folio, page, nr, addr);
>> -    folio_add_file_rmap_ptes(folio, page, nr, vma);
>> +    *nr_mapped += __folio_add_file_rmap_ptes(folio, page, nr, vma);
>> +
>>       set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr);
>>       /* no need to invalidate: a not-present page won't be cached */
>> @@ -3525,7 +3526,8 @@ static void filemap_set_pte_range(struct 
>> vm_fault *vmf, struct folio *folio,
>>   static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>>               struct folio *folio, unsigned long start,
>>               unsigned long addr, unsigned int nr_pages,
>> -            unsigned long *rss, unsigned int *mmap_miss)
>> +            unsigned long *rss, int *nr_mapped,
>> +            unsigned int *mmap_miss)
>>   {
>>       vm_fault_t ret = 0;
>>       struct page *page = folio_page(folio, start);
>> @@ -3558,7 +3560,8 @@ static vm_fault_t filemap_map_folio_range(struct 
>> vm_fault *vmf,
>>           continue;
>>   skip:
>>           if (count) {
>> -            filemap_set_pte_range(vmf, folio, page, count, addr, rss);
>> +            filemap_set_pte_range(vmf, folio, page, count, addr,
>> +                          rss, nr_mapped);
>>               if (in_range(vmf->address, addr, count * PAGE_SIZE))
>>                   ret = VM_FAULT_NOPAGE;
>>           }
>> @@ -3571,7 +3574,8 @@ static vm_fault_t filemap_map_folio_range(struct 
>> vm_fault *vmf,
>>       } while (--nr_pages > 0);
>>       if (count) {
>> -        filemap_set_pte_range(vmf, folio, page, count, addr, rss);
>> +        filemap_set_pte_range(vmf, folio, page, count, addr, rss,
>> +                      nr_mapped);
>>           if (in_range(vmf->address, addr, count * PAGE_SIZE))
>>               ret = VM_FAULT_NOPAGE;
>>       }
>> @@ -3583,7 +3587,7 @@ static vm_fault_t filemap_map_folio_range(struct 
>> vm_fault *vmf,
>>   static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf,
>>           struct folio *folio, unsigned long addr,
>> -        unsigned long *rss, unsigned int *mmap_miss)
>> +        unsigned long *rss, int *nr_mapped, unsigned int *mmap_miss)
>>   {
>>       vm_fault_t ret = 0;
>>       struct page *page = &folio->page;
>> @@ -3606,7 +3610,7 @@ static vm_fault_t 
>> filemap_map_order0_folio(struct vm_fault *vmf,
>>       if (vmf->address == addr)
>>           ret = VM_FAULT_NOPAGE;
>> -    filemap_set_pte_range(vmf, folio, page, 1, addr, rss);
>> +    filemap_set_pte_range(vmf, folio, page, 1, addr, rss, nr_mapped);
>>       return ret;
>>   }
>> @@ -3646,6 +3650,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>>       folio_type = mm_counter_file(folio);
>>       do {
>>           unsigned long end;
>> +        int nr_mapped = 0;
>>           addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
>>           vmf->pte += xas.xa_index - last_pgoff;
>> @@ -3655,11 +3660,15 @@ vm_fault_t filemap_map_pages(struct vm_fault 
>> *vmf,
>>           if (!folio_test_large(folio))
>>               ret |= filemap_map_order0_folio(vmf,
>> -                    folio, addr, &rss, &mmap_miss);
>> +                    folio, addr, &rss, &nr_mapped,
>> +                    &mmap_miss);
>>           else
>>               ret |= filemap_map_folio_range(vmf, folio,
>>                       xas.xa_index - folio->index, addr,
>> -                    nr_pages, &rss, &mmap_miss);
>> +                    nr_pages, &rss, &nr_mapped,
>> +                    &mmap_miss);
>> +
>> +        __lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr_mapped);
>>           folio_unlock(folio);
>>           folio_put(folio);
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 2608c40dffad..55face4024f2 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1452,6 +1452,22 @@ static __always_inline void 
>> __folio_add_file_rmap(struct folio *folio,
>>           mlock_vma_folio(folio, vma);
>>   }
>> +int __folio_add_file_rmap_ptes(struct folio *folio, struct page *page,
>> +        int nr_pages, struct vm_area_struct *vma)
>> +{
>> +    int nr, nr_pmdmapped = 0;
>> +
>> +    VM_WARN_ON_FOLIO(folio_test_anon(folio), folio);
>> +
>> +    nr = __folio_add_rmap(folio, page, nr_pages, RMAP_LEVEL_PTE,
>> +                  &nr_pmdmapped);
>> +
>> +    /* See comments in folio_add_anon_rmap_*() */
>> +    if (!folio_test_large(folio))
>> +        mlock_vma_folio(folio, vma);
>> +
>> +    return nr;
>> +}
> 
> I'm not really a fan :/ It does make the code more complicated, and it 
> will be harder to extend if we decide to ever account differently (e.g., 
> NR_SHMEM_MAPPED, additional tracking for mTHP etc).

If more different accounts, this may lead to bad scalability.
> 
> With large folios we'll be naturally batching already here, and I do 

Yes, it is batched with large folios，but our fs is ext4/tmpfs, there
are not support large folio or still upstreaming.

> wonder, if this is really worth for performance, or if we could find 
> another way of batching (let the caller activate batching and drain 
> afterwards) without exposing these details to the caller.

It does reduce latency when batch lruvec stat updating without large
folio, but I can't find better way, or let's wait for the large folio
support on ext4/tmpfs, I also Cced memcg maintainers in patch4 to see if
there are any other ideas.


> 
> Note that there is another cleanup happening [1].

The patch is related RMAP_LEVEL_PMD level, it should be not involved in 
filemap_map_pages().

> 
> [1] 
> https://lore.kernel.org/all/20240506192924.271999-1-yosryahmed@google.com/T/#u
>  

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH rfc 3/4] mm: filemap: move __lruvec_stat_mod_folio() out of filemap_set_pte_range()
  2024-05-07 13:12     ` Kefeng Wang
@ 2024-05-08  9:33       ` David Hildenbrand
  2024-05-08 11:15         ` Kefeng Wang
  0 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand @ 2024-05-08  9:33 UTC (permalink / raw)
  To: Kefeng Wang, Andrew Morton
  Cc: Matthew Wilcox (Oracle), linux-mm, linux-fsdevel

On 07.05.24 15:12, Kefeng Wang wrote:
> 
> 
> On 2024/5/7 19:11, David Hildenbrand wrote:
>> On 29.04.24 09:24, Kefeng Wang wrote:
>>> Adding __folio_add_file_rmap_ptes() which don't update lruvec stat, it
>>> is used in filemap_set_pte_range(), with it, lruvec stat updating is
>>> moved into the caller, no functional changes.
>>>
>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>> ---
>>>    include/linux/rmap.h |  2 ++
>>>    mm/filemap.c         | 27 ++++++++++++++++++---------
>>>    mm/rmap.c            | 16 ++++++++++++++++
>>>    3 files changed, 36 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>> index 7229b9baf20d..43014ddd06f9 100644
>>> --- a/include/linux/rmap.h
>>> +++ b/include/linux/rmap.h
>>> @@ -242,6 +242,8 @@ void folio_add_anon_rmap_pmd(struct folio *,
>>> struct page *,
>>>            struct vm_area_struct *, unsigned long address, rmap_t flags);
>>>    void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>>>            unsigned long address);
>>> +int __folio_add_file_rmap_ptes(struct folio *, struct page *, int
>>> nr_pages,
>>> +        struct vm_area_struct *);
>>>    void folio_add_file_rmap_ptes(struct folio *, struct page *, int
>>> nr_pages,
>>>            struct vm_area_struct *);
>>>    #define folio_add_file_rmap_pte(folio, page, vma) \
>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>> index 7019692daddd..3966b6616d02 100644
>>> --- a/mm/filemap.c
>>> +++ b/mm/filemap.c
>>> @@ -3501,14 +3501,15 @@ static struct folio
>>> *next_uptodate_folio(struct xa_state *xas,
>>>    static void filemap_set_pte_range(struct vm_fault *vmf, struct folio
>>> *folio,
>>>                struct page *page, unsigned int nr, unsigned long addr,
>>> -            unsigned long *rss)
>>> +            unsigned long *rss, int *nr_mapped)
>>>    {
>>>        struct vm_area_struct *vma = vmf->vma;
>>>        pte_t entry;
>>>        entry = prepare_range_pte_entry(vmf, false, folio, page, nr, addr);
>>> -    folio_add_file_rmap_ptes(folio, page, nr, vma);
>>> +    *nr_mapped += __folio_add_file_rmap_ptes(folio, page, nr, vma);
>>> +
>>>        set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr);
>>>        /* no need to invalidate: a not-present page won't be cached */
>>> @@ -3525,7 +3526,8 @@ static void filemap_set_pte_range(struct
>>> vm_fault *vmf, struct folio *folio,
>>>    static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>>>                struct folio *folio, unsigned long start,
>>>                unsigned long addr, unsigned int nr_pages,
>>> -            unsigned long *rss, unsigned int *mmap_miss)
>>> +            unsigned long *rss, int *nr_mapped,
>>> +            unsigned int *mmap_miss)
>>>    {
>>>        vm_fault_t ret = 0;
>>>        struct page *page = folio_page(folio, start);
>>> @@ -3558,7 +3560,8 @@ static vm_fault_t filemap_map_folio_range(struct
>>> vm_fault *vmf,
>>>            continue;
>>>    skip:
>>>            if (count) {
>>> -            filemap_set_pte_range(vmf, folio, page, count, addr, rss);
>>> +            filemap_set_pte_range(vmf, folio, page, count, addr,
>>> +                          rss, nr_mapped);
>>>                if (in_range(vmf->address, addr, count * PAGE_SIZE))
>>>                    ret = VM_FAULT_NOPAGE;
>>>            }
>>> @@ -3571,7 +3574,8 @@ static vm_fault_t filemap_map_folio_range(struct
>>> vm_fault *vmf,
>>>        } while (--nr_pages > 0);
>>>        if (count) {
>>> -        filemap_set_pte_range(vmf, folio, page, count, addr, rss);
>>> +        filemap_set_pte_range(vmf, folio, page, count, addr, rss,
>>> +                      nr_mapped);
>>>            if (in_range(vmf->address, addr, count * PAGE_SIZE))
>>>                ret = VM_FAULT_NOPAGE;
>>>        }
>>> @@ -3583,7 +3587,7 @@ static vm_fault_t filemap_map_folio_range(struct
>>> vm_fault *vmf,
>>>    static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf,
>>>            struct folio *folio, unsigned long addr,
>>> -        unsigned long *rss, unsigned int *mmap_miss)
>>> +        unsigned long *rss, int *nr_mapped, unsigned int *mmap_miss)
>>>    {
>>>        vm_fault_t ret = 0;
>>>        struct page *page = &folio->page;
>>> @@ -3606,7 +3610,7 @@ static vm_fault_t
>>> filemap_map_order0_folio(struct vm_fault *vmf,
>>>        if (vmf->address == addr)
>>>            ret = VM_FAULT_NOPAGE;
>>> -    filemap_set_pte_range(vmf, folio, page, 1, addr, rss);
>>> +    filemap_set_pte_range(vmf, folio, page, 1, addr, rss, nr_mapped);
>>>        return ret;
>>>    }
>>> @@ -3646,6 +3650,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>>>        folio_type = mm_counter_file(folio);
>>>        do {
>>>            unsigned long end;
>>> +        int nr_mapped = 0;
>>>            addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
>>>            vmf->pte += xas.xa_index - last_pgoff;
>>> @@ -3655,11 +3660,15 @@ vm_fault_t filemap_map_pages(struct vm_fault
>>> *vmf,
>>>            if (!folio_test_large(folio))
>>>                ret |= filemap_map_order0_folio(vmf,
>>> -                    folio, addr, &rss, &mmap_miss);
>>> +                    folio, addr, &rss, &nr_mapped,
>>> +                    &mmap_miss);
>>>            else
>>>                ret |= filemap_map_folio_range(vmf, folio,
>>>                        xas.xa_index - folio->index, addr,
>>> -                    nr_pages, &rss, &mmap_miss);
>>> +                    nr_pages, &rss, &nr_mapped,
>>> +                    &mmap_miss);
>>> +
>>> +        __lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr_mapped);
>>>            folio_unlock(folio);
>>>            folio_put(folio);
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 2608c40dffad..55face4024f2 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1452,6 +1452,22 @@ static __always_inline void
>>> __folio_add_file_rmap(struct folio *folio,
>>>            mlock_vma_folio(folio, vma);
>>>    }
>>> +int __folio_add_file_rmap_ptes(struct folio *folio, struct page *page,
>>> +        int nr_pages, struct vm_area_struct *vma)
>>> +{
>>> +    int nr, nr_pmdmapped = 0;
>>> +
>>> +    VM_WARN_ON_FOLIO(folio_test_anon(folio), folio);
>>> +
>>> +    nr = __folio_add_rmap(folio, page, nr_pages, RMAP_LEVEL_PTE,
>>> +                  &nr_pmdmapped);
>>> +
>>> +    /* See comments in folio_add_anon_rmap_*() */
>>> +    if (!folio_test_large(folio))
>>> +        mlock_vma_folio(folio, vma);
>>> +
>>> +    return nr;
>>> +}
>>
>> I'm not really a fan :/ It does make the code more complicated, and it
>> will be harder to extend if we decide to ever account differently (e.g.,
>> NR_SHMEM_MAPPED, additional tracking for mTHP etc).
> 
> If more different accounts, this may lead to bad scalability.

We already do it for PMD mappings.

>>
>> With large folios we'll be naturally batching already here, and I do
> 
> Yes, it is batched with large folios，but our fs is ext4/tmpfs, there
> are not support large folio or still upstreaming.

Okay, so that will be sorted out sooner or later.

> 
>> wonder, if this is really worth for performance, or if we could find
>> another way of batching (let the caller activate batching and drain
>> afterwards) without exposing these details to the caller.
> 
> It does reduce latency when batch lruvec stat updating without large
> folio, but I can't find better way, or let's wait for the large folio
> support on ext4/tmpfs, I also Cced memcg maintainers in patch4 to see if
> there are any other ideas.

I'm not convinced this benefit here is worth making the code more 
complicated.

Maybe we can find another way to optimize this batching in rmap code 
without having to leak these details to the callers.

For example, we could pass an optional batching structure to all rmap 
add/rel functions that would collect these stat updates. Then we could 
have one function to flush it and update the counters combined.

Such batching could be beneficial also for page unmapping/zapping where 
we might unmap various different folios in one go.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH rfc 3/4] mm: filemap: move __lruvec_stat_mod_folio() out of filemap_set_pte_range()
  2024-05-08  9:33       ` David Hildenbrand
@ 2024-05-08 11:15         ` Kefeng Wang
  2024-05-08 11:27           ` David Hildenbrand
  0 siblings, 1 reply; 14+ messages in thread
From: Kefeng Wang @ 2024-05-08 11:15 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton
  Cc: Matthew Wilcox (Oracle), linux-mm, linux-fsdevel



On 2024/5/8 17:33, David Hildenbrand wrote:
> On 07.05.24 15:12, Kefeng Wang wrote:
>>
>>
>> On 2024/5/7 19:11, David Hildenbrand wrote:
>>> On 29.04.24 09:24, Kefeng Wang wrote:
>>>> Adding __folio_add_file_rmap_ptes() which don't update lruvec stat, it
>>>> is used in filemap_set_pte_range(), with it, lruvec stat updating is
>>>> moved into the caller, no functional changes.
>>>>
>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>> ---
>>>>    include/linux/rmap.h |  2 ++
>>>>    mm/filemap.c         | 27 ++++++++++++++++++---------
>>>>    mm/rmap.c            | 16 ++++++++++++++++
>>>>    3 files changed, 36 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>>> index 7229b9baf20d..43014ddd06f9 100644
>>>> --- a/include/linux/rmap.h
>>>> +++ b/include/linux/rmap.h
>>>> @@ -242,6 +242,8 @@ void folio_add_anon_rmap_pmd(struct folio *,
>>>> struct page *,
>>>>            struct vm_area_struct *, unsigned long address, rmap_t 
>>>> flags);
>>>>    void folio_add_new_anon_rmap(struct folio *, struct 
>>>> vm_area_struct *,
>>>>            unsigned long address);
>>>> +int __folio_add_file_rmap_ptes(struct folio *, struct page *, int
>>>> nr_pages,
>>>> +        struct vm_area_struct *);
>>>>    void folio_add_file_rmap_ptes(struct folio *, struct page *, int
>>>> nr_pages,
>>>>            struct vm_area_struct *);
>>>>    #define folio_add_file_rmap_pte(folio, page, vma) \
>>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>>> index 7019692daddd..3966b6616d02 100644
>>>> --- a/mm/filemap.c
>>>> +++ b/mm/filemap.c
>>>> @@ -3501,14 +3501,15 @@ static struct folio
>>>> *next_uptodate_folio(struct xa_state *xas,
>>>>    static void filemap_set_pte_range(struct vm_fault *vmf, struct folio
>>>> *folio,
>>>>                struct page *page, unsigned int nr, unsigned long addr,
>>>> -            unsigned long *rss)
>>>> +            unsigned long *rss, int *nr_mapped)
>>>>    {
>>>>        struct vm_area_struct *vma = vmf->vma;
>>>>        pte_t entry;
>>>>        entry = prepare_range_pte_entry(vmf, false, folio, page, nr, 
>>>> addr);
>>>> -    folio_add_file_rmap_ptes(folio, page, nr, vma);
>>>> +    *nr_mapped += __folio_add_file_rmap_ptes(folio, page, nr, vma);
>>>> +
>>>>        set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr);
>>>>        /* no need to invalidate: a not-present page won't be cached */
>>>> @@ -3525,7 +3526,8 @@ static void filemap_set_pte_range(struct
>>>> vm_fault *vmf, struct folio *folio,
>>>>    static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>>>>                struct folio *folio, unsigned long start,
>>>>                unsigned long addr, unsigned int nr_pages,
>>>> -            unsigned long *rss, unsigned int *mmap_miss)
>>>> +            unsigned long *rss, int *nr_mapped,
>>>> +            unsigned int *mmap_miss)
>>>>    {
>>>>        vm_fault_t ret = 0;
>>>>        struct page *page = folio_page(folio, start);
>>>> @@ -3558,7 +3560,8 @@ static vm_fault_t filemap_map_folio_range(struct
>>>> vm_fault *vmf,
>>>>            continue;
>>>>    skip:
>>>>            if (count) {
>>>> -            filemap_set_pte_range(vmf, folio, page, count, addr, rss);
>>>> +            filemap_set_pte_range(vmf, folio, page, count, addr,
>>>> +                          rss, nr_mapped);
>>>>                if (in_range(vmf->address, addr, count * PAGE_SIZE))
>>>>                    ret = VM_FAULT_NOPAGE;
>>>>            }
>>>> @@ -3571,7 +3574,8 @@ static vm_fault_t filemap_map_folio_range(struct
>>>> vm_fault *vmf,
>>>>        } while (--nr_pages > 0);
>>>>        if (count) {
>>>> -        filemap_set_pte_range(vmf, folio, page, count, addr, rss);
>>>> +        filemap_set_pte_range(vmf, folio, page, count, addr, rss,
>>>> +                      nr_mapped);
>>>>            if (in_range(vmf->address, addr, count * PAGE_SIZE))
>>>>                ret = VM_FAULT_NOPAGE;
>>>>        }
>>>> @@ -3583,7 +3587,7 @@ static vm_fault_t filemap_map_folio_range(struct
>>>> vm_fault *vmf,
>>>>    static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf,
>>>>            struct folio *folio, unsigned long addr,
>>>> -        unsigned long *rss, unsigned int *mmap_miss)
>>>> +        unsigned long *rss, int *nr_mapped, unsigned int *mmap_miss)
>>>>    {
>>>>        vm_fault_t ret = 0;
>>>>        struct page *page = &folio->page;
>>>> @@ -3606,7 +3610,7 @@ static vm_fault_t
>>>> filemap_map_order0_folio(struct vm_fault *vmf,
>>>>        if (vmf->address == addr)
>>>>            ret = VM_FAULT_NOPAGE;
>>>> -    filemap_set_pte_range(vmf, folio, page, 1, addr, rss);
>>>> +    filemap_set_pte_range(vmf, folio, page, 1, addr, rss, nr_mapped);
>>>>        return ret;
>>>>    }
>>>> @@ -3646,6 +3650,7 @@ vm_fault_t filemap_map_pages(struct vm_fault 
>>>> *vmf,
>>>>        folio_type = mm_counter_file(folio);
>>>>        do {
>>>>            unsigned long end;
>>>> +        int nr_mapped = 0;
>>>>            addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
>>>>            vmf->pte += xas.xa_index - last_pgoff;
>>>> @@ -3655,11 +3660,15 @@ vm_fault_t filemap_map_pages(struct vm_fault
>>>> *vmf,
>>>>            if (!folio_test_large(folio))
>>>>                ret |= filemap_map_order0_folio(vmf,
>>>> -                    folio, addr, &rss, &mmap_miss);
>>>> +                    folio, addr, &rss, &nr_mapped,
>>>> +                    &mmap_miss);
>>>>            else
>>>>                ret |= filemap_map_folio_range(vmf, folio,
>>>>                        xas.xa_index - folio->index, addr,
>>>> -                    nr_pages, &rss, &mmap_miss);
>>>> +                    nr_pages, &rss, &nr_mapped,
>>>> +                    &mmap_miss);
>>>> +
>>>> +        __lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr_mapped);
>>>>            folio_unlock(folio);
>>>>            folio_put(folio);
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 2608c40dffad..55face4024f2 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1452,6 +1452,22 @@ static __always_inline void
>>>> __folio_add_file_rmap(struct folio *folio,
>>>>            mlock_vma_folio(folio, vma);
>>>>    }
>>>> +int __folio_add_file_rmap_ptes(struct folio *folio, struct page *page,
>>>> +        int nr_pages, struct vm_area_struct *vma)
>>>> +{
>>>> +    int nr, nr_pmdmapped = 0;
>>>> +
>>>> +    VM_WARN_ON_FOLIO(folio_test_anon(folio), folio);
>>>> +
>>>> +    nr = __folio_add_rmap(folio, page, nr_pages, RMAP_LEVEL_PTE,
>>>> +                  &nr_pmdmapped);
>>>> +
>>>> +    /* See comments in folio_add_anon_rmap_*() */
>>>> +    if (!folio_test_large(folio))
>>>> +        mlock_vma_folio(folio, vma);
>>>> +
>>>> +    return nr;
>>>> +}
>>>
>>> I'm not really a fan :/ It does make the code more complicated, and it
>>> will be harder to extend if we decide to ever account differently (e.g.,
>>> NR_SHMEM_MAPPED, additional tracking for mTHP etc).
>>
>> If more different accounts, this may lead to bad scalability.
> 
> We already do it for PMD mappings.
> 
>>>
>>> With large folios we'll be naturally batching already here, and I do
>>
>> Yes, it is batched with large folios，but our fs is ext4/tmpfs, there
>> are not support large folio or still upstreaming.
> 
> Okay, so that will be sorted out sooner or later.
> 
>>
>>> wonder, if this is really worth for performance, or if we could find
>>> another way of batching (let the caller activate batching and drain
>>> afterwards) without exposing these details to the caller.
>>
>> It does reduce latency when batch lruvec stat updating without large
>> folio, but I can't find better way, or let's wait for the large folio
>> support on ext4/tmpfs, I also Cced memcg maintainers in patch4 to see if
>> there are any other ideas.
> 
> I'm not convinced this benefit here is worth making the code more 
> complicated.
> 
> Maybe we can find another way to optimize this batching in rmap code 
> without having to leak these details to the callers.
> 
> For example, we could pass an optional batching structure to all rmap 
> add/rel functions that would collect these stat updates. Then we could 
> have one function to flush it and update the counters combined.
> 
> Such batching could be beneficial also for page unmapping/zapping where 
> we might unmap various different folios in one go.

It sounds better and clearer, I will try it and see the results, thanks 
for your advise!

> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH rfc 3/4] mm: filemap: move __lruvec_stat_mod_folio() out of filemap_set_pte_range()
  2024-05-08 11:15         ` Kefeng Wang
@ 2024-05-08 11:27           ` David Hildenbrand
  2024-05-08 13:56             ` Kefeng Wang
  0 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand @ 2024-05-08 11:27 UTC (permalink / raw)
  To: Kefeng Wang, Andrew Morton
  Cc: Matthew Wilcox (Oracle), linux-mm, linux-fsdevel

On 08.05.24 13:15, Kefeng Wang wrote:
> 
> 
> On 2024/5/8 17:33, David Hildenbrand wrote:
>> On 07.05.24 15:12, Kefeng Wang wrote:
>>>
>>>
>>> On 2024/5/7 19:11, David Hildenbrand wrote:
>>>> On 29.04.24 09:24, Kefeng Wang wrote:
>>>>> Adding __folio_add_file_rmap_ptes() which don't update lruvec stat, it
>>>>> is used in filemap_set_pte_range(), with it, lruvec stat updating is
>>>>> moved into the caller, no functional changes.
>>>>>
>>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>> ---
>>>>>     include/linux/rmap.h |  2 ++
>>>>>     mm/filemap.c         | 27 ++++++++++++++++++---------
>>>>>     mm/rmap.c            | 16 ++++++++++++++++
>>>>>     3 files changed, 36 insertions(+), 9 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>>>> index 7229b9baf20d..43014ddd06f9 100644
>>>>> --- a/include/linux/rmap.h
>>>>> +++ b/include/linux/rmap.h
>>>>> @@ -242,6 +242,8 @@ void folio_add_anon_rmap_pmd(struct folio *,
>>>>> struct page *,
>>>>>             struct vm_area_struct *, unsigned long address, rmap_t
>>>>> flags);
>>>>>     void folio_add_new_anon_rmap(struct folio *, struct
>>>>> vm_area_struct *,
>>>>>             unsigned long address);
>>>>> +int __folio_add_file_rmap_ptes(struct folio *, struct page *, int
>>>>> nr_pages,
>>>>> +        struct vm_area_struct *);
>>>>>     void folio_add_file_rmap_ptes(struct folio *, struct page *, int
>>>>> nr_pages,
>>>>>             struct vm_area_struct *);
>>>>>     #define folio_add_file_rmap_pte(folio, page, vma) \
>>>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>>>> index 7019692daddd..3966b6616d02 100644
>>>>> --- a/mm/filemap.c
>>>>> +++ b/mm/filemap.c
>>>>> @@ -3501,14 +3501,15 @@ static struct folio
>>>>> *next_uptodate_folio(struct xa_state *xas,
>>>>>     static void filemap_set_pte_range(struct vm_fault *vmf, struct folio
>>>>> *folio,
>>>>>                 struct page *page, unsigned int nr, unsigned long addr,
>>>>> -            unsigned long *rss)
>>>>> +            unsigned long *rss, int *nr_mapped)
>>>>>     {
>>>>>         struct vm_area_struct *vma = vmf->vma;
>>>>>         pte_t entry;
>>>>>         entry = prepare_range_pte_entry(vmf, false, folio, page, nr,
>>>>> addr);
>>>>> -    folio_add_file_rmap_ptes(folio, page, nr, vma);
>>>>> +    *nr_mapped += __folio_add_file_rmap_ptes(folio, page, nr, vma);
>>>>> +
>>>>>         set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr);
>>>>>         /* no need to invalidate: a not-present page won't be cached */
>>>>> @@ -3525,7 +3526,8 @@ static void filemap_set_pte_range(struct
>>>>> vm_fault *vmf, struct folio *folio,
>>>>>     static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>>>>>                 struct folio *folio, unsigned long start,
>>>>>                 unsigned long addr, unsigned int nr_pages,
>>>>> -            unsigned long *rss, unsigned int *mmap_miss)
>>>>> +            unsigned long *rss, int *nr_mapped,
>>>>> +            unsigned int *mmap_miss)
>>>>>     {
>>>>>         vm_fault_t ret = 0;
>>>>>         struct page *page = folio_page(folio, start);
>>>>> @@ -3558,7 +3560,8 @@ static vm_fault_t filemap_map_folio_range(struct
>>>>> vm_fault *vmf,
>>>>>             continue;
>>>>>     skip:
>>>>>             if (count) {
>>>>> -            filemap_set_pte_range(vmf, folio, page, count, addr, rss);
>>>>> +            filemap_set_pte_range(vmf, folio, page, count, addr,
>>>>> +                          rss, nr_mapped);
>>>>>                 if (in_range(vmf->address, addr, count * PAGE_SIZE))
>>>>>                     ret = VM_FAULT_NOPAGE;
>>>>>             }
>>>>> @@ -3571,7 +3574,8 @@ static vm_fault_t filemap_map_folio_range(struct
>>>>> vm_fault *vmf,
>>>>>         } while (--nr_pages > 0);
>>>>>         if (count) {
>>>>> -        filemap_set_pte_range(vmf, folio, page, count, addr, rss);
>>>>> +        filemap_set_pte_range(vmf, folio, page, count, addr, rss,
>>>>> +                      nr_mapped);
>>>>>             if (in_range(vmf->address, addr, count * PAGE_SIZE))
>>>>>                 ret = VM_FAULT_NOPAGE;
>>>>>         }
>>>>> @@ -3583,7 +3587,7 @@ static vm_fault_t filemap_map_folio_range(struct
>>>>> vm_fault *vmf,
>>>>>     static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf,
>>>>>             struct folio *folio, unsigned long addr,
>>>>> -        unsigned long *rss, unsigned int *mmap_miss)
>>>>> +        unsigned long *rss, int *nr_mapped, unsigned int *mmap_miss)
>>>>>     {
>>>>>         vm_fault_t ret = 0;
>>>>>         struct page *page = &folio->page;
>>>>> @@ -3606,7 +3610,7 @@ static vm_fault_t
>>>>> filemap_map_order0_folio(struct vm_fault *vmf,
>>>>>         if (vmf->address == addr)
>>>>>             ret = VM_FAULT_NOPAGE;
>>>>> -    filemap_set_pte_range(vmf, folio, page, 1, addr, rss);
>>>>> +    filemap_set_pte_range(vmf, folio, page, 1, addr, rss, nr_mapped);
>>>>>         return ret;
>>>>>     }
>>>>> @@ -3646,6 +3650,7 @@ vm_fault_t filemap_map_pages(struct vm_fault
>>>>> *vmf,
>>>>>         folio_type = mm_counter_file(folio);
>>>>>         do {
>>>>>             unsigned long end;
>>>>> +        int nr_mapped = 0;
>>>>>             addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
>>>>>             vmf->pte += xas.xa_index - last_pgoff;
>>>>> @@ -3655,11 +3660,15 @@ vm_fault_t filemap_map_pages(struct vm_fault
>>>>> *vmf,
>>>>>             if (!folio_test_large(folio))
>>>>>                 ret |= filemap_map_order0_folio(vmf,
>>>>> -                    folio, addr, &rss, &mmap_miss);
>>>>> +                    folio, addr, &rss, &nr_mapped,
>>>>> +                    &mmap_miss);
>>>>>             else
>>>>>                 ret |= filemap_map_folio_range(vmf, folio,
>>>>>                         xas.xa_index - folio->index, addr,
>>>>> -                    nr_pages, &rss, &mmap_miss);
>>>>> +                    nr_pages, &rss, &nr_mapped,
>>>>> +                    &mmap_miss);
>>>>> +
>>>>> +        __lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr_mapped);
>>>>>             folio_unlock(folio);
>>>>>             folio_put(folio);
>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>> index 2608c40dffad..55face4024f2 100644
>>>>> --- a/mm/rmap.c
>>>>> +++ b/mm/rmap.c
>>>>> @@ -1452,6 +1452,22 @@ static __always_inline void
>>>>> __folio_add_file_rmap(struct folio *folio,
>>>>>             mlock_vma_folio(folio, vma);
>>>>>     }
>>>>> +int __folio_add_file_rmap_ptes(struct folio *folio, struct page *page,
>>>>> +        int nr_pages, struct vm_area_struct *vma)
>>>>> +{
>>>>> +    int nr, nr_pmdmapped = 0;
>>>>> +
>>>>> +    VM_WARN_ON_FOLIO(folio_test_anon(folio), folio);
>>>>> +
>>>>> +    nr = __folio_add_rmap(folio, page, nr_pages, RMAP_LEVEL_PTE,
>>>>> +                  &nr_pmdmapped);
>>>>> +
>>>>> +    /* See comments in folio_add_anon_rmap_*() */
>>>>> +    if (!folio_test_large(folio))
>>>>> +        mlock_vma_folio(folio, vma);
>>>>> +
>>>>> +    return nr;
>>>>> +}
>>>>
>>>> I'm not really a fan :/ It does make the code more complicated, and it
>>>> will be harder to extend if we decide to ever account differently (e.g.,
>>>> NR_SHMEM_MAPPED, additional tracking for mTHP etc).
>>>
>>> If more different accounts, this may lead to bad scalability.
>>
>> We already do it for PMD mappings.
>>
>>>>
>>>> With large folios we'll be naturally batching already here, and I do
>>>
>>> Yes, it is batched with large folios，but our fs is ext4/tmpfs, there
>>> are not support large folio or still upstreaming.
>>
>> Okay, so that will be sorted out sooner or later.
>>
>>>
>>>> wonder, if this is really worth for performance, or if we could find
>>>> another way of batching (let the caller activate batching and drain
>>>> afterwards) without exposing these details to the caller.
>>>
>>> It does reduce latency when batch lruvec stat updating without large
>>> folio, but I can't find better way, or let's wait for the large folio
>>> support on ext4/tmpfs, I also Cced memcg maintainers in patch4 to see if
>>> there are any other ideas.
>>
>> I'm not convinced this benefit here is worth making the code more
>> complicated.
>>
>> Maybe we can find another way to optimize this batching in rmap code
>> without having to leak these details to the callers.
>>
>> For example, we could pass an optional batching structure to all rmap
>> add/rel functions that would collect these stat updates. Then we could
>> have one function to flush it and update the counters combined.
>>
>> Such batching could be beneficial also for page unmapping/zapping where
>> we might unmap various different folios in one go.
> 
> It sounds better and clearer, I will try it and see the results, thanks
> for your advise!

To batch across multiple folios, it might be sufficient to remember in 
the batching structure for now:

* folio_memcg(folio)
* folio_pgdat(folio)
* NR_ANON_MAPPED diff
* NR_FILE_MAPPED diff

If the memcg of pgdat would change, we simply flush. Otherwise we batch 
and the caller flushes.

Likely we mapping/unmapping multiple folios they belong to the same 
pgdat+memcg.

The only tricky bit is the rcu_read_lock() around folio_memcg(); that 
might require some thought.

Hm ....

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH rfc 3/4] mm: filemap: move __lruvec_stat_mod_folio() out of filemap_set_pte_range()
  2024-05-08 11:27           ` David Hildenbrand
@ 2024-05-08 13:56             ` Kefeng Wang
  0 siblings, 0 replies; 14+ messages in thread
From: Kefeng Wang @ 2024-05-08 13:56 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton
  Cc: Matthew Wilcox (Oracle), linux-mm, linux-fsdevel



On 2024/5/8 19:27, David Hildenbrand wrote:
> On 08.05.24 13:15, Kefeng Wang wrote:
>>
>>
>> On 2024/5/8 17:33, David Hildenbrand wrote:
>>> On 07.05.24 15:12, Kefeng Wang wrote:
>>>>
>>>>
>>>> On 2024/5/7 19:11, David Hildenbrand wrote:
>>>>> On 29.04.24 09:24, Kefeng Wang wrote:
>>>>>> Adding __folio_add_file_rmap_ptes() which don't update lruvec 
>>>>>> stat, it
>>>>>> is used in filemap_set_pte_range(), with it, lruvec stat updating is
>>>>>> moved into the caller, no functional changes.
>>>>>>
>>>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>> ---
>>>>>>     include/linux/rmap.h |  2 ++
>>>>>>     mm/filemap.c         | 27 ++++++++++++++++++---------
>>>>>>     mm/rmap.c            | 16 ++++++++++++++++
>>>>>>     3 files changed, 36 insertions(+), 9 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>>>>> index 7229b9baf20d..43014ddd06f9 100644
>>>>>> --- a/include/linux/rmap.h
>>>>>> +++ b/include/linux/rmap.h
>>>>>> @@ -242,6 +242,8 @@ void folio_add_anon_rmap_pmd(struct folio *,
>>>>>> struct page *,
>>>>>>             struct vm_area_struct *, unsigned long address, rmap_t
>>>>>> flags);
>>>>>>     void folio_add_new_anon_rmap(struct folio *, struct
>>>>>> vm_area_struct *,
>>>>>>             unsigned long address);
>>>>>> +int __folio_add_file_rmap_ptes(struct folio *, struct page *, int
>>>>>> nr_pages,
>>>>>> +        struct vm_area_struct *);
>>>>>>     void folio_add_file_rmap_ptes(struct folio *, struct page *, int
>>>>>> nr_pages,
>>>>>>             struct vm_area_struct *);
>>>>>>     #define folio_add_file_rmap_pte(folio, page, vma) \
>>>>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>>>>> index 7019692daddd..3966b6616d02 100644
>>>>>> --- a/mm/filemap.c
>>>>>> +++ b/mm/filemap.c
>>>>>> @@ -3501,14 +3501,15 @@ static struct folio
>>>>>> *next_uptodate_folio(struct xa_state *xas,
>>>>>>     static void filemap_set_pte_range(struct vm_fault *vmf, struct 
>>>>>> folio
>>>>>> *folio,
>>>>>>                 struct page *page, unsigned int nr, unsigned long 
>>>>>> addr,
>>>>>> -            unsigned long *rss)
>>>>>> +            unsigned long *rss, int *nr_mapped)
>>>>>>     {
>>>>>>         struct vm_area_struct *vma = vmf->vma;
>>>>>>         pte_t entry;
>>>>>>         entry = prepare_range_pte_entry(vmf, false, folio, page, nr,
>>>>>> addr);
>>>>>> -    folio_add_file_rmap_ptes(folio, page, nr, vma);
>>>>>> +    *nr_mapped += __folio_add_file_rmap_ptes(folio, page, nr, vma);
>>>>>> +
>>>>>>         set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr);
>>>>>>         /* no need to invalidate: a not-present page won't be 
>>>>>> cached */
>>>>>> @@ -3525,7 +3526,8 @@ static void filemap_set_pte_range(struct
>>>>>> vm_fault *vmf, struct folio *folio,
>>>>>>     static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>>>>>>                 struct folio *folio, unsigned long start,
>>>>>>                 unsigned long addr, unsigned int nr_pages,
>>>>>> -            unsigned long *rss, unsigned int *mmap_miss)
>>>>>> +            unsigned long *rss, int *nr_mapped,
>>>>>> +            unsigned int *mmap_miss)
>>>>>>     {
>>>>>>         vm_fault_t ret = 0;
>>>>>>         struct page *page = folio_page(folio, start);
>>>>>> @@ -3558,7 +3560,8 @@ static vm_fault_t 
>>>>>> filemap_map_folio_range(struct
>>>>>> vm_fault *vmf,
>>>>>>             continue;
>>>>>>     skip:
>>>>>>             if (count) {
>>>>>> -            filemap_set_pte_range(vmf, folio, page, count, addr, 
>>>>>> rss);
>>>>>> +            filemap_set_pte_range(vmf, folio, page, count, addr,
>>>>>> +                          rss, nr_mapped);
>>>>>>                 if (in_range(vmf->address, addr, count * PAGE_SIZE))
>>>>>>                     ret = VM_FAULT_NOPAGE;
>>>>>>             }
>>>>>> @@ -3571,7 +3574,8 @@ static vm_fault_t 
>>>>>> filemap_map_folio_range(struct
>>>>>> vm_fault *vmf,
>>>>>>         } while (--nr_pages > 0);
>>>>>>         if (count) {
>>>>>> -        filemap_set_pte_range(vmf, folio, page, count, addr, rss);
>>>>>> +        filemap_set_pte_range(vmf, folio, page, count, addr, rss,
>>>>>> +                      nr_mapped);
>>>>>>             if (in_range(vmf->address, addr, count * PAGE_SIZE))
>>>>>>                 ret = VM_FAULT_NOPAGE;
>>>>>>         }
>>>>>> @@ -3583,7 +3587,7 @@ static vm_fault_t 
>>>>>> filemap_map_folio_range(struct
>>>>>> vm_fault *vmf,
>>>>>>     static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf,
>>>>>>             struct folio *folio, unsigned long addr,
>>>>>> -        unsigned long *rss, unsigned int *mmap_miss)
>>>>>> +        unsigned long *rss, int *nr_mapped, unsigned int *mmap_miss)
>>>>>>     {
>>>>>>         vm_fault_t ret = 0;
>>>>>>         struct page *page = &folio->page;
>>>>>> @@ -3606,7 +3610,7 @@ static vm_fault_t
>>>>>> filemap_map_order0_folio(struct vm_fault *vmf,
>>>>>>         if (vmf->address == addr)
>>>>>>             ret = VM_FAULT_NOPAGE;
>>>>>> -    filemap_set_pte_range(vmf, folio, page, 1, addr, rss);
>>>>>> +    filemap_set_pte_range(vmf, folio, page, 1, addr, rss, 
>>>>>> nr_mapped);
>>>>>>         return ret;
>>>>>>     }
>>>>>> @@ -3646,6 +3650,7 @@ vm_fault_t filemap_map_pages(struct vm_fault
>>>>>> *vmf,
>>>>>>         folio_type = mm_counter_file(folio);
>>>>>>         do {
>>>>>>             unsigned long end;
>>>>>> +        int nr_mapped = 0;
>>>>>>             addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
>>>>>>             vmf->pte += xas.xa_index - last_pgoff;
>>>>>> @@ -3655,11 +3660,15 @@ vm_fault_t filemap_map_pages(struct vm_fault
>>>>>> *vmf,
>>>>>>             if (!folio_test_large(folio))
>>>>>>                 ret |= filemap_map_order0_folio(vmf,
>>>>>> -                    folio, addr, &rss, &mmap_miss);
>>>>>> +                    folio, addr, &rss, &nr_mapped,
>>>>>> +                    &mmap_miss);
>>>>>>             else
>>>>>>                 ret |= filemap_map_folio_range(vmf, folio,
>>>>>>                         xas.xa_index - folio->index, addr,
>>>>>> -                    nr_pages, &rss, &mmap_miss);
>>>>>> +                    nr_pages, &rss, &nr_mapped,
>>>>>> +                    &mmap_miss);
>>>>>> +
>>>>>> +        __lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr_mapped);
>>>>>>             folio_unlock(folio);
>>>>>>             folio_put(folio);
>>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>>> index 2608c40dffad..55face4024f2 100644
>>>>>> --- a/mm/rmap.c
>>>>>> +++ b/mm/rmap.c
>>>>>> @@ -1452,6 +1452,22 @@ static __always_inline void
>>>>>> __folio_add_file_rmap(struct folio *folio,
>>>>>>             mlock_vma_folio(folio, vma);
>>>>>>     }
>>>>>> +int __folio_add_file_rmap_ptes(struct folio *folio, struct page 
>>>>>> *page,
>>>>>> +        int nr_pages, struct vm_area_struct *vma)
>>>>>> +{
>>>>>> +    int nr, nr_pmdmapped = 0;
>>>>>> +
>>>>>> +    VM_WARN_ON_FOLIO(folio_test_anon(folio), folio);
>>>>>> +
>>>>>> +    nr = __folio_add_rmap(folio, page, nr_pages, RMAP_LEVEL_PTE,
>>>>>> +                  &nr_pmdmapped);
>>>>>> +
>>>>>> +    /* See comments in folio_add_anon_rmap_*() */
>>>>>> +    if (!folio_test_large(folio))
>>>>>> +        mlock_vma_folio(folio, vma);
>>>>>> +
>>>>>> +    return nr;
>>>>>> +}
>>>>>
>>>>> I'm not really a fan :/ It does make the code more complicated, and it
>>>>> will be harder to extend if we decide to ever account differently 
>>>>> (e.g.,
>>>>> NR_SHMEM_MAPPED, additional tracking for mTHP etc).
>>>>
>>>> If more different accounts, this may lead to bad scalability.
>>>
>>> We already do it for PMD mappings.
>>>
>>>>>
>>>>> With large folios we'll be naturally batching already here, and I do
>>>>
>>>> Yes, it is batched with large folios，but our fs is ext4/tmpfs, there
>>>> are not support large folio or still upstreaming.
>>>
>>> Okay, so that will be sorted out sooner or later.
>>>
>>>>
>>>>> wonder, if this is really worth for performance, or if we could find
>>>>> another way of batching (let the caller activate batching and drain
>>>>> afterwards) without exposing these details to the caller.
>>>>
>>>> It does reduce latency when batch lruvec stat updating without large
>>>> folio, but I can't find better way, or let's wait for the large folio
>>>> support on ext4/tmpfs, I also Cced memcg maintainers in patch4 to 
>>>> see if
>>>> there are any other ideas.
>>>
>>> I'm not convinced this benefit here is worth making the code more
>>> complicated.
>>>
>>> Maybe we can find another way to optimize this batching in rmap code
>>> without having to leak these details to the callers.
>>>
>>> For example, we could pass an optional batching structure to all rmap
>>> add/rel functions that would collect these stat updates. Then we could
>>> have one function to flush it and update the counters combined.
>>>
>>> Such batching could be beneficial also for page unmapping/zapping where
>>> we might unmap various different folios in one go.
>>
>> It sounds better and clearer, I will try it and see the results, thanks
>> for your advise!
> 
> To batch across multiple folios, it might be sufficient to remember in 
> the batching structure for now:
> 
> * folio_memcg(folio)
> * folio_pgdat(folio)
> * NR_ANON_MAPPED diff
> * NR_FILE_MAPPED diff
> 
> If the memcg of pgdat would change, we simply flush. Otherwise we batch 
> and the caller flushes.

for memcg = NULL, we need flush when pgdat changed.

> 
> Likely we mapping/unmapping multiple folios they belong to the same 
> pgdat+memcg.

Yes, the batch should be suitable for mapping/unmapping.

> 
> The only tricky bit is the rcu_read_lock() around folio_memcg(); that 
> might require some thought.

Indeed, we need stable folio/memcg bounding.

Thanks.
> 
> Hm ....
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH rfc 4/4] mm: filemap: try to batch lruvec stat updating
  2024-05-07  9:06   ` Kefeng Wang
@ 2024-05-09 14:01     ` Johannes Weiner
  2024-05-10  1:55       ` Kefeng Wang
  0 siblings, 1 reply; 14+ messages in thread
From: Johannes Weiner @ 2024-05-09 14:01 UTC (permalink / raw)
  To: Kefeng Wang
  Cc: Andrew Morton, David Hildenbrand, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Matthew Wilcox (Oracle),
	linux-mm, linux-fsdevel, Muchun Song

On Tue, May 07, 2024 at 05:06:57PM +0800, Kefeng Wang wrote:
> > +static void filemap_lruvec_stat_update(struct mem_cgroup *memcg,
> > +				       pg_data_t *pgdat, int nr)
> > +{
> > +	struct lruvec *lruvec;
> > +
> > +	if (!memcg) {
> > +		__mod_node_page_state(pgdat, NR_FILE_MAPPED, nr);
> > +		return;
> > +	}
> > +
> > +	lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > +	__mod_lruvec_state(lruvec, NR_FILE_MAPPED, nr);
> > +}
> > +
> >   vm_fault_t filemap_map_pages(struct vm_fault *vmf,
> >   			     pgoff_t start_pgoff, pgoff_t end_pgoff)
> >   {
> > @@ -3628,6 +3642,9 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
> >   	vm_fault_t ret = 0;
> >   	unsigned long rss = 0;
> >   	unsigned int nr_pages = 0, mmap_miss = 0, mmap_miss_saved, folio_type;
> > +	struct mem_cgroup *memcg, *memcg_cur;
> > +	pg_data_t *pgdat, *pgdat_cur;
> > +	int nr_mapped = 0;
> >   
> >   	rcu_read_lock();
> >   	folio = next_uptodate_folio(&xas, mapping, end_pgoff);
> > @@ -3648,9 +3665,20 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
> >   	}
> >   
> >   	folio_type = mm_counter_file(folio);
> > +	memcg = folio_memcg(folio);
> > +	pgdat = folio_pgdat(folio);

You should be able to do:

	lruvec = folio_lruvec(folio);

and then pass that directly to filemap_lruvec_stat_update().

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH rfc 4/4] mm: filemap: try to batch lruvec stat updating
  2024-05-09 14:01     ` Johannes Weiner
@ 2024-05-10  1:55       ` Kefeng Wang
  0 siblings, 0 replies; 14+ messages in thread
From: Kefeng Wang @ 2024-05-10  1:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, David Hildenbrand, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Matthew Wilcox (Oracle),
	linux-mm, linux-fsdevel, Muchun Song



On 2024/5/9 22:01, Johannes Weiner wrote:
> On Tue, May 07, 2024 at 05:06:57PM +0800, Kefeng Wang wrote:
>>> +static void filemap_lruvec_stat_update(struct mem_cgroup *memcg,
>>> +				       pg_data_t *pgdat, int nr)
>>> +{
>>> +	struct lruvec *lruvec;
>>> +
>>> +	if (!memcg) {
>>> +		__mod_node_page_state(pgdat, NR_FILE_MAPPED, nr);
>>> +		return;
>>> +	}
>>> +
>>> +	lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>> +	__mod_lruvec_state(lruvec, NR_FILE_MAPPED, nr);
>>> +}
>>> +
>>>    vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>>>    			     pgoff_t start_pgoff, pgoff_t end_pgoff)
>>>    {
>>> @@ -3628,6 +3642,9 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>>>    	vm_fault_t ret = 0;
>>>    	unsigned long rss = 0;
>>>    	unsigned int nr_pages = 0, mmap_miss = 0, mmap_miss_saved, folio_type;
>>> +	struct mem_cgroup *memcg, *memcg_cur;
>>> +	pg_data_t *pgdat, *pgdat_cur;
>>> +	int nr_mapped = 0;
>>>    
>>>    	rcu_read_lock();
>>>    	folio = next_uptodate_folio(&xas, mapping, end_pgoff);
>>> @@ -3648,9 +3665,20 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>>>    	}
>>>    
>>>    	folio_type = mm_counter_file(folio);
>>> +	memcg = folio_memcg(folio);
>>> +	pgdat = folio_pgdat(folio);
> 
> You should be able to do:
> 
> 	lruvec = folio_lruvec(folio);
> 
> and then pass that directly to filemap_lruvec_stat_update().

It's obviously better, will update and address David's comment in 
patch3, thank you.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-05-10  1:55 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-29  7:24 [PATCH rfc 0/4] mm: filemap: try to batch lruvec stat updating Kefeng Wang
2024-04-29  7:24 ` [PATCH rfc 1/4] mm: memory: add prepare_range_pte_entry() Kefeng Wang
2024-04-29  7:24 ` [PATCH rfc 2/4] mm: filemap: add filemap_set_pte_range() Kefeng Wang
2024-04-29  7:24 ` [PATCH rfc 3/4] mm: filemap: move __lruvec_stat_mod_folio() out of filemap_set_pte_range() Kefeng Wang
2024-05-07 11:11   ` David Hildenbrand
2024-05-07 13:12     ` Kefeng Wang
2024-05-08  9:33       ` David Hildenbrand
2024-05-08 11:15         ` Kefeng Wang
2024-05-08 11:27           ` David Hildenbrand
2024-05-08 13:56             ` Kefeng Wang
2024-04-29  7:24 ` [PATCH rfc 4/4] mm: filemap: try to batch lruvec stat updating Kefeng Wang
2024-05-07  9:06   ` Kefeng Wang
2024-05-09 14:01     ` Johannes Weiner
2024-05-10  1:55       ` Kefeng Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).