All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v3 0/4] folio based filemap_map_pages()
@ 2023-02-03 13:16 Yin Fengwei
  2023-02-03 13:16 ` [RFC PATCH v3 1/4] filemap: add function filemap_map_folio_range() Yin Fengwei
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Yin Fengwei @ 2023-02-03 13:16 UTC (permalink / raw)
  To: willy, david, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang, fengwei.yin

Current filemap_map_pages() uses page granularity even when
underneath folio is large folio. Making it use folio based
granularity allows batched refcount, rmap and mm counter
update. Which brings performance gain.

This sereis tries to bring batched refcount, rmap and
mm counter for filemap_map_pages(). Testing with a micro
benchmark like will-it-scale.pagefault on a 48C/96T IceLake
tbox showed:
   - batched rmap brings around 15% performance gain
   - batched refcount brings around 2% performance gain

Patch 1 update filemap_map_pages() to do map based on folio
        granularity and batched refcount update

Patch 3,4,5 enable batched rmap and mm counter


Change from v2:
  - Drop patch 1 because it misses ->page_mkwrite() as Kirill
    pointed out.

  Patch 2:
    - Change page_add_file_rmap_range() to folio_add_file_rmap_range()
      as Matthew suggested
    - Remove "sub" as Matthew suggested

  Patch 3:
    - Only handle !cow case in do_set_pte_range() as David pointed out
    - Add a parameter of pte and avoid change vmf->pte in do_set_pte_range()
    - Drop do_set_pte_entry()

  Patch 4:
    - adopt the suggestion from Matthew to avoid subtracted/add vmf->pte.
      filemap_map_folio_range() doesn't update vmf->pte now. Make it easy
      to fit the filemap_map_pages() and no possible point to wrong page
      table.

Change from v1:
  - Update the struct page * parameter of *_range() to index
    in the folio as Matthew suggested
  - Fix indentations problem as Matthew pointed out
  - Add back the function comment as Matthew pointed out
  - Use nr_pages than len as Matthew pointed out
  - Add do_set_pte_range() as Matthew suggested
  - Add function comment as Ying suggested
  - Add performance test result to patch 1/2/5 commit message

  Patch 1:
    - Adapt commit message as Matthew suggested
    - Add Reviweed-by from Matthew
  Patch 3:
    - Restore general logic of page_add_file_rmap_range() to
      make patch review easier as Matthew suggested
  Patch 5:
    - Add perf data collected to understand the reason of
      performance gain

Yin Fengwei (4):
  filemap: add function filemap_map_folio_range()
  rmap: add folio_add_file_rmap_range()
  mm: add do_set_pte_range()
  filemap: batched update mm counter,rmap when map file folio

 include/linux/mm.h   |   3 ++
 include/linux/rmap.h |   2 +
 mm/filemap.c         | 101 ++++++++++++++++++++++++++++---------------
 mm/memory.c          |  59 ++++++++++++++++---------
 mm/rmap.c            |  66 ++++++++++++++++++++++------
 5 files changed, 163 insertions(+), 68 deletions(-)

-- 
2.30.2



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH v3 1/4] filemap: add function filemap_map_folio_range()
  2023-02-03 13:16 [RFC PATCH v3 0/4] folio based filemap_map_pages() Yin Fengwei
@ 2023-02-03 13:16 ` Yin Fengwei
  2023-02-03 13:53   ` Matthew Wilcox
  2023-02-03 14:17   ` Kirill A. Shutemov
  2023-02-03 13:16 ` [RFC PATCH v3 2/4] rmap: add folio_add_file_rmap_range() Yin Fengwei
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 21+ messages in thread
From: Yin Fengwei @ 2023-02-03 13:16 UTC (permalink / raw)
  To: willy, david, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang, fengwei.yin

filemap_map_folio_range() maps partial/full folio. Comparing to
original filemap_map_pages(), it batched updates refcount and
get minor performance improvement for large folio.

a self cooked will-it-scale.page_fault3 like app (change file
write fault to read fault) with xfs filesystem got 2% performance
gain.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 mm/filemap.c | 88 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 54 insertions(+), 34 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 992554c18f1f..f444684db9f2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3351,6 +3351,53 @@ static inline struct folio *next_map_page(struct address_space *mapping,
 				  mapping, xas, end_pgoff);
 }
 
+/*
+ * Map sub-pages range [start_page, start_page + nr_pages) of folio.
+ * start_page is gotten from start by folio_page(folio, start)
+ */
+static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
+			struct folio *folio, unsigned long start,
+			unsigned long addr, unsigned int nr_pages)
+{
+	vm_fault_t ret = 0;
+	struct vm_area_struct *vma = vmf->vma;
+	struct file *file = vma->vm_file;
+	struct page *page = folio_page(folio, start);
+	unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
+	unsigned int ref_count = 0, count = 0;
+
+	do {
+		if (PageHWPoison(page))
+			continue;
+
+		if (mmap_miss > 0)
+			mmap_miss--;
+
+		/*
+		 * NOTE: If there're PTE markers, we'll leave them to be
+		 * handled in the specific fault path, and it'll prohibit the
+		 * fault-around logic.
+		 */
+		if (!pte_none(*vmf->pte))
+			continue;
+
+		if (vmf->address == addr)
+			ret = VM_FAULT_NOPAGE;
+
+		ref_count++;
+		do_set_pte(vmf, page, addr);
+		update_mmu_cache(vma, addr, vmf->pte);
+	} while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
+
+	/* Restore the vmf->pte */
+	vmf->pte -= nr_pages;
+
+	folio_ref_add(folio, ref_count);
+	WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
+
+	return ret;
+}
+
 vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 			     pgoff_t start_pgoff, pgoff_t end_pgoff)
 {
@@ -3361,9 +3408,9 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	unsigned long addr;
 	XA_STATE(xas, &mapping->i_pages, start_pgoff);
 	struct folio *folio;
-	struct page *page;
 	unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
 	vm_fault_t ret = 0;
+	int nr_pages = 0;
 
 	rcu_read_lock();
 	folio = first_map_page(mapping, &xas, end_pgoff);
@@ -3378,45 +3425,18 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
 	do {
-again:
-		page = folio_file_page(folio, xas.xa_index);
-		if (PageHWPoison(page))
-			goto unlock;
-
-		if (mmap_miss > 0)
-			mmap_miss--;
+		unsigned long end;
 
 		addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
 		vmf->pte += xas.xa_index - last_pgoff;
 		last_pgoff = xas.xa_index;
+		end = folio->index + folio_nr_pages(folio) - 1;
+		nr_pages = min(end, end_pgoff) - xas.xa_index + 1;
 
-		/*
-		 * NOTE: If there're PTE markers, we'll leave them to be
-		 * handled in the specific fault path, and it'll prohibit the
-		 * fault-around logic.
-		 */
-		if (!pte_none(*vmf->pte))
-			goto unlock;
-
-		/* We're about to handle the fault */
-		if (vmf->address == addr)
-			ret = VM_FAULT_NOPAGE;
+		ret |=	filemap_map_folio_range(vmf, folio,
+				xas.xa_index - folio->index, addr, nr_pages);
+		xas.xa_index = end;
 
-		do_set_pte(vmf, page, addr);
-		/* no need to invalidate: a not-present page won't be cached */
-		update_mmu_cache(vma, addr, vmf->pte);
-		if (folio_more_pages(folio, xas.xa_index, end_pgoff)) {
-			xas.xa_index++;
-			folio_ref_inc(folio);
-			goto again;
-		}
-		folio_unlock(folio);
-		continue;
-unlock:
-		if (folio_more_pages(folio, xas.xa_index, end_pgoff)) {
-			xas.xa_index++;
-			goto again;
-		}
 		folio_unlock(folio);
 		folio_put(folio);
 	} while ((folio = next_map_page(mapping, &xas, end_pgoff)) != NULL);
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v3 2/4] rmap: add folio_add_file_rmap_range()
  2023-02-03 13:16 [RFC PATCH v3 0/4] folio based filemap_map_pages() Yin Fengwei
  2023-02-03 13:16 ` [RFC PATCH v3 1/4] filemap: add function filemap_map_folio_range() Yin Fengwei
@ 2023-02-03 13:16 ` Yin Fengwei
  2023-02-03 14:02   ` Matthew Wilcox
  2023-02-03 14:19   ` Kirill A. Shutemov
  2023-02-03 13:16 ` [RFC PATCH v3 3/4] mm: add do_set_pte_range() Yin Fengwei
  2023-02-03 13:16 ` [RFC PATCH v3 4/4] filemap: batched update mm counter,rmap when map file folio Yin Fengwei
  3 siblings, 2 replies; 21+ messages in thread
From: Yin Fengwei @ 2023-02-03 13:16 UTC (permalink / raw)
  To: willy, david, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang, fengwei.yin

folio_add_file_rmap_range() allows to add pte mapping to a specific
range of file folio. Comparing to page_add_file_rmap(),it batched
updates __lruvec_stat for large folio.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 include/linux/rmap.h |  2 ++
 mm/rmap.c            | 66 ++++++++++++++++++++++++++++++++++----------
 2 files changed, 54 insertions(+), 14 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index a4570da03e58..974124b41fee 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -198,6 +198,8 @@ void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
 		unsigned long address);
 void page_add_file_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
+void folio_add_file_rmap_range(struct folio *, unsigned long start,
+		unsigned int nr_pages, struct vm_area_struct *, bool compound);
 void page_remove_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 8287f2cc327d..f8861d832468 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1303,31 +1303,44 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
 }
 
 /**
- * page_add_file_rmap - add pte mapping to a file page
- * @page:	the page to add the mapping to
+ * flio_add_file_rmap_range - add pte mapping to a sub page range of a folio
+ * @folio:	The filio to add the mapping to
+ * @start:	The first page index in folio
+ * @nr_pages:	The number of pages for pte mapping adding
  * @vma:	the vm area in which the mapping is added
  * @compound:	charge the page as compound or small page
  *
+ * The page range of folio is defined by [first_page, first_page + nr_pages)
+ *
  * The caller needs to hold the pte lock.
  */
-void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
-		bool compound)
+void folio_add_file_rmap_range(struct folio *folio, unsigned long start,
+			unsigned int nr_pages, struct vm_area_struct *vma,
+			bool compound)
 {
-	struct folio *folio = page_folio(page);
 	atomic_t *mapped = &folio->_nr_pages_mapped;
-	int nr = 0, nr_pmdmapped = 0;
-	bool first;
+	unsigned int nr_pmdmapped = 0, first;
+	int nr = 0;
 
-	VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
+	VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio);
 
 	/* Is page being mapped by PTE? Is this its first map to be added? */
 	if (likely(!compound)) {
-		first = atomic_inc_and_test(&page->_mapcount);
-		nr = first;
-		if (first && folio_test_large(folio)) {
-			nr = atomic_inc_return_relaxed(mapped);
-			nr = (nr < COMPOUND_MAPPED);
-		}
+		struct page *page = folio_page(folio, start);
+
+		nr_pages = min_t(unsigned int, nr_pages,
+					folio_nr_pages(folio) - start);
+
+		do {
+			first = atomic_inc_and_test(&page->_mapcount);
+			if (first && folio_test_large(folio)) {
+				first = atomic_inc_return_relaxed(mapped);
+				first = (nr < COMPOUND_MAPPED);
+			}
+
+			if (first)
+				nr++;
+		} while (page++, --nr_pages > 0);
 	} else if (folio_test_pmd_mappable(folio)) {
 		/* That test is redundant: it's for safety or to optimize out */
 
@@ -1356,6 +1369,31 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
 	mlock_vma_folio(folio, vma, compound);
 }
 
+/**
+ * page_add_file_rmap - add pte mapping to a file page
+ * @page:	the page to add the mapping to
+ * @vma:	the vm area in which the mapping is added
+ * @compound:	charge the page as compound or small page
+ *
+ * The caller needs to hold the pte lock.
+ */
+void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
+		bool compound)
+{
+	struct folio *folio = page_folio(page);
+	unsigned int nr_pages;
+
+	VM_WARN_ON_ONCE_PAGE(compound && !PageTransHuge(page), page);
+
+	if (likely(!compound))
+		nr_pages = 1;
+	else
+		nr_pages = folio_nr_pages(folio);
+
+	folio_add_file_rmap_range(folio, folio_page_idx(folio, page),
+			nr_pages, vma, compound);
+}
+
 /**
  * page_remove_rmap - take down pte mapping from a page
  * @page:	page to remove mapping from
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v3 3/4] mm: add do_set_pte_range()
  2023-02-03 13:16 [RFC PATCH v3 0/4] folio based filemap_map_pages() Yin Fengwei
  2023-02-03 13:16 ` [RFC PATCH v3 1/4] filemap: add function filemap_map_folio_range() Yin Fengwei
  2023-02-03 13:16 ` [RFC PATCH v3 2/4] rmap: add folio_add_file_rmap_range() Yin Fengwei
@ 2023-02-03 13:16 ` Yin Fengwei
  2023-02-03 13:25   ` David Hildenbrand
  2023-02-03 13:32   ` Chih-En Lin
  2023-02-03 13:16 ` [RFC PATCH v3 4/4] filemap: batched update mm counter,rmap when map file folio Yin Fengwei
  3 siblings, 2 replies; 21+ messages in thread
From: Yin Fengwei @ 2023-02-03 13:16 UTC (permalink / raw)
  To: willy, david, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang, fengwei.yin

do_set_pte_range() allows to setup page table entries for a
specific range. It calls folio_add_file_rmap_range() to take
advantage of batched rmap update for large folio.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 include/linux/mm.h |  3 +++
 mm/filemap.c       |  1 -
 mm/memory.c        | 59 ++++++++++++++++++++++++++++++----------------
 3 files changed, 42 insertions(+), 21 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d6f8f41514cc..93192f04b276 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1162,6 +1162,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 
 vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
 void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
+void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
+		unsigned long addr, pte_t *pte,
+		unsigned long start, unsigned int nr);
 
 vm_fault_t finish_fault(struct vm_fault *vmf);
 vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
diff --git a/mm/filemap.c b/mm/filemap.c
index f444684db9f2..74046a3a0ff5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3386,7 +3386,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 
 		ref_count++;
 		do_set_pte(vmf, page, addr);
-		update_mmu_cache(vma, addr, vmf->pte);
 	} while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
 
 	/* Restore the vmf->pte */
diff --git a/mm/memory.c b/mm/memory.c
index 7a04a1130ec1..3754b2ef166a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4257,36 +4257,58 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 }
 #endif
 
-void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
+void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
+		unsigned long addr, pte_t *pte,
+		unsigned long start, unsigned int nr)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
+	bool cow = write && !(vma->vm_flags & VM_SHARED);
 	bool prefault = vmf->address != addr;
+	struct page *page = folio_page(folio, start);
 	pte_t entry;
 
-	flush_icache_page(vma, page);
-	entry = mk_pte(page, vma->vm_page_prot);
+	if (!cow) {
+		folio_add_file_rmap_range(folio, start, nr, vma, false);
+		add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
+	}
 
-	if (prefault && arch_wants_old_prefaulted_pte())
-		entry = pte_mkold(entry);
-	else
-		entry = pte_sw_mkyoung(entry);
+	do {
+		flush_icache_page(vma, page);
+		entry = mk_pte(page, vma->vm_page_prot);
 
-	if (write)
-		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-	if (unlikely(uffd_wp))
-		entry = pte_mkuffd_wp(entry);
-	/* copy-on-write page */
-	if (write && !(vma->vm_flags & VM_SHARED)) {
+		if (prefault && arch_wants_old_prefaulted_pte())
+			entry = pte_mkold(entry);
+		else
+			entry = pte_sw_mkyoung(entry);
+
+		if (write)
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		if (unlikely(uffd_wp))
+			entry = pte_mkuffd_wp(entry);
+		set_pte_at(vma->vm_mm, addr, pte, entry);
+
+		/* no need to invalidate: a not-present page won't be cached */
+		update_mmu_cache(vma, addr, pte);
+	} while (pte++, page++, addr += PAGE_SIZE, --nr > 0);
+}
+
+void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
+{
+	struct folio *folio = page_folio(page);
+	struct vm_area_struct *vma = vmf->vma;
+	bool cow = (vmf->flags & FAULT_FLAG_WRITE) &&
+			!(vma->vm_flags & VM_SHARED);
+
+	if (cow) {
 		inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
 		page_add_new_anon_rmap(page, vma, addr);
 		lru_cache_add_inactive_or_unevictable(page, vma);
-	} else {
-		inc_mm_counter(vma->vm_mm, mm_counter_file(page));
-		page_add_file_rmap(page, vma, false);
 	}
-	set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
+
+	do_set_pte_range(vmf, folio, addr, vmf->pte,
+			folio_page_idx(folio, page), 1);
 }
 
 static bool vmf_pte_changed(struct vm_fault *vmf)
@@ -4361,9 +4383,6 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 	if (likely(!vmf_pte_changed(vmf))) {
 		do_set_pte(vmf, page, vmf->address);
 
-		/* no need to invalidate: a not-present page won't be cached */
-		update_mmu_cache(vma, vmf->address, vmf->pte);
-
 		ret = 0;
 	} else {
 		update_mmu_tlb(vma, vmf->address, vmf->pte);
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v3 4/4] filemap: batched update mm counter,rmap when map file folio
  2023-02-03 13:16 [RFC PATCH v3 0/4] folio based filemap_map_pages() Yin Fengwei
                   ` (2 preceding siblings ...)
  2023-02-03 13:16 ` [RFC PATCH v3 3/4] mm: add do_set_pte_range() Yin Fengwei
@ 2023-02-03 13:16 ` Yin Fengwei
  3 siblings, 0 replies; 21+ messages in thread
From: Yin Fengwei @ 2023-02-03 13:16 UTC (permalink / raw)
  To: willy, david, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang, fengwei.yin

Use do_set_pte_range() in filemap_map_folio_range(). Which
batched updates mm counter, rmap for large folio.

With a will-it-scale.page_fault3 like app (change file write
fault testing to read fault testing. Trying to upstream it to
will-it-scale at [1]) got 15% performance gain.

Perf data collected before/after the change:
  18.73%--page_add_file_rmap
          |
           --11.60%--__mod_lruvec_page_state
                     |
                     |--7.40%--__mod_memcg_lruvec_state
                     |          |
                     |           --5.58%--cgroup_rstat_updated
                     |
                      --2.53%--__mod_lruvec_state
                                |
                                 --1.48%--__mod_node_page_state

  9.93%--page_add_file_rmap_range
         |
          --2.67%--__mod_lruvec_page_state
                    |
                    |--1.95%--__mod_memcg_lruvec_state
                    |          |
                    |           --1.57%--cgroup_rstat_updated
                    |
                     --0.61%--__mod_lruvec_state
                               |
                                --0.54%--__mod_node_page_state

The running time of __mode_lruvec_page_state() is reduced about 9%.

[1]: https://github.com/antonblanchard/will-it-scale/pull/37

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 mm/filemap.c | 34 ++++++++++++++++++++++++----------
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 74046a3a0ff5..be75352050fe 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3364,11 +3364,12 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 	struct file *file = vma->vm_file;
 	struct page *page = folio_page(folio, start);
 	unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
-	unsigned int ref_count = 0, count = 0;
+	unsigned int mapped = 0;
+	pte_t *pte = vmf->pte;
 
 	do {
 		if (PageHWPoison(page))
-			continue;
+			goto map;
 
 		if (mmap_miss > 0)
 			mmap_miss--;
@@ -3378,20 +3379,33 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 		 * handled in the specific fault path, and it'll prohibit the
 		 * fault-around logic.
 		 */
-		if (!pte_none(*vmf->pte))
-			continue;
+		if (!pte_none(pte[mapped]))
+			goto map;
 
 		if (vmf->address == addr)
 			ret = VM_FAULT_NOPAGE;
 
-		ref_count++;
-		do_set_pte(vmf, page, addr);
-	} while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
+		mapped++;
+		continue;
 
-	/* Restore the vmf->pte */
-	vmf->pte -= nr_pages;
+map:
+		if (mapped) {
+			do_set_pte_range(vmf, folio, addr, pte, start, mapped);
+			folio_ref_add(folio, mapped);
+		}
+
+		/* advance 1 to jump the HWPoison or !pte_none entry */
+		start += mapped + 1;
+		pte += mapped + 1;
+		addr += (mapped + 1) * PAGE_SIZE;
+		mapped = 0;
+	} while (page++, --nr_pages > 0);
+
+	if (mapped) {
+		do_set_pte_range(vmf, folio, addr, pte, start, mapped);
+		folio_ref_add(folio, mapped);
+	}
 
-	folio_ref_add(folio, ref_count);
 	WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
 
 	return ret;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v3 3/4] mm: add do_set_pte_range()
  2023-02-03 13:16 ` [RFC PATCH v3 3/4] mm: add do_set_pte_range() Yin Fengwei
@ 2023-02-03 13:25   ` David Hildenbrand
  2023-02-03 13:30     ` Yin, Fengwei
  2023-02-03 13:32   ` Chih-En Lin
  1 sibling, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2023-02-03 13:25 UTC (permalink / raw)
  To: Yin Fengwei, willy, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang

On 03.02.23 14:16, Yin Fengwei wrote:
> do_set_pte_range() allows to setup page table entries for a
> specific range. It calls folio_add_file_rmap_range() to take
> advantage of batched rmap update for large folio.
> 
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> ---
>   include/linux/mm.h |  3 +++
>   mm/filemap.c       |  1 -
>   mm/memory.c        | 59 ++++++++++++++++++++++++++++++----------------
>   3 files changed, 42 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d6f8f41514cc..93192f04b276 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1162,6 +1162,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>   
>   vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
>   void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
> +		unsigned long addr, pte_t *pte,
> +		unsigned long start, unsigned int nr);
>   
>   vm_fault_t finish_fault(struct vm_fault *vmf);
>   vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index f444684db9f2..74046a3a0ff5 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3386,7 +3386,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>   
>   		ref_count++;
>   		do_set_pte(vmf, page, addr);
> -		update_mmu_cache(vma, addr, vmf->pte);
>   	} while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
>   
>   	/* Restore the vmf->pte */
> diff --git a/mm/memory.c b/mm/memory.c
> index 7a04a1130ec1..3754b2ef166a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4257,36 +4257,58 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>   }
>   #endif
>   
> -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
> +		unsigned long addr, pte_t *pte,
> +		unsigned long start, unsigned int nr)
>   {
>   	struct vm_area_struct *vma = vmf->vma;
>   	bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
>   	bool write = vmf->flags & FAULT_FLAG_WRITE;
> +	bool cow = write && !(vma->vm_flags & VM_SHARED);
>   	bool prefault = vmf->address != addr;
> +	struct page *page = folio_page(folio, start);
>   	pte_t entry;
>   
> -	flush_icache_page(vma, page);
> -	entry = mk_pte(page, vma->vm_page_prot);
> +	if (!cow) {
> +		folio_add_file_rmap_range(folio, start, nr, vma, false);
> +		add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
> +	}
>   
> -	if (prefault && arch_wants_old_prefaulted_pte())
> -		entry = pte_mkold(entry);
> -	else
> -		entry = pte_sw_mkyoung(entry);
> +	do {
> +		flush_icache_page(vma, page);
> +		entry = mk_pte(page, vma->vm_page_prot);
>   
> -	if (write)
> -		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> -	if (unlikely(uffd_wp))
> -		entry = pte_mkuffd_wp(entry);
> -	/* copy-on-write page */
> -	if (write && !(vma->vm_flags & VM_SHARED)) {
> +		if (prefault && arch_wants_old_prefaulted_pte())
> +			entry = pte_mkold(entry);
> +		else
> +			entry = pte_sw_mkyoung(entry);
> +
> +		if (write)
> +			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +		if (unlikely(uffd_wp))
> +			entry = pte_mkuffd_wp(entry);
> +		set_pte_at(vma->vm_mm, addr, pte, entry);
> +
> +		/* no need to invalidate: a not-present page won't be cached */
> +		update_mmu_cache(vma, addr, pte);
> +	} while (pte++, page++, addr += PAGE_SIZE, --nr > 0);
> +}
> +
> +void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
> +{
> +	struct folio *folio = page_folio(page);
> +	struct vm_area_struct *vma = vmf->vma;
> +	bool cow = (vmf->flags & FAULT_FLAG_WRITE) &&
> +			!(vma->vm_flags & VM_SHARED);
> +
> +	if (cow) {
>   		inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>   		page_add_new_anon_rmap(page, vma, addr);

As raised, we cannot PTE-map a multi-page folio that way.

This function only supports single-page anon folios.

page_add_new_anon_rmap() -> folio_add_new_anon_rmap(). As that documents:

"If the folio is large, it is accounted as a THP" -- for example, we 
would only increment the "entire mapcount" and set the PageAnonExclusive 
bit only on the head page.

So this really doesn't work for multi-page folios and if the function 
would be used for that, we'd be in trouble.

We'd want some fence here to detect that and bail out if we'd be 
instructed to do that. At least a WARN_ON_ONCE() I guess.
update_mmu_tlb(vma, vmf->address, vmf->pte);

Right now the function looks like it might just handle that.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v3 3/4] mm: add do_set_pte_range()
  2023-02-03 13:25   ` David Hildenbrand
@ 2023-02-03 13:30     ` Yin, Fengwei
  2023-02-03 13:34       ` David Hildenbrand
  0 siblings, 1 reply; 21+ messages in thread
From: Yin, Fengwei @ 2023-02-03 13:30 UTC (permalink / raw)
  To: David Hildenbrand, Yin Fengwei, willy, linux-mm
  Cc: dave.hansen, tim.c.chen, ying.huang



On 2/3/2023 9:25 PM, David Hildenbrand wrote:
> On 03.02.23 14:16, Yin Fengwei wrote:
>> do_set_pte_range() allows to setup page table entries for a
>> specific range. It calls folio_add_file_rmap_range() to take
>> advantage of batched rmap update for large folio.
>>
>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>> ---
>>   include/linux/mm.h |  3 +++
>>   mm/filemap.c       |  1 -
>>   mm/memory.c        | 59 ++++++++++++++++++++++++++++++----------------
>>   3 files changed, 42 insertions(+), 21 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index d6f8f41514cc..93192f04b276 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1162,6 +1162,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>>     vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
>>   void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
>> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
>> +        unsigned long addr, pte_t *pte,
>> +        unsigned long start, unsigned int nr);
>>     vm_fault_t finish_fault(struct vm_fault *vmf);
>>   vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index f444684db9f2..74046a3a0ff5 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -3386,7 +3386,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>>             ref_count++;
>>           do_set_pte(vmf, page, addr);
>> -        update_mmu_cache(vma, addr, vmf->pte);
>>       } while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
>>         /* Restore the vmf->pte */
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 7a04a1130ec1..3754b2ef166a 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4257,36 +4257,58 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>>   }
>>   #endif
>>   -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
>> +        unsigned long addr, pte_t *pte,
>> +        unsigned long start, unsigned int nr)
>>   {
>>       struct vm_area_struct *vma = vmf->vma;
>>       bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
>>       bool write = vmf->flags & FAULT_FLAG_WRITE;
>> +    bool cow = write && !(vma->vm_flags & VM_SHARED);
>>       bool prefault = vmf->address != addr;
>> +    struct page *page = folio_page(folio, start);
>>       pte_t entry;
>>   -    flush_icache_page(vma, page);
>> -    entry = mk_pte(page, vma->vm_page_prot);
>> +    if (!cow) {
>> +        folio_add_file_rmap_range(folio, start, nr, vma, false);
>> +        add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
>> +    }
>>   -    if (prefault && arch_wants_old_prefaulted_pte())
>> -        entry = pte_mkold(entry);
>> -    else
>> -        entry = pte_sw_mkyoung(entry);
>> +    do {
>> +        flush_icache_page(vma, page);
>> +        entry = mk_pte(page, vma->vm_page_prot);
>>   -    if (write)
>> -        entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>> -    if (unlikely(uffd_wp))
>> -        entry = pte_mkuffd_wp(entry);
>> -    /* copy-on-write page */
>> -    if (write && !(vma->vm_flags & VM_SHARED)) {
>> +        if (prefault && arch_wants_old_prefaulted_pte())
>> +            entry = pte_mkold(entry);
>> +        else
>> +            entry = pte_sw_mkyoung(entry);
>> +
>> +        if (write)
>> +            entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>> +        if (unlikely(uffd_wp))
>> +            entry = pte_mkuffd_wp(entry);
>> +        set_pte_at(vma->vm_mm, addr, pte, entry);
>> +
>> +        /* no need to invalidate: a not-present page won't be cached */
>> +        update_mmu_cache(vma, addr, pte);
>> +    } while (pte++, page++, addr += PAGE_SIZE, --nr > 0);
>> +}
>> +
>> +void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>> +{
>> +    struct folio *folio = page_folio(page);
>> +    struct vm_area_struct *vma = vmf->vma;
>> +    bool cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>> +            !(vma->vm_flags & VM_SHARED);
>> +
>> +    if (cow) {
>>           inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>           page_add_new_anon_rmap(page, vma, addr);
> 
> As raised, we cannot PTE-map a multi-page folio that way.
> 
> This function only supports single-page anon folios.
> 
> page_add_new_anon_rmap() -> folio_add_new_anon_rmap(). As that documents:
> 
> "If the folio is large, it is accounted as a THP" -- for example, we would only increment the "entire mapcount" and set the PageAnonExclusive bit only on the head page.
> 
> So this really doesn't work for multi-page folios and if the function would be used for that, we'd be in trouble.
> 
> We'd want some fence here to detect that and bail out if we'd be instructed to do that. At least a WARN_ON_ONCE() I guess.
> update_mmu_tlb(vma, vmf->address, vmf->pte);
> 
> Right now the function looks like it might just handle that.
You are right. I thought moving cow case out of it can make it explicit.
But looks like it doesn't. I will add WARN_ON_ONCE(). Thanks.


Regards
Yin, Fengwei

> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v3 3/4] mm: add do_set_pte_range()
  2023-02-03 13:16 ` [RFC PATCH v3 3/4] mm: add do_set_pte_range() Yin Fengwei
  2023-02-03 13:25   ` David Hildenbrand
@ 2023-02-03 13:32   ` Chih-En Lin
  2023-02-03 13:38     ` Yin, Fengwei
  1 sibling, 1 reply; 21+ messages in thread
From: Chih-En Lin @ 2023-02-03 13:32 UTC (permalink / raw)
  To: Yin Fengwei; +Cc: willy, david, linux-mm, dave.hansen, tim.c.chen, ying.huang

On Fri, Feb 03, 2023 at 09:16:35PM +0800, Yin Fengwei wrote:
> do_set_pte_range() allows to setup page table entries for a
> specific range. It calls folio_add_file_rmap_range() to take
> advantage of batched rmap update for large folio.
> 
> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> ---
>  include/linux/mm.h |  3 +++
>  mm/filemap.c       |  1 -
>  mm/memory.c        | 59 ++++++++++++++++++++++++++++++----------------
>  3 files changed, 42 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d6f8f41514cc..93192f04b276 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1162,6 +1162,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>  
>  vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
>  void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
> +		unsigned long addr, pte_t *pte,
> +		unsigned long start, unsigned int nr);
>  
>  vm_fault_t finish_fault(struct vm_fault *vmf);
>  vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index f444684db9f2..74046a3a0ff5 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3386,7 +3386,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>  
>  		ref_count++;
>  		do_set_pte(vmf, page, addr);
> -		update_mmu_cache(vma, addr, vmf->pte);
>  	} while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
>  
>  	/* Restore the vmf->pte */
> diff --git a/mm/memory.c b/mm/memory.c
> index 7a04a1130ec1..3754b2ef166a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4257,36 +4257,58 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>  }
>  #endif
>  
> -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
> +		unsigned long addr, pte_t *pte,
> +		unsigned long start, unsigned int nr)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
>  	bool write = vmf->flags & FAULT_FLAG_WRITE;
> +	bool cow = write && !(vma->vm_flags & VM_SHARED);

Why don't use is_cow_mapping()?
Is there anything messed up with VM_MAYWRITE?

>  	bool prefault = vmf->address != addr;
> +	struct page *page = folio_page(folio, start);
>  	pte_t entry;
>  
> -	flush_icache_page(vma, page);
> -	entry = mk_pte(page, vma->vm_page_prot);
> +	if (!cow) {
> +		folio_add_file_rmap_range(folio, start, nr, vma, false);
> +		add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
> +	}
>  
> -	if (prefault && arch_wants_old_prefaulted_pte())
> -		entry = pte_mkold(entry);
> -	else
> -		entry = pte_sw_mkyoung(entry);
> +	do {
> +		flush_icache_page(vma, page);
> +		entry = mk_pte(page, vma->vm_page_prot);
>  
> -	if (write)
> -		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> -	if (unlikely(uffd_wp))
> -		entry = pte_mkuffd_wp(entry);
> -	/* copy-on-write page */
> -	if (write && !(vma->vm_flags & VM_SHARED)) {
> +		if (prefault && arch_wants_old_prefaulted_pte())
> +			entry = pte_mkold(entry);
> +		else
> +			entry = pte_sw_mkyoung(entry);
> +
> +		if (write)
> +			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +		if (unlikely(uffd_wp))
> +			entry = pte_mkuffd_wp(entry);
> +		set_pte_at(vma->vm_mm, addr, pte, entry);
> +
> +		/* no need to invalidate: a not-present page won't be cached */
> +		update_mmu_cache(vma, addr, pte);
> +	} while (pte++, page++, addr += PAGE_SIZE, --nr > 0);
> +}
> +
> +void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
> +{
> +	struct folio *folio = page_folio(page);
> +	struct vm_area_struct *vma = vmf->vma;
> +	bool cow = (vmf->flags & FAULT_FLAG_WRITE) &&
> +			!(vma->vm_flags & VM_SHARED);

Here too.

> +
> +	if (cow) {
>  		inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>  		page_add_new_anon_rmap(page, vma, addr);
>  		lru_cache_add_inactive_or_unevictable(page, vma);
> -	} else {
> -		inc_mm_counter(vma->vm_mm, mm_counter_file(page));
> -		page_add_file_rmap(page, vma, false);
>  	}
> -	set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
> +
> +	do_set_pte_range(vmf, folio, addr, vmf->pte,
> +			folio_page_idx(folio, page), 1);
>  }
>  
>  static bool vmf_pte_changed(struct vm_fault *vmf)
> @@ -4361,9 +4383,6 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>  	if (likely(!vmf_pte_changed(vmf))) {
>  		do_set_pte(vmf, page, vmf->address);
>  
> -		/* no need to invalidate: a not-present page won't be cached */
> -		update_mmu_cache(vma, vmf->address, vmf->pte);
> -
>  		ret = 0;
>  	} else {
>  		update_mmu_tlb(vma, vmf->address, vmf->pte);
> -- 
> 2.30.2
> 
> 

Thanks,
Chih-En Lin


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v3 3/4] mm: add do_set_pte_range()
  2023-02-03 13:30     ` Yin, Fengwei
@ 2023-02-03 13:34       ` David Hildenbrand
  2023-02-03 13:39         ` Yin, Fengwei
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2023-02-03 13:34 UTC (permalink / raw)
  To: Yin, Fengwei, willy, linux-mm; +Cc: dave.hansen, tim.c.chen, ying.huang

On 03.02.23 14:30, Yin, Fengwei wrote:
> 
> 
> On 2/3/2023 9:25 PM, David Hildenbrand wrote:
>> On 03.02.23 14:16, Yin Fengwei wrote:
>>> do_set_pte_range() allows to setup page table entries for a
>>> specific range. It calls folio_add_file_rmap_range() to take
>>> advantage of batched rmap update for large folio.
>>>
>>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>>> ---
>>>    include/linux/mm.h |  3 +++
>>>    mm/filemap.c       |  1 -
>>>    mm/memory.c        | 59 ++++++++++++++++++++++++++++++----------------
>>>    3 files changed, 42 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index d6f8f41514cc..93192f04b276 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -1162,6 +1162,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>>>      vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
>>>    void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
>>> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
>>> +        unsigned long addr, pte_t *pte,
>>> +        unsigned long start, unsigned int nr);
>>>      vm_fault_t finish_fault(struct vm_fault *vmf);
>>>    vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>> index f444684db9f2..74046a3a0ff5 100644
>>> --- a/mm/filemap.c
>>> +++ b/mm/filemap.c
>>> @@ -3386,7 +3386,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>>>              ref_count++;
>>>            do_set_pte(vmf, page, addr);
>>> -        update_mmu_cache(vma, addr, vmf->pte);
>>>        } while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
>>>          /* Restore the vmf->pte */
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 7a04a1130ec1..3754b2ef166a 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4257,36 +4257,58 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>>>    }
>>>    #endif
>>>    -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>>> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
>>> +        unsigned long addr, pte_t *pte,
>>> +        unsigned long start, unsigned int nr)
>>>    {
>>>        struct vm_area_struct *vma = vmf->vma;
>>>        bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
>>>        bool write = vmf->flags & FAULT_FLAG_WRITE;
>>> +    bool cow = write && !(vma->vm_flags & VM_SHARED);
>>>        bool prefault = vmf->address != addr;
>>> +    struct page *page = folio_page(folio, start);
>>>        pte_t entry;
>>>    -    flush_icache_page(vma, page);
>>> -    entry = mk_pte(page, vma->vm_page_prot);
>>> +    if (!cow) {
>>> +        folio_add_file_rmap_range(folio, start, nr, vma, false);
>>> +        add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
>>> +    }
>>>    -    if (prefault && arch_wants_old_prefaulted_pte())
>>> -        entry = pte_mkold(entry);
>>> -    else
>>> -        entry = pte_sw_mkyoung(entry);
>>> +    do {
>>> +        flush_icache_page(vma, page);
>>> +        entry = mk_pte(page, vma->vm_page_prot);
>>>    -    if (write)
>>> -        entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>>> -    if (unlikely(uffd_wp))
>>> -        entry = pte_mkuffd_wp(entry);
>>> -    /* copy-on-write page */
>>> -    if (write && !(vma->vm_flags & VM_SHARED)) {
>>> +        if (prefault && arch_wants_old_prefaulted_pte())
>>> +            entry = pte_mkold(entry);
>>> +        else
>>> +            entry = pte_sw_mkyoung(entry);
>>> +
>>> +        if (write)
>>> +            entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>>> +        if (unlikely(uffd_wp))
>>> +            entry = pte_mkuffd_wp(entry);
>>> +        set_pte_at(vma->vm_mm, addr, pte, entry);
>>> +
>>> +        /* no need to invalidate: a not-present page won't be cached */
>>> +        update_mmu_cache(vma, addr, pte);
>>> +    } while (pte++, page++, addr += PAGE_SIZE, --nr > 0);
>>> +}
>>> +
>>> +void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>>> +{
>>> +    struct folio *folio = page_folio(page);
>>> +    struct vm_area_struct *vma = vmf->vma;
>>> +    bool cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>>> +            !(vma->vm_flags & VM_SHARED);
>>> +
>>> +    if (cow) {
>>>            inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>>            page_add_new_anon_rmap(page, vma, addr);
>>
>> As raised, we cannot PTE-map a multi-page folio that way.
>>
>> This function only supports single-page anon folios.
>>
>> page_add_new_anon_rmap() -> folio_add_new_anon_rmap(). As that documents:
>>
>> "If the folio is large, it is accounted as a THP" -- for example, we would only increment the "entire mapcount" and set the PageAnonExclusive bit only on the head page.
>>
>> So this really doesn't work for multi-page folios and if the function would be used for that, we'd be in trouble.
>>
>> We'd want some fence here to detect that and bail out if we'd be instructed to do that. At least a WARN_ON_ONCE() I guess.
>> update_mmu_tlb(vma, vmf->address, vmf->pte);
>>
>> Right now the function looks like it might just handle that.
> You are right. I thought moving cow case out of it can make it explicit.
> But looks like it doesn't. I will add WARN_ON_ONCE(). Thanks.

I guess I would move the cow check into  do_set_pte_range() as well, and 
verify in there that we are really only dealing with a single-page 
folio, commenting that rmap code would need serious adjustment to make 
it work and that current code never passes a multi-page folio.

Thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v3 3/4] mm: add do_set_pte_range()
  2023-02-03 13:32   ` Chih-En Lin
@ 2023-02-03 13:38     ` Yin, Fengwei
  2023-02-03 14:30       ` Chih-En Lin
  0 siblings, 1 reply; 21+ messages in thread
From: Yin, Fengwei @ 2023-02-03 13:38 UTC (permalink / raw)
  To: Chih-En Lin, Yin Fengwei
  Cc: willy, david, linux-mm, dave.hansen, tim.c.chen, ying.huang



On 2/3/2023 9:32 PM, Chih-En Lin wrote:
> On Fri, Feb 03, 2023 at 09:16:35PM +0800, Yin Fengwei wrote:
>> do_set_pte_range() allows to setup page table entries for a
>> specific range. It calls folio_add_file_rmap_range() to take
>> advantage of batched rmap update for large folio.
>>
>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>> ---
>>  include/linux/mm.h |  3 +++
>>  mm/filemap.c       |  1 -
>>  mm/memory.c        | 59 ++++++++++++++++++++++++++++++----------------
>>  3 files changed, 42 insertions(+), 21 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index d6f8f41514cc..93192f04b276 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1162,6 +1162,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>>  
>>  vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
>>  void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
>> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
>> +		unsigned long addr, pte_t *pte,
>> +		unsigned long start, unsigned int nr);
>>  
>>  vm_fault_t finish_fault(struct vm_fault *vmf);
>>  vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index f444684db9f2..74046a3a0ff5 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -3386,7 +3386,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>>  
>>  		ref_count++;
>>  		do_set_pte(vmf, page, addr);
>> -		update_mmu_cache(vma, addr, vmf->pte);
>>  	} while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
>>  
>>  	/* Restore the vmf->pte */
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 7a04a1130ec1..3754b2ef166a 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4257,36 +4257,58 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>>  }
>>  #endif
>>  
>> -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
>> +		unsigned long addr, pte_t *pte,
>> +		unsigned long start, unsigned int nr)
>>  {
>>  	struct vm_area_struct *vma = vmf->vma;
>>  	bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
>>  	bool write = vmf->flags & FAULT_FLAG_WRITE;
>> +	bool cow = write && !(vma->vm_flags & VM_SHARED);
> 
> Why don't use is_cow_mapping()?
> Is there anything messed up with VM_MAYWRITE?
My understanding is it's not related with the mapping. It's related with
what operation triggers fault here. Say, if it's a writable mapping, and
if the read operation triggers fault here, no cow or maybe_mkwrite() needed
here. Thanks.


Regards
Yin, Fengwei

> 
>>  	bool prefault = vmf->address != addr;
>> +	struct page *page = folio_page(folio, start);
>>  	pte_t entry;
>>  
>> -	flush_icache_page(vma, page);
>> -	entry = mk_pte(page, vma->vm_page_prot);
>> +	if (!cow) {
>> +		folio_add_file_rmap_range(folio, start, nr, vma, false);
>> +		add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
>> +	}
>>  
>> -	if (prefault && arch_wants_old_prefaulted_pte())
>> -		entry = pte_mkold(entry);
>> -	else
>> -		entry = pte_sw_mkyoung(entry);
>> +	do {
>> +		flush_icache_page(vma, page);
>> +		entry = mk_pte(page, vma->vm_page_prot);
>>  
>> -	if (write)
>> -		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>> -	if (unlikely(uffd_wp))
>> -		entry = pte_mkuffd_wp(entry);
>> -	/* copy-on-write page */
>> -	if (write && !(vma->vm_flags & VM_SHARED)) {
>> +		if (prefault && arch_wants_old_prefaulted_pte())
>> +			entry = pte_mkold(entry);
>> +		else
>> +			entry = pte_sw_mkyoung(entry);
>> +
>> +		if (write)
>> +			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>> +		if (unlikely(uffd_wp))
>> +			entry = pte_mkuffd_wp(entry);
>> +		set_pte_at(vma->vm_mm, addr, pte, entry);
>> +
>> +		/* no need to invalidate: a not-present page won't be cached */
>> +		update_mmu_cache(vma, addr, pte);
>> +	} while (pte++, page++, addr += PAGE_SIZE, --nr > 0);
>> +}
>> +
>> +void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>> +{
>> +	struct folio *folio = page_folio(page);
>> +	struct vm_area_struct *vma = vmf->vma;
>> +	bool cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>> +			!(vma->vm_flags & VM_SHARED);
> 
> Here too.
> 
>> +
>> +	if (cow) {
>>  		inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>  		page_add_new_anon_rmap(page, vma, addr);
>>  		lru_cache_add_inactive_or_unevictable(page, vma);
>> -	} else {
>> -		inc_mm_counter(vma->vm_mm, mm_counter_file(page));
>> -		page_add_file_rmap(page, vma, false);
>>  	}
>> -	set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
>> +
>> +	do_set_pte_range(vmf, folio, addr, vmf->pte,
>> +			folio_page_idx(folio, page), 1);
>>  }
>>  
>>  static bool vmf_pte_changed(struct vm_fault *vmf)
>> @@ -4361,9 +4383,6 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>  	if (likely(!vmf_pte_changed(vmf))) {
>>  		do_set_pte(vmf, page, vmf->address);
>>  
>> -		/* no need to invalidate: a not-present page won't be cached */
>> -		update_mmu_cache(vma, vmf->address, vmf->pte);
>> -
>>  		ret = 0;
>>  	} else {
>>  		update_mmu_tlb(vma, vmf->address, vmf->pte);
>> -- 
>> 2.30.2
>>
>>
> 
> Thanks,
> Chih-En Lin
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v3 3/4] mm: add do_set_pte_range()
  2023-02-03 13:34       ` David Hildenbrand
@ 2023-02-03 13:39         ` Yin, Fengwei
  0 siblings, 0 replies; 21+ messages in thread
From: Yin, Fengwei @ 2023-02-03 13:39 UTC (permalink / raw)
  To: David Hildenbrand, Yin, Fengwei, willy, linux-mm
  Cc: dave.hansen, tim.c.chen, ying.huang



On 2/3/2023 9:34 PM, David Hildenbrand wrote:
> On 03.02.23 14:30, Yin, Fengwei wrote:
>>
>>
>> On 2/3/2023 9:25 PM, David Hildenbrand wrote:
>>> On 03.02.23 14:16, Yin Fengwei wrote:
>>>> do_set_pte_range() allows to setup page table entries for a
>>>> specific range. It calls folio_add_file_rmap_range() to take
>>>> advantage of batched rmap update for large folio.
>>>>
>>>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>>>> ---
>>>>    include/linux/mm.h |  3 +++
>>>>    mm/filemap.c       |  1 -
>>>>    mm/memory.c        | 59 ++++++++++++++++++++++++++++++----------------
>>>>    3 files changed, 42 insertions(+), 21 deletions(-)
>>>>
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index d6f8f41514cc..93192f04b276 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -1162,6 +1162,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>>>>      vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
>>>>    void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
>>>> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
>>>> +        unsigned long addr, pte_t *pte,
>>>> +        unsigned long start, unsigned int nr);
>>>>      vm_fault_t finish_fault(struct vm_fault *vmf);
>>>>    vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
>>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>>> index f444684db9f2..74046a3a0ff5 100644
>>>> --- a/mm/filemap.c
>>>> +++ b/mm/filemap.c
>>>> @@ -3386,7 +3386,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>>>>              ref_count++;
>>>>            do_set_pte(vmf, page, addr);
>>>> -        update_mmu_cache(vma, addr, vmf->pte);
>>>>        } while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
>>>>          /* Restore the vmf->pte */
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 7a04a1130ec1..3754b2ef166a 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -4257,36 +4257,58 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>>>>    }
>>>>    #endif
>>>>    -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>>>> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
>>>> +        unsigned long addr, pte_t *pte,
>>>> +        unsigned long start, unsigned int nr)
>>>>    {
>>>>        struct vm_area_struct *vma = vmf->vma;
>>>>        bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
>>>>        bool write = vmf->flags & FAULT_FLAG_WRITE;
>>>> +    bool cow = write && !(vma->vm_flags & VM_SHARED);
>>>>        bool prefault = vmf->address != addr;
>>>> +    struct page *page = folio_page(folio, start);
>>>>        pte_t entry;
>>>>    -    flush_icache_page(vma, page);
>>>> -    entry = mk_pte(page, vma->vm_page_prot);
>>>> +    if (!cow) {
>>>> +        folio_add_file_rmap_range(folio, start, nr, vma, false);
>>>> +        add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
>>>> +    }
>>>>    -    if (prefault && arch_wants_old_prefaulted_pte())
>>>> -        entry = pte_mkold(entry);
>>>> -    else
>>>> -        entry = pte_sw_mkyoung(entry);
>>>> +    do {
>>>> +        flush_icache_page(vma, page);
>>>> +        entry = mk_pte(page, vma->vm_page_prot);
>>>>    -    if (write)
>>>> -        entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>>>> -    if (unlikely(uffd_wp))
>>>> -        entry = pte_mkuffd_wp(entry);
>>>> -    /* copy-on-write page */
>>>> -    if (write && !(vma->vm_flags & VM_SHARED)) {
>>>> +        if (prefault && arch_wants_old_prefaulted_pte())
>>>> +            entry = pte_mkold(entry);
>>>> +        else
>>>> +            entry = pte_sw_mkyoung(entry);
>>>> +
>>>> +        if (write)
>>>> +            entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>>>> +        if (unlikely(uffd_wp))
>>>> +            entry = pte_mkuffd_wp(entry);
>>>> +        set_pte_at(vma->vm_mm, addr, pte, entry);
>>>> +
>>>> +        /* no need to invalidate: a not-present page won't be cached */
>>>> +        update_mmu_cache(vma, addr, pte);
>>>> +    } while (pte++, page++, addr += PAGE_SIZE, --nr > 0);
>>>> +}
>>>> +
>>>> +void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>>>> +{
>>>> +    struct folio *folio = page_folio(page);
>>>> +    struct vm_area_struct *vma = vmf->vma;
>>>> +    bool cow = (vmf->flags & FAULT_FLAG_WRITE) &&
>>>> +            !(vma->vm_flags & VM_SHARED);
>>>> +
>>>> +    if (cow) {
>>>>            inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>>>            page_add_new_anon_rmap(page, vma, addr);
>>>
>>> As raised, we cannot PTE-map a multi-page folio that way.
>>>
>>> This function only supports single-page anon folios.
>>>
>>> page_add_new_anon_rmap() -> folio_add_new_anon_rmap(). As that documents:
>>>
>>> "If the folio is large, it is accounted as a THP" -- for example, we would only increment the "entire mapcount" and set the PageAnonExclusive bit only on the head page.
>>>
>>> So this really doesn't work for multi-page folios and if the function would be used for that, we'd be in trouble.
>>>
>>> We'd want some fence here to detect that and bail out if we'd be instructed to do that. At least a WARN_ON_ONCE() I guess.
>>> update_mmu_tlb(vma, vmf->address, vmf->pte);
>>>
>>> Right now the function looks like it might just handle that.
>> You are right. I thought moving cow case out of it can make it explicit.
>> But looks like it doesn't. I will add WARN_ON_ONCE(). Thanks.
> 
> I guess I would move the cow check into  do_set_pte_range() as well, and verify in there that we are really only dealing with a single-page folio, commenting that rmap code would need serious adjustment to make it work and that current code never passes a multi-page folio.

Yes. There is !cow in do_set_pte_range() already. I planned to add
WARN_ON_ONCE() on else case there. Thanks.


Regards
Yin, Fengwei

> 
> Thanks!
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v3 1/4] filemap: add function filemap_map_folio_range()
  2023-02-03 13:16 ` [RFC PATCH v3 1/4] filemap: add function filemap_map_folio_range() Yin Fengwei
@ 2023-02-03 13:53   ` Matthew Wilcox
  2023-02-04  3:25     ` Yin, Fengwei
  2023-02-03 14:17   ` Kirill A. Shutemov
  1 sibling, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2023-02-03 13:53 UTC (permalink / raw)
  To: Yin Fengwei; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang

On Fri, Feb 03, 2023 at 09:16:33PM +0800, Yin Fengwei wrote:
> filemap_map_folio_range() maps partial/full folio. Comparing to
> original filemap_map_pages(), it batched updates refcount and
> get minor performance improvement for large folio.
> 
> a self cooked will-it-scale.page_fault3 like app (change file
> write fault to read fault) with xfs filesystem got 2% performance
> gain.

Please delete folio_more_pages() as part of this patch; this was the
only caller.

> +		ret |=	filemap_map_folio_range(vmf, folio,
> +				xas.xa_index - folio->index, addr, nr_pages);

Sorry to nitpick, but there's an extra space between |= and
filemap_map_folio_range() here.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v3 2/4] rmap: add folio_add_file_rmap_range()
  2023-02-03 13:16 ` [RFC PATCH v3 2/4] rmap: add folio_add_file_rmap_range() Yin Fengwei
@ 2023-02-03 14:02   ` Matthew Wilcox
  2023-02-04  3:34     ` Yin, Fengwei
  2023-02-03 14:19   ` Kirill A. Shutemov
  1 sibling, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2023-02-03 14:02 UTC (permalink / raw)
  To: Yin Fengwei; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang

On Fri, Feb 03, 2023 at 09:16:34PM +0800, Yin Fengwei wrote:
> +++ b/mm/rmap.c
> @@ -1303,31 +1303,44 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>  }
>  
>  /**
> - * page_add_file_rmap - add pte mapping to a file page
> - * @page:	the page to add the mapping to
> + * flio_add_file_rmap_range - add pte mapping to a sub page range of a folio

Typo -- missing 'o'.  And there's that word "sub" still ;-)

> + * @folio:	The filio to add the mapping to

s/filio/folio/

> + * @start:	The first page index in folio

Better to describe it as the 'first page number' -- index is usually
used as file index, ie offset within the file, rather than offset within
the folio.

> + * @nr_pages:	The number of pages for pte mapping adding

Maybe "The number of pages which will be mapped"?



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v3 1/4] filemap: add function filemap_map_folio_range()
  2023-02-03 13:16 ` [RFC PATCH v3 1/4] filemap: add function filemap_map_folio_range() Yin Fengwei
  2023-02-03 13:53   ` Matthew Wilcox
@ 2023-02-03 14:17   ` Kirill A. Shutemov
  2023-02-04  3:31     ` Yin, Fengwei
  1 sibling, 1 reply; 21+ messages in thread
From: Kirill A. Shutemov @ 2023-02-03 14:17 UTC (permalink / raw)
  To: Yin Fengwei; +Cc: willy, david, linux-mm, dave.hansen, tim.c.chen, ying.huang

On Fri, Feb 03, 2023 at 09:16:33PM +0800, Yin Fengwei wrote:
> @@ -3378,45 +3425,18 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>  	addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>  	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>  	do {
> -again:
> -		page = folio_file_page(folio, xas.xa_index);
> -		if (PageHWPoison(page))
> -			goto unlock;
> -
> -		if (mmap_miss > 0)
> -			mmap_miss--;
> +		unsigned long end;
>  
>  		addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
>  		vmf->pte += xas.xa_index - last_pgoff;
>  		last_pgoff = xas.xa_index;
> +		end = folio->index + folio_nr_pages(folio) - 1;
> +		nr_pages = min(end, end_pgoff) - xas.xa_index + 1;
>  
> -		/*
> -		 * NOTE: If there're PTE markers, we'll leave them to be
> -		 * handled in the specific fault path, and it'll prohibit the
> -		 * fault-around logic.
> -		 */
> -		if (!pte_none(*vmf->pte))
> -			goto unlock;
> -
> -		/* We're about to handle the fault */
> -		if (vmf->address == addr)
> -			ret = VM_FAULT_NOPAGE;
> +		ret |=	filemap_map_folio_range(vmf, folio,
> +				xas.xa_index - folio->index, addr, nr_pages);
> +		xas.xa_index = end;

IIRC, end here can be beyond end_pgoff. Can it cause an issue? I don't see
it, but just in case.


-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v3 2/4] rmap: add folio_add_file_rmap_range()
  2023-02-03 13:16 ` [RFC PATCH v3 2/4] rmap: add folio_add_file_rmap_range() Yin Fengwei
  2023-02-03 14:02   ` Matthew Wilcox
@ 2023-02-03 14:19   ` Kirill A. Shutemov
  2023-02-04  3:35     ` Yin, Fengwei
  1 sibling, 1 reply; 21+ messages in thread
From: Kirill A. Shutemov @ 2023-02-03 14:19 UTC (permalink / raw)
  To: Yin Fengwei; +Cc: willy, david, linux-mm, dave.hansen, tim.c.chen, ying.huang

On Fri, Feb 03, 2023 at 09:16:34PM +0800, Yin Fengwei wrote:
> folio_add_file_rmap_range() allows to add pte mapping to a specific
> range of file folio. Comparing to page_add_file_rmap(),it batched

Missing space before 'it'.


-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v3 3/4] mm: add do_set_pte_range()
  2023-02-03 13:38     ` Yin, Fengwei
@ 2023-02-03 14:30       ` Chih-En Lin
  2023-02-04  5:47         ` Yin, Fengwei
  0 siblings, 1 reply; 21+ messages in thread
From: Chih-En Lin @ 2023-02-03 14:30 UTC (permalink / raw)
  To: Yin, Fengwei; +Cc: willy, david, linux-mm, dave.hansen, tim.c.chen, ying.huang

On Fri, Feb 03, 2023 at 09:38:15PM +0800, Yin, Fengwei wrote:
> 
> 
> On 2/3/2023 9:32 PM, Chih-En Lin wrote:
> > On Fri, Feb 03, 2023 at 09:16:35PM +0800, Yin Fengwei wrote:
> >> do_set_pte_range() allows to setup page table entries for a
> >> specific range. It calls folio_add_file_rmap_range() to take
> >> advantage of batched rmap update for large folio.
> >>
> >> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> >> ---
> >>  include/linux/mm.h |  3 +++
> >>  mm/filemap.c       |  1 -
> >>  mm/memory.c        | 59 ++++++++++++++++++++++++++++++----------------
> >>  3 files changed, 42 insertions(+), 21 deletions(-)
> >>
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index d6f8f41514cc..93192f04b276 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -1162,6 +1162,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> >>  
> >>  vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
> >>  void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
> >> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
> >> +		unsigned long addr, pte_t *pte,
> >> +		unsigned long start, unsigned int nr);
> >>  
> >>  vm_fault_t finish_fault(struct vm_fault *vmf);
> >>  vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
> >> diff --git a/mm/filemap.c b/mm/filemap.c
> >> index f444684db9f2..74046a3a0ff5 100644
> >> --- a/mm/filemap.c
> >> +++ b/mm/filemap.c
> >> @@ -3386,7 +3386,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
> >>  
> >>  		ref_count++;
> >>  		do_set_pte(vmf, page, addr);
> >> -		update_mmu_cache(vma, addr, vmf->pte);
> >>  	} while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
> >>  
> >>  	/* Restore the vmf->pte */
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index 7a04a1130ec1..3754b2ef166a 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -4257,36 +4257,58 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
> >>  }
> >>  #endif
> >>  
> >> -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
> >> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
> >> +		unsigned long addr, pte_t *pte,
> >> +		unsigned long start, unsigned int nr)
> >>  {
> >>  	struct vm_area_struct *vma = vmf->vma;
> >>  	bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
> >>  	bool write = vmf->flags & FAULT_FLAG_WRITE;
> >> +	bool cow = write && !(vma->vm_flags & VM_SHARED);
> > 
> > Why don't use is_cow_mapping()?
> > Is there anything messed up with VM_MAYWRITE?
> My understanding is it's not related with the mapping. It's related with
> what operation triggers fault here. Say, if it's a writable mapping, and
> if the read operation triggers fault here, no cow or maybe_mkwrite() needed
> here. Thanks.

Sorry, I didn't describe the thing properly.
It makes sense for the relationship with the write/read fault.
I'm just wondering if "!(vma->vm_flags & VM_SHARED)" is enough to determine
the COW page? And, I also found it in do_fault().

Like, copy_present_pte() use is_cow_mapping() for COW page and
"vm_flags & VM_SHARED" for shared mapping.

So, I looked up the commit that introduced the is_cow_mapping(),
67121172f9753 ("Allow arbitrary read-only shared pfn-remapping too").

Here is the commit message:
"
    The VM layer (for historical reasons) turns a read-only shared mmap into
    a private-like mapping with the VM_MAYWRITE bit clear.  Thus checking
    just VM_SHARED isn't actually sufficient.
    
    So use a trivial helper function for the cases where we wanted to inquire
    if a mapping was COW-like or not.
"

hmm, but it is v2.6.15.
So is "vm_flags & VM_SHARED" enough to check the COW page now?

Thanks,
Chih-En Lin


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v3 1/4] filemap: add function filemap_map_folio_range()
  2023-02-03 13:53   ` Matthew Wilcox
@ 2023-02-04  3:25     ` Yin, Fengwei
  0 siblings, 0 replies; 21+ messages in thread
From: Yin, Fengwei @ 2023-02-04  3:25 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang



On 2/3/2023 9:53 PM, Matthew Wilcox wrote:
> On Fri, Feb 03, 2023 at 09:16:33PM +0800, Yin Fengwei wrote:
>> filemap_map_folio_range() maps partial/full folio. Comparing to
>> original filemap_map_pages(), it batched updates refcount and
>> get minor performance improvement for large folio.
>>
>> a self cooked will-it-scale.page_fault3 like app (change file
>> write fault to read fault) with xfs filesystem got 2% performance
>> gain.
> 
> Please delete folio_more_pages() as part of this patch; this was the
> only caller.
OK. Will update in next version.

> 
>> +		ret |=	filemap_map_folio_range(vmf, folio,
>> +				xas.xa_index - folio->index, addr, nr_pages);
> 
> Sorry to nitpick, but there's an extra space between |= and
> filemap_map_folio_range() here.
OK. Will update in next version.


Regards
Yin, Fengwei

> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v3 1/4] filemap: add function filemap_map_folio_range()
  2023-02-03 14:17   ` Kirill A. Shutemov
@ 2023-02-04  3:31     ` Yin, Fengwei
  0 siblings, 0 replies; 21+ messages in thread
From: Yin, Fengwei @ 2023-02-04  3:31 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: willy, david, linux-mm, dave.hansen, tim.c.chen, ying.huang



On 2/3/2023 10:17 PM, Kirill A. Shutemov wrote:
> On Fri, Feb 03, 2023 at 09:16:33PM +0800, Yin Fengwei wrote:
>> @@ -3378,45 +3425,18 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>>  	addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>>  	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
>>  	do {
>> -again:
>> -		page = folio_file_page(folio, xas.xa_index);
>> -		if (PageHWPoison(page))
>> -			goto unlock;
>> -
>> -		if (mmap_miss > 0)
>> -			mmap_miss--;
>> +		unsigned long end;
>>  
>>  		addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
>>  		vmf->pte += xas.xa_index - last_pgoff;
>>  		last_pgoff = xas.xa_index;
>> +		end = folio->index + folio_nr_pages(folio) - 1;
>> +		nr_pages = min(end, end_pgoff) - xas.xa_index + 1;
>>  
>> -		/*
>> -		 * NOTE: If there're PTE markers, we'll leave them to be
>> -		 * handled in the specific fault path, and it'll prohibit the
>> -		 * fault-around logic.
>> -		 */
>> -		if (!pte_none(*vmf->pte))
>> -			goto unlock;
>> -
>> -		/* We're about to handle the fault */
>> -		if (vmf->address == addr)
>> -			ret = VM_FAULT_NOPAGE;
>> +		ret |=	filemap_map_folio_range(vmf, folio,
>> +				xas.xa_index - folio->index, addr, nr_pages);
>> +		xas.xa_index = end;
> 
> IIRC, end here can be beyond end_pgoff. Can it cause an issue? I don't see
> it, but just in case.
Yes. end can beyond end_pgoff. And it's fine because that will end the loop
and no access to xas.xa_index after that. But let me change the line to:
   xas.xa_index += nr_pages;
to keep the same behavior as before in next version. Thanks.

Regards
Yin, Fengwei

> 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v3 2/4] rmap: add folio_add_file_rmap_range()
  2023-02-03 14:02   ` Matthew Wilcox
@ 2023-02-04  3:34     ` Yin, Fengwei
  0 siblings, 0 replies; 21+ messages in thread
From: Yin, Fengwei @ 2023-02-04  3:34 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: david, linux-mm, dave.hansen, tim.c.chen, ying.huang



On 2/3/2023 10:02 PM, Matthew Wilcox wrote:
> On Fri, Feb 03, 2023 at 09:16:34PM +0800, Yin Fengwei wrote:
>> +++ b/mm/rmap.c
>> @@ -1303,31 +1303,44 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>>  }
>>  
>>  /**
>> - * page_add_file_rmap - add pte mapping to a file page
>> - * @page:	the page to add the mapping to
>> + * flio_add_file_rmap_range - add pte mapping to a sub page range of a folio
> 
> Typo -- missing 'o'.  And there's that word "sub" still ;-)
OK. 

> 
>> + * @folio:	The filio to add the mapping to
> 
> s/filio/folio/
OK.

> 
>> + * @start:	The first page index in folio
> 
> Better to describe it as the 'first page number' -- index is usually
> used as file index, ie offset within the file, rather than offset within
> the folio.
OK.

> 
>> + * @nr_pages:	The number of pages for pte mapping adding
> 
> Maybe "The number of pages which will be mapped"?
Sure.


Regards
Yin, Fengwei

> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v3 2/4] rmap: add folio_add_file_rmap_range()
  2023-02-03 14:19   ` Kirill A. Shutemov
@ 2023-02-04  3:35     ` Yin, Fengwei
  0 siblings, 0 replies; 21+ messages in thread
From: Yin, Fengwei @ 2023-02-04  3:35 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: willy, david, linux-mm, dave.hansen, tim.c.chen, ying.huang



On 2/3/2023 10:19 PM, Kirill A. Shutemov wrote:
> On Fri, Feb 03, 2023 at 09:16:34PM +0800, Yin Fengwei wrote:
>> folio_add_file_rmap_range() allows to add pte mapping to a specific
>> range of file folio. Comparing to page_add_file_rmap(),it batched
> 
> Missing space before 'it'.
Sure.

checkpatch didn't warn me for all these things you and Matthew pointed
out. I will be more careful for this in the future. And thanks a lot
for the reviewing.


Regards
Yin, Fengwei

> 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v3 3/4] mm: add do_set_pte_range()
  2023-02-03 14:30       ` Chih-En Lin
@ 2023-02-04  5:47         ` Yin, Fengwei
  0 siblings, 0 replies; 21+ messages in thread
From: Yin, Fengwei @ 2023-02-04  5:47 UTC (permalink / raw)
  To: Chih-En Lin; +Cc: willy, david, linux-mm, dave.hansen, tim.c.chen, ying.huang



On 2/3/2023 10:30 PM, Chih-En Lin wrote:
> On Fri, Feb 03, 2023 at 09:38:15PM +0800, Yin, Fengwei wrote:
>>
>>
>> On 2/3/2023 9:32 PM, Chih-En Lin wrote:
>>> On Fri, Feb 03, 2023 at 09:16:35PM +0800, Yin Fengwei wrote:
>>>> do_set_pte_range() allows to setup page table entries for a
>>>> specific range. It calls folio_add_file_rmap_range() to take
>>>> advantage of batched rmap update for large folio.
>>>>
>>>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>>>> ---
>>>>  include/linux/mm.h |  3 +++
>>>>  mm/filemap.c       |  1 -
>>>>  mm/memory.c        | 59 ++++++++++++++++++++++++++++++----------------
>>>>  3 files changed, 42 insertions(+), 21 deletions(-)
>>>>
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index d6f8f41514cc..93192f04b276 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -1162,6 +1162,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>>>>  
>>>>  vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
>>>>  void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
>>>> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
>>>> +		unsigned long addr, pte_t *pte,
>>>> +		unsigned long start, unsigned int nr);
>>>>  
>>>>  vm_fault_t finish_fault(struct vm_fault *vmf);
>>>>  vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
>>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>>> index f444684db9f2..74046a3a0ff5 100644
>>>> --- a/mm/filemap.c
>>>> +++ b/mm/filemap.c
>>>> @@ -3386,7 +3386,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>>>>  
>>>>  		ref_count++;
>>>>  		do_set_pte(vmf, page, addr);
>>>> -		update_mmu_cache(vma, addr, vmf->pte);
>>>>  	} while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
>>>>  
>>>>  	/* Restore the vmf->pte */
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 7a04a1130ec1..3754b2ef166a 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -4257,36 +4257,58 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>>>>  }
>>>>  #endif
>>>>  
>>>> -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>>>> +void do_set_pte_range(struct vm_fault *vmf, struct folio *folio,
>>>> +		unsigned long addr, pte_t *pte,
>>>> +		unsigned long start, unsigned int nr)
>>>>  {
>>>>  	struct vm_area_struct *vma = vmf->vma;
>>>>  	bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
>>>>  	bool write = vmf->flags & FAULT_FLAG_WRITE;
>>>> +	bool cow = write && !(vma->vm_flags & VM_SHARED);
>>>
>>> Why don't use is_cow_mapping()?
>>> Is there anything messed up with VM_MAYWRITE?
>> My understanding is it's not related with the mapping. It's related with
>> what operation triggers fault here. Say, if it's a writable mapping, and
>> if the read operation triggers fault here, no cow or maybe_mkwrite() needed
>> here. Thanks.
> 
> Sorry, I didn't describe the thing properly.
> It makes sense for the relationship with the write/read fault.
> I'm just wondering if "!(vma->vm_flags & VM_SHARED)" is enough to determine
> the COW page? And, I also found it in do_fault().
> 
> Like, copy_present_pte() use is_cow_mapping() for COW page and
> "vm_flags & VM_SHARED" for shared mapping.
> 
> So, I looked up the commit that introduced the is_cow_mapping(),
> 67121172f9753 ("Allow arbitrary read-only shared pfn-remapping too").
> 
> Here is the commit message:
> "
>     The VM layer (for historical reasons) turns a read-only shared mmap into
>     a private-like mapping with the VM_MAYWRITE bit clear.  Thus checking
>     just VM_SHARED isn't actually sufficient.
>     
>     So use a trivial helper function for the cases where we wanted to inquire
>     if a mapping was COW-like or not.
> "
> 
> hmm, but it is v2.6.15.
> So is "vm_flags & VM_SHARED" enough to check the COW page now?
Thanks for the detail info here. Yes. VM_MAYWRITE bit of vma->vm_flags needs
be checked for COW also.

In the page fault path, the VM_MAYWRITE bit was checked in sanitize_fault_flags():
                /* Write faults on read-only mappings are impossible ... */
                if (WARN_ON_ONCE(!(vma->vm_flags & VM_MAYWRITE)))
                        return VM_FAULT_SIGSEGV;

and bail out early if it's write fault and no VM_MAYWRITE.

My understanding is that sanitize_fault_flags() is called first before hit
do_set_pte()/do_set_pte_range().


Regards
Yin, Fengwei

> 
> Thanks,
> Chih-En Lin


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-02-04  5:47 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-03 13:16 [RFC PATCH v3 0/4] folio based filemap_map_pages() Yin Fengwei
2023-02-03 13:16 ` [RFC PATCH v3 1/4] filemap: add function filemap_map_folio_range() Yin Fengwei
2023-02-03 13:53   ` Matthew Wilcox
2023-02-04  3:25     ` Yin, Fengwei
2023-02-03 14:17   ` Kirill A. Shutemov
2023-02-04  3:31     ` Yin, Fengwei
2023-02-03 13:16 ` [RFC PATCH v3 2/4] rmap: add folio_add_file_rmap_range() Yin Fengwei
2023-02-03 14:02   ` Matthew Wilcox
2023-02-04  3:34     ` Yin, Fengwei
2023-02-03 14:19   ` Kirill A. Shutemov
2023-02-04  3:35     ` Yin, Fengwei
2023-02-03 13:16 ` [RFC PATCH v3 3/4] mm: add do_set_pte_range() Yin Fengwei
2023-02-03 13:25   ` David Hildenbrand
2023-02-03 13:30     ` Yin, Fengwei
2023-02-03 13:34       ` David Hildenbrand
2023-02-03 13:39         ` Yin, Fengwei
2023-02-03 13:32   ` Chih-En Lin
2023-02-03 13:38     ` Yin, Fengwei
2023-02-03 14:30       ` Chih-En Lin
2023-02-04  5:47         ` Yin, Fengwei
2023-02-03 13:16 ` [RFC PATCH v3 4/4] filemap: batched update mm counter,rmap when map file folio Yin Fengwei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.