linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection
@ 2022-06-14  9:36 David Hildenbrand
  2022-06-15 15:25 ` Peter Xu
  0 siblings, 1 reply; 4+ messages in thread
From: David Hildenbrand @ 2022-06-14  9:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Peter Collingbourne, Linus Torvalds,
	Andrew Morton, Nadav Amit, Dave Hansen, Andrea Arcangeli,
	Peter Xu, Yang Shi, Hugh Dickins, Mel Gorman

Similar to our MM_CP_DIRTY_ACCT handling for shared, writable mappings, we
can try mapping anonymous pages in a private writable mapping writable if
they are exclusive, the PTE is already dirty, and no special handling
applies. Mapping the anonymous page writable is essentially the same thing
the write fault handler would do in this case.

Special handling is required for uffd-wp and softdirty tracking, so take
care of that properly. Also, leave PROT_NONE handling alone for now;
in the future, we could similarly extend the logic in do_numa_page() or
use pte_mk_savedwrite() here.

While this improves mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE)
performance, it should also be a valuable optimization for uffd-wp, when
un-protecting.

This has been previously suggested by Peter Collingbourne in [1],
relevant in the context of the Scudo memory allocator, before we had
PageAnonExclusive.

This commit doesn't add the same handling for PMDs (i.e., anonymous THP,
anonymous hugetlb); benchmark results from Andrea indicate that there
are minor performance gains, so it's might still be valuable to streamline
that logic for all anonymous pages in the future.

As we now also set MM_CP_DIRTY_ACCT for private mappings, let's rename
it to MM_CP_TRY_CHANGE_WRITABLE, to make it clearer what's actually
happening.

Micro-benchmark courtesy of Andrea:

===
 #define _GNU_SOURCE
 #include <sys/mman.h>
 #include <stdlib.h>
 #include <string.h>
 #include <stdio.h>
 #include <unistd.h>

 #define SIZE (1024*1024*1024)

int main(int argc, char *argv[])
{
	char *p;
	if (posix_memalign((void **)&p, sysconf(_SC_PAGESIZE)*512, SIZE))
		perror("posix_memalign"), exit(1);
	if (madvise(p, SIZE, argc > 1 ? MADV_HUGEPAGE : MADV_NOHUGEPAGE))
		perror("madvise");
	explicit_bzero(p, SIZE);
	for (int loops = 0; loops < 40; loops++) {
		if (mprotect(p, SIZE, PROT_READ))
			perror("mprotect"), exit(1);
		if (mprotect(p, SIZE, PROT_READ|PROT_WRITE))
			perror("mprotect"), exit(1);
		explicit_bzero(p, SIZE);
	}
}
===

Results on my Ryzen 9 3900X:


Stock 10 runs (lower is better):   AVG 6.398s, STDEV 0.043
Patched 10 runs (lower is better): AVG 3.780s, STDEV 0.026

===

[1] https://lkml.kernel.org/r/20210429214801.2583336-1-pcc@google.com

Suggested-by: Peter Collingbourne <pcc@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Collingbourne <pcc@google.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---

v2 -> v3:
* Check some VMA flags early and glue them to MM_CP_TRY_CHANGE_WRITABLE
* MM_CP_DIRTY_ACCT -> MM_CP_TRY_CHANGE_WRITABLE
* Use benchmark from Andrea and add updated results
* Add reference to Peter's patch
* Rephrase patch description + code comments

v1 -> v2:
* Rebased on v5.19-rc1
* Rerun benchmark
* Fix minor spelling issues in subject+description
* Drop IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) check
* Move pte_write() check into caller

---
 include/linux/mm.h |  8 +++--
 mm/mprotect.c      | 77 +++++++++++++++++++++++++++++++++++++---------
 2 files changed, 68 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bc8f326be0ce..ae66d26dd708 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1962,8 +1962,12 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
  * for now all the callers are only use one of the flags at the same
  * time.
  */
-/* Whether we should allow dirty bit accounting */
-#define  MM_CP_DIRTY_ACCT                  (1UL << 0)
+/*
+ * Whether we should manually check if we can map individual PTEs writable,
+ * because something (e.g., COW, uffd-wp) blocks that from happening for all
+ * PTEs automatically in a writable mapping.
+ */
+#define  MM_CP_TRY_CHANGE_WRITABLE	   (1UL << 0)
 /* Whether this protection change is for NUMA hints */
 #define  MM_CP_PROT_NUMA                   (1UL << 1)
 /* Whether this change is for write protecting */
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ba5592655ee3..996a97e213ad 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -38,6 +38,39 @@
 
 #include "internal.h"
 
+static inline bool can_change_pte_writable(struct vm_area_struct *vma,
+					   unsigned long addr, pte_t pte)
+{
+	struct page *page;
+
+	VM_BUG_ON(!(vma->vm_flags & VM_WRITE) || pte_write(pte));
+
+	if (pte_protnone(pte) || !pte_dirty(pte))
+		return false;
+
+	/* Do we need write faults for softdirty tracking? */
+	if ((vma->vm_flags & VM_SOFTDIRTY) && !pte_soft_dirty(pte))
+		return false;
+
+	/* Do we need write faults for uffd-wp tracking? */
+	if (userfaultfd_pte_wp(vma, pte))
+		return false;
+
+	if (!(vma->vm_flags & VM_SHARED)) {
+		/*
+		 * We can only special-case on exclusive anonymous pages,
+		 * because we know that our write-fault handler similarly would
+		 * map them writable without any additional checks while holding
+		 * the PT lock.
+		 */
+		page = vm_normal_page(vma, addr, pte);
+		if (!page || !PageAnon(page) || !PageAnonExclusive(page))
+			return false;
+	}
+
+	return true;
+}
+
 static unsigned long change_pte_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
@@ -46,7 +79,6 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 	spinlock_t *ptl;
 	unsigned long pages = 0;
 	int target_node = NUMA_NO_NODE;
-	bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT;
 	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
 	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
@@ -137,21 +169,27 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 				ptent = pte_wrprotect(ptent);
 				ptent = pte_mkuffd_wp(ptent);
 			} else if (uffd_wp_resolve) {
-				/*
-				 * Leave the write bit to be handled
-				 * by PF interrupt handler, then
-				 * things like COW could be properly
-				 * handled.
-				 */
 				ptent = pte_clear_uffd_wp(ptent);
 			}
 
-			/* Avoid taking write faults for known dirty pages */
-			if (dirty_accountable && pte_dirty(ptent) &&
-					(pte_soft_dirty(ptent) ||
-					 !(vma->vm_flags & VM_SOFTDIRTY))) {
+			/*
+			 * In some writable, shared mappings, we might want
+			 * to catch actual write access -- see
+			 * vma_wants_writenotify().
+			 *
+			 * In all writable, private mappings, we have to
+			 * properly handle COW.
+			 *
+			 * In both cases, we can sometimes still change PTEs
+			 * writable and avoid the write-fault handler, for
+			 * example, if a PTE is already dirty and no other
+			 * COW or special handling is required.
+			 */
+			if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
+			    !pte_write(ptent) &&
+			    can_change_pte_writable(vma, addr, ptent))
 				ptent = pte_mkwrite(ptent);
-			}
+
 			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
 			if (pte_needs_flush(oldpte, ptent))
 				tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
@@ -505,9 +543,9 @@ mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	unsigned long oldflags = vma->vm_flags;
 	long nrpages = (end - start) >> PAGE_SHIFT;
 	unsigned long charged = 0;
+	bool try_change_writable;
 	pgoff_t pgoff;
 	int error;
-	int dirty_accountable = 0;
 
 	if (newflags == oldflags) {
 		*pprev = vma;
@@ -583,11 +621,20 @@ mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	 * held in write mode.
 	 */
 	vma->vm_flags = newflags;
-	dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
+	/*
+	 * We want to check manually if we can change individual PTEs writable
+	 * if we can't do that automatically for all PTEs in a mapping. For
+	 * private mappings, that's always the case when we have write
+	 * permissions as we properly have to handle COW.
+	 */
+	if (vma->vm_flags & VM_SHARED)
+		try_change_writable = vma_wants_writenotify(vma, vma->vm_page_prot);
+	else
+		try_change_writable = !!(vma->vm_flags & VM_WRITE);
 	vma_set_page_prot(vma);
 
 	change_protection(tlb, vma, start, end, vma->vm_page_prot,
-			  dirty_accountable ? MM_CP_DIRTY_ACCT : 0);
+			  try_change_writable ? MM_CP_TRY_CHANGE_WRITABLE : 0);
 
 	/*
 	 * Private VM_LOCKED VMA becoming writable: trigger COW to avoid major

base-commit: b13baccc3850ca8b8cccbf8ed9912dbaa0fdf7f3
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection
  2022-06-14  9:36 [PATCH v3] mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection David Hildenbrand
@ 2022-06-15 15:25 ` Peter Xu
  2022-06-15 19:52   ` David Hildenbrand
  0 siblings, 1 reply; 4+ messages in thread
From: Peter Xu @ 2022-06-15 15:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Peter Collingbourne, Linus Torvalds,
	Andrew Morton, Nadav Amit, Dave Hansen, Andrea Arcangeli,
	Yang Shi, Hugh Dickins, Mel Gorman

On Tue, Jun 14, 2022 at 11:36:29AM +0200, David Hildenbrand wrote:
> Similar to our MM_CP_DIRTY_ACCT handling for shared, writable mappings, we
> can try mapping anonymous pages in a private writable mapping writable if
> they are exclusive, the PTE is already dirty, and no special handling
> applies. Mapping the anonymous page writable is essentially the same thing
> the write fault handler would do in this case.
> 
> Special handling is required for uffd-wp and softdirty tracking, so take
> care of that properly. Also, leave PROT_NONE handling alone for now;
> in the future, we could similarly extend the logic in do_numa_page() or
> use pte_mk_savedwrite() here.
> 
> While this improves mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE)
> performance, it should also be a valuable optimization for uffd-wp, when
> un-protecting.
> 
> This has been previously suggested by Peter Collingbourne in [1],
> relevant in the context of the Scudo memory allocator, before we had
> PageAnonExclusive.
> 
> This commit doesn't add the same handling for PMDs (i.e., anonymous THP,
> anonymous hugetlb); benchmark results from Andrea indicate that there
> are minor performance gains, so it's might still be valuable to streamline
> that logic for all anonymous pages in the future.
> 
> As we now also set MM_CP_DIRTY_ACCT for private mappings, let's rename
> it to MM_CP_TRY_CHANGE_WRITABLE, to make it clearer what's actually
> happening.

I'm personally not sure why DIRTY_ACCT cannot be applied to private
mappings; it sounds not only for shared but a common thing.  I also don't
know whether "change writable" could be misread too anyway. Say, we're
never changing RO->RW mappings with this flag, but only try to unprotect
the page proactively when proper, from that POV Nadav's suggestion seems
slightly better on using "unprotect".

No strong opinion, the patch looks correct to me, and thanks for providing
the new test results,

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection
  2022-06-15 15:25 ` Peter Xu
@ 2022-06-15 19:52   ` David Hildenbrand
  2022-06-15 20:16     ` Peter Xu
  0 siblings, 1 reply; 4+ messages in thread
From: David Hildenbrand @ 2022-06-15 19:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Peter Collingbourne, Linus Torvalds,
	Andrew Morton, Nadav Amit, Dave Hansen, Andrea Arcangeli,
	Yang Shi, Hugh Dickins, Mel Gorman

On 15.06.22 17:25, Peter Xu wrote:
> On Tue, Jun 14, 2022 at 11:36:29AM +0200, David Hildenbrand wrote:
>> Similar to our MM_CP_DIRTY_ACCT handling for shared, writable mappings, we
>> can try mapping anonymous pages in a private writable mapping writable if
>> they are exclusive, the PTE is already dirty, and no special handling
>> applies. Mapping the anonymous page writable is essentially the same thing
>> the write fault handler would do in this case.
>>
>> Special handling is required for uffd-wp and softdirty tracking, so take
>> care of that properly. Also, leave PROT_NONE handling alone for now;
>> in the future, we could similarly extend the logic in do_numa_page() or
>> use pte_mk_savedwrite() here.
>>
>> While this improves mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE)
>> performance, it should also be a valuable optimization for uffd-wp, when
>> un-protecting.
>>
>> This has been previously suggested by Peter Collingbourne in [1],
>> relevant in the context of the Scudo memory allocator, before we had
>> PageAnonExclusive.
>>
>> This commit doesn't add the same handling for PMDs (i.e., anonymous THP,
>> anonymous hugetlb); benchmark results from Andrea indicate that there
>> are minor performance gains, so it's might still be valuable to streamline
>> that logic for all anonymous pages in the future.
>>
>> As we now also set MM_CP_DIRTY_ACCT for private mappings, let's rename
>> it to MM_CP_TRY_CHANGE_WRITABLE, to make it clearer what's actually
>> happening.
> 
> I'm personally not sure why DIRTY_ACCT cannot be applied to private
> mappings; it sounds not only for shared but a common thing.  I also don't

TBH, I think the name is just absolutely unclear in that context.

> know whether "change writable" could be misread too anyway. Say, we're
> never changing RO->RW mappings with this flag, but only try to unprotect
> the page proactively when proper, from that POV Nadav's suggestion seems
> slightly better on using "unprotect".

write unprotection is a change from RO->RW, so I don't immediately see
the difference.

Anyhow, I don't like the sounding of TRY_WRITE_UNPROTECT.

I made it match the function name that I had:

MM_CP_TRY_CHANGE_WRITABLE
-> !pte_write()?
 -> can_change_pte_writable() ?
  ->pte_mkwrite()

Maybe MM_CP_TRY_MAKE_WRITABLE / MM_CP_TRY_MAKE_PTE_WRITABLE is clearer?

Open for suggestions because I'm apparently not the bast at naming
things either.

> 
> No strong opinion, the patch looks correct to me, and thanks for providing
> the new test results,
> 
> Acked-by: Peter Xu <peterx@redhat.com>
> 

Thanks Peter!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection
  2022-06-15 19:52   ` David Hildenbrand
@ 2022-06-15 20:16     ` Peter Xu
  0 siblings, 0 replies; 4+ messages in thread
From: Peter Xu @ 2022-06-15 20:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Peter Collingbourne, Linus Torvalds,
	Andrew Morton, Nadav Amit, Dave Hansen, Andrea Arcangeli,
	Yang Shi, Hugh Dickins, Mel Gorman

On Wed, Jun 15, 2022 at 09:52:11PM +0200, David Hildenbrand wrote:
> write unprotection is a change from RO->RW, so I don't immediately see
> the difference.

In my view "unprotect a pte" is only a subset of "grant pte write
permission", since: "unprotect" has a prerequisite that it used to be
"protected" so that's why we can unprotect. Aka, in mm term that's only
when VM_WRITE set.

So basically it is a hint that we're only working on VM_WRITE regions,
where I thought "unprotect" was slightly better.

> 
> Anyhow, I don't like the sounding of TRY_WRITE_UNPROTECT.
> 
> I made it match the function name that I had:
> 
> MM_CP_TRY_CHANGE_WRITABLE
> -> !pte_write()?
>  -> can_change_pte_writable() ?
>   ->pte_mkwrite()
> 
> Maybe MM_CP_TRY_MAKE_WRITABLE / MM_CP_TRY_MAKE_PTE_WRITABLE is clearer?
> 
> Open for suggestions because I'm apparently not the bast at naming
> things either.

Me neither.  I don't have a strong opinion anyway, and frankly indeed the
old naming is not great either to me.  Maybe there's better thoughts.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-06-15 20:16 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-14  9:36 [PATCH v3] mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection David Hildenbrand
2022-06-15 15:25 ` Peter Xu
2022-06-15 19:52   ` David Hildenbrand
2022-06-15 20:16     ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).