stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RESEND v3] mm/userfaultfd: fix memory corruption due to writeprotect
@ 2021-03-03  9:57 Nadav Amit
  2021-03-03 19:03 ` Peter Xu
  0 siblings, 1 reply; 3+ messages in thread
From: Nadav Amit @ 2021-03-03  9:57 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Nadav Amit, Andrea Arcangeli,
	Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz,
	Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, stable,
	Yu Zhao

From: Nadav Amit <namit@vmware.com>

Userfaultfd self-test fails occasionally, indicating a memory
corruption.

Analyzing this problem indicates that there is a real bug since
mmap_lock is only taken for read in mwriteprotect_range() and defers
flushes, and since there is insufficient consideration of concurrent
deferred TLB flushes in wp_page_copy(). Although the PTE is flushed from
the TLBs in wp_page_copy(), this flush takes place after the copy has
already been performed, and therefore changes of the page are possible
between the time of the copy and the time in which the PTE is flushed.

To make matters worse, memory-unprotection using userfaultfd also poses
a problem. Although memory unprotection is logically a promotion of PTE
permissions, and therefore should not require a TLB flush, the current
userrfaultfd code might actually cause a demotion of the architectural
PTE permission: when userfaultfd_writeprotect() unprotects memory
region, it unintentionally *clears* the RW-bit if it was already set.
Note that this unprotecting a PTE that is not write-protected is a valid
use-case: the userfaultfd monitor might ask to unprotect a region that
holds both write-protected and write-unprotected PTEs.

The scenario that happens in selftests/vm/userfaultfd is as follows:

cpu0				cpu1			cpu2
----				----			----
							[ Writable PTE
							  cached in TLB ]
userfaultfd_writeprotect()
[ write-*unprotect* ]
mwriteprotect_range()
mmap_read_lock()
change_protection()

change_protection_range()
...
change_pte_range()
[ *clear* “write”-bit ]
[ defer TLB flushes ]
				[ page-fault ]
				...
				wp_page_copy()
				 cow_user_page()
				  [ copy page ]
							[ write to old
							  page ]
				...
				 set_pte_at_notify()

A similar scenario can happen:

cpu0		cpu1		cpu2		cpu3
----		----		----		----
						[ Writable PTE
				  		  cached in TLB ]
userfaultfd_writeprotect()
[ write-protect ]
[ deferred TLB flush ]
		userfaultfd_writeprotect()
		[ write-unprotect ]
		[ deferred TLB flush]
				[ page-fault ]
				wp_page_copy()
				 cow_user_page()
				 [ copy page ]
				 ...		[ write to page ]
				set_pte_at_notify()

This race exists since commit 292924b26024 ("userfaultfd: wp: apply
_PAGE_UFFD_WP bit"). Yet, as Yu Zhao pointed, these races became
apparent since commit 09854ba94c6a ("mm: do_wp_page() simplification")
which made wp_page_copy() more likely to take place, specifically if
page_count(page) > 1.

To resolve the aforementioned races, check whether there are pending
flushes on uffd-write-protected VMAs, and if there are, perform a flush
before doing the COW.

Further optimizations will follow to avoid during uffd-write-unprotect
unnecassary PTE write-protection and TLB flushes.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Suggested-by: Yu Zhao <yuzhao@google.com>
Fixes: 292924b26024 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit")
Signed-off-by: Nadav Amit <namit@vmware.com>

---
v2->v3:
* Do not acquire mmap_lock for write, flush conditionally instead [Yu]
* Change the fixes tag to the patch that made the race apparent [Yu]
* Removing patch to avoid write-protect on uffd unprotect. More
  comprehensive solution to follow (and avoid the TLB flush as well).
---
 mm/memory.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 9e8576a83147..06da04f98936 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3092,6 +3092,13 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 		return handle_userfault(vmf, VM_UFFD_WP);
 	}
 
+	/*
+	 * Userfaultfd write-protect can defer flushes. Ensure the TLB
+	 * is flushed in this case before copying.
+	 */
+	if (userfaultfd_wp(vmf->vma) && mm_tlb_flush_pending(vmf->vma->vm_mm))
+		flush_tlb_page(vmf->vma, vmf->address);
+
 	vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
 	if (!vmf->page) {
 		/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH RESEND v3] mm/userfaultfd: fix memory corruption due to writeprotect
  2021-03-03  9:57 [PATCH RESEND v3] mm/userfaultfd: fix memory corruption due to writeprotect Nadav Amit
@ 2021-03-03 19:03 ` Peter Xu
  2021-03-03 23:22   ` Nadav Amit
  0 siblings, 1 reply; 3+ messages in thread
From: Peter Xu @ 2021-03-03 19:03 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-mm, linux-kernel, Andrew Morton, Nadav Amit,
	Andrea Arcangeli, Andy Lutomirski, Pavel Emelyanov, Mike Kravetz,
	Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, stable,
	Yu Zhao

On Wed, Mar 03, 2021 at 01:57:02AM -0800, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> Userfaultfd self-test fails occasionally, indicating a memory
> corruption.

It's failing very constantly now for me after I got it run on a 40 cores
system...  While indeed not easy to fail on my laptop.

[...]

> Fixes: 292924b26024 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit")
> Signed-off-by: Nadav Amit <namit@vmware.com>
> 
> ---
> v2->v3:
> * Do not acquire mmap_lock for write, flush conditionally instead [Yu]
> * Change the fixes tag to the patch that made the race apparent [Yu]

Did you forget about this one?  It would still be good to point to 09854ba94c6a
just to show that 5.7/5.8 stable branches shouldn't need this patch as they're
not prone to the tlb data curruption.  Maybe also cc stable with 5.9+?

> * Removing patch to avoid write-protect on uffd unprotect. More
>   comprehensive solution to follow (and avoid the TLB flush as well).
> ---
>  mm/memory.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 9e8576a83147..06da04f98936 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3092,6 +3092,13 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>  		return handle_userfault(vmf, VM_UFFD_WP);
>  	}
>  
> +	/*
> +	 * Userfaultfd write-protect can defer flushes. Ensure the TLB
> +	 * is flushed in this case before copying.
> +	 */
> +	if (userfaultfd_wp(vmf->vma) && mm_tlb_flush_pending(vmf->vma->vm_mm))
> +		flush_tlb_page(vmf->vma, vmf->address);
> +
>  	vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
>  	if (!vmf->page) {
>  		/*
> -- 
> 2.25.1
> 

Thanks for being consistent on fixing this problem.

Maybe it's even better to put that into a "unlikely" to reduce the affect of
normal do_wp_page as much as possible?  But I'll leave it to others.

If with the fixes tag modified:

Reviewed-by: Peter Xu <peterx@redhat.com>
Tested-by: Peter Xu <peterx@redhat.com>

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH RESEND v3] mm/userfaultfd: fix memory corruption due to writeprotect
  2021-03-03 19:03 ` Peter Xu
@ 2021-03-03 23:22   ` Nadav Amit
  0 siblings, 0 replies; 3+ messages in thread
From: Nadav Amit @ 2021-03-03 23:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux-MM, LKML, Andrew Morton, Andrea Arcangeli, Andy Lutomirski,
	Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim,
	Will Deacon, Peter Zijlstra, stable, Yu Zhao

[-- Attachment #1: Type: text/plain, Size: 2424 bytes --]



> On Mar 3, 2021, at 11:03 AM, Peter Xu <peterx@redhat.com> wrote:
> 
> On Wed, Mar 03, 2021 at 01:57:02AM -0800, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Userfaultfd self-test fails occasionally, indicating a memory
>> corruption.
> 
> It's failing very constantly now for me after I got it run on a 40 cores
> system...  While indeed not easy to fail on my laptop.
> 

It fails rather constantly for me, but since nobody else reproduced it,
I was afraid to say otherwise ;-)

> 
>> Fixes: 292924b26024 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit")
>> Signed-off-by: Nadav Amit <namit@vmware.com>
>> 
>> ---
>> v2->v3:
>> * Do not acquire mmap_lock for write, flush conditionally instead [Yu]
>> * Change the fixes tag to the patch that made the race apparent [Yu]
> 
> Did you forget about this one?  It would still be good to point to 09854ba94c6a
> just to show that 5.7/5.8 stable branches shouldn't need this patch as they're
> not prone to the tlb data curruption.  Maybe also cc stable with 5.9+?

The fixes tag is wrong, as you say. I will fix it and cc stable with 5.9+.

> 
>> * Removing patch to avoid write-protect on uffd unprotect. More
>>  comprehensive solution to follow (and avoid the TLB flush as well).
>> ---
>> mm/memory.c | 7 +++++++
>> 1 file changed, 7 insertions(+)
>> 
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 9e8576a83147..06da04f98936 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3092,6 +3092,13 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>> 		return handle_userfault(vmf, VM_UFFD_WP);
>> 	}
>> 
>> +	/*
>> +	 * Userfaultfd write-protect can defer flushes. Ensure the TLB
>> +	 * is flushed in this case before copying.
>> +	 */
>> +	if (userfaultfd_wp(vmf->vma) && mm_tlb_flush_pending(vmf->vma->vm_mm))
>> +		flush_tlb_page(vmf->vma, vmf->address);
>> +
>> 	vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
>> 	if (!vmf->page) {
>> 		/*
>> --
>> 2.25.1
>> 
> 
> Thanks for being consistent on fixing this problem.
> 
> Maybe it's even better to put that into a "unlikely" to reduce the affect of
> normal do_wp_page as much as possible?  But I'll leave it to others.
> 
> If with the fixes tag modified:
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Tested-by: Peter Xu <peterx@redhat.com>

Thanks, I will send v4 later today.

Regards,
Nadav


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-03-04  0:49 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-03  9:57 [PATCH RESEND v3] mm/userfaultfd: fix memory corruption due to writeprotect Nadav Amit
2021-03-03 19:03 ` Peter Xu
2021-03-03 23:22   ` Nadav Amit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).