linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Nadav Amit <namit@vmware.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	linux-mm <linux-mm@kvack.org>,
	lkml <linux-kernel@vger.kernel.org>, Yu Zhao <yuzhao@google.com>,
	Andy Lutomirski <luto@kernel.org>, Peter Xu <peterx@redhat.com>,
	Pavel Emelyanov <xemul@openvz.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Minchan Kim <minchan@kernel.org>, Will Deacon <will@kernel.org>,
	Mel Gorman <mgorman@suse.de>
Subject: Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect
Date: Mon, 4 Jan 2021 21:26:33 +0000	[thread overview]
Message-ID: <2844ACC1-8908-494C-B411-3C69B27A1730@vmware.com> (raw)
In-Reply-To: <X/OCMalFYnDdGnch@redhat.com>

> On Jan 4, 2021, at 1:01 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> On Mon, Jan 04, 2021 at 08:39:37PM +0000, Nadav Amit wrote:
>>> On Jan 4, 2021, at 12:19 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>>> 
>>> On Mon, Jan 04, 2021 at 07:35:06PM +0000, Nadav Amit wrote:
>>>>> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
>>>>>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
>>>>>> 
>>>>>>> The scenario that happens in selftests/vm/userfaultfd is as follows:
>>>>>>> 
>>>>>>> cpu0				cpu1			cpu2
>>>>>>> ----				----			----
>>>>>>> 							[ Writable PTE
>>>>>>> 							  cached in TLB ]
>>>>>>> userfaultfd_writeprotect()
>>>>>>> [ write-*unprotect* ]
>>>>>>> mwriteprotect_range()
>>>>>>> mmap_read_lock()
>>>>>>> change_protection()
>>>>>>> 
>>>>>>> change_protection_range()
>>>>>>> ...
>>>>>>> change_pte_range()
>>>>>>> [ *clear* “write”-bit ]
>>>>>>> [ defer TLB flushes ]
>>>>>>> 				[ page-fault ]
>>>>>>> 				...
>>>>>>> 				wp_page_copy()
>>>>>>> 				 cow_user_page()
>>>>>>> 				  [ copy page ]
>>>>>>> 							[ write to old
>>>>>>> 							  page ]
>>>>>>> 				...
>>>>>>> 				 set_pte_at_notify()
>>>>>> 
>>>>>> Yuck!
>>>>> 
>>>>> Note, the above was posted before we figured out the details so it
>>>>> wasn't showing the real deferred tlb flush that caused problems (the
>>>>> one showed on the left causes zero issues).
>>>> 
>>>> Actually it was posted after (note that this is v2). The aforementioned
>>>> scenario that Peter regards to is the one that I actually encountered (not
>>>> the second scenario that is “theoretical”). This scenario that Peter regards
>>>> is indeed more “stupid” in the sense that we should just not write-protect
>>>> the PTE on userfaultfd write-unprotect.
>>>> 
>>>> Let me know if I made any mistake in the description.
>>> 
>>> I didn't say there is a mistake. I said it is not showing the real
>>> deferred tlb flush that cause problems.
>>> 
>>> The issue here is that we have a "defer tlb flush" that runs after
>>> "write to old page".
>>> 
>>> If you look at the above, you're induced to think the "defer tlb
>>> flush" that causes issues is the one in cpu0. It's not. That is
>>> totally harmless.
>> 
>> I do not understand what you say. The deferred TLB flush on cpu0 *is* the
>> the one that causes the problem. The PTE is write-protected (although it is
>> a userfaultfd unprotect operation), causing cpu1 to encounter a #PF, handle
>> the page-fault (and copy), while cpu2 keeps writing to the source page. If
>> cpu0 did not defer the TLB flush, this problem would not happen.
> 
> Your argument "If cpu0 did not defer the TLB flush, this problem would
> not happen" is identical to "if the cpu0 has a small TLB size and the
> tlb entry is recycled, the problem would not happen".
> 
> There are a multitude of factors that are unrelated to the real
> problematic deferred tlb flush that may happen and still won't cause
> the issue, including a parallel IPI.
> 
> The point is that we don't need to worry about the "defer TLB flushes"
> of the un-wrprotect, because you said earlier that deferring tlb
> flushes when you're doing "permission promotions" does not cause
> problems.
> 
> The only "deferred tlb flush" we need to worry about, not in the
> picture, is the one following the actual permission removal (the
> wrprotection).

I think you are missing the point of this scenario, which is different than
the second scenario.

In this scenario, change_pte_range(), when called to do userfaultfd’s
*unprotect* operation, did not preserve the write-bit if it was already set.
Instead change_pte_range() *cleared* the write-bit. So upon a logical
permission promotion operation - userfaultfd *unprotect* - you got a
physical permission demotion, turning RW PTEs into RO.

This problem is fully resolved by this part of the patch:

--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
		oldpte = *pte;
		if (pte_present(oldpte)) {
			pte_t ptent;
-			bool preserve_write = prot_numa && pte_write(oldpte);
+			bool preserve_write = (prot_numa || uffd_wp_resolve) &&
+					      pte_write(oldpte);

You can argue that this not directly related to the deferred TLB flush, as
once this chunk is added, a TLB flush would not be needed at all for
userfaultfd-unprotect. But I consider it a part of the problem, especially
since this is what triggered the userfaultfd self-tests to fail.

>> it shows the write that triggers the corruption instead of discussing
>> “windows”, which might be less clear. Running copy_user_page() with stale
> 
> I think showing exactly where the race window opens is key to
> understand the issue, but then that's the way I work and feel free to
> think it in any other way that may sound simpler.
> 
> I just worried people thinks the deferred tlb flush in your v2 trace
> is the one that causes problem when obviously it's not since it
> follows a permission promotion. Once that is clear, feel free to
> reject my trace.
> 
> All I care about is that performance don't regress from CPU-speed to
> disk I/O spindle speed, for soft dirty and uffd-wp.

I would feel more comfortable if you provide patches for uffd-wp. If you
want, I will do it, but I restate that I do not feel comfortable with this
solution (worried as it seems a bit ad-hoc and might leave out a scenario
we all missed or cause a TLB shootdown storm).

As for soft-dirty, I thought that you said that you do not see a better
(backportable) solution for soft-dirty. Correct me if I am wrong.

Anyhow, I will add your comments regarding the stale TLB window to make the
description clearer.


  reply	other threads:[~2021-01-04 21:27 UTC|newest]

Thread overview: 96+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-25  9:25 [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Nadav Amit
2020-12-25  9:25 ` [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect Nadav Amit
2021-01-04 12:22   ` Peter Zijlstra
2021-01-04 19:24     ` Andrea Arcangeli
2021-01-04 19:35       ` Nadav Amit
2021-01-04 20:19         ` Andrea Arcangeli
2021-01-04 20:39           ` Nadav Amit
2021-01-04 21:01             ` Andrea Arcangeli
2021-01-04 21:26               ` Nadav Amit [this message]
2021-01-05 18:45                 ` Andrea Arcangeli
2021-01-05 19:05                   ` Nadav Amit
2021-01-05 19:45                     ` Andrea Arcangeli
2021-01-05 20:06                       ` Nadav Amit
2021-01-05 21:06                         ` Andrea Arcangeli
2021-01-05 21:43                           ` Peter Xu
2021-01-05  8:13       ` Peter Zijlstra
2021-01-05  8:52         ` Nadav Amit
2021-01-05 14:26           ` Peter Zijlstra
2021-01-05  8:58       ` Peter Zijlstra
2021-01-05  9:22         ` Nadav Amit
2021-01-05 17:58         ` Andrea Arcangeli
2021-01-05 15:08   ` Peter Xu
2021-01-05 18:08     ` Andrea Arcangeli
2021-01-05 18:41       ` Peter Xu
2021-01-05 18:55         ` Andrea Arcangeli
2021-01-05 19:07     ` Nadav Amit
2021-01-05 19:43       ` Peter Xu
2020-12-25  9:25 ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Nadav Amit
2021-01-05 15:08   ` Will Deacon
2021-01-05 18:20   ` Andrea Arcangeli
2021-01-05 19:26     ` Nadav Amit
2021-01-05 20:39       ` Andrea Arcangeli
2021-01-05 21:20         ` Yu Zhao
2021-01-05 21:22         ` Nadav Amit
2021-01-05 22:16           ` Will Deacon
2021-01-06  0:29             ` Andrea Arcangeli
2021-01-06  0:02           ` Andrea Arcangeli
2021-01-07 20:04           ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Andrea Arcangeli
2021-01-07 20:04             ` [PATCH 1/2] mm: proc: Invalidate TLB after clearing soft-dirty page state Andrea Arcangeli
2021-01-07 20:04             ` [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending Andrea Arcangeli
2021-01-07 20:17               ` Linus Torvalds
2021-01-07 20:25                 ` Linus Torvalds
2021-01-07 20:58                 ` Andrea Arcangeli
2021-01-07 21:29                   ` Linus Torvalds
2021-01-07 21:53                     ` John Hubbard
2021-01-07 22:00                       ` Linus Torvalds
2021-01-07 22:14                         ` John Hubbard
2021-01-07 22:20                           ` Linus Torvalds
2021-01-07 22:24                             ` Linus Torvalds
2021-01-07 22:37                               ` John Hubbard
2021-01-15 11:27                       ` Jan Kara
2021-01-07 22:31                     ` Andrea Arcangeli
2021-01-07 22:42                       ` Linus Torvalds
2021-01-07 22:51                         ` Linus Torvalds
2021-01-07 23:48                           ` Andrea Arcangeli
2021-01-08  0:25                             ` Linus Torvalds
2021-01-08 12:48                               ` Will Deacon
2021-01-08 16:14                                 ` Andrea Arcangeli
2021-01-08 17:39                                   ` Linus Torvalds
2021-01-08 17:53                                     ` Andrea Arcangeli
2021-01-08 19:25                                       ` Linus Torvalds
2021-01-09  0:12                                         ` Andrea Arcangeli
2021-01-08 17:30                                 ` Linus Torvalds
2021-01-07 23:28                         ` Andrea Arcangeli
2021-01-07 21:36               ` kernel test robot
2021-01-07 20:25             ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Jason Gunthorpe
2021-01-07 20:32               ` Linus Torvalds
2021-01-07 21:05                 ` Linus Torvalds
2021-01-07 22:02                   ` Andrea Arcangeli
2021-01-07 22:17                     ` Linus Torvalds
2021-01-07 22:56                       ` Andrea Arcangeli
2021-01-09 19:32                   ` Matthew Wilcox
2021-01-09 19:46                     ` Linus Torvalds
2021-01-15 14:30                       ` Jan Kara
2021-01-07 21:54                 ` Andrea Arcangeli
2021-01-07 21:45               ` Andrea Arcangeli
2021-01-08 13:36                 ` Jason Gunthorpe
2021-01-08 17:00                   ` Andrea Arcangeli
2021-01-08 18:19                     ` Jason Gunthorpe
2021-01-08 18:31                       ` Andy Lutomirski
2021-01-08 18:38                         ` Linus Torvalds
2021-01-08 23:34                         ` Andrea Arcangeli
2021-01-09 19:03                           ` Andy Lutomirski
2021-01-09 19:15                             ` Linus Torvalds
2021-01-08 18:59                       ` Linus Torvalds
2021-01-08 22:43                       ` Andrea Arcangeli
2021-01-09  0:42                         ` Jason Gunthorpe
2021-01-09  2:50                           ` Andrea Arcangeli
2021-01-11 14:30                             ` Jason Gunthorpe
2021-01-13 21:56                           ` Jerome Glisse
2021-01-13 23:39                             ` Jason Gunthorpe
2021-01-14  2:35                               ` Jerome Glisse
     [not found]                     ` <20210109034958.6928-1-hdanton@sina.com>
2021-01-11 14:39                       ` Jason Gunthorpe
2021-01-05 21:55         ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Peter Xu
2021-03-02 22:13 ` [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Peter Xu
2021-03-02 22:14   ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2844ACC1-8908-494C-B411-3C69B27A1730@vmware.com \
    --to=namit@vmware.com \
    --cc=aarcange@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mgorman@suse.de \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=will@kernel.org \
    --cc=xemul@openvz.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).