linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	linux-mm <linux-mm@kvack.org>,
	lkml <linux-kernel@vger.kernel.org>, Yu Zhao <yuzhao@google.com>,
	Andy Lutomirski <luto@kernel.org>, Peter Xu <peterx@redhat.com>,
	Pavel Emelyanov <xemul@openvz.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Minchan Kim <minchan@kernel.org>, Will Deacon <will@kernel.org>,
	Mel Gorman <mgorman@suse.de>
Subject: Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect
Date: Mon, 4 Jan 2021 16:01:37 -0500	[thread overview]
Message-ID: <X/OCMalFYnDdGnch@redhat.com> (raw)
In-Reply-To: <73EE9007-65AF-4416-9930-D992C74447A9@vmware.com>

On Mon, Jan 04, 2021 at 08:39:37PM +0000, Nadav Amit wrote:
> > On Jan 4, 2021, at 12:19 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > 
> > On Mon, Jan 04, 2021 at 07:35:06PM +0000, Nadav Amit wrote:
> >>> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >>> 
> >>> Hello,
> >>> 
> >>> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
> >>>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
> >>>> 
> >>>>> The scenario that happens in selftests/vm/userfaultfd is as follows:
> >>>>> 
> >>>>> cpu0				cpu1			cpu2
> >>>>> ----				----			----
> >>>>> 							[ Writable PTE
> >>>>> 							  cached in TLB ]
> >>>>> userfaultfd_writeprotect()
> >>>>> [ write-*unprotect* ]
> >>>>> mwriteprotect_range()
> >>>>> mmap_read_lock()
> >>>>> change_protection()
> >>>>> 
> >>>>> change_protection_range()
> >>>>> ...
> >>>>> change_pte_range()
> >>>>> [ *clear* “write”-bit ]
> >>>>> [ defer TLB flushes ]
> >>>>> 				[ page-fault ]
> >>>>> 				...
> >>>>> 				wp_page_copy()
> >>>>> 				 cow_user_page()
> >>>>> 				  [ copy page ]
> >>>>> 							[ write to old
> >>>>> 							  page ]
> >>>>> 				...
> >>>>> 				 set_pte_at_notify()
> >>>> 
> >>>> Yuck!
> >>> 
> >>> Note, the above was posted before we figured out the details so it
> >>> wasn't showing the real deferred tlb flush that caused problems (the
> >>> one showed on the left causes zero issues).
> >> 
> >> Actually it was posted after (note that this is v2). The aforementioned
> >> scenario that Peter regards to is the one that I actually encountered (not
> >> the second scenario that is “theoretical”). This scenario that Peter regards
> >> is indeed more “stupid” in the sense that we should just not write-protect
> >> the PTE on userfaultfd write-unprotect.
> >> 
> >> Let me know if I made any mistake in the description.
> > 
> > I didn't say there is a mistake. I said it is not showing the real
> > deferred tlb flush that cause problems.
> > 
> > The issue here is that we have a "defer tlb flush" that runs after
> > "write to old page".
> > 
> > If you look at the above, you're induced to think the "defer tlb
> > flush" that causes issues is the one in cpu0. It's not. That is
> > totally harmless.
> 
> I do not understand what you say. The deferred TLB flush on cpu0 *is* the
> the one that causes the problem. The PTE is write-protected (although it is
> a userfaultfd unprotect operation), causing cpu1 to encounter a #PF, handle
> the page-fault (and copy), while cpu2 keeps writing to the source page. If
> cpu0 did not defer the TLB flush, this problem would not happen.

Your argument "If cpu0 did not defer the TLB flush, this problem would
not happen" is identical to "if the cpu0 has a small TLB size and the
tlb entry is recycled, the problem would not happen".

There are a multitude of factors that are unrelated to the real
problematic deferred tlb flush that may happen and still won't cause
the issue, including a parallel IPI.

The point is that we don't need to worry about the "defer TLB flushes"
of the un-wrprotect, because you said earlier that deferring tlb
flushes when you're doing "permission promotions" does not cause
problems.

The only "deferred tlb flush" we need to worry about, not in the
picture, is the one following the actual permission removal (the
wrprotection).

> it shows the write that triggers the corruption instead of discussing
> “windows”, which might be less clear. Running copy_user_page() with stale

I think showing exactly where the race window opens is key to
understand the issue, but then that's the way I work and feel free to
think it in any other way that may sound simpler.

I just worried people thinks the deferred tlb flush in your v2 trace
is the one that causes problem when obviously it's not since it
follows a permission promotion. Once that is clear, feel free to
reject my trace.

All I care about is that performance don't regress from CPU-speed to
disk I/O spindle speed, for soft dirty and uffd-wp.

> I am not married to my description and if you (and others) insist I would
> copy-paste your version.

I definitely don't insist, I only wanted to clarify in case it may not
have been clear the problematic deferred tlb flush wasn't part of your
trace.

> Are you talking about the first scenario (write-unprotect), the second one
> (write-protect followed by write-unprotect), both? It seems to me that all
> the deferred TLB flushes are mentioned at the point they are deferred. I can
> add the point in which they are performed.

The only case that has an issue for uffd-wp is in my trace and only
ever happens if there's a wrprotect in flight, the deferred tlb flush
of the wrprotect is deferred (and that's the problematic one that
closes the window when it finally runs) and un-wrprotect runs. The
window opens when the un-wrprotect unlocks the PT lock. The deferred
tlb flush of un-wrprotect is as relevant for this race, as random tlb
flushes from IPI or the TLB being small or none.

Thanks,
Andrea


  reply	other threads:[~2021-01-04 21:03 UTC|newest]

Thread overview: 96+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-25  9:25 [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Nadav Amit
2020-12-25  9:25 ` [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect Nadav Amit
2021-01-04 12:22   ` Peter Zijlstra
2021-01-04 19:24     ` Andrea Arcangeli
2021-01-04 19:35       ` Nadav Amit
2021-01-04 20:19         ` Andrea Arcangeli
2021-01-04 20:39           ` Nadav Amit
2021-01-04 21:01             ` Andrea Arcangeli [this message]
2021-01-04 21:26               ` Nadav Amit
2021-01-05 18:45                 ` Andrea Arcangeli
2021-01-05 19:05                   ` Nadav Amit
2021-01-05 19:45                     ` Andrea Arcangeli
2021-01-05 20:06                       ` Nadav Amit
2021-01-05 21:06                         ` Andrea Arcangeli
2021-01-05 21:43                           ` Peter Xu
2021-01-05  8:13       ` Peter Zijlstra
2021-01-05  8:52         ` Nadav Amit
2021-01-05 14:26           ` Peter Zijlstra
2021-01-05  8:58       ` Peter Zijlstra
2021-01-05  9:22         ` Nadav Amit
2021-01-05 17:58         ` Andrea Arcangeli
2021-01-05 15:08   ` Peter Xu
2021-01-05 18:08     ` Andrea Arcangeli
2021-01-05 18:41       ` Peter Xu
2021-01-05 18:55         ` Andrea Arcangeli
2021-01-05 19:07     ` Nadav Amit
2021-01-05 19:43       ` Peter Xu
2020-12-25  9:25 ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Nadav Amit
2021-01-05 15:08   ` Will Deacon
2021-01-05 18:20   ` Andrea Arcangeli
2021-01-05 19:26     ` Nadav Amit
2021-01-05 20:39       ` Andrea Arcangeli
2021-01-05 21:20         ` Yu Zhao
2021-01-05 21:22         ` Nadav Amit
2021-01-05 22:16           ` Will Deacon
2021-01-06  0:29             ` Andrea Arcangeli
2021-01-06  0:02           ` Andrea Arcangeli
2021-01-07 20:04           ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Andrea Arcangeli
2021-01-07 20:04             ` [PATCH 1/2] mm: proc: Invalidate TLB after clearing soft-dirty page state Andrea Arcangeli
2021-01-07 20:04             ` [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending Andrea Arcangeli
2021-01-07 20:17               ` Linus Torvalds
2021-01-07 20:25                 ` Linus Torvalds
2021-01-07 20:58                 ` Andrea Arcangeli
2021-01-07 21:29                   ` Linus Torvalds
2021-01-07 21:53                     ` John Hubbard
2021-01-07 22:00                       ` Linus Torvalds
2021-01-07 22:14                         ` John Hubbard
2021-01-07 22:20                           ` Linus Torvalds
2021-01-07 22:24                             ` Linus Torvalds
2021-01-07 22:37                               ` John Hubbard
2021-01-15 11:27                       ` Jan Kara
2021-01-07 22:31                     ` Andrea Arcangeli
2021-01-07 22:42                       ` Linus Torvalds
2021-01-07 22:51                         ` Linus Torvalds
2021-01-07 23:48                           ` Andrea Arcangeli
2021-01-08  0:25                             ` Linus Torvalds
2021-01-08 12:48                               ` Will Deacon
2021-01-08 16:14                                 ` Andrea Arcangeli
2021-01-08 17:39                                   ` Linus Torvalds
2021-01-08 17:53                                     ` Andrea Arcangeli
2021-01-08 19:25                                       ` Linus Torvalds
2021-01-09  0:12                                         ` Andrea Arcangeli
2021-01-08 17:30                                 ` Linus Torvalds
2021-01-07 23:28                         ` Andrea Arcangeli
2021-01-07 21:36               ` kernel test robot
2021-01-07 20:25             ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Jason Gunthorpe
2021-01-07 20:32               ` Linus Torvalds
2021-01-07 21:05                 ` Linus Torvalds
2021-01-07 22:02                   ` Andrea Arcangeli
2021-01-07 22:17                     ` Linus Torvalds
2021-01-07 22:56                       ` Andrea Arcangeli
2021-01-09 19:32                   ` Matthew Wilcox
2021-01-09 19:46                     ` Linus Torvalds
2021-01-15 14:30                       ` Jan Kara
2021-01-07 21:54                 ` Andrea Arcangeli
2021-01-07 21:45               ` Andrea Arcangeli
2021-01-08 13:36                 ` Jason Gunthorpe
2021-01-08 17:00                   ` Andrea Arcangeli
2021-01-08 18:19                     ` Jason Gunthorpe
2021-01-08 18:31                       ` Andy Lutomirski
2021-01-08 18:38                         ` Linus Torvalds
2021-01-08 23:34                         ` Andrea Arcangeli
2021-01-09 19:03                           ` Andy Lutomirski
2021-01-09 19:15                             ` Linus Torvalds
2021-01-08 18:59                       ` Linus Torvalds
2021-01-08 22:43                       ` Andrea Arcangeli
2021-01-09  0:42                         ` Jason Gunthorpe
2021-01-09  2:50                           ` Andrea Arcangeli
2021-01-11 14:30                             ` Jason Gunthorpe
2021-01-13 21:56                           ` Jerome Glisse
2021-01-13 23:39                             ` Jason Gunthorpe
2021-01-14  2:35                               ` Jerome Glisse
     [not found]                     ` <20210109034958.6928-1-hdanton@sina.com>
2021-01-11 14:39                       ` Jason Gunthorpe
2021-01-05 21:55         ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Peter Xu
2021-03-02 22:13 ` [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Peter Xu
2021-03-02 22:14   ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=X/OCMalFYnDdGnch@redhat.com \
    --to=aarcange@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mgorman@suse.de \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=namit@vmware.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=will@kernel.org \
    --cc=xemul@openvz.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).