All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yu Zhao <yuzhao@google.com>
To: Nadav Amit <nadav.amit@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	linux-mm <linux-mm@kvack.org>, Peter Xu <peterx@redhat.com>,
	lkml <linux-kernel@vger.kernel.org>,
	Pavel Emelyanov <xemul@openvz.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	stable@vger.kernel.org, minchan@kernel.org,
	Andy Lutomirski <luto@kernel.org>, Will Deacon <will@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
Date: Mon, 21 Dec 2020 00:29:53 -0700	[thread overview]
Message-ID: <X+BO8VLTN8BYLN80@google.com> (raw)
In-Reply-To: <7EB8560C-620A-433D-933C-996D7E4F2CA1@gmail.com>

On Sun, Dec 20, 2020 at 09:39:06PM -0800, Nadav Amit wrote:
> > On Dec 20, 2020, at 9:25 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> > 
> >> On Dec 20, 2020, at 9:12 PM, Yu Zhao <yuzhao@google.com> wrote:
> >> 
> >> On Sun, Dec 20, 2020 at 08:36:15PM -0800, Nadav Amit wrote:
> >>>> On Dec 19, 2020, at 6:20 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >>>> 
> >>>> On Sat, Dec 19, 2020 at 02:06:02PM -0800, Nadav Amit wrote:
> >>>>>> On Dec 19, 2020, at 1:34 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> >>>>>> 
> >>>>>> [ cc’ing some more people who have experience with similar problems ]
> >>>>>> 
> >>>>>>> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >>>>>>> 
> >>>>>>> Hello,
> >>>>>>> 
> >>>>>>> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
> >>>>>>>> Analyzing this problem indicates that there is a real bug since
> >>>>>>>> mmap_lock is only taken for read in mwriteprotect_range(). This might
> >>>>>>> 
> >>>>>>> Never having to take the mmap_sem for writing, and in turn never
> >>>>>>> blocking, in order to modify the pagetables is quite an important
> >>>>>>> feature in uffd that justifies uffd instead of mprotect. It's not the
> >>>>>>> most important reason to use uffd, but it'd be nice if that guarantee
> >>>>>>> would remain also for the UFFDIO_WRITEPROTECT API, not only for the
> >>>>>>> other pgtable manipulations.
> >>>>>>> 
> >>>>>>>> Consider the following scenario with 3 CPUs (cpu2 is not shown):
> >>>>>>>> 
> >>>>>>>> cpu0				cpu1
> >>>>>>>> ----				----
> >>>>>>>> userfaultfd_writeprotect()
> >>>>>>>> [ write-protecting ]
> >>>>>>>> mwriteprotect_range()
> >>>>>>>> mmap_read_lock()
> >>>>>>>> change_protection()
> >>>>>>>> change_protection_range()
> >>>>>>>> ...
> >>>>>>>> change_pte_range()
> >>>>>>>> [ defer TLB flushes]
> >>>>>>>> 				userfaultfd_writeprotect()
> >>>>>>>> 				 mmap_read_lock()
> >>>>>>>> 				 change_protection()
> >>>>>>>> 				 [ write-unprotect ]
> >>>>>>>> 				 ...
> >>>>>>>> 				  [ unprotect PTE logically ]
> >>>> 
> >>>> Is the uffd selftest failing with upstream or after your kernel
> >>>> modification that removes the tlb flush from unprotect?
> >>> 
> >>> Please see my reply to Yu. I was wrong in this analysis, and I sent a
> >>> correction to my analysis. The problem actually happens when
> >>> userfaultfd_writeprotect() unprotects the memory.
> >>> 
> >>>> } else if (uffd_wp_resolve) {
> >>>> 				/*
> >>>> 				 * Leave the write bit to be handled
> >>>> 				 * by PF interrupt handler, then
> >>>> 				 * things like COW could be properly
> >>>> 				 * handled.
> >>>> 				 */
> >>>> 				ptent = pte_clear_uffd_wp(ptent);
> >>>> 			}
> >>>> 
> >>>> Upstraem this will still do pages++, there's a tlb flush before
> >>>> change_protection can return here, so I'm confused.
> >>> 
> >>> You are correct. The problem I encountered with userfaultfd_writeprotect()
> >>> is during unprotecting path.
> >>> 
> >>> Having said that, I think that there are additional scenarios that are
> >>> problematic. Consider for instance madvise_dontneed_free() that is racing
> >>> with userfaultfd_writeprotect(). If madvise_dontneed_free() completed
> >>> removing the PTEs, but still did not flush, change_pte_range() will see
> >>> non-present PTEs, say a flush is not needed, and then
> >>> change_protection_range() will not do a flush, and return while
> >>> the memory is still not protected.
> >>> 
> >>>> I don't share your concern. What matters is the PT lock, so it
> >>>> wouldn't be one per pte, but a least an order 9 higher, but let's
> >>>> assume one flush per pte.
> >>>> 
> >>>> It's either huge mapping and then it's likely running without other
> >>>> tlb flushing in background (postcopy snapshotting), or it's a granular
> >>>> protect with distributed shared memory in which case the number of
> >>>> changd ptes or huge_pmds tends to be always 1 anyway. So it doesn't
> >>>> matter if it's deferred.
> >>>> 
> >>>> I agree it may require a larger tlb flush review not just mprotect
> >>>> though, but it didn't sound particularly complex. Note the
> >>>> UFFDIO_WRITEPROTECT is still relatively recent so backports won't
> >>>> risk to reject so heavy as to require a band-aid.
> >>>> 
> >>>> My second thought is, I don't see exactly the bug and it's not clear
> >>>> if it's upstream reproducing this, but assuming this happens on
> >>>> upstream, even ignoring everything else happening in the tlb flush
> >>>> code, this sounds like purely introduced by userfaultfd_writeprotect()
> >>>> vs userfaultfd_writeprotect() (since it's the only place changing
> >>>> protection with mmap_sem for reading and note we already unmap and
> >>>> flush tlb with mmap_sem for reading in MADV_DONTNEED/MADV_FREE clears
> >>>> the dirty bit etc..). Flushing tlbs with mmap_sem for reading is
> >>>> nothing new, the only new thing is the flush after wrprotect.
> >>>> 
> >>>> So instead of altering any tlb flush code, would it be possible to
> >>>> just stick to mmap_lock for reading and then serialize
> >>>> userfaultfd_writeprotect() against itself with an additional
> >>>> mm->mmap_wprotect_lock mutex? That'd be a very local change to
> >>>> userfaultfd too.
> >>>> 
> >>>> Can you look if the rule mmap_sem for reading plus a new
> >>>> mm->mmap_wprotect_lock mutex or the mmap_sem for writing, whenever
> >>>> wrprotecting ptes, is enough to comply with the current tlb flushing
> >>>> code, so not to require any change non local to uffd (modulo the
> >>>> additional mutex).
> >>> 
> >>> So I did not fully understand your solution, but I took your point and
> >>> looked again on similar cases. To be fair, despite my experience with these
> >>> deferred TLB flushes as well as Peter Zijlstra’s great documentation, I keep
> >>> getting confused (e.g., can’t we somehow combine tlb_flush_batched and
> >>> tlb_flush_pending ?)
> >>> 
> >>> As I said before, my initial scenario was wrong, and the problem is not
> >>> userfaultfd_writeprotect() racing against itself. This one seems actually
> >>> benign to me.
> >>> 
> >>> Nevertheless, I do think there is a problem in change_protection_range().
> >>> Specifically, see the aforementioned scenario of a race between
> >>> madvise_dontneed_free() and userfaultfd_writeprotect().
> >>> 
> >>> So an immediate solution for such a case can be resolve without holding
> >>> mmap_lock for write, by just adding a test for mm_tlb_flush_nested() in
> >>> change_protection_range():
> >>> 
> >>>       /*
> >>> 	 * Only flush the TLB if we actually modified any entries
> >>> 	 * or if there are pending TLB flushes.
> >>> 	 */
> >>>       if (pages || mm_tlb_flush_nested(mm))
> >>>               flush_tlb_range(vma, start, end);
> >>> 
> >>> To be fair, I am not confident I did not miss other problematic cases.
> >>> 
> >>> But for now, this change, with the preserve_write change should address the
> >>> immediate issues. Let me know if you agree.
> >>> 
> >>> Let me know whether you agree.
> >> 
> >> The problem starts in UFD, and is related to tlb flush. But its focal
> >> point is in do_wp_page(). I'd suggest you look at function and see
> >> what it does before and after the commits I listed, with the following
> >> conditions
> >> 
> >> PageAnon(), !PageKsm(), !PageSwapCache(), !pte_write(),
> >> page_mapcount() = 1, page_count() > 1 or PageLocked()
> >> 
> >> when it runs against the two UFD examples you listed.
> > 
> > Thanks for your quick response. I wanted to write a lengthy response, but I
> > do want to sleep on it. I presume page_count() > 1, since I have multiple
> > concurrent page-faults on the same address in my test, but I will check.
> > 
> > Anyhow, before I give a further response, I was just wondering - since you
> > recently dealt with soft-dirty issue as I remember - isn't this problematic
> > COW for non-COW page scenario, in which the copy races with writes to a page
> > which is protected in the PTE but not in all TLB, also problematic for
> > soft-dirty clearing?

Yes, it has the same problem.

> Stupid me. You hold mmap_lock for write, so no, it cannot happen when clear
> soft-dirty.

mmap_write_lock is temporarily held to update vm_page_prot for write
notifications. It doesn't help in the context of this problem.

  reply	other threads:[~2020-12-21  7:30 UTC|newest]

Thread overview: 141+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-19  4:30 [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect Nadav Amit
2020-12-19 19:15 ` Andrea Arcangeli
2020-12-19 21:34   ` Nadav Amit
2020-12-19 22:06     ` Nadav Amit
2020-12-20  2:20       ` Andrea Arcangeli
2020-12-21  4:36         ` Nadav Amit
2020-12-21  5:12           ` Yu Zhao
2020-12-21  5:25             ` Nadav Amit
2020-12-21  5:39               ` Nadav Amit
2020-12-21  7:29                 ` Yu Zhao [this message]
2020-12-22 20:34       ` Andy Lutomirski
2020-12-22 20:34         ` Andy Lutomirski
2020-12-22 20:58         ` Nadav Amit
2020-12-22 21:34           ` Andrea Arcangeli
2020-12-20  2:01     ` Andy Lutomirski
2020-12-20  2:01       ` Andy Lutomirski
2020-12-20  2:49       ` Andrea Arcangeli
2020-12-20  5:08         ` Andy Lutomirski
2020-12-20  5:08           ` Andy Lutomirski
2020-12-21 18:03           ` Andrea Arcangeli
2020-12-21 18:22             ` Andy Lutomirski
2020-12-21 18:22               ` Andy Lutomirski
2020-12-20  6:05     ` Yu Zhao
2020-12-20  8:06       ` Nadav Amit
2020-12-20  9:54         ` Yu Zhao
2020-12-21  3:33           ` Nadav Amit
2020-12-21  4:44             ` Yu Zhao
2020-12-21 17:27         ` Peter Xu
2020-12-21 18:31           ` Nadav Amit
2020-12-21 19:16             ` Yu Zhao
2020-12-21 19:55               ` Linus Torvalds
2020-12-21 19:55                 ` Linus Torvalds
2020-12-21 20:21                 ` Yu Zhao
2020-12-21 20:25                   ` Linus Torvalds
2020-12-21 20:25                     ` Linus Torvalds
2020-12-21 20:23                 ` Nadav Amit
2020-12-21 20:26                   ` Linus Torvalds
2020-12-21 20:26                     ` Linus Torvalds
2020-12-21 21:24                     ` Yu Zhao
2020-12-21 21:49                       ` Nadav Amit
2020-12-21 22:30                         ` Peter Xu
2020-12-21 22:55                           ` Nadav Amit
2020-12-21 23:30                             ` Linus Torvalds
2020-12-21 23:30                               ` Linus Torvalds
2020-12-21 23:46                               ` Nadav Amit
2020-12-22 19:44                             ` Andrea Arcangeli
2020-12-22 20:19                               ` Nadav Amit
2020-12-22 21:17                                 ` Andrea Arcangeli
2020-12-21 23:12                           ` Yu Zhao
2020-12-21 23:33                             ` Linus Torvalds
2020-12-21 23:33                               ` Linus Torvalds
2020-12-22  0:00                               ` Yu Zhao
2020-12-22  0:11                                 ` Linus Torvalds
2020-12-22  0:11                                   ` Linus Torvalds
2020-12-22  0:24                                   ` Yu Zhao
2020-12-21 23:22                           ` Linus Torvalds
2020-12-21 23:22                             ` Linus Torvalds
2020-12-22  3:19                             ` Andy Lutomirski
2020-12-22  3:19                               ` Andy Lutomirski
2020-12-22  4:16                               ` Linus Torvalds
2020-12-22  4:16                                 ` Linus Torvalds
2020-12-22 20:19                                 ` Andy Lutomirski
2020-12-22 20:19                                   ` Andy Lutomirski
2021-01-05 15:37                                 ` Peter Zijlstra
2021-01-05 18:03                                   ` Andrea Arcangeli
2021-01-12 16:20                                     ` Peter Zijlstra
2021-01-12 11:43                                   ` Vinayak Menon
2021-01-12 15:47                                     ` Laurent Dufour
2021-01-12 16:57                                       ` Peter Zijlstra
2021-01-12 19:02                                         ` Laurent Dufour
2021-01-12 19:15                                           ` Nadav Amit
2021-01-12 19:56                                             ` Yu Zhao
2021-01-12 20:38                                               ` Nadav Amit
2021-01-12 20:49                                                 ` Yu Zhao
2021-01-12 21:43                                                 ` Will Deacon
2021-01-12 22:29                                                   ` Nadav Amit
2021-01-12 22:46                                                     ` Will Deacon
2021-01-13  0:31                                                     ` Andy Lutomirski
2021-01-17  4:41                                                   ` Yu Zhao
2021-01-17  7:32                                                     ` Nadav Amit
2021-01-17  9:16                                                       ` Yu Zhao
2021-01-17 10:13                                                         ` Nadav Amit
2021-01-17 19:25                                                           ` Yu Zhao
2021-01-18  2:49                                                             ` Nadav Amit
2020-12-22  9:38                               ` Nadav Amit
2020-12-22 19:31                               ` Andrea Arcangeli
2020-12-22 20:15                                 ` Matthew Wilcox
2020-12-22 20:26                                   ` Andrea Arcangeli
2020-12-22 21:14                                 ` Yu Zhao
2020-12-22 22:02                                   ` Andrea Arcangeli
2020-12-22 23:39                                     ` Yu Zhao
2020-12-22 23:50                                       ` Linus Torvalds
2020-12-22 23:50                                         ` Linus Torvalds
2020-12-23  0:01                                         ` Linus Torvalds
2020-12-23  0:01                                           ` Linus Torvalds
2020-12-23  0:23                                           ` Yu Zhao
2020-12-23  2:17                                             ` Andrea Arcangeli
2020-12-23  9:44                                           ` Linus Torvalds
2020-12-23  9:44                                             ` Linus Torvalds
2020-12-23 10:06                                             ` Yu Zhao
2020-12-23 16:24                                               ` Peter Xu
2020-12-23 18:51                                                 ` Andrea Arcangeli
2020-12-23 18:55                                                   ` Andrea Arcangeli
2020-12-23 19:12                                                 ` Yu Zhao
2020-12-23 19:32                                                   ` Peter Xu
2020-12-23  0:20                                         ` Linus Torvalds
2020-12-23  0:20                                           ` Linus Torvalds
2020-12-23  2:56                                       ` Andrea Arcangeli
2020-12-23  3:36                                         ` Yu Zhao
2020-12-23 15:52                                           ` Peter Xu
2020-12-23 21:07                                             ` Andrea Arcangeli
2020-12-23 21:39                                           ` Andrea Arcangeli
2020-12-23 22:29                                             ` Yu Zhao
2020-12-23 23:04                                               ` Andrea Arcangeli
2020-12-24  1:21                                               ` Andy Lutomirski
2020-12-24  2:00                                                 ` Andrea Arcangeli
2020-12-24  3:09                                                   ` Nadav Amit
2020-12-24  3:30                                                     ` Nadav Amit
2020-12-24  3:34                                                     ` Yu Zhao
2020-12-24  4:01                                                       ` Andrea Arcangeli
2020-12-24  5:18                                                         ` Nadav Amit
2020-12-24 18:49                                                           ` Andrea Arcangeli
2020-12-24 19:16                                                             ` Andrea Arcangeli
2020-12-24  4:37                                                       ` Nadav Amit
2020-12-24  3:31                                                   ` Andrea Arcangeli
2020-12-23 23:39                                             ` Linus Torvalds
2020-12-23 23:39                                               ` Linus Torvalds
2020-12-24  1:01                                               ` Andrea Arcangeli
2020-12-22 21:14                                 ` Nadav Amit
2020-12-22 12:40                       ` Nadav Amit
2020-12-22 18:30                         ` Yu Zhao
2020-12-22 19:20                           ` Nadav Amit
2020-12-23 16:23                             ` Will Deacon
2020-12-23 19:04                               ` Nadav Amit
2020-12-23 22:05                         ` Andrea Arcangeli
2020-12-23 22:45                           ` Nadav Amit
2020-12-23 23:55                             ` Andrea Arcangeli
2020-12-21 21:55                   ` Peter Xu
2020-12-21 23:13                     ` Linus Torvalds
2020-12-21 23:13                       ` Linus Torvalds
2020-12-21 19:53             ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=X+BO8VLTN8BYLN80@google.com \
    --to=yuzhao@google.com \
    --cc=aarcange@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=nadav.amit@gmail.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=stable@vger.kernel.org \
    --cc=will@kernel.org \
    --cc=xemul@openvz.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.