All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Nadav Amit <nadav.amit@gmail.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Nadav Amit <namit@vmware.com>, Yu Zhao <yuzhao@google.com>,
	Andy Lutomirski <luto@kernel.org>, Peter Xu <peterx@redhat.com>,
	Pavel Emelyanov <xemul@openvz.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Minchan Kim <minchan@kernel.org>, Will Deacon <will@kernel.org>,
	Mel Gorman <mgorman@suse.de>
Subject: Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect
Date: Tue, 5 Jan 2021 12:58:55 -0500	[thread overview]
Message-ID: <X/So39qVhYYzDZ+8@redhat.com> (raw)
In-Reply-To: <20210105085857.GE3040@hirez.programming.kicks-ass.net>

On Tue, Jan 05, 2021 at 09:58:57AM +0100, Peter Zijlstra wrote:
> On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote:
> > On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
> > > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
> > > 
> > > > The scenario that happens in selftests/vm/userfaultfd is as follows:
> > > > 
> > > > cpu0				cpu1			cpu2
> > > > ----				----			----
> > > > 							[ Writable PTE
> > > > 							  cached in TLB ]
> > > > userfaultfd_writeprotect()
> > > > [ write-*unprotect* ]
> > > > mwriteprotect_range()
> > > > mmap_read_lock()
> > > > change_protection()
> > > > 
> > > > change_protection_range()
> > > > ...
> > > > change_pte_range()
> > > > [ *clear* “write”-bit ]
> > > > [ defer TLB flushes ]
> > > > 				[ page-fault ]
> > > > 				...
> > > > 				wp_page_copy()
> > > > 				 cow_user_page()
> > > > 				  [ copy page ]
> > > > 							[ write to old
> > > > 							  page ]
> > > > 				...
> > > > 				 set_pte_at_notify()
> > > 
> > > Yuck!
> > > 
> > 
> > Note, the above was posted before we figured out the details so it
> > wasn't showing the real deferred tlb flush that caused problems (the
> > one showed on the left causes zero issues).
> > 
> > The problematic one not pictured is the one of the wrprotect that has
> > to be running in another CPU which is also isn't picture above. More
> > accurate traces are posted later in the thread.
> 
> Lets assume CPU0 does a read-lock, W -> RO with deferred flush.

I was mistaken saying the deferred tlb flush was not shown in the v2
trace, just this appears a new different case we didn't happen to
consider before.

In the previous case we discussed earlier, when un-wrprotect above is
called it never should have been a W->RO since a wrprotect run first.

Doesn't it ring a bell that if an un-wrprotect does a W->RO
transition, something is a bit going backwards?

I don't recall from previous discussion that un-wrprotect was
considered as called on read-write memory. I think we need the below
change to fix this new case.

			if (uffd_wp) {
+				if (unlikely(pte_uffd_wp(oldpte)))
+					continue;
				ptent = pte_wrprotect(ptent);
				ptent = pte_mkuffd_wp(ptent);
			} else if (uffd_wp_resolve) {
+				if (unlikely(!pte_uffd_wp(oldpte)))
+					continue;
				/*
				 * Leave the write bit to be handled
				 * by PF interrupt handler, then
				 * things like COW could be properly
				 * handled.
				 */
				ptent = pte_clear_uffd_wp(ptent);
			}

I now get why the v2 patch touches preserved_write, but this is not
about preserve_write, it's not about leaving the write bit alone. This
is about leaving the whole pte alone if the uffd-wp bit doesn't
actually change.

We shouldn't just defer the tlb flush if un-wprotect is called on
read-write memory: we should not have flushed the tlb at all in such
case.

Same for hugepmd in huge_memory.c which will be somewhere else.

Once the above is optimized, then un-wrprotect as in
MM_CP_UFFD_WP_RESOLVE is usually preceded by wrprotect as in
MM_CP_UFFD_WP, and so it'll never be a W->RO but a RO->RO transition
that just clears the uffd_wp flag and nothing else and whose tlb flush
is in turn irrelevant.

The fix discussed still works for this new case too: I'm not
suggesting we should rely on the above optimization for the tlb
safety. The above is just a missing optimization.

> > > Isn't this all rather similar to the problem that resulted in the
> > > tlb_flush_pending mess?
> > > 
> > > I still think that's all fundamentally buggered, the much saner solution
> > > (IMO) would've been to make things wait for the pending flush, instead
> > 
> > How do intend you wait in PT lock while the writer also has to take PT
> > lock repeatedly before it can do wake_up_var?
> > 
> > If you release the PT lock before calling wait_tlb_flush_pending it
> > all falls apart again.
> 
> I suppose you can check for pending, if found, release lock, wait for 0,
> and re-take the fault?

Aborting the page fault unconditionally while MADV_DONTNEED is running
on some other unrelated vma, sounds not desirable.

Doing it only for !VM_SOFTDIRTY or soft dirty not compiled in sounds
less bad but it would still mean that while clear_refs is running, no
thread can write to any anon memory of the process.

> > This I guess explains why a local pte/hugepmd smp local invlpg is the
> > only working solution for this issue, similarly to how it's done in rmap.
> 
> In that case a local invalidate on CPU1 simply doesn't help anything.
> 
> CPU1 needs to do a global invalidate or wait for the in-progress one to
> complete, such that CPU2 is sure to not have a W entry left before CPU1
> goes and copies the page.

Yes, it was a global invlpg, definitely not local sorry for the
confusion, as in the PoC posted here which needs cleaning up:

       https://lkml.kernel.org/r/X+QLr1WmGXMs33Ld@redhat.com

+                       flush_tlb_page(vma, vmf->address);

I think instead of the flush_tlb_page above, we just need an
ad-hoc abstraction there.

The added complexity to the page fault common code consist in having
to call such abstract call in the right place of the page fault.

The vm_flags to check will be the same for both the flush_tlb_page and
the wait_tlb_pending approaches.

Once the filter on vm_flags pass, the only difference is between
"flush_tlb_page; return void" or "PT unlock; wait_; return
VM_FAULT_RETRY" so it looks more an implementation detail with a
different tradeoff at runtime.

Thanks,
Andrea


  parent reply	other threads:[~2021-01-05 18:00 UTC|newest]

Thread overview: 119+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-25  9:25 [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Nadav Amit
2020-12-25  9:25 ` [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect Nadav Amit
2021-01-04 12:22   ` Peter Zijlstra
2021-01-04 19:24     ` Andrea Arcangeli
2021-01-04 19:35       ` Nadav Amit
2021-01-04 20:19         ` Andrea Arcangeli
2021-01-04 20:39           ` Nadav Amit
2021-01-04 21:01             ` Andrea Arcangeli
2021-01-04 21:26               ` Nadav Amit
2021-01-05 18:45                 ` Andrea Arcangeli
2021-01-05 19:05                   ` Nadav Amit
2021-01-05 19:45                     ` Andrea Arcangeli
2021-01-05 20:06                       ` Nadav Amit
2021-01-05 21:06                         ` Andrea Arcangeli
2021-01-05 21:43                           ` Peter Xu
2021-01-05  8:13       ` Peter Zijlstra
2021-01-05  8:52         ` Nadav Amit
2021-01-05 14:26           ` Peter Zijlstra
2021-01-05  8:58       ` Peter Zijlstra
2021-01-05  9:22         ` Nadav Amit
2021-01-05 17:58         ` Andrea Arcangeli [this message]
2021-01-05 15:08   ` Peter Xu
2021-01-05 18:08     ` Andrea Arcangeli
2021-01-05 18:41       ` Peter Xu
2021-01-05 18:55         ` Andrea Arcangeli
2021-01-05 19:07     ` Nadav Amit
2021-01-05 19:43       ` Peter Xu
2020-12-25  9:25 ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Nadav Amit
2021-01-05 15:08   ` Will Deacon
2021-01-05 18:20   ` Andrea Arcangeli
2021-01-05 19:26     ` Nadav Amit
2021-01-05 20:39       ` Andrea Arcangeli
2021-01-05 21:20         ` Yu Zhao
2021-01-05 21:22         ` Nadav Amit
2021-01-05 22:16           ` Will Deacon
2021-01-06  0:29             ` Andrea Arcangeli
2021-01-06  0:02           ` Andrea Arcangeli
2021-01-07 20:04           ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Andrea Arcangeli
2021-01-07 20:04             ` [PATCH 1/2] mm: proc: Invalidate TLB after clearing soft-dirty page state Andrea Arcangeli
2021-01-07 20:04             ` [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending Andrea Arcangeli
2021-01-07 20:17               ` Linus Torvalds
2021-01-07 20:17                 ` Linus Torvalds
2021-01-07 20:25                 ` Linus Torvalds
2021-01-07 20:25                   ` Linus Torvalds
2021-01-07 20:58                 ` Andrea Arcangeli
2021-01-07 21:29                   ` Linus Torvalds
2021-01-07 21:29                     ` Linus Torvalds
2021-01-07 21:53                     ` John Hubbard
2021-01-07 22:00                       ` Linus Torvalds
2021-01-07 22:00                         ` Linus Torvalds
2021-01-07 22:14                         ` John Hubbard
2021-01-07 22:20                           ` Linus Torvalds
2021-01-07 22:20                             ` Linus Torvalds
2021-01-07 22:24                             ` Linus Torvalds
2021-01-07 22:24                               ` Linus Torvalds
2021-01-07 22:37                               ` John Hubbard
2021-01-15 11:27                       ` Jan Kara
2021-01-07 22:31                     ` Andrea Arcangeli
2021-01-07 22:42                       ` Linus Torvalds
2021-01-07 22:42                         ` Linus Torvalds
2021-01-07 22:51                         ` Linus Torvalds
2021-01-07 22:51                           ` Linus Torvalds
2021-01-07 23:48                           ` Andrea Arcangeli
2021-01-08  0:25                             ` Linus Torvalds
2021-01-08  0:25                               ` Linus Torvalds
2021-01-08 12:48                               ` Will Deacon
2021-01-08 16:14                                 ` Andrea Arcangeli
2021-01-08 17:39                                   ` Linus Torvalds
2021-01-08 17:39                                     ` Linus Torvalds
2021-01-08 17:53                                     ` Andrea Arcangeli
2021-01-08 19:25                                       ` Linus Torvalds
2021-01-08 19:25                                         ` Linus Torvalds
2021-01-09  0:12                                         ` Andrea Arcangeli
2021-01-08 17:30                                 ` Linus Torvalds
2021-01-08 17:30                                   ` Linus Torvalds
2021-01-07 23:28                         ` Andrea Arcangeli
2021-01-07 21:36               ` kernel test robot
2021-01-07 21:36                 ` kernel test robot
2021-01-07 20:25             ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Jason Gunthorpe
2021-01-07 20:32               ` Linus Torvalds
2021-01-07 20:32                 ` Linus Torvalds
2021-01-07 21:05                 ` Linus Torvalds
2021-01-07 21:05                   ` Linus Torvalds
2021-01-07 22:02                   ` Andrea Arcangeli
2021-01-07 22:17                     ` Linus Torvalds
2021-01-07 22:17                       ` Linus Torvalds
2021-01-07 22:56                       ` Andrea Arcangeli
2021-01-09 19:32                   ` Matthew Wilcox
2021-01-09 19:46                     ` Linus Torvalds
2021-01-09 19:46                       ` Linus Torvalds
2021-01-15 14:30                       ` Jan Kara
2021-01-07 21:54                 ` Andrea Arcangeli
2021-01-07 21:45               ` Andrea Arcangeli
2021-01-08 13:36                 ` Jason Gunthorpe
2021-01-08 17:00                   ` Andrea Arcangeli
2021-01-08 18:19                     ` Jason Gunthorpe
2021-01-08 18:31                       ` Andy Lutomirski
2021-01-08 18:31                         ` Andy Lutomirski
2021-01-08 18:38                         ` Linus Torvalds
2021-01-08 18:38                           ` Linus Torvalds
2021-01-08 23:34                         ` Andrea Arcangeli
2021-01-09 19:03                           ` Andy Lutomirski
2021-01-09 19:03                             ` Andy Lutomirski
2021-01-09 19:15                             ` Linus Torvalds
2021-01-09 19:15                               ` Linus Torvalds
2021-01-08 18:59                       ` Linus Torvalds
2021-01-08 18:59                         ` Linus Torvalds
2021-01-08 22:43                       ` Andrea Arcangeli
2021-01-09  0:42                         ` Jason Gunthorpe
2021-01-09  2:50                           ` Andrea Arcangeli
2021-01-11 14:30                             ` Jason Gunthorpe
2021-01-13 21:56                           ` Jerome Glisse
2021-01-13 23:39                             ` Jason Gunthorpe
2021-01-14  2:35                               ` Jerome Glisse
2021-01-09  3:49                       ` Hillf Danton
2021-01-11 14:39                         ` Jason Gunthorpe
2021-01-05 21:55         ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Peter Xu
2021-03-02 22:13 ` [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Peter Xu
2021-03-02 22:14   ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=X/So39qVhYYzDZ+8@redhat.com \
    --to=aarcange@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mgorman@suse.de \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=nadav.amit@gmail.com \
    --cc=namit@vmware.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=will@kernel.org \
    --cc=xemul@openvz.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.