linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Andy Lutomirski <luto@amacapital.net>
Cc: Yu Zhao <yuzhao@google.com>, Andy Lutomirski <luto@kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Peter Xu <peterx@redhat.com>, Nadav Amit <nadav.amit@gmail.com>,
	linux-mm <linux-mm@kvack.org>,
	lkml <linux-kernel@vger.kernel.org>,
	Pavel Emelyanov <xemul@openvz.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	stable <stable@vger.kernel.org>, Minchan Kim <minchan@kernel.org>,
	Will Deacon <will@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
Date: Wed, 23 Dec 2020 22:31:59 -0500	[thread overview]
Message-ID: <X+QLr1WmGXMs33Ld@redhat.com> (raw)
In-Reply-To: <X+P2OnR+ipY8d2qL@redhat.com>

On Wed, Dec 23, 2020 at 09:00:26PM -0500, Andrea Arcangeli wrote:
> One other near zero cost improvement easy to add if this would be "if
> (vma->vm_flags & (VM_SOFTDIRTY|VM_UFFD_WP))" and it could be made

The next worry then is if UFFDIO_WRITEPROTECT is very large then there
would be a flood of wrprotect faults, and they'd end up all issuing a
tlb flush during the UFFDIO_WRITEPROTECT itself which again is a
performance concern for certain uffd-wp use cases.

Those use cases would be for example redis and postcopy live
snapshotting, to use it for async snapshots, unprivileged too in the
case of redis if it temporarily uses bounce buffers for the syscall
I/O for the duration of the snapshot. hypervisors tuned profiles need
to manually lift the unprivileged_userfaultfd to 1 unless their jailer
leaves one capability in the snapshot thread.

Moving the check after userfaultfd_pte_wp would solve
userfaultfd_writeprotect(mode_wp=true), but that still wouldn't avoid
a flood of tlb flushes during userfaultfd_writeprotect(mode_wp=false)
because change_protection doesn't restore the pte_write:

			} else if (uffd_wp_resolve) {
				/*
				 * Leave the write bit to be handled
				 * by PF interrupt handler, then
				 * things like COW could be properly
				 * handled.
				 */
				ptent = pte_clear_uffd_wp(ptent);
			}

When the snapshot is complete userfaultfd_writeprotect(mode_wp=false)
would need to run again on the whole range which can be very big
again.

Orthogonally I think we should also look to restore the pte_write
above orthogonally in uffd-wp, so it'll get yet an extra boost if
compared to current redis snapshotting fork(), that cannot restore all
pte_write after the snapshot child quit and forces a flood of spurious
wrprotect faults (uffd-wp can solve that too).

However, even if uffd-wp restored the pte_write, things would remain
suboptimal for a terabyte process under clear_refs, since softdirty
wrprotect faults that start happening while softdirty is still running
on the mm, won't be caught in userfaultfd_pte_wp.

Something like below, if cleaned up, abstracted properly and
documented well in the two places involved, will have a better chance
to perform optimally for softdirty too.

And on a side note the CONFIG_MEM_SOFT_DIRTY compile time check is
compulsory because VM_SOFTDIRTY is defined to zero if softdirty is not
built in. (for VM_UFFD_WP the CONFIG_HAVE_ARCH_USERFAULTFD_WP can be
removed and it won't make any measurable difference even when
USERFAULTFD=n)

RFC untested below, it's supposed to fix the softdirty testcase too,
even without the incremental fix, since it already does tlb_gather_mmu
before walk_page_range and tlb_finish_mmu after it and that appears
enough to define the inc/dec_tlb_flush_pending.

diff --git a/mm/memory.c b/mm/memory.c
index 7d608765932b..66fd6d070c47 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2844,11 +2844,26 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		if (!new_page)
 			goto oom;
 	} else {
+		bool in_uffd_wp, in_softdirty;
 		new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
 				vmf->address);
 		if (!new_page)
 			goto oom;
 
+#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+		in_uffd_wp = !!(vma->vm_flags & VM_UFFD_WP);
+#else
+		in_uffd_wp = false;
+#endif
+#ifdef CONFIG_MEM_SOFT_DIRTY
+		in_softdirty = !(vma->vm_flags & VM_SOFTDIRTY);
+#else
+		in_softdirty = false;
+#endif
+		if ((in_uffd_wp || in_softdirty) &&
+		    mm_tlb_flush_pending(mm))
+			flush_tlb_page(vma, vmf->address);
+
 		if (!cow_user_page(new_page, old_page, vmf)) {
 			/*
 			 * COW failed, if the fault was solved by other,



  parent reply	other threads:[~2020-12-24  3:32 UTC|newest]

Thread overview: 120+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-19  4:30 [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect Nadav Amit
2020-12-19 19:15 ` Andrea Arcangeli
     [not found]   ` <EDC00345-B46E-4396-8379-98E943723809@gmail.com>
2020-12-19 22:06     ` Nadav Amit
2020-12-20  2:20       ` Andrea Arcangeli
2020-12-21  4:36         ` Nadav Amit
2020-12-21  5:12           ` Yu Zhao
2020-12-21  5:25             ` Nadav Amit
2020-12-21  5:39               ` Nadav Amit
2020-12-21  7:29                 ` Yu Zhao
2020-12-22 20:34       ` Andy Lutomirski
2020-12-22 20:58         ` Nadav Amit
2020-12-22 21:34           ` Andrea Arcangeli
2020-12-20  2:01     ` Andy Lutomirski
2020-12-20  2:49       ` Andrea Arcangeli
2020-12-20  5:08         ` Andy Lutomirski
2020-12-21 18:03           ` Andrea Arcangeli
2020-12-21 18:22             ` Andy Lutomirski
2020-12-20  6:05     ` Yu Zhao
2020-12-20  8:06       ` Nadav Amit
2020-12-20  9:54         ` Yu Zhao
2020-12-21  3:33           ` Nadav Amit
2020-12-21  4:44             ` Yu Zhao
2020-12-21 17:27         ` Peter Xu
2020-12-21 18:31           ` Nadav Amit
2020-12-21 19:16             ` Yu Zhao
2020-12-21 19:55               ` Linus Torvalds
2020-12-21 20:21                 ` Yu Zhao
2020-12-21 20:25                   ` Linus Torvalds
2020-12-21 20:23                 ` Nadav Amit
2020-12-21 20:26                   ` Linus Torvalds
2020-12-21 21:24                     ` Yu Zhao
2020-12-21 21:49                       ` Nadav Amit
2020-12-21 22:30                         ` Peter Xu
2020-12-21 22:55                           ` Nadav Amit
2020-12-21 23:30                             ` Linus Torvalds
2020-12-21 23:46                               ` Nadav Amit
2020-12-22 19:44                             ` Andrea Arcangeli
2020-12-22 20:19                               ` Nadav Amit
2020-12-22 21:17                                 ` Andrea Arcangeli
2020-12-21 23:12                           ` Yu Zhao
2020-12-21 23:33                             ` Linus Torvalds
2020-12-22  0:00                               ` Yu Zhao
2020-12-22  0:11                                 ` Linus Torvalds
2020-12-22  0:24                                   ` Yu Zhao
2020-12-21 23:22                           ` Linus Torvalds
2020-12-22  3:19                             ` Andy Lutomirski
2020-12-22  4:16                               ` Linus Torvalds
2020-12-22 20:19                                 ` Andy Lutomirski
2021-01-05 15:37                                 ` Peter Zijlstra
2021-01-05 18:03                                   ` Andrea Arcangeli
2021-01-12 16:20                                     ` Peter Zijlstra
2021-01-12 11:43                                   ` Vinayak Menon
2021-01-12 15:47                                     ` Laurent Dufour
2021-01-12 16:57                                       ` Peter Zijlstra
2021-01-12 19:02                                         ` Laurent Dufour
2021-01-12 19:15                                           ` Nadav Amit
2021-01-12 19:56                                             ` Yu Zhao
2021-01-12 20:38                                               ` Nadav Amit
2021-01-12 20:49                                                 ` Yu Zhao
2021-01-12 21:43                                                 ` Will Deacon
2021-01-12 22:29                                                   ` Nadav Amit
2021-01-12 22:46                                                     ` Will Deacon
2021-01-13  0:31                                                     ` Andy Lutomirski
2021-01-17  4:41                                                   ` Yu Zhao
2021-01-17  7:32                                                     ` Nadav Amit
2021-01-17  9:16                                                       ` Yu Zhao
2021-01-17 10:13                                                         ` Nadav Amit
2021-01-17 19:25                                                           ` Yu Zhao
2021-01-18  2:49                                                             ` Nadav Amit
2020-12-22  9:38                               ` Nadav Amit
2020-12-22 19:31                               ` Andrea Arcangeli
2020-12-22 20:15                                 ` Matthew Wilcox
2020-12-22 20:26                                   ` Andrea Arcangeli
2020-12-22 21:14                                 ` Yu Zhao
2020-12-22 22:02                                   ` Andrea Arcangeli
2020-12-22 23:39                                     ` Yu Zhao
2020-12-22 23:50                                       ` Linus Torvalds
2020-12-23  0:01                                         ` Linus Torvalds
2020-12-23  0:23                                           ` Yu Zhao
2020-12-23  2:17                                             ` Andrea Arcangeli
2020-12-23  9:44                                           ` Linus Torvalds
2020-12-23 10:06                                             ` Yu Zhao
2020-12-23 16:24                                               ` Peter Xu
2020-12-23 18:51                                                 ` Andrea Arcangeli
2020-12-23 18:55                                                   ` Andrea Arcangeli
2020-12-23 19:12                                                 ` Yu Zhao
2020-12-23 19:32                                                   ` Peter Xu
2020-12-23  0:20                                         ` Linus Torvalds
2020-12-23  2:56                                       ` Andrea Arcangeli
2020-12-23  3:36                                         ` Yu Zhao
2020-12-23 15:52                                           ` Peter Xu
2020-12-23 21:07                                             ` Andrea Arcangeli
2020-12-23 21:39                                           ` Andrea Arcangeli
2020-12-23 22:29                                             ` Yu Zhao
2020-12-23 23:04                                               ` Andrea Arcangeli
2020-12-24  1:21                                               ` Andy Lutomirski
2020-12-24  2:00                                                 ` Andrea Arcangeli
2020-12-24  3:09                                                   ` Nadav Amit
2020-12-24  3:30                                                     ` Nadav Amit
2020-12-24  3:34                                                     ` Yu Zhao
2020-12-24  4:01                                                       ` Andrea Arcangeli
2020-12-24  5:18                                                         ` Nadav Amit
2020-12-24 18:49                                                           ` Andrea Arcangeli
2020-12-24 19:16                                                             ` Andrea Arcangeli
2020-12-24  4:37                                                       ` Nadav Amit
2020-12-24  3:31                                                   ` Andrea Arcangeli [this message]
2020-12-23 23:39                                             ` Linus Torvalds
2020-12-24  1:01                                               ` Andrea Arcangeli
2020-12-22 21:14                                 ` Nadav Amit
2020-12-22 12:40                       ` Nadav Amit
2020-12-22 18:30                         ` Yu Zhao
2020-12-22 19:20                           ` Nadav Amit
2020-12-23 16:23                             ` Will Deacon
2020-12-23 19:04                               ` Nadav Amit
2020-12-23 22:05                         ` Andrea Arcangeli
2020-12-23 22:45                           ` Nadav Amit
2020-12-23 23:55                             ` Andrea Arcangeli
2020-12-21 21:55                   ` Peter Xu
2020-12-21 23:13                     ` Linus Torvalds
2020-12-21 19:53             ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=X+QLr1WmGXMs33Ld@redhat.com \
    --to=aarcange@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@amacapital.net \
    --cc=luto@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=nadav.amit@gmail.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=will@kernel.org \
    --cc=xemul@openvz.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).