linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Vinayak Menon <vinmenon@codeaurora.org>
To: Peter Zijlstra <peterz@infradead.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>, Peter Xu <peterx@redhat.com>,
	Nadav Amit <nadav.amit@gmail.com>, Yu Zhao <yuzhao@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	linux-mm <linux-mm@kvack.org>,
	lkml <linux-kernel@vger.kernel.org>,
	Pavel Emelyanov <xemul@openvz.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	stable <stable@vger.kernel.org>, Minchan Kim <minchan@kernel.org>,
	Will Deacon <will@kernel.org>,
	ldufour@linux.vnet.ibm.com, surenb@google.com
Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
Date: Tue, 12 Jan 2021 17:13:57 +0530	[thread overview]
Message-ID: <bfb1cbe6-a705-469d-c95a-776624817e33@codeaurora.org> (raw)
In-Reply-To: <20210105153727.GK3040@hirez.programming.kicks-ass.net>

On 1/5/2021 9:07 PM, Peter Zijlstra wrote:
> On Mon, Dec 21, 2020 at 08:16:11PM -0800, Linus Torvalds wrote:
>
>> So I think the basic rule is that "if you hold mmap_sem for writing,
>> you're always safe". And that really should be considered the
>> "default" locking.
>>
>> ANY time you make a modification to the VM layer, you should basically
>> always treat it as a write operation, and get the mmap_sem for
>> writing.
>>
>> Yeah, yeah, that's a bit simplified, and it ignores various special
>> cases (and the hardware page table walkers that obviously take no
>> locks at all), but if you hold the mmap_sem for writing you won't
>> really race with anything else - not page faults, and not other
>> "modify this VM".
>> To a first approximation, everybody that changes the VM should take
>> the mmap_sem for writing, and the readers should just be just about
>> page fault handling (and I count GUP as "page fault handling" too -
>> it's kind of the same "look up page" rather than "modify vm" kind of
>> operation).
>>
>> And there are just a _lot_ more page faults than there are things that
>> modify the page tables and the vma's.
>>
>> So having that mental model of "lookup of pages in a VM take mmap_semn
>> for reading, any modification of the VM uses it for writing" makes
>> sense both from a performance angle and a logical standpoint. It's the
>> correct model.
>> And it's worth noting that COW is still "lookup of pages", even though
>> it might modify the page tables in the process. The same way lookup
>> can modify the page tables to mark things accessed or dirty.
>>
>> So COW is still a lookup operation, in ways that "change the
>> writabiility of this range" very much is not. COW is "lookup for
>> write", and the magic we do to copy to make that write valid is still
>> all about the lookup of the page.
> (your other email clarified this point; the COW needs to copy while
> holding the PTL and we need TLBI under PTL if we're to change this)
>
>> Which brings up another mental mistake I saw earlier in this thread:
>> you should not think "mmap_sem is for vma, and the page table lock is
>> for the page table changes".
>>
>> mmap_sem is the primary lock for any modifications to the VM layout,
>> whether it be in the vma's or in the page tables.
>>
>> Now, the page table lock does exist _in_addition_to_ the mmap_sem, but
>> it is partly because
>>
>>   (a) we have things that historically walked the page tables _without_
>> walking the vma's (notably the virtual memory scanning)
>>
>>   (b) we do allow concurrent page faults, so we then need a lower-level
>> lock to serialize the parallelism we _do_ have.
> And I'm thinking the speculative page fault series steps right into all
> this, it fundamentally avoids mmap_sem and entirely relies on the PTL.
>
> Which opens it up to exactly these races explored here.
>
> The range lock approach does not suffer this, but I'm still worried
> about the actual performance of that thing.


Some thoughts on why there may not be an issue with speculative page fault.
Adding Laurent as well.

Possibility of race against other PTE modifiers

1) Fork - We have seen a case of SPF racing with fork marking PTEs RO 
and that
is described and fixed here https://lore.kernel.org/patchwork/patch/1062672/
2) mprotect - change_protection in mprotect which does the deferred flush is
marked under vm_write_begin/vm_write_end, thus SPF bails out on faults 
on those VMAs.
3) userfaultfd - mwriteprotect_range is not protected unlike in (2) above.
But SPF does not take UFFD faults.
4) hugetlb - hugetlb_change_protection - called from mprotect and covered by
(2) above.
5) Concurrent faults - SPF does not handle all faults. Only anon page 
faults.
Of which do_anonymous_page and do_swap_page are NONE/NON-PRESENT->PRESENT
transitions without tlb flush. And I hope do_wp_page with RO->RW is fine 
as well.
I could not see a case where speculative path cannot see a PTE update 
done via
a fault on another CPU.

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation



  parent reply	other threads:[~2021-01-12 11:44 UTC|newest]

Thread overview: 120+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-19  4:30 [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect Nadav Amit
2020-12-19 19:15 ` Andrea Arcangeli
     [not found]   ` <EDC00345-B46E-4396-8379-98E943723809@gmail.com>
2020-12-19 22:06     ` Nadav Amit
2020-12-20  2:20       ` Andrea Arcangeli
2020-12-21  4:36         ` Nadav Amit
2020-12-21  5:12           ` Yu Zhao
2020-12-21  5:25             ` Nadav Amit
2020-12-21  5:39               ` Nadav Amit
2020-12-21  7:29                 ` Yu Zhao
2020-12-22 20:34       ` Andy Lutomirski
2020-12-22 20:58         ` Nadav Amit
2020-12-22 21:34           ` Andrea Arcangeli
2020-12-20  2:01     ` Andy Lutomirski
2020-12-20  2:49       ` Andrea Arcangeli
2020-12-20  5:08         ` Andy Lutomirski
2020-12-21 18:03           ` Andrea Arcangeli
2020-12-21 18:22             ` Andy Lutomirski
2020-12-20  6:05     ` Yu Zhao
2020-12-20  8:06       ` Nadav Amit
2020-12-20  9:54         ` Yu Zhao
2020-12-21  3:33           ` Nadav Amit
2020-12-21  4:44             ` Yu Zhao
2020-12-21 17:27         ` Peter Xu
2020-12-21 18:31           ` Nadav Amit
2020-12-21 19:16             ` Yu Zhao
2020-12-21 19:55               ` Linus Torvalds
2020-12-21 20:21                 ` Yu Zhao
2020-12-21 20:25                   ` Linus Torvalds
2020-12-21 20:23                 ` Nadav Amit
2020-12-21 20:26                   ` Linus Torvalds
2020-12-21 21:24                     ` Yu Zhao
2020-12-21 21:49                       ` Nadav Amit
2020-12-21 22:30                         ` Peter Xu
2020-12-21 22:55                           ` Nadav Amit
2020-12-21 23:30                             ` Linus Torvalds
2020-12-21 23:46                               ` Nadav Amit
2020-12-22 19:44                             ` Andrea Arcangeli
2020-12-22 20:19                               ` Nadav Amit
2020-12-22 21:17                                 ` Andrea Arcangeli
2020-12-21 23:12                           ` Yu Zhao
2020-12-21 23:33                             ` Linus Torvalds
2020-12-22  0:00                               ` Yu Zhao
2020-12-22  0:11                                 ` Linus Torvalds
2020-12-22  0:24                                   ` Yu Zhao
2020-12-21 23:22                           ` Linus Torvalds
2020-12-22  3:19                             ` Andy Lutomirski
2020-12-22  4:16                               ` Linus Torvalds
2020-12-22 20:19                                 ` Andy Lutomirski
2021-01-05 15:37                                 ` Peter Zijlstra
2021-01-05 18:03                                   ` Andrea Arcangeli
2021-01-12 16:20                                     ` Peter Zijlstra
2021-01-12 11:43                                   ` Vinayak Menon [this message]
2021-01-12 15:47                                     ` Laurent Dufour
2021-01-12 16:57                                       ` Peter Zijlstra
2021-01-12 19:02                                         ` Laurent Dufour
2021-01-12 19:15                                           ` Nadav Amit
2021-01-12 19:56                                             ` Yu Zhao
2021-01-12 20:38                                               ` Nadav Amit
2021-01-12 20:49                                                 ` Yu Zhao
2021-01-12 21:43                                                 ` Will Deacon
2021-01-12 22:29                                                   ` Nadav Amit
2021-01-12 22:46                                                     ` Will Deacon
2021-01-13  0:31                                                     ` Andy Lutomirski
2021-01-17  4:41                                                   ` Yu Zhao
2021-01-17  7:32                                                     ` Nadav Amit
2021-01-17  9:16                                                       ` Yu Zhao
2021-01-17 10:13                                                         ` Nadav Amit
2021-01-17 19:25                                                           ` Yu Zhao
2021-01-18  2:49                                                             ` Nadav Amit
2020-12-22  9:38                               ` Nadav Amit
2020-12-22 19:31                               ` Andrea Arcangeli
2020-12-22 20:15                                 ` Matthew Wilcox
2020-12-22 20:26                                   ` Andrea Arcangeli
2020-12-22 21:14                                 ` Yu Zhao
2020-12-22 22:02                                   ` Andrea Arcangeli
2020-12-22 23:39                                     ` Yu Zhao
2020-12-22 23:50                                       ` Linus Torvalds
2020-12-23  0:01                                         ` Linus Torvalds
2020-12-23  0:23                                           ` Yu Zhao
2020-12-23  2:17                                             ` Andrea Arcangeli
2020-12-23  9:44                                           ` Linus Torvalds
2020-12-23 10:06                                             ` Yu Zhao
2020-12-23 16:24                                               ` Peter Xu
2020-12-23 18:51                                                 ` Andrea Arcangeli
2020-12-23 18:55                                                   ` Andrea Arcangeli
2020-12-23 19:12                                                 ` Yu Zhao
2020-12-23 19:32                                                   ` Peter Xu
2020-12-23  0:20                                         ` Linus Torvalds
2020-12-23  2:56                                       ` Andrea Arcangeli
2020-12-23  3:36                                         ` Yu Zhao
2020-12-23 15:52                                           ` Peter Xu
2020-12-23 21:07                                             ` Andrea Arcangeli
2020-12-23 21:39                                           ` Andrea Arcangeli
2020-12-23 22:29                                             ` Yu Zhao
2020-12-23 23:04                                               ` Andrea Arcangeli
2020-12-24  1:21                                               ` Andy Lutomirski
2020-12-24  2:00                                                 ` Andrea Arcangeli
2020-12-24  3:09                                                   ` Nadav Amit
2020-12-24  3:30                                                     ` Nadav Amit
2020-12-24  3:34                                                     ` Yu Zhao
2020-12-24  4:01                                                       ` Andrea Arcangeli
2020-12-24  5:18                                                         ` Nadav Amit
2020-12-24 18:49                                                           ` Andrea Arcangeli
2020-12-24 19:16                                                             ` Andrea Arcangeli
2020-12-24  4:37                                                       ` Nadav Amit
2020-12-24  3:31                                                   ` Andrea Arcangeli
2020-12-23 23:39                                             ` Linus Torvalds
2020-12-24  1:01                                               ` Andrea Arcangeli
2020-12-22 21:14                                 ` Nadav Amit
2020-12-22 12:40                       ` Nadav Amit
2020-12-22 18:30                         ` Yu Zhao
2020-12-22 19:20                           ` Nadav Amit
2020-12-23 16:23                             ` Will Deacon
2020-12-23 19:04                               ` Nadav Amit
2020-12-23 22:05                         ` Andrea Arcangeli
2020-12-23 22:45                           ` Nadav Amit
2020-12-23 23:55                             ` Andrea Arcangeli
2020-12-21 21:55                   ` Peter Xu
2020-12-21 23:13                     ` Linus Torvalds
2020-12-21 19:53             ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bfb1cbe6-a705-469d-c95a-776624817e33@codeaurora.org \
    --to=vinmenon@codeaurora.org \
    --cc=aarcange@redhat.com \
    --cc=ldufour@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=nadav.amit@gmail.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=stable@vger.kernel.org \
    --cc=surenb@google.com \
    --cc=torvalds@linux-foundation.org \
    --cc=will@kernel.org \
    --cc=xemul@openvz.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).