linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Nadav Amit <namit@vmware.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Hugh Dickins <hughd@google.com>,
	David Rientjes <rientjes@google.com>,
	Shakeel Butt <shakeelb@google.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@linux.ibm.com>,
	Yang Shi <shy828301@gmail.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Matthew Wilcox <willy@infradead.org>,
	Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
	Michal Hocko <mhocko@kernel.org>, Rik van Riel <riel@surriel.com>,
	Roman Gushchin <guro@fb.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Peter Xu <peterx@redhat.com>, Donald Dutile <ddutile@redhat.com>,
	Christoph Hellwig <hch@lst.de>, Oleg Nesterov <oleg@redhat.com>,
	Jan Kara <jack@suse.cz>, Linux-MM <linux-mm@kvack.org>,
	"open list:KERNEL SELFTEST FRAMEWORK" 
	<linux-kselftest@vger.kernel.org>,
	"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)
Date: Tue, 21 Dec 2021 09:58:32 +0100	[thread overview]
Message-ID: <fd7e3195-4f36-3804-1793-d453d5bd3e9f@redhat.com> (raw)
In-Reply-To: <20211221010312.GC1432915@nvidia.com>

On 21.12.21 02:03, Jason Gunthorpe wrote:
> On Sun, Dec 19, 2021 at 06:59:51PM +0100, David Hildenbrand wrote:
> 
>> handler (COW or unshare). Outside of COW/Unshare code, the bit can only
>> be set if page_count() == 1 and we sync against fork(). (and that's the
>> problem for gup-fast-only that I'm investigating right now, because it
>> would then always have to fallback to the slow variant if the bit isn't
>> already set)
> 

[in the meantime I figured out which pageflag we can reuse for anon
pages, which is at least one step into the right direction]

> I'm having a hard time imagining how gup_fast can maintain any sort of
> bit - it lacks all forms of locks so how can we do an atomic test and
> set between two pieces of data?

And exactly that is to be figured out.

Note that I am trying to make also any kind of R/O pins on an anonymous
page work as expected as well, to fix any kind of GUP after fork() and
GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page
similarly has to make sure that the page is exclusive -- even if it's
mapped R/O (!).

In the pagefault handler we can then always reuse a PageAnonExclusive()
page, because we know it's exclusive and it will stay exclusive because
concurrent fork() isn't possible.

> 
> I think the point of Linus's plan really is that the new bit is
> derived from page_count, we get the set the new bit when we observe
> page_count==1 in various situations and we clear the new bit whenever
> we write protect with the intent to copy.

Here are is one problems I'm fighting with:


Assume we set the bit whenever we create a new anon page (either due to
COW, ordinary fault, unsharing request, ..., even if it's mapped R/O
first). We know the page is exclusive at that point because we created
it and fork() could not happen yet.

fork() is the only code that can share the page between processes and
turn it non-exclusive.

We can only clear the bit during fork() -- to turn the share page
exclusive and map it R/O into both processes -- when we are sure that
*nobody* concurrently takes a reference on the page the would be
problematic (-> GUP).

So to clear the bit during fork, we have to
(1) Check against page_count == 1
(2) Synchronize against GUP

(2) is easy using the mmap_lock and the mm->write_protect_seq


BUT, it would mean that whenever we fork() and there is one additional
reference on a page (even if it's from the swapcache), we would slow
down fork() even if there was never any GUP. This would apply to any
process out there that does a fork() ...


So the idea is to mark a page only exclusive as soon as someone needs
the page to be exclusive and stay exclusive (-> e.g., GUP with FOLL_PIN
or selected FOLL_GET like O_DIRECT). This can happen in my current
approach using two ways:

(1) Set the bit when we know we are the only users

We can set PageAnonExclusive() in case *we sync against fork* and the
page cannot get unmapped (pt lock) when:
* The page is mapped writable
* The page is mapped readable and page_count == 1

This should work during ordinary GUP in many cases.

If we cannot set the page exclusive, we have to trigger a page fault.

(2) During pagefaults when FOLL_FAULT_UNSHARE is set.

GUP will set FOLL_FAULT_UNSHARE for a pagefault when required (again
e.g., FOLL_PIN or selected FOLL_GET users), and manual setting of the
bit failed. The page fault will then try once again to set the bit if
there is a page mapped, and if that fails, do the COW/unshare and set
the bit.


The above should work fairly reliable with GUP. But indeed,
gup-fast-only is the problem. I'm still investigating what kind of
lightweight synchronization we could do against fork() such that we
wouldn't try setting a page PageAnonExclusive() while fork()
concurrently shares the page.

We could eventually use the page lock and do a try_lock(), both in
fork() and in gup-fast-only. fork() would only clear the bit if the
try_lock() succeeded. gup-fast-only would only be able to set the bit
and not fallback to the slow path if try_lock() succeeded.


But I'm still investigating if there are better alternatives ...

> 
> GUP doesn't interact with this bit. A writable page would still be the
> second way you can say "you clearly have full ownership of the page',
> so GUP just checks writability and bumps the refcount. The challenge
> of GUP reamins to sanely sequence it with things that are doing WP.
> 
> The elevated GUP refcount prevents the page from getting the bit set
> again, regardless of what happens to it.
> 
> Then on the WP sides.. Obviously we clear the bit when applying a WP
> for copy. So all the bad GUP cases are halted now, as with a cleared
> bit and a != 1 refcount COW must happen.
> 
> Then, it is also the case that most often a read-only page will have
> this bit cleared, UFFWD WP being the exception.
> 
> UFFWD WP works fine as it will have the bit set in the cases we care
> about and COW will not happen.
> 
> If the bit is not set then everything works as is today and you get
> extra COWs. We still have to fix the things that are missing the extra
> COWs to check the page ref to fix this.
> 
> It seems this new bit is acting as a 'COW disable', so, the accuracy
> of COW vs GUP&speculative pagerefs now relies on setting the bit as
> aggressively as possible when it is safe and cheap to do so.

But we really want to avoid degrading fork() for everybody that doesn't
do heavy GUP ...

> 
> If I got it right this is why it is not just mapcount reduced to 1
> bit. It is quite different, even though "this VM owns the page
> outright" sure sounds like "mapcount == 1"..
> 
> It seems like an interesting direction - the security properties seem
> good as we only have to take care of sites applying WP to decide what
> to do with the extra bit, and all places that set the bit to 1 do so
> after testing refcount under various locks preventing PTE WP.
> 
> That just leave the THP splitting.. I suppose we get the PTL, then
> compute the current value of the new bit based on refcount and diffuse
> it to all tail pages, then update the PMD and release the PTL. Safe
> against concurrent WP - don't need DoubleMap horrors because it isn't
> a counter.
> 
>> Not set it stone, just an idea what I'm playing with right now ... and I
>> have to tripple-check if
>> * page is PTE mapped in the page table I'm walking
>> * page_count() == 1
>> Really means that "this is the only reference.". I do strongly believe
>> so .. :)
> 
> AFAIK the only places that can break this are places putting struct
> page memory into special PTEs. Which is horrific and is just bugs, but
> I think I've seen it from time to time :(

As we only care about anon pages, I think that doesn't apply. At least
that's what I hope.


-- 
Thanks,

David / dhildenb


  parent reply	other threads:[~2021-12-21  8:58 UTC|newest]

Thread overview: 137+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-17 11:30 [PATCH v1 00/11] mm: COW fixes part 1: fix the COW security issue for THP and hugetlb David Hildenbrand
2021-12-17 11:30 ` [PATCH v1 01/11] seqlock: provide lockdep-free raw_seqcount_t variant David Hildenbrand
2021-12-17 17:02   ` Nadav Amit
2021-12-17 17:29     ` David Hildenbrand
2021-12-17 17:49       ` David Hildenbrand
2021-12-17 18:01         ` Nadav Amit
2021-12-17 21:28   ` Thomas Gleixner
2021-12-17 22:02     ` David Hildenbrand
2021-12-17 11:30 ` [PATCH v1 02/11] mm: thp: consolidate mapcount logic on THP split David Hildenbrand
2021-12-17 19:06   ` Yang Shi
2021-12-18 14:24   ` Kirill A. Shutemov
2021-12-17 11:30 ` [PATCH v1 03/11] mm: simplify hugetlb and file-THP handling in __page_mapcount() David Hildenbrand
2021-12-17 17:16   ` Nadav Amit
2021-12-17 17:30     ` David Hildenbrand
2021-12-17 18:06   ` Mike Kravetz
2021-12-17 18:11     ` David Hildenbrand
2021-12-17 19:07   ` Yang Shi
2021-12-18 14:31   ` Kirill A. Shutemov
2021-12-17 11:30 ` [PATCH v1 04/11] mm: thp: simlify total_mapcount() David Hildenbrand
2021-12-17 19:12   ` Yang Shi
2021-12-18 14:35   ` Kirill A. Shutemov
2021-12-17 11:30 ` [PATCH v1 05/11] mm: thp: allow for reading the THP mapcount atomically via a raw_seqlock_t David Hildenbrand
2021-12-17 11:30 ` [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb) David Hildenbrand
2021-12-17 19:04   ` Linus Torvalds
2021-12-17 19:22     ` Linus Torvalds
2021-12-17 20:17       ` David Hildenbrand
2021-12-17 20:36         ` Linus Torvalds
2021-12-17 20:39           ` Linus Torvalds
2021-12-17 20:43             ` Linus Torvalds
2021-12-17 20:42           ` David Hildenbrand
2021-12-17 20:45             ` Linus Torvalds
2021-12-18 22:52               ` Kirill A. Shutemov
2021-12-18 23:05                 ` Linus Torvalds
2021-12-17 20:47           ` Jason Gunthorpe
2021-12-17 20:56             ` Linus Torvalds
2021-12-17 21:17               ` David Hildenbrand
2021-12-17 21:04             ` David Hildenbrand
2021-12-18  0:50               ` Jason Gunthorpe
2021-12-17 21:15             ` Nadav Amit
2021-12-17 21:20               ` David Hildenbrand
2021-12-18  0:50               ` Jason Gunthorpe
2021-12-18  1:53               ` Linus Torvalds
2021-12-18  2:17                 ` Linus Torvalds
2021-12-18  2:42                   ` Linus Torvalds
2021-12-18  3:36                     ` Linus Torvalds
2021-12-18 10:06                     ` David Hildenbrand
2021-12-18  3:05                 ` Jason Gunthorpe
2021-12-18  3:30                   ` Nadav Amit
2021-12-18  3:38                     ` Linus Torvalds
2021-12-18 18:42                       ` Jason Gunthorpe
2021-12-18 18:49                         ` David Hildenbrand
2021-12-18 21:48                         ` Nadav Amit
2021-12-18 22:53                           ` Linus Torvalds
2021-12-19  0:19                             ` Nadav Amit
2021-12-19  0:35                               ` Linus Torvalds
2021-12-19  6:02                                 ` Nadav Amit
2021-12-19  8:01                                   ` John Hubbard
2021-12-19 11:30                                     ` Matthew Wilcox
2021-12-19 17:27                                   ` Linus Torvalds
2021-12-19 17:44                                     ` David Hildenbrand
2021-12-19 17:44                                     ` Linus Torvalds
2021-12-19 17:59                                       ` David Hildenbrand
2021-12-19 21:12                                         ` Matthew Wilcox
2021-12-19 21:27                                           ` Linus Torvalds
2021-12-19 21:47                                             ` Matthew Wilcox
2021-12-19 21:53                                               ` Linus Torvalds
2021-12-19 22:02                                                 ` Matthew Wilcox
2021-12-19 22:12                                                   ` Linus Torvalds
2021-12-19 22:26                                                     ` Matthew Wilcox
2021-12-20 18:37                                           ` Matthew Wilcox
2021-12-20 18:52                                             ` Matthew Wilcox
2021-12-20 19:38                                               ` Linus Torvalds
2021-12-20 19:15                                             ` Linus Torvalds
2021-12-20 21:02                                               ` Matthew Wilcox
2021-12-20 21:27                                                 ` Linus Torvalds
2021-12-21  1:03                                         ` Jason Gunthorpe
2021-12-21  3:29                                           ` Matthew Wilcox
2021-12-21  8:58                                           ` David Hildenbrand [this message]
2021-12-21 14:28                                             ` Jason Gunthorpe
2021-12-21 15:19                                               ` David Hildenbrand
2021-12-21 23:54                                                 ` Jason Gunthorpe
2021-12-21 17:05                                             ` Linus Torvalds
2021-12-21 17:40                                               ` David Hildenbrand
2021-12-21 18:00                                                 ` Linus Torvalds
2021-12-21 18:28                                                   ` David Hildenbrand
2021-12-21 21:11                                                     ` John Hubbard
2021-12-21 18:07                                                 ` Jan Kara
2021-12-21 18:30                                                   ` Linus Torvalds
2021-12-21 18:51                                                     ` David Hildenbrand
2021-12-21 18:58                                                       ` Linus Torvalds
2021-12-21 21:16                                                     ` John Hubbard
2021-12-21 19:07                                                 ` Jason Gunthorpe
2021-12-22  8:51                                                   ` David Hildenbrand
2021-12-22  9:58                                                     ` David Hildenbrand
2021-12-22 12:41                                                       ` Jan Kara
2021-12-22 13:09                                                         ` David Hildenbrand
2021-12-22 14:42                                                           ` Jan Kara
2021-12-22 14:48                                                             ` David Hildenbrand
2021-12-22 16:08                                                               ` Jan Kara
2021-12-22 16:44                                                                 ` Matthew Wilcox
2021-12-22 18:40                                                                 ` Linus Torvalds
2021-12-23 12:54                                                                   ` Jan Kara
2021-12-23 17:18                                                                     ` Linus Torvalds
2021-12-23  0:21                                                           ` Matthew Wilcox
2021-12-24  2:53                                                             ` Jason Gunthorpe
2021-12-24  4:53                                                               ` Matthew Wilcox
2022-01-04  0:33                                                                 ` Jason Gunthorpe
2021-12-21 23:59                                                 ` Jason Gunthorpe
2021-12-22  8:30                                                   ` David Hildenbrand
2021-12-22 12:44                                                   ` Jan Kara
2021-12-17 20:45     ` David Hildenbrand
2021-12-17 20:51       ` Linus Torvalds
2021-12-17 20:55         ` David Hildenbrand
2021-12-17 21:36           ` Linus Torvalds
2021-12-17 21:47             ` David Hildenbrand
2021-12-17 21:50               ` Linus Torvalds
2021-12-17 22:29                 ` David Hildenbrand
2021-12-17 22:58                   ` Linus Torvalds
2021-12-17 23:29                     ` David Hildenbrand
2021-12-17 23:53                       ` Nadav Amit
2021-12-18  4:02                         ` Linus Torvalds
2021-12-18  4:52                           ` Nadav Amit
2021-12-18  5:03                             ` Matthew Wilcox
2021-12-18  5:23                               ` Nadav Amit
2021-12-18 18:37                               ` Linus Torvalds
2021-12-17 22:18               ` Linus Torvalds
2021-12-17 22:43                 ` David Hildenbrand
2021-12-17 23:20                   ` Linus Torvalds
2021-12-18  9:57                     ` David Hildenbrand
2021-12-18 19:21                       ` Linus Torvalds
2021-12-18 19:52                         ` Linus Torvalds
2021-12-19  8:43                           ` David Hildenbrand
2021-12-17 11:30 ` [PATCH v1 07/11] mm: gup: trigger unsharing via FAULT_FLAG_UNSHARE when required (!hugetlb) David Hildenbrand
2021-12-17 11:30 ` [PATCH v1 08/11] mm: hugetlb: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE David Hildenbrand
2021-12-17 11:30 ` [PATCH v1 09/11] mm: gup: trigger unsharing via FAULT_FLAG_UNSHARE when required (hugetlb) David Hildenbrand
2021-12-17 11:30 ` [PATCH v1 10/11] mm: thp: introduce and use page_trans_huge_anon_shared() David Hildenbrand
2021-12-17 11:30 ` [PATCH v1 11/11] selftests/vm: add tests for the known COW security issues David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fd7e3195-4f36-3804-1793-d453d5bd3e9f@redhat.com \
    --to=david@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=ddutile@redhat.com \
    --cc=guro@fb.com \
    --cc=hch@lst.de \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=namit@vmware.com \
    --cc=oleg@redhat.com \
    --cc=peterx@redhat.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=rppt@linux.ibm.com \
    --cc=shakeelb@google.com \
    --cc=shy828301@gmail.com \
    --cc=torvalds@linux-foundation.org \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).