linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Linux-MM <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Yu Zhao <yuzhao@google.com>, Andy Lutomirski <luto@kernel.org>,
	Peter Xu <peterx@redhat.com>, Pavel Emelyanov <xemul@openvz.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Minchan Kim <minchan@kernel.org>, Will Deacon <will@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Hugh Dickins <hughd@google.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Matthew Wilcox <willy@infradead.org>,
	Oleg Nesterov <oleg@redhat.com>, Jann Horn <jannh@google.com>,
	Kees Cook <keescook@chromium.org>,
	John Hubbard <jhubbard@nvidia.com>,
	Leon Romanovsky <leonro@nvidia.com>, Jan Kara <jack@suse.cz>,
	Kirill Tkhai <ktkhai@virtuozzo.com>
Subject: Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy
Date: Fri, 8 Jan 2021 10:59:35 -0800	[thread overview]
Message-ID: <CAHk-=wirQT8Sc8ZJwLqUfet1GTokyc0L0Vt+Y_b0mS++KbX36g@mail.gmail.com> (raw)
In-Reply-To: <20210108181945.GF504133@ziepe.ca>

On Fri, Jan 8, 2021 at 10:19 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> Sorry, I missed something, how does mmaping a fresh new page in the
> child impact the parent?
>
> I guess the issue is not to mmap but to GUP a shared page

No.

It has nothing to do with a shared page.

The problem with the COW in the child is that the parent now BELIEVES
that it has a private copy (because page_mapcount() was 1), but it
doesn't really.

But because the parent *thought* it had a private copy of the page,
when the _parent_ did a write, it would cause the page COW logic to go
"you have exclusive access to the page, so I'll just make it
writable".

The parent then writes whatever private data to that page.

That page is still in the system as a vmsplice'd page, and the child
can now read that private data that was _supposed_ to be exclusive to
the parent, but wasn't.

And the thing is, blaming vmsplice() is entirely wrong. The exact same
thing used to be able to happen with any GUP case, vmsplice() was just
the simplest way to cause that non-mapped page access. But any GUP
could do it, with the child basically fooling the parent into
revealing data.

Note that Zygote itself is in no way special from a technical
standpoint, and this can happen after any random fork().

The only real difference is that in all *traditional* UNIX cases, this
"child can see the parent's data with trickery before execve()"
situation simply doesn't *matter*.  In traditional fork() situations,
the parent and the child are really the same program, and if you don't
trust the child, then you don't trust the parent either.

The Android Zygote case isn't _technically_ any different. But the
difference is that because the whole idea with Zygote is to pre-map
the JIT stuff for the child, you are in this special situation where
the parent doesn't actually trust the child.

See? No _technical_ difference. Exact same scenario as for any random
fork() with GUP and COW going the wrong way.

It just normally doesn't _matter_.

And see above: because this is not really specific to vmsplice()
(apart from that just being the easiest model), the _original_ fix for
this was just "GUP will break COW early" commit:

   17839856fd58 ("gup: document and work around "COW can break either
way" issue")

which is very straightforward: if you do a GUP lookup, you force that
GUP to do the COW for you, so that nobody can then fool another
process to think that it has a private page that can be re-used, but
it really has a second reference to it. Because whoever took the
"sneaky" GUP reference had to get their _own_ private copy first.

But while that approach was very simple and very targeted (and I don't
think it's wrong per se), it then caused other problems.

In fact, it caused other problems for pretty much all the same cases
that the current model causes problems for: all the odd special cases
that do weird things to the VM.

And because these problems were so odd, the alternate solution - and
the thing I'm really pushing for - is to make the _core_ VM rules very
simple and straightforward, and then the odd special cases have to
live with those simple and straightforward rules.

And the most core of those rules is that "page_mapcount()"
fundamenally doesn't matter, because there are other references to
pages that are all equally valid. Thinking that a page being "mapped"
makes is special is wrong, as exemplified by any GUP case (but also as
exemplified by the page cache or the swap cache, which were always a
source of _other_ special cases for the COW code).

So if you accept that notion of "page_mapcount()" is meaninfless being
a truism (which Andrea obviously doesn't), then the logical extension
of that is the set of rules I outlined in my reply to Andy:

 (a) COW can never happen "too much", and "page_count()" is the
fundamental "somebody has a reference to this page"

 (b) page pinning and any other "this needs to be coherent" ends up
being a special per-page "shared memory" case

That special "shared memory page" thing in (b) is then that rule that
when we pin a page, we make sure it's writable, and stays writable, so
that COW never breaks the association.

That's then the thing that causes problems for anybody who wants to
write-protect stuff.

         Linus

  parent reply	other threads:[~2021-01-08 19:00 UTC|newest]

Thread overview: 96+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-25  9:25 [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Nadav Amit
2020-12-25  9:25 ` [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect Nadav Amit
2021-01-04 12:22   ` Peter Zijlstra
2021-01-04 19:24     ` Andrea Arcangeli
2021-01-04 19:35       ` Nadav Amit
2021-01-04 20:19         ` Andrea Arcangeli
2021-01-04 20:39           ` Nadav Amit
2021-01-04 21:01             ` Andrea Arcangeli
2021-01-04 21:26               ` Nadav Amit
2021-01-05 18:45                 ` Andrea Arcangeli
2021-01-05 19:05                   ` Nadav Amit
2021-01-05 19:45                     ` Andrea Arcangeli
2021-01-05 20:06                       ` Nadav Amit
2021-01-05 21:06                         ` Andrea Arcangeli
2021-01-05 21:43                           ` Peter Xu
2021-01-05  8:13       ` Peter Zijlstra
2021-01-05  8:52         ` Nadav Amit
2021-01-05 14:26           ` Peter Zijlstra
2021-01-05  8:58       ` Peter Zijlstra
2021-01-05  9:22         ` Nadav Amit
2021-01-05 17:58         ` Andrea Arcangeli
2021-01-05 15:08   ` Peter Xu
2021-01-05 18:08     ` Andrea Arcangeli
2021-01-05 18:41       ` Peter Xu
2021-01-05 18:55         ` Andrea Arcangeli
2021-01-05 19:07     ` Nadav Amit
2021-01-05 19:43       ` Peter Xu
2020-12-25  9:25 ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Nadav Amit
2021-01-05 15:08   ` Will Deacon
2021-01-05 18:20   ` Andrea Arcangeli
2021-01-05 19:26     ` Nadav Amit
2021-01-05 20:39       ` Andrea Arcangeli
2021-01-05 21:20         ` Yu Zhao
2021-01-05 21:22         ` Nadav Amit
2021-01-05 22:16           ` Will Deacon
2021-01-06  0:29             ` Andrea Arcangeli
2021-01-06  0:02           ` Andrea Arcangeli
2021-01-07 20:04           ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Andrea Arcangeli
2021-01-07 20:04             ` [PATCH 1/2] mm: proc: Invalidate TLB after clearing soft-dirty page state Andrea Arcangeli
2021-01-07 20:04             ` [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending Andrea Arcangeli
2021-01-07 20:17               ` Linus Torvalds
2021-01-07 20:25                 ` Linus Torvalds
2021-01-07 20:58                 ` Andrea Arcangeli
2021-01-07 21:29                   ` Linus Torvalds
2021-01-07 21:53                     ` John Hubbard
2021-01-07 22:00                       ` Linus Torvalds
2021-01-07 22:14                         ` John Hubbard
2021-01-07 22:20                           ` Linus Torvalds
2021-01-07 22:24                             ` Linus Torvalds
2021-01-07 22:37                               ` John Hubbard
2021-01-15 11:27                       ` Jan Kara
2021-01-07 22:31                     ` Andrea Arcangeli
2021-01-07 22:42                       ` Linus Torvalds
2021-01-07 22:51                         ` Linus Torvalds
2021-01-07 23:48                           ` Andrea Arcangeli
2021-01-08  0:25                             ` Linus Torvalds
2021-01-08 12:48                               ` Will Deacon
2021-01-08 16:14                                 ` Andrea Arcangeli
2021-01-08 17:39                                   ` Linus Torvalds
2021-01-08 17:53                                     ` Andrea Arcangeli
2021-01-08 19:25                                       ` Linus Torvalds
2021-01-09  0:12                                         ` Andrea Arcangeli
2021-01-08 17:30                                 ` Linus Torvalds
2021-01-07 23:28                         ` Andrea Arcangeli
2021-01-07 21:36               ` kernel test robot
2021-01-07 20:25             ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Jason Gunthorpe
2021-01-07 20:32               ` Linus Torvalds
2021-01-07 21:05                 ` Linus Torvalds
2021-01-07 22:02                   ` Andrea Arcangeli
2021-01-07 22:17                     ` Linus Torvalds
2021-01-07 22:56                       ` Andrea Arcangeli
2021-01-09 19:32                   ` Matthew Wilcox
2021-01-09 19:46                     ` Linus Torvalds
2021-01-15 14:30                       ` Jan Kara
2021-01-07 21:54                 ` Andrea Arcangeli
2021-01-07 21:45               ` Andrea Arcangeli
2021-01-08 13:36                 ` Jason Gunthorpe
2021-01-08 17:00                   ` Andrea Arcangeli
2021-01-08 18:19                     ` Jason Gunthorpe
2021-01-08 18:31                       ` Andy Lutomirski
2021-01-08 18:38                         ` Linus Torvalds
2021-01-08 23:34                         ` Andrea Arcangeli
2021-01-09 19:03                           ` Andy Lutomirski
2021-01-09 19:15                             ` Linus Torvalds
2021-01-08 18:59                       ` Linus Torvalds [this message]
2021-01-08 22:43                       ` Andrea Arcangeli
2021-01-09  0:42                         ` Jason Gunthorpe
2021-01-09  2:50                           ` Andrea Arcangeli
2021-01-11 14:30                             ` Jason Gunthorpe
2021-01-13 21:56                           ` Jerome Glisse
2021-01-13 23:39                             ` Jason Gunthorpe
2021-01-14  2:35                               ` Jerome Glisse
     [not found]                     ` <20210109034958.6928-1-hdanton@sina.com>
2021-01-11 14:39                       ` Jason Gunthorpe
2021-01-05 21:55         ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Peter Xu
2021-03-02 22:13 ` [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Peter Xu
2021-03-02 22:14   ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHk-=wirQT8Sc8ZJwLqUfet1GTokyc0L0Vt+Y_b0mS++KbX36g@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=aarcange@redhat.com \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=keescook@chromium.org \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=ktkhai@virtuozzo.com \
    --cc=leonro@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=oleg@redhat.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=xemul@openvz.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).