All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux-MM <linux-mm@kvack.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Yu Zhao <yuzhao@google.com>, Andy Lutomirski <luto@kernel.org>,
	Peter Xu <peterx@redhat.com>, Pavel Emelyanov <xemul@openvz.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Minchan Kim <minchan@kernel.org>, Will Deacon <will@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Hugh Dickins <hughd@google.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Matthew Wilcox <willy@infradead.org>,
	Oleg Nesterov <oleg@redhat.com>, Jann Horn <jannh@google.com>,
	Kees Cook <keescook@chromium.org>,
	John Hubbard <jhubbard@nvidia.com>,
	Leon Romanovsky <leonro@nvidia.com>,
	Jason Gunthorpe <jgg@ziepe.ca>, Jan Kara <jack@suse.cz>,
	Kirill Tkhai <ktkhai@virtuozzo.com>
Subject: Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending
Date: Thu, 7 Jan 2021 17:31:29 -0500	[thread overview]
Message-ID: <X/eLwQPd5bi620Vt@redhat.com> (raw)
In-Reply-To: <CAHk-=wjs9v-hp_7HV_TrTmisu7pNX=MwZ62ZV82i0evLhPwS1Q@mail.gmail.com>

On Thu, Jan 07, 2021 at 01:29:43PM -0800, Linus Torvalds wrote:
> On Thu, Jan 7, 2021 at 12:59 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > The problem is it's not even possible to detect reliably if there's
> > really a long term GUP pin because of speculative pagecache lookups.
> 
> So none of the normal code _needs_ that any more these days, which is
> what I think is so nice. Any pinning will do the COW, and then we have
> the logic to make sure it stays writable, and that keeps everything
> nicely coherent and is all fairly simple.
> 
> And yes, it does mean that if somebody then explicitly write-protects
> a page, it may end up being COW'ed after all, but if you first pinned
> it, and then started playing with the protections of that page, why
> should you be surprised?
> 
> So to me, this sounds like a "don't do that then" situation.
> 
> Anybody who does page pinning and wants coherency should NOT TOUCH THE
> MAPPING IT PINNED.
> 
> (And if you do touch it, it's your own fault, and you get to keep both
> of the broken pieces)
> 
> Now, I do agree that from a QoI standpoint, it would be really lovely
> if we actually enforced it. I'm not entirely sure we can, but maybe it
> would be reasonable to use that
> 
>   mm->has_pinned && page_maybe_dma_pinned(page)
> 
> at least as the beginning of a heuristic.
> 
> In fact, I do think that "page_maybe_dma_pinned()" could possibly be
> made stronger than it is. Because at *THAT* point, we might say "we
> know a pinned page always must have a page_mapcount() of 1" - since as
> part of pinning it and doing the GUP_PIN, we forced the COW, and then
> subsequent fork() operations enforce it too.
> 
> So I do think that it might be possible to make that clear_refs code
> notice "this page is pinned, I can't mark it WP without the pinning
> coherency breaking".
> 
> It might not even be hard. But admittedly I'm somewhat handwaving
> here, and I might not have thought of some situation.

I suppose the objective would be to detect when it's a transient pin
(as an O_DIRECT write) and fail clear_refs with an error for all other
cases of real long term pins that need to keep reading at full PCI
bandwidth, without extra GUP invocations after the wp_copy_page run.

Because of the speculative lookups are making the count unstable, it's
not even enough to use mmu notifier and never use FOLL_GET in GUP to
make it safe again (unlike what I mentioned in a previous email).

Random memory corruption will still silently materialize as result of
the speculative lookups in the above scenario.

My whole point here in starting this new thread to suggest page_count
in do_wp_page is an untenable solution is that such commit silently
broke every single long term PIN user that may be used in combination
of clear_refs since 2013.

Silent memory corruption undetected or a detectable error out of
clear_refs, are both different side effects the same technical ABI
break that rendered clear_refs fundamentally incompatible with
clear_refs, so detecting it or not still an ABI break that is.

I felt obliged to report there's something much deeper and
fundamentally incompatible between the page_count in do_wp_page any
wrprotection of exclusive memory in place, as if used in combination
with any RDMA for example. The TLB flushing and the
mmap_read/write_lock were just the tip of the iceberg and they're not
the main concern anymore.

In addition, the inefficiency caused by the fact the page_count effect
is multiplied by 512 times or 262144 while the mapcount remains 4k
granular, makes me think the page_count is unsuitable to be used there
and this specific cure with page_count in do_wp_page, looks worse than
the initial zygote disease.

Thanks,
Andrea


  parent reply	other threads:[~2021-01-07 22:33 UTC|newest]

Thread overview: 119+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-25  9:25 [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Nadav Amit
2020-12-25  9:25 ` [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect Nadav Amit
2021-01-04 12:22   ` Peter Zijlstra
2021-01-04 19:24     ` Andrea Arcangeli
2021-01-04 19:35       ` Nadav Amit
2021-01-04 20:19         ` Andrea Arcangeli
2021-01-04 20:39           ` Nadav Amit
2021-01-04 21:01             ` Andrea Arcangeli
2021-01-04 21:26               ` Nadav Amit
2021-01-05 18:45                 ` Andrea Arcangeli
2021-01-05 19:05                   ` Nadav Amit
2021-01-05 19:45                     ` Andrea Arcangeli
2021-01-05 20:06                       ` Nadav Amit
2021-01-05 21:06                         ` Andrea Arcangeli
2021-01-05 21:43                           ` Peter Xu
2021-01-05  8:13       ` Peter Zijlstra
2021-01-05  8:52         ` Nadav Amit
2021-01-05 14:26           ` Peter Zijlstra
2021-01-05  8:58       ` Peter Zijlstra
2021-01-05  9:22         ` Nadav Amit
2021-01-05 17:58         ` Andrea Arcangeli
2021-01-05 15:08   ` Peter Xu
2021-01-05 18:08     ` Andrea Arcangeli
2021-01-05 18:41       ` Peter Xu
2021-01-05 18:55         ` Andrea Arcangeli
2021-01-05 19:07     ` Nadav Amit
2021-01-05 19:43       ` Peter Xu
2020-12-25  9:25 ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Nadav Amit
2021-01-05 15:08   ` Will Deacon
2021-01-05 18:20   ` Andrea Arcangeli
2021-01-05 19:26     ` Nadav Amit
2021-01-05 20:39       ` Andrea Arcangeli
2021-01-05 21:20         ` Yu Zhao
2021-01-05 21:22         ` Nadav Amit
2021-01-05 22:16           ` Will Deacon
2021-01-06  0:29             ` Andrea Arcangeli
2021-01-06  0:02           ` Andrea Arcangeli
2021-01-07 20:04           ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Andrea Arcangeli
2021-01-07 20:04             ` [PATCH 1/2] mm: proc: Invalidate TLB after clearing soft-dirty page state Andrea Arcangeli
2021-01-07 20:04             ` [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending Andrea Arcangeli
2021-01-07 20:17               ` Linus Torvalds
2021-01-07 20:17                 ` Linus Torvalds
2021-01-07 20:25                 ` Linus Torvalds
2021-01-07 20:25                   ` Linus Torvalds
2021-01-07 20:58                 ` Andrea Arcangeli
2021-01-07 21:29                   ` Linus Torvalds
2021-01-07 21:29                     ` Linus Torvalds
2021-01-07 21:53                     ` John Hubbard
2021-01-07 22:00                       ` Linus Torvalds
2021-01-07 22:00                         ` Linus Torvalds
2021-01-07 22:14                         ` John Hubbard
2021-01-07 22:20                           ` Linus Torvalds
2021-01-07 22:20                             ` Linus Torvalds
2021-01-07 22:24                             ` Linus Torvalds
2021-01-07 22:24                               ` Linus Torvalds
2021-01-07 22:37                               ` John Hubbard
2021-01-15 11:27                       ` Jan Kara
2021-01-07 22:31                     ` Andrea Arcangeli [this message]
2021-01-07 22:42                       ` Linus Torvalds
2021-01-07 22:42                         ` Linus Torvalds
2021-01-07 22:51                         ` Linus Torvalds
2021-01-07 22:51                           ` Linus Torvalds
2021-01-07 23:48                           ` Andrea Arcangeli
2021-01-08  0:25                             ` Linus Torvalds
2021-01-08  0:25                               ` Linus Torvalds
2021-01-08 12:48                               ` Will Deacon
2021-01-08 16:14                                 ` Andrea Arcangeli
2021-01-08 17:39                                   ` Linus Torvalds
2021-01-08 17:39                                     ` Linus Torvalds
2021-01-08 17:53                                     ` Andrea Arcangeli
2021-01-08 19:25                                       ` Linus Torvalds
2021-01-08 19:25                                         ` Linus Torvalds
2021-01-09  0:12                                         ` Andrea Arcangeli
2021-01-08 17:30                                 ` Linus Torvalds
2021-01-08 17:30                                   ` Linus Torvalds
2021-01-07 23:28                         ` Andrea Arcangeli
2021-01-07 21:36               ` kernel test robot
2021-01-07 21:36                 ` kernel test robot
2021-01-07 20:25             ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Jason Gunthorpe
2021-01-07 20:32               ` Linus Torvalds
2021-01-07 20:32                 ` Linus Torvalds
2021-01-07 21:05                 ` Linus Torvalds
2021-01-07 21:05                   ` Linus Torvalds
2021-01-07 22:02                   ` Andrea Arcangeli
2021-01-07 22:17                     ` Linus Torvalds
2021-01-07 22:17                       ` Linus Torvalds
2021-01-07 22:56                       ` Andrea Arcangeli
2021-01-09 19:32                   ` Matthew Wilcox
2021-01-09 19:46                     ` Linus Torvalds
2021-01-09 19:46                       ` Linus Torvalds
2021-01-15 14:30                       ` Jan Kara
2021-01-07 21:54                 ` Andrea Arcangeli
2021-01-07 21:45               ` Andrea Arcangeli
2021-01-08 13:36                 ` Jason Gunthorpe
2021-01-08 17:00                   ` Andrea Arcangeli
2021-01-08 18:19                     ` Jason Gunthorpe
2021-01-08 18:31                       ` Andy Lutomirski
2021-01-08 18:31                         ` Andy Lutomirski
2021-01-08 18:38                         ` Linus Torvalds
2021-01-08 18:38                           ` Linus Torvalds
2021-01-08 23:34                         ` Andrea Arcangeli
2021-01-09 19:03                           ` Andy Lutomirski
2021-01-09 19:03                             ` Andy Lutomirski
2021-01-09 19:15                             ` Linus Torvalds
2021-01-09 19:15                               ` Linus Torvalds
2021-01-08 18:59                       ` Linus Torvalds
2021-01-08 18:59                         ` Linus Torvalds
2021-01-08 22:43                       ` Andrea Arcangeli
2021-01-09  0:42                         ` Jason Gunthorpe
2021-01-09  2:50                           ` Andrea Arcangeli
2021-01-11 14:30                             ` Jason Gunthorpe
2021-01-13 21:56                           ` Jerome Glisse
2021-01-13 23:39                             ` Jason Gunthorpe
2021-01-14  2:35                               ` Jerome Glisse
2021-01-09  3:49                       ` Hillf Danton
2021-01-11 14:39                         ` Jason Gunthorpe
2021-01-05 21:55         ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Peter Xu
2021-03-02 22:13 ` [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Peter Xu
2021-03-02 22:14   ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=X/eLwQPd5bi620Vt@redhat.com \
    --to=aarcange@redhat.com \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=keescook@chromium.org \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=ktkhai@virtuozzo.com \
    --cc=leonro@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=oleg@redhat.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=torvalds@linux-foundation.org \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=xemul@openvz.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.