All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@ziepe.ca>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Yu Zhao <yuzhao@google.com>, Andy Lutomirski <luto@kernel.org>,
	Peter Xu <peterx@redhat.com>, Pavel Emelyanov <xemul@openvz.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Minchan Kim <minchan@kernel.org>, Will Deacon <will@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Hugh Dickins <hughd@google.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Matthew Wilcox <willy@infradead.org>,
	Oleg Nesterov <oleg@redhat.com>, Jann Horn <jannh@google.com>,
	Kees Cook <keescook@chromium.org>,
	John Hubbard <jhubbard@nvidia.com>,
	Leon Romanovsky <leonro@nvidia.com>, Jan Kara <jack@suse.cz>,
	Kirill Tkhai <ktkhai@virtuozzo.com>
Subject: Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy
Date: Mon, 11 Jan 2021 10:30:41 -0400	[thread overview]
Message-ID: <20210111143041.GI504133@ziepe.ca> (raw)
In-Reply-To: <X/kZ4ETE6LR8jpug@redhat.com>

On Fri, Jan 08, 2021 at 09:50:08PM -0500, Andrea Arcangeli wrote:

> For all those that aren't using mmu notifier and that rely solely on
> page pins, they still require privilege, except they do through /dev/
> permissions.

It is normal that the dev nodes are a+rw so it doesn't really require
privilege in any real sense.

> Actually the mmu notifier doesn't strictly require pins, it only
> requires GUP. All users tend to use FOLL_GET just as a safety
> precaution (I already tried to optimize away the two atomics per GUP,
> but we were naked by the KVM maintainer that didn't want to take the
> risk, I would have, but it's a fair point indeed, obviously it's safer
> with the pin plus the mmu notifier, two is safer than one).

I'm not sure what holding the pin will do to reduce risk?

If you get into a situation where you are stuffing a page into the
SMMU that is not in the CPU's MMU then everything is lost. Holding a
pin while carrying a page from the CPU page table to the SMMU just
ensures that page isn't freed until it is installed, but once
installed you are back to being broken.
 
> I'm not sure how any copy-user could obviate a secondary MMU mapping,
> mappings and copies are mutually exclusive. Any copy would be breaking
> memory coherency in this environment.

Because most places need to copy from user to stable kernel memory
before processing data under user control. You can't just cast a user
controlled pointer to a kstruct and use it - that is very likely a
security bug.

Still, the general version is something like kmap:

  map = user_map_setup(user_ptr, length)
  kptr = user_map_enter(map)
  [use kptr]
  user_map_leave(map, kptr)

And inside it could use mmu notifiers, or gup, or whatever.

user_map_setup() would register the notifier and user_map_enter()
would validate the cache'd page pointer and block cached invalidation
until user_map_leave().

> The primary concern with the mmu notifier in io_uring is the
> take_all_locks latency.

Just enabling mmu_notifier takes a performance hit on the entire
process too, it is not such a simple decision.. We'd need benchmarks
against a database or scientific application to see how negative the
notifier actually becomes.

> The problem with the mmu notifier as an universal solution, for
> example is that it can't wait for I/O completion of O_DIRECT since it
> has no clue where the put_page is to wait for it, otherwise we could
> avoid even the FOLL_GET for O_DIRECT and guarantee the I/O has to be
> completed before paging or anything can unmap the page under I/O from
> the pagetable.

GPU is already doing something like this, waiting in a notifier
invalidate callback for DMA to finish before allowing invalidate to
complete.

It is horrendously complicated and I'm not sure blocking invalidate
for a long time is actually much better for the MM..

> I see the incompatibility you describe as problem we have today, in
> the present, and that will fade with time.
> 
> Reminds me when we had >4G of RAM and 32bit devices doing DMA. How
> many 32bit devices are there now?

I'm not so sure anymore. A few years ago OpenCAPI and PCI PRI seemed
like good things, but now with experience they carry pretty bad
performance hits to use them. Lots of places are skipping them.

CXL offers another chance at this, so we'll see again in another 5
years or so if it works out. It is not any easy problem to solve from
a HW perspective.

> We're not talking here about any random PCI device, we're talking here
> about very special and very advanced devices that need to have "long
> term" GUP pins in order to operate, not the usual nvme/gigabit device
> where GUP pins are never long term.

Beyond RDMA, netdev's XDP uses FOLL_LONGTERM, so do various video
devices, lots of things related to virtualization like vfio, vdpa and
vhost. I think this is a bit defeatist to say it doesn't matter. If
anything as time goes on it seems to be growing, not shrinking
currently.

> The point is that if you do echo ... >/proc/self/clear_refs on your
> pid, that has any FOLL_LONGTERM on its mm, it'll just cause your
> device driver to go out of sync with the mm. It'll see the old pages,
> before the spurious COWs. The CPU will use new pages (the spurious
> COWs).

But if you do that then clear-refs isn't going to work they way it
thought either - this first needs some explanation for how clear_refs
is supposed to work when DMA WRITE is active on the page.

I'd certainly say causing a loss of synchrony is not acceptable, so if
we keep Linus's version of COW then clear_refs has to not write
protect pages under DMA.

> > secondary-mmu drivers using mmu notifier should not trigger this logic
> > and should not restrict write protect.
> 
> That's a great point. I didn't think the mmu notifier will invalidate
> the secondary MMU and ultimately issue a GUP after the wp_copy_page to
> keep it in sync.

It had better, or mmu notifiers are broken, right?

> The funny thing that doesn't make sense is that wp_copy_page will only
> be invoked because the PIN was left by KVM on the page for that extra
> safety I was talking about earlier.

Yes, with the COW change if kvm cares about this inefficiency it
should not have the unnecessary pin.

> You clearly contemplate the existance of a read mode, long term. That
> is also completely compatible with wrprotection. 

We talked about a read mode, but we didn't flesh it out. It is not
unconditionally compatible with wrprotect - most likely you still
can't write protect a page under READ DMA because when you eventually
take the COW there will be ambiguous situations that will break the
synchrony.

Jason

  reply	other threads:[~2021-01-11 14:31 UTC|newest]

Thread overview: 119+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-25  9:25 [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Nadav Amit
2020-12-25  9:25 ` [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect Nadav Amit
2021-01-04 12:22   ` Peter Zijlstra
2021-01-04 19:24     ` Andrea Arcangeli
2021-01-04 19:35       ` Nadav Amit
2021-01-04 20:19         ` Andrea Arcangeli
2021-01-04 20:39           ` Nadav Amit
2021-01-04 21:01             ` Andrea Arcangeli
2021-01-04 21:26               ` Nadav Amit
2021-01-05 18:45                 ` Andrea Arcangeli
2021-01-05 19:05                   ` Nadav Amit
2021-01-05 19:45                     ` Andrea Arcangeli
2021-01-05 20:06                       ` Nadav Amit
2021-01-05 21:06                         ` Andrea Arcangeli
2021-01-05 21:43                           ` Peter Xu
2021-01-05  8:13       ` Peter Zijlstra
2021-01-05  8:52         ` Nadav Amit
2021-01-05 14:26           ` Peter Zijlstra
2021-01-05  8:58       ` Peter Zijlstra
2021-01-05  9:22         ` Nadav Amit
2021-01-05 17:58         ` Andrea Arcangeli
2021-01-05 15:08   ` Peter Xu
2021-01-05 18:08     ` Andrea Arcangeli
2021-01-05 18:41       ` Peter Xu
2021-01-05 18:55         ` Andrea Arcangeli
2021-01-05 19:07     ` Nadav Amit
2021-01-05 19:43       ` Peter Xu
2020-12-25  9:25 ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Nadav Amit
2021-01-05 15:08   ` Will Deacon
2021-01-05 18:20   ` Andrea Arcangeli
2021-01-05 19:26     ` Nadav Amit
2021-01-05 20:39       ` Andrea Arcangeli
2021-01-05 21:20         ` Yu Zhao
2021-01-05 21:22         ` Nadav Amit
2021-01-05 22:16           ` Will Deacon
2021-01-06  0:29             ` Andrea Arcangeli
2021-01-06  0:02           ` Andrea Arcangeli
2021-01-07 20:04           ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Andrea Arcangeli
2021-01-07 20:04             ` [PATCH 1/2] mm: proc: Invalidate TLB after clearing soft-dirty page state Andrea Arcangeli
2021-01-07 20:04             ` [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending Andrea Arcangeli
2021-01-07 20:17               ` Linus Torvalds
2021-01-07 20:17                 ` Linus Torvalds
2021-01-07 20:25                 ` Linus Torvalds
2021-01-07 20:25                   ` Linus Torvalds
2021-01-07 20:58                 ` Andrea Arcangeli
2021-01-07 21:29                   ` Linus Torvalds
2021-01-07 21:29                     ` Linus Torvalds
2021-01-07 21:53                     ` John Hubbard
2021-01-07 22:00                       ` Linus Torvalds
2021-01-07 22:00                         ` Linus Torvalds
2021-01-07 22:14                         ` John Hubbard
2021-01-07 22:20                           ` Linus Torvalds
2021-01-07 22:20                             ` Linus Torvalds
2021-01-07 22:24                             ` Linus Torvalds
2021-01-07 22:24                               ` Linus Torvalds
2021-01-07 22:37                               ` John Hubbard
2021-01-15 11:27                       ` Jan Kara
2021-01-07 22:31                     ` Andrea Arcangeli
2021-01-07 22:42                       ` Linus Torvalds
2021-01-07 22:42                         ` Linus Torvalds
2021-01-07 22:51                         ` Linus Torvalds
2021-01-07 22:51                           ` Linus Torvalds
2021-01-07 23:48                           ` Andrea Arcangeli
2021-01-08  0:25                             ` Linus Torvalds
2021-01-08  0:25                               ` Linus Torvalds
2021-01-08 12:48                               ` Will Deacon
2021-01-08 16:14                                 ` Andrea Arcangeli
2021-01-08 17:39                                   ` Linus Torvalds
2021-01-08 17:39                                     ` Linus Torvalds
2021-01-08 17:53                                     ` Andrea Arcangeli
2021-01-08 19:25                                       ` Linus Torvalds
2021-01-08 19:25                                         ` Linus Torvalds
2021-01-09  0:12                                         ` Andrea Arcangeli
2021-01-08 17:30                                 ` Linus Torvalds
2021-01-08 17:30                                   ` Linus Torvalds
2021-01-07 23:28                         ` Andrea Arcangeli
2021-01-07 21:36               ` kernel test robot
2021-01-07 21:36                 ` kernel test robot
2021-01-07 20:25             ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Jason Gunthorpe
2021-01-07 20:32               ` Linus Torvalds
2021-01-07 20:32                 ` Linus Torvalds
2021-01-07 21:05                 ` Linus Torvalds
2021-01-07 21:05                   ` Linus Torvalds
2021-01-07 22:02                   ` Andrea Arcangeli
2021-01-07 22:17                     ` Linus Torvalds
2021-01-07 22:17                       ` Linus Torvalds
2021-01-07 22:56                       ` Andrea Arcangeli
2021-01-09 19:32                   ` Matthew Wilcox
2021-01-09 19:46                     ` Linus Torvalds
2021-01-09 19:46                       ` Linus Torvalds
2021-01-15 14:30                       ` Jan Kara
2021-01-07 21:54                 ` Andrea Arcangeli
2021-01-07 21:45               ` Andrea Arcangeli
2021-01-08 13:36                 ` Jason Gunthorpe
2021-01-08 17:00                   ` Andrea Arcangeli
2021-01-08 18:19                     ` Jason Gunthorpe
2021-01-08 18:31                       ` Andy Lutomirski
2021-01-08 18:31                         ` Andy Lutomirski
2021-01-08 18:38                         ` Linus Torvalds
2021-01-08 18:38                           ` Linus Torvalds
2021-01-08 23:34                         ` Andrea Arcangeli
2021-01-09 19:03                           ` Andy Lutomirski
2021-01-09 19:03                             ` Andy Lutomirski
2021-01-09 19:15                             ` Linus Torvalds
2021-01-09 19:15                               ` Linus Torvalds
2021-01-08 18:59                       ` Linus Torvalds
2021-01-08 18:59                         ` Linus Torvalds
2021-01-08 22:43                       ` Andrea Arcangeli
2021-01-09  0:42                         ` Jason Gunthorpe
2021-01-09  2:50                           ` Andrea Arcangeli
2021-01-11 14:30                             ` Jason Gunthorpe [this message]
2021-01-13 21:56                           ` Jerome Glisse
2021-01-13 23:39                             ` Jason Gunthorpe
2021-01-14  2:35                               ` Jerome Glisse
2021-01-09  3:49                       ` Hillf Danton
2021-01-11 14:39                         ` Jason Gunthorpe
2021-01-05 21:55         ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Peter Xu
2021-03-02 22:13 ` [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Peter Xu
2021-03-02 22:14   ` Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210111143041.GI504133@ziepe.ca \
    --to=jgg@ziepe.ca \
    --cc=aarcange@redhat.com \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jhubbard@nvidia.com \
    --cc=keescook@chromium.org \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=ktkhai@virtuozzo.com \
    --cc=leonro@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=oleg@redhat.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=torvalds@linux-foundation.org \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=xemul@openvz.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.