linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mike Rapoport <rppt@kernel.org>
To: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Xu <peterx@redhat.com>,
	David Hildenbrand <david@redhat.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Linux-MM <linux-mm@kvack.org>
Subject: Re: userfaultfd: usability issue due to lack of UFFD events ordering
Date: Thu, 17 Feb 2022 23:15:46 +0200	[thread overview]
Message-ID: <Yg67AoSBMNM4JVvP@kernel.org> (raw)
In-Reply-To: <F195F8B6-05C4-45BC-BA10-632CA3699941@gmail.com>

On Tue, Feb 15, 2022 at 02:35:09PM -0800, Nadav Amit wrote:
> 
> 
> > On Feb 13, 2022, at 8:02 PM, Peter Xu <peterx@redhat.com> wrote:
> > 
> > Thanks for explaining.
> > 
> > I also digged out the discussion threads between you and Mike and that's a good
> > one too summarizing the problems:
> > 
> > https://lore.kernel.org/all/5921BA80-F263-4F8D-B7E6-316CEB602B51@gmail.com/
> > 
> > Scenario 4 is kind of special imho along all those, because that's the only one
> > that can be workarounded by user application by only copying pages one by one.
> > I know you were even leveraging iouring in your local tree, so that's probably
> > not a solution at all for you. But I'm just trying to start thinking without
> > that scenario for now.
> > 
> > Per my understanding, a major issue regarding the rest of the scenarios is
> > ordering of uffd messages may not match with how things are happening.  This
> > actually contains two problems.
> > 
> > First of all, mmap_sem is mostly held read for all page faults and most of the
> > mm changes except e.g. fork, then we can never serialize them.  Not to mention
> > uffd events releases mmap_sem within prep and completion.  Let's call it
> > problem 1.
> > 
> > The other problem 2 is we can never serialize faults against events.
> > 
> > For problem 1, I do sense something that mmap_sem is just not suitable for uffd
> > scenario. Say, we grant concurrent with most of the events like dontneed and
> > mremap, but when uffd ordering is a concern we may not want to grant that
> > concurrency.  I'm wondering whether it means uffd may need its own semaphore to
> > achieve this.  So for all events that uffd cares we take write lock on a new
> > uffd_sem after mmap_sem, meanwhile we don't release that uffd_sem after prep of
> > events, not until completion (the message is read).  It'll slow down uffd
> > tracked systems but guarantees ordering.
> 
> Peter,
> 
> Thanks for finding the time and looking into the issues that I encountered.
> 
> Your approach sounds possible, but it sounds to me unsafe to acquire uffd_sem
> after mmap_lock, since it might cause deadlocks (e.g., if a process uses events
> to manage its own memory).
> 
> > 
> > At the meantime, I'm wildly thinking whether we can tackle with the other
> > problem by merging the page fault queue with the event queue, aka, event_wqh
> > and fault_pending_wqh.  Obviously we'll need to identify the messages when
> > read() and conditionally move then into fault_wqh only if they come from page
> > faults, but that seems doable?
> 
> This, I guess is necessary in addition to your aforementioned proposal to have
> some semaphore protecting, can do the trick.
> 
> While I got your attention, let me share some other challenges I encountered
> using userfaultfd. They might be unrelated, but perhaps you can keep them in
> the back of your mind. Nobody should suffer as I did ;-)
> 
> 1. mmap_changing (i.e., -EAGAIN on ioctls) makes using userfaultfd harder than
> it should be, especially when using io-uring as I wish to do.
> 
> I think it is not too hard to address by changing the API. For instance, if
> uffd-ctx had a uffd-generation that would increase on each event, the user
> could have provided an ioctl-generation as part of copy/zero/etc ioctls, and
> the kernel would only fail the operation if ioctl copy/zero/etc operation
> only succeeds if the uffd-generation is lower/equal than the one provided by
> the user. 

Do you mean that if there were page faults with generations 1 and 3 and,
say, MADV_DONTNEED with generation 2, then even if the uffd copy that resolves
page fault 1 races with MADV_DONTNEED it will go through and the copy for
page fault 3 will fail?

But how would you order zapping the pages and copying into them internally?
Or may understanding of your idea was completely off?

As for technicality of adding a generation to uffd_msg and to
uffdio_{copy,zero,etc}, we can use __u32 reserved in the first one and 32
bits from mode in the second with a bit of care for wraparound.
 
> 2. userfaultfd is separated from other tracing/instrumentation mechanisms in
> the kernel. I, for instance, also wanted to track mmap events (let’s put
> aside for a second why). Tracking these events can be done with ptrace or
> perf_event_open() but then it is hard to correlate these events with
> userfaultfd. It would have been easier for users, I think, if userfaultfd
> notifications were provided through ptrace/tracepoints mechanisms as well.

This sounds like opening Pandora box ;-)

I think it's possible to trace userfaultfd events to some extent with a
probe at userfaultfd_event_wait_completion() entry and handle_userfault().
The "interesting" information is passed to these functions as parameters
and I believe all the data can be extracted with tools like bpftrace.
 
> 3. Nesting/chaining. It is not easy to allow two monitors to use userfaultfd
> concurrently. This seems as a general problem that I believe ptrace suffers
> from too. I know it might seem far-fetched to have 2 monitors at the moment,
> but I think that any tracking/instrumentation mechanism (e.g., ptrace,
> software-dirty, not to mention hardware virtualization) should be designed
> from the beginning with such support as adding it in a later stage can be
> tricky.

It's not too far fetched to have nested userfaultfd contexts even now. If
CRIU would need to post-copy restore a process that uses userfaultfd it
will need to deal with nested uffds.
 
> Thanks again,
> Nadav

-- 
Sincerely yours,
Mike.


  parent reply	other threads:[~2022-02-17 21:15 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-30  6:23 userfaultfd: usability issue due to lack of UFFD events ordering Nadav Amit
2022-01-31 10:42 ` Mike Rapoport
2022-01-31 10:48   ` David Hildenbrand
2022-01-31 14:05     ` Mike Rapoport
2022-01-31 14:12       ` David Hildenbrand
2022-01-31 14:28         ` Mike Rapoport
2022-01-31 14:41           ` David Hildenbrand
2022-01-31 18:47             ` Mike Rapoport
2022-01-31 22:39               ` Nadav Amit
2022-02-01  9:10                 ` Mike Rapoport
2022-02-10  7:48                 ` Peter Xu
2022-02-10 18:42                   ` Nadav Amit
2022-02-14  4:02                     ` Peter Xu
2022-02-15 22:35                       ` Nadav Amit
2022-02-16  8:27                         ` Peter Xu
2022-02-17 21:15                         ` Mike Rapoport [this message]
2022-01-31 17:23   ` Nadav Amit
2022-01-31 17:28     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yg67AoSBMNM4JVvP@kernel.org \
    --to=rppt@kernel.org \
    --cc=aarcange@redhat.com \
    --cc=david@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=nadav.amit@gmail.com \
    --cc=peterx@redhat.com \
    --cc=rppt@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).