Re: userfaultfd: usability issue due to lack of UFFD events ordering

From: Peter Xu <peterx@redhat.com>
To: Nadav Amit <nadav.amit@gmail.com>
Cc: Mike Rapoport <rppt@kernel.org>,
	David Hildenbrand <david@redhat.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Linux-MM <linux-mm@kvack.org>
Subject: Re: userfaultfd: usability issue due to lack of UFFD events ordering
Date: Wed, 16 Feb 2022 16:27:07 +0800	[thread overview]
Message-ID: <Ygy1Ww7HMAlxP7ea@xz-m1.local> (raw)
In-Reply-To: <F195F8B6-05C4-45BC-BA10-632CA3699941@gmail.com>

On Tue, Feb 15, 2022 at 02:35:09PM -0800, Nadav Amit wrote:
> 
> 
> > On Feb 13, 2022, at 8:02 PM, Peter Xu <peterx@redhat.com> wrote:
> > 
> > Thanks for explaining.
> > 
> > I also digged out the discussion threads between you and Mike and that's a good
> > one too summarizing the problems:
> > 
> > https://lore.kernel.org/all/5921BA80-F263-4F8D-B7E6-316CEB602B51@gmail.com/
> > 
> > Scenario 4 is kind of special imho along all those, because that's the only one
> > that can be workarounded by user application by only copying pages one by one.
> > I know you were even leveraging iouring in your local tree, so that's probably
> > not a solution at all for you. But I'm just trying to start thinking without
> > that scenario for now.
> > 
> > Per my understanding, a major issue regarding the rest of the scenarios is
> > ordering of uffd messages may not match with how things are happening.  This
> > actually contains two problems.
> > 
> > First of all, mmap_sem is mostly held read for all page faults and most of the
> > mm changes except e.g. fork, then we can never serialize them.  Not to mention
> > uffd events releases mmap_sem within prep and completion.  Let's call it
> > problem 1.
> > 
> > The other problem 2 is we can never serialize faults against events.
> > 
> > For problem 1, I do sense something that mmap_sem is just not suitable for uffd
> > scenario. Say, we grant concurrent with most of the events like dontneed and
> > mremap, but when uffd ordering is a concern we may not want to grant that
> > concurrency.  I'm wondering whether it means uffd may need its own semaphore to
> > achieve this.  So for all events that uffd cares we take write lock on a new
> > uffd_sem after mmap_sem, meanwhile we don't release that uffd_sem after prep of
> > events, not until completion (the message is read).  It'll slow down uffd
> > tracked systems but guarantees ordering.
> 
> Peter,
> 
> Thanks for finding the time and looking into the issues that I encountered.
> 
> Your approach sounds possible, but it sounds to me unsafe to acquire uffd_sem
> after mmap_lock, since it might cause deadlocks (e.g., if a process uses events
> to manage its own memory).

Right, it's unsafe if to be taken after mmap_sem.  If to do so IIUC we need to
take it before mmap_sem hence we can release mmap_sem under it.

In my mind that could be a feature bit UFFD_FEATURE_STRICT_ORDERING, when it's
set then the mm bound to the userfaultfd file will have a flag set within the
mm->flags, let's say MMF_UFFD_STRICT_ORDER.

Then for uffd related syscalls like fork(), mremap() and so on we conditionally
take that uffd_sem and we need to do that before mmap_sem.  We take it write
for all the uffd event contexts, and take it read for all the uffd page faults.

But even if above would work again I have little confidence that it'll work in
reality. Firstly it does look odd already that an uffd lock needs to be taken
before the whole mm's, starting to affect common workloads even not using uffd
(even the flag lookup could affect cacheline, I think, but not sure how slower
it would be).  Not to mention that should greatly slow down the tracee process.
It definitely needs more thoughts anyway.

> 
> > 
> > At the meantime, I'm wildly thinking whether we can tackle with the other
> > problem by merging the page fault queue with the event queue, aka, event_wqh
> > and fault_pending_wqh.  Obviously we'll need to identify the messages when
> > read() and conditionally move then into fault_wqh only if they come from page
> > faults, but that seems doable?
> 
> This, I guess is necessary in addition to your aforementioned proposal to have
> some semaphore protecting, can do the trick.
> 
> While I got your attention, let me share some other challenges I encountered
> using userfaultfd. They might be unrelated, but perhaps you can keep them in
> the back of your mind. Nobody should suffer as I did ;-)

Heh.

> 
> 1. mmap_changing (i.e., -EAGAIN on ioctls) makes using userfaultfd harder than
> it should be, especially when using io-uring as I wish to do.
> 
> I think it is not too hard to address by changing the API. For instance, if
> uffd-ctx had a uffd-generation that would increase on each event, the user
> could have provided an ioctl-generation as part of copy/zero/etc ioctls, and
> the kernel would only fail the operation if ioctl copy/zero/etc operation
> only succeeds if the uffd-generation is lower/equal than the one provided by
> the user. 

Assuming that gen_id is copied over from the uffd message, and if that counter
only increases, then I don't understand why it can be lower than the user
provided.

I don't quite get how that solves your problem too, since -EAGAIN can still
trigger.  I must have missed something.

> 
> 2. userfaultfd is separated from other tracing/instrumentation mechanisms in
> the kernel. I, for instance, also wanted to track mmap events (let’s put
> aside for a second why). Tracking these events can be done with ptrace or
> perf_event_open() but then it is hard to correlate these events with
> userfaultfd. It would have been easier for users, I think, if userfaultfd
> notifications were provided through ptrace/tracepoints mechanisms as well.
> 
> 3. Nesting/chaining. It is not easy to allow two monitors to use userfaultfd
> concurrently. This seems as a general problem that I believe ptrace suffers
> from too. I know it might seem far-fetched to have 2 monitors at the moment,
> but I think that any tracking/instrumentation mechanism (e.g., ptrace,
> software-dirty, not to mention hardware virtualization) should be designed
> from the beginning with such support as adding it in a later stage can be
> tricky.

2) and 3) definitely need more thoughts..

PS: I think I first read your name from a paper on the nested virt. :-) But I
forgot the details.

> 
> 4. Missing state. It would be useful to provide the TID of the faulting
> thread. I will send a patch for this one once I get the necessary
> internal approvals.

Before I fully digest your reply and the problems, I want to make sure you are
aware of UFFD_FEATURE_THREAD_ID.. I don't know how you missed it, but it does
sound like what you wanted.

Thanks,

-- 
Peter Xu