linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Nadav Amit <nadav.amit@gmail.com>
To: Peter Xu <peterx@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>,
	David Hildenbrand <david@redhat.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Linux-MM <linux-mm@kvack.org>
Subject: Re: userfaultfd: usability issue due to lack of UFFD events ordering
Date: Thu, 10 Feb 2022 10:42:43 -0800	[thread overview]
Message-ID: <BC9D9187-1777-4336-AFA4-CD34208DF31E@gmail.com> (raw)
In-Reply-To: <YgTDTjrhoiyH4ZTr@xz-m1.local>



> On Feb 9, 2022, at 11:48 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> Hi, Nadav & all,
> 
> On Mon, Jan 31, 2022 at 02:39:01PM -0800, Nadav Amit wrote:
>> There are use-cases in which you do need to know the order between
>> user-initiated MADV_DONTNEED and page-faults. For instance, if you
>> build a userspace paging mechanism, you need to know whether the
>> page content is zero or whatever is held in the disk.
> 
> When there's no uffd monitor, concurrent page faults with MADV_DONTNEED can
> already result in undefined behavior, right? Assuming the page fault is a write
> with non-zero data, then the page can either contain zero or non-zero data
> at last, IIUC.
> 
> If above is true, I'm wondering whether it's already impossible to do it right
> when there is an uffd monitor?

I think that the MADV_DONTNEED/PF-resolution “race" only affects usage-models
that handle the page-fault concurrently with UFFD-monitoring (using multiple
monitor threads or IO-uring as I try to do). At least for use-cases such as
live-migration.

I think the scenario you have in mind is the following, which is resolved
with mmap_changing that Mike introduced some time ago:

  UFFD monitor		App thread #0		App thread #1
  ------------		-------------		-------------
  						#PF
  UFFD Read
   [#PF]						
			MADV_DONTNEED
			 mmap_changing = 1

			 userfaultfd_event_wait_completion()
			  [queue event, wait]  
  UFFD-copy
   -EAGAIN since mmmap_changing > 0


mmap_changing will keep being elevated, and UFFD-copy not served (fail) until
the monitor reads the UFFD event. The monitor, in this scenario, is single
threaded and therefore orders UFFD-read and UFFD-copy, preventing them from
racing.

Assuming the monitor is smart enough to reevaluate the course of action after
MADV_DONTNEED is handled, it should be safe. Personally, I do not like the
burden that this scheme puts on the monitor, the fact it needs to retry or
even the return value [I think it should be EBUSY since immediate retry would
fail. With IO-uring, EAGAIN triggers an immediate retry, which is useless.]
Yet, concurrent UFFD-event/#PF can be handled properly by a smart monitor.

*However*, userfaultfd events seem as very hard to use (to say the least) in
the following cases:

1. The UFFD-copy is issued by one thread and the UFFD-read is performed by
   another. For me this is the most painful even if you may consider it
   as “unorthodox”. It is very useful for performance, especially if the
   UFFD-copy is large.

2. If the race is between 2 userfaultfd *events*. The events might not be
   properly ordered (i.e., the order in which they are read does not reflect
   the order in which they occurred) despite the use of *_userfaultfd_prep(),
   since they are only queued (to be reported and trigger wake) by
   userfaultfd_event_wait_completion(), after the VMA and PTEs were updated
   and more importantly after mmap-lock was dropped.

   This means that if you have fork and MADV_DONTNEED, the monitor might see
   their order inverted, and won’t be able to know whether the child has the
   pages zapped or not.

   Other races are possible too, for instance between mremap() and munmap().
   In most cases the monitor might be able (with quite some work) to
   figure out that the order of the events it received does not make sense
   and the events must have been reordered. Yet, implementing something like
   that is far from trivial and there are some cases that are probably
   impossible to resolve just based on the UFFD read events.

I personally addressed this issue with seccomp+ptrace to trap on
entry/exit to relevant syscalls (e.g., munmap, mmap, fork), and
prevent concurrent calls to obtain correct order of the events. It is far
from trivial and introduces some overheads, but I did not find a better
solution.

Let me know if I am missing anything.

Thanks,
Nadav

  reply	other threads:[~2022-02-10 18:42 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-30  6:23 userfaultfd: usability issue due to lack of UFFD events ordering Nadav Amit
2022-01-31 10:42 ` Mike Rapoport
2022-01-31 10:48   ` David Hildenbrand
2022-01-31 14:05     ` Mike Rapoport
2022-01-31 14:12       ` David Hildenbrand
2022-01-31 14:28         ` Mike Rapoport
2022-01-31 14:41           ` David Hildenbrand
2022-01-31 18:47             ` Mike Rapoport
2022-01-31 22:39               ` Nadav Amit
2022-02-01  9:10                 ` Mike Rapoport
2022-02-10  7:48                 ` Peter Xu
2022-02-10 18:42                   ` Nadav Amit [this message]
2022-02-14  4:02                     ` Peter Xu
2022-02-15 22:35                       ` Nadav Amit
2022-02-16  8:27                         ` Peter Xu
2022-02-17 21:15                         ` Mike Rapoport
2022-01-31 17:23   ` Nadav Amit
2022-01-31 17:28     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BC9D9187-1777-4336-AFA4-CD34208DF31E@gmail.com \
    --to=nadav.amit@gmail.com \
    --cc=aarcange@redhat.com \
    --cc=david@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=peterx@redhat.com \
    --cc=rppt@kernel.org \
    --cc=rppt@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).