Re: [PATCH] userfaultfd: prevent non-cooperative events vs mcopy_atomic races

From: Nadav Amit <nadav.amit@gmail.com>
To: Mike Rapoport <rppt@linux.ibm.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>,
	lkml <linux-kernel@vger.kernel.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Pavel Emelyanov <xemul@virtuozzo.com>,
	Andrei Vagin <avagin@virtuozzo.com>
Subject: Re: [PATCH] userfaultfd: prevent non-cooperative events vs mcopy_atomic races
Date: Sun, 6 Dec 2020 20:31:39 -0800	[thread overview]
Message-ID: <5921BA80-F263-4F8D-B7E6-316CEB602B51@gmail.com> (raw)
In-Reply-To: <20201206093703.GY123287@linux.ibm.com>

Thanks for the detailed answer, Mike. Things are clearer in regard to your
intention.

> On Dec 6, 2020, at 1:37 AM, Mike Rapoport <rppt@linux.ibm.com> wrote:
> 
> The uffd monotor should *know* what is the state of child's memory and
> without this patch it could only guess.

I see - so mmap_changing is not just about graceful handling of copy-ioctl’s
(which I think monitors could handle before mmap_changing was introduced)
but to allow the monitor to know which pages are mapped in each process.
Makes sense, but I have strong doubts it really works (see below).

>> 2. How is memory ordering supposed to work here? IIUC, mmap_changing is not
>> protected by any lock and there are no memory barriers that are associated
>> with the assignment. Indeed, the code calls WRITE_ONCE()/READ_ONCE(), but
>> AFAIK this does not guarantee ordering with non-volatile reads/writes.
> 
> There is also mmap_lock involved, so I don't see how copy can start in
> parallel with fork processing. Fork sets mmap_chaning to true while
> holding mmap_lock, so copy cannot start in parallel. When mmap_lock is
> realeased, mmap_chaning remains true until fork event is pushed to
> userspace and when this is done there is no issue with
> userfaultfd_copy.

Whenever I run into a non-standard and non-trivial synchronization algorithm
in the kernel (and elsewhere), I become very confused and concerned. I
raised my question since I wanted to modify the code and could not figure
out how to properly do so. Based on your input that the monitor is expected
to know the child mappings according to userfaultfd events, I now think that
the kernel does not provide this ability and the locking scheme is broken.

Here are some scenarios that I think are broken - please correct me if I am
wrong:

* Scenario 1: MADV_DONTNEED racing with userfaultfd page-faults

userfaultfd_remove() only holds the mmap_lock for read, so these events
cannot be ordered with userfaultfd page-faults.

* Scenario 2: MADV_DONTNEED racing with fork()

As userfaultfd_remove() releases mmap_lock after the user notification and
before the actual unmapping, concurrent fork() might happen before or after
the actual unmapping in MADV_DONTNEED and the user therefore has no way of
knowing whether the actual unmapping took place before or after the fork().

* Scenario 3: Concurrent MADV_DONTNEED can cause userfaultfd_remove() to
clear mmap_changing cleared before all the notifications are completed.

As mmap_lock is only taken for read, the first thread the completed
userfaultfd_remove() would clear the indication that was set by the other
one.

* Scenario 4: Fork starts and ends between copying of two pages.

As mmap_lock might be released during ioctl_copy() (inside
__mcopy_atomic()), some pages might be mapped in the child and others not:

CPU0				CPU1
----				----
ioctl_copy():
 __mcopy_atomic()
  mmap_read_lock()
  !mmap_changing [ok]
  mfill_atomic_pte() == 0 [page0 copied]
  mfill_atomic_pte() == -ENOENT [page1 will be retried]
  mmap_read_unlock()
  goto retry

				fork():
				 dup_userfaultfd()
				 -> mmap_changing=true
				 userfaultfd_event_wait_completion()
				 -> mmap_changing=false

  mmap_read_lock()
  !mmap_changing [ok]
  mfill_atomic_pte() == 0 [page1 copied]
  mmap_read_unlock()

 return: 2 pages were mapped, while the first is present in the child and
 the second one is non-present.

Bottom-line: it seems to me that mmap_changing should be a counter (not
boolean) that is protected by mmap_lock. This counter should be kept
elevated throughout the entire operation (in regard to MADV_DONTNEED).
Perhaps mmap_lock does not have to be taken to decrease the counter, but
then an smp_wmb() would be needed before the counter is decreased.

Let me know whether I am completely off or missing something.

Thanks,
Nadav