All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nadav Amit <nadav.amit@gmail.com>
To: Mike Rapoport <rppt@linux.ibm.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>,
	lkml <linux-kernel@vger.kernel.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Pavel Emelyanov <xemul@virtuozzo.com>,
	Andrei Vagin <avagin@virtuozzo.com>
Subject: Re: [PATCH] userfaultfd: prevent non-cooperative events vs mcopy_atomic races
Date: Tue, 8 Dec 2020 00:57:30 -0800	[thread overview]
Message-ID: <83A6D439-F732-4112-BD1F-00195EBFCE4C@gmail.com> (raw)
In-Reply-To: <20201208083434.GA1164013@linux.ibm.com>

> On Dec 8, 2020, at 12:34 AM, Mike Rapoport <rppt@linux.ibm.com> wrote:
> 
> On Sun, Dec 06, 2020 at 08:31:39PM -0800, Nadav Amit wrote:
>> Whenever I run into a non-standard and non-trivial synchronization algorithm
>> in the kernel (and elsewhere), I become very confused and concerned. I
>> raised my question since I wanted to modify the code and could not figure
>> out how to properly do so. Based on your input that the monitor is expected
>> to know the child mappings according to userfaultfd events, I now think that
>> the kernel does not provide this ability and the locking scheme is broken.
>> 
>> Here are some scenarios that I think are broken - please correct me if I am
>> wrong:
>> 
>> * Scenario 1: MADV_DONTNEED racing with userfaultfd page-faults
>> 
>> userfaultfd_remove() only holds the mmap_lock for read, so these events
>> cannot be ordered with userfaultfd page-faults.
>> 
>> * Scenario 2: MADV_DONTNEED racing with fork()
>> 
>> As userfaultfd_remove() releases mmap_lock after the user notification and
>> before the actual unmapping, concurrent fork() might happen before or after
>> the actual unmapping in MADV_DONTNEED and the user therefore has no way of
>> knowing whether the actual unmapping took place before or after the fork().
>> 
>> * Scenario 3: Concurrent MADV_DONTNEED can cause userfaultfd_remove() to
>> clear mmap_changing cleared before all the notifications are completed.
>> 
>> As mmap_lock is only taken for read, the first thread the completed
>> userfaultfd_remove() would clear the indication that was set by the other
>> one.
>> 
>> * Scenario 4: Fork starts and ends between copying of two pages.
>> 
>> As mmap_lock might be released during ioctl_copy() (inside
>> __mcopy_atomic()), some pages might be mapped in the child and others not:
>> 
>> 
>> CPU0				CPU1
>> ----				----
>> ioctl_copy():
>> __mcopy_atomic()
>>  mmap_read_lock()
>>  !mmap_changing [ok]
>>  mfill_atomic_pte() == 0 [page0 copied]
>>  mfill_atomic_pte() == -ENOENT [page1 will be retried]
>>  mmap_read_unlock()
>>  goto retry
>> 
>> 				fork():
>> 				 dup_userfaultfd()
>> 				 -> mmap_changing=true
>> 				 userfaultfd_event_wait_completion()
>> 				 -> mmap_changing=false
>> 
>>  mmap_read_lock()
>>  !mmap_changing [ok]
>>  mfill_atomic_pte() == 0 [page1 copied]
>>  mmap_read_unlock()
>> 
>> return: 2 pages were mapped, while the first is present in the child and
>> the second one is non-present.
>> 
>> Bottom-line: it seems to me that mmap_changing should be a counter (not
>> boolean) that is protected by mmap_lock. This counter should be kept
>> elevated throughout the entire operation (in regard to MADV_DONTNEED).
>> Perhaps mmap_lock does not have to be taken to decrease the counter, but
>> then an smp_wmb() would be needed before the counter is decreased.
>> 
>> Let me know whether I am completely off or missing something.
> 
> I tried to remember what's going on there and wrap my head around your
> examples. I'm not sure if userspace cannot workaround some of those, but
> I can't say I can propose it right now.
> 
> There is for sure userspace is helpless in Scenario 4, but I think it is
> very unlikely that fork() will be fast enough to grab and release
> mmap_lock while uffd_copy() waits for CPU to retry.
> 
> I agree that a making mmap_changing a counter would be more robust
> anyway.

Thanks for confirming my suspicion.

On a second thought, I think that a sequence lock would be required. I will
work on a patch to resolve it in the next RFC of the related patch series I
am working on.

As for the race window size, as there are lock optimizations to prevent
writers' starvation, I do not think the last scenario is completely
far-fetched.

Thanks again,
Nadav

      reply	other threads:[~2020-12-08  8:58 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-23  7:42 [PATCH] userfaultfd: prevent non-cooperative events vs mcopy_atomic races Mike Rapoport
2018-05-24 11:24 ` Pavel Emelyanov
2018-05-24 11:56   ` Mike Rapoport
2018-05-24 16:40     ` Pavel Emelyanov
2018-05-24 19:06       ` Mike Rapoport
2018-05-25 14:05         ` Pavel Emelyanov
2020-12-03 19:57 ` Nadav Amit
2020-12-06  9:37   ` Mike Rapoport
2020-12-07  4:31     ` Nadav Amit
2020-12-08  8:34       ` Mike Rapoport
2020-12-08  8:57         ` Nadav Amit [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=83A6D439-F732-4112-BD1F-00195EBFCE4C@gmail.com \
    --to=nadav.amit@gmail.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=avagin@virtuozzo.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=rppt@linux.ibm.com \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=xemul@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.