userfaultfd: usability issue due to lack of UFFD events ordering

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* userfaultfd: usability issue due to lack of UFFD events ordering
@ 2022-01-30  6:23 Nadav Amit
  2022-01-31 10:42 ` Mike Rapoport
  0 siblings, 1 reply; 18+ messages in thread
From: Nadav Amit @ 2022-01-30  6:23 UTC (permalink / raw)
  To: David Hildenbrand, Mike Rapoport, Andrea Arcangeli, Peter Xu; +Cc: Linux-MM

Using userfautlfd and looking at the kernel code, I encountered a usability
issue that complicates userspace UFFD-monitor implementation. I obviosuly
might be wrong, so I would appreciate a (polite?) feedback. I do have a
userspace workaround, but I thought it is worthy to share and to hear your
opinion, as well as feedback from other UFFD users.

The issue I encountered regards the ordering of UFFD events tbat might not
reflect the actual order in which events took place.

In more detail, UFFD events (e.g., unmap, fork) are not ordered against
themselves [*]. The mm-lock is dropped before notifying the userspace
UFFD-monitor, and therefore there is no guarantee as to whether the order of
the events actually reflects the order in which the events took place. This
can prevent a UFFD-monitor from using the events to track which ranges are
mapped. Specifically, UFFD_EVENT_FORK message and a UFFD_EVENT_UNMAP message
(which reflects unmap in the parent process) can be reordered, if the events
are triggered by two different threads. In this case the UFFD-monitor cannot
figure from the events whether the child process has the unmapped memory
range still mapped (because fork happened first) or not.

Obviously, it does not make sense to keep holding mm-lock while notifying the
user, as it can even lead to deadlocks. Userspace UFFD-monitors can
workaround this issue by using seccomp+ptrace instead of UFFD-events to
obtain order of the events or examine /proc/[pid]/smaps. Yet, this introduces
overheads, is complicated, and I doubt anyone does so. I wonder if the API is
reasonable, or whether I am missing something.

Thanks,
Nadav

[*] Note that I do not discuss UFFD-monitor issued ioctl's, but the order
    between UFFD-events.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: userfaultfd: usability issue due to lack of UFFD events ordering
  2022-01-30  6:23 userfaultfd: usability issue due to lack of UFFD events ordering Nadav Amit
@ 2022-01-31 10:42 ` Mike Rapoport
  2022-01-31 10:48   ` David Hildenbrand
  2022-01-31 17:23   ` Nadav Amit
  0 siblings, 2 replies; 18+ messages in thread
From: Mike Rapoport @ 2022-01-31 10:42 UTC (permalink / raw)
  To: Nadav Amit
  Cc: David Hildenbrand, Mike Rapoport, Andrea Arcangeli, Peter Xu, Linux-MM

Hi Nadav,

On Sat, Jan 29, 2022 at 10:23:55PM -0800, Nadav Amit wrote:
> Using userfautlfd and looking at the kernel code, I encountered a usability
> issue that complicates userspace UFFD-monitor implementation. I obviosuly
> might be wrong, so I would appreciate a (polite?) feedback. I do have a
> userspace workaround, but I thought it is worthy to share and to hear your
> opinion, as well as feedback from other UFFD users.
> 
> The issue I encountered regards the ordering of UFFD events tbat might not
> reflect the actual order in which events took place.
> 
> In more detail, UFFD events (e.g., unmap, fork) are not ordered against
> themselves [*]. The mm-lock is dropped before notifying the userspace
> UFFD-monitor, and therefore there is no guarantee as to whether the order of
> the events actually reflects the order in which the events took place.
> This can prevent a UFFD-monitor from using the events to track which
> ranges are mapped. Specifically, UFFD_EVENT_FORK message and a
> UFFD_EVENT_UNMAP message (which reflects unmap in the parent process) can
> be reordered, if the events are triggered by two different threads. In
> this case the UFFD-monitor cannot figure from the events whether the
> child process has the unmapped memory range still mapped (because fork
> happened first) or not.

Yeah, it seems that something like this is possible:


fork()					munmap()
	mmap_write_unlock();
						mmap_write_lock_killable();
						do_things();
						mmap_{read,write}_unlock();
						userfaultfd_unmap_complete();
	dup_userfaultfd_complete();

A solution could be to split uffd_*_complete() to two parts: one that
queues up the event message and the second one that waits for it to be read
by the monitor. The first part then can run befor mm-lock is released.

If you can think of something nicer, it'll be really great!

> Obviously, it does not make sense to keep holding mm-lock while notifying the
> user, as it can even lead to deadlocks. Userspace UFFD-monitors can
> workaround this issue by using seccomp+ptrace instead of UFFD-events to
> obtain order of the events or examine /proc/[pid]/smaps. Yet, this introduces
> overheads, is complicated, and I doubt anyone does so. I wonder if the API is
> reasonable, or whether I am missing something.
> 
> Thanks,
> Nadav
> 
> [*] Note that I do not discuss UFFD-monitor issued ioctl's, but the order
>     between UFFD-events.
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: userfaultfd: usability issue due to lack of UFFD events ordering
  2022-01-31 10:42 ` Mike Rapoport
@ 2022-01-31 10:48   ` David Hildenbrand
  2022-01-31 14:05     ` Mike Rapoport
  2022-01-31 17:23   ` Nadav Amit
  1 sibling, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2022-01-31 10:48 UTC (permalink / raw)
  To: Mike Rapoport, Nadav Amit
  Cc: Mike Rapoport, Andrea Arcangeli, Peter Xu, Linux-MM

On 31.01.22 11:42, Mike Rapoport wrote:
> Hi Nadav,
> 
> On Sat, Jan 29, 2022 at 10:23:55PM -0800, Nadav Amit wrote:
>> Using userfautlfd and looking at the kernel code, I encountered a usability
>> issue that complicates userspace UFFD-monitor implementation. I obviosuly
>> might be wrong, so I would appreciate a (polite?) feedback. I do have a
>> userspace workaround, but I thought it is worthy to share and to hear your
>> opinion, as well as feedback from other UFFD users.
>>
>> The issue I encountered regards the ordering of UFFD events tbat might not
>> reflect the actual order in which events took place.
>>
>> In more detail, UFFD events (e.g., unmap, fork) are not ordered against
>> themselves [*]. The mm-lock is dropped before notifying the userspace
>> UFFD-monitor, and therefore there is no guarantee as to whether the order of
>> the events actually reflects the order in which the events took place.
>> This can prevent a UFFD-monitor from using the events to track which
>> ranges are mapped. Specifically, UFFD_EVENT_FORK message and a
>> UFFD_EVENT_UNMAP message (which reflects unmap in the parent process) can
>> be reordered, if the events are triggered by two different threads. In
>> this case the UFFD-monitor cannot figure from the events whether the
>> child process has the unmapped memory range still mapped (because fork
>> happened first) or not.
> 
> Yeah, it seems that something like this is possible:
> 
> 
> fork()					munmap()
> 	mmap_write_unlock();
> 						mmap_write_lock_killable();
> 						do_things();
> 						mmap_{read,write}_unlock();
> 						userfaultfd_unmap_complete();
> 	dup_userfaultfd_complete();
> 

I was thinking about other possible races, e.g., MADV_DONTNEED/MADV_FREE
racing with UFFD_EVENT_PAGEFAULT -- where we only hold the mmap_lock in
read mode. But not sure if they apply. The fork() vs. munmap() is
somewhat "obviously problematic" :)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: userfaultfd: usability issue due to lack of UFFD events ordering
  2022-01-31 10:48   ` David Hildenbrand
@ 2022-01-31 14:05     ` Mike Rapoport
  2022-01-31 14:12       ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Rapoport @ 2022-01-31 14:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nadav Amit, Mike Rapoport, Andrea Arcangeli, Peter Xu, Linux-MM

On Mon, Jan 31, 2022 at 11:48:27AM +0100, David Hildenbrand wrote:
> On 31.01.22 11:42, Mike Rapoport wrote:
> > Hi Nadav,
> > 
> > On Sat, Jan 29, 2022 at 10:23:55PM -0800, Nadav Amit wrote:
> >> Using userfautlfd and looking at the kernel code, I encountered a usability
> >> issue that complicates userspace UFFD-monitor implementation. I obviosuly
> >> might be wrong, so I would appreciate a (polite?) feedback. I do have a
> >> userspace workaround, but I thought it is worthy to share and to hear your
> >> opinion, as well as feedback from other UFFD users.
> >>
> >> The issue I encountered regards the ordering of UFFD events tbat might not
> >> reflect the actual order in which events took place.
> >>
> >> In more detail, UFFD events (e.g., unmap, fork) are not ordered against
> >> themselves [*]. The mm-lock is dropped before notifying the userspace
> >> UFFD-monitor, and therefore there is no guarantee as to whether the order of
> >> the events actually reflects the order in which the events took place.
> >> This can prevent a UFFD-monitor from using the events to track which
> >> ranges are mapped. Specifically, UFFD_EVENT_FORK message and a
> >> UFFD_EVENT_UNMAP message (which reflects unmap in the parent process) can
> >> be reordered, if the events are triggered by two different threads. In
> >> this case the UFFD-monitor cannot figure from the events whether the
> >> child process has the unmapped memory range still mapped (because fork
> >> happened first) or not.
> > 
> > Yeah, it seems that something like this is possible:
> > 
> > 
> > fork()					munmap()
> > 	mmap_write_unlock();
> > 						mmap_write_lock_killable();
> > 						do_things();
> > 						mmap_{read,write}_unlock();
> > 						userfaultfd_unmap_complete();
> > 	dup_userfaultfd_complete();
> > 
> 
> I was thinking about other possible races, e.g., MADV_DONTNEED/MADV_FREE
> racing with UFFD_EVENT_PAGEFAULT -- where we only hold the mmap_lock in
> read mode. But not sure if they apply.

The userspace can live with these, at least for uffd missing page faults.
If the monitor will try to resolve a page fault for a removed area, the
errno from UFFDIO_COPY/ZERO can be used to detect such races.

> The fork() vs. munmap() is somewhat "obviously problematic" :)

Nothing funny about it ;-)

> -- 
> Thanks,
> 
> David / dhildenb
> 
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: userfaultfd: usability issue due to lack of UFFD events ordering
  2022-01-31 14:05     ` Mike Rapoport
@ 2022-01-31 14:12       ` David Hildenbrand
  2022-01-31 14:28         ` Mike Rapoport
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2022-01-31 14:12 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Nadav Amit, Mike Rapoport, Andrea Arcangeli, Peter Xu, Linux-MM

On 31.01.22 15:05, Mike Rapoport wrote:
> On Mon, Jan 31, 2022 at 11:48:27AM +0100, David Hildenbrand wrote:
>> On 31.01.22 11:42, Mike Rapoport wrote:
>>> Hi Nadav,
>>>
>>> On Sat, Jan 29, 2022 at 10:23:55PM -0800, Nadav Amit wrote:
>>>> Using userfautlfd and looking at the kernel code, I encountered a usability
>>>> issue that complicates userspace UFFD-monitor implementation. I obviosuly
>>>> might be wrong, so I would appreciate a (polite?) feedback. I do have a
>>>> userspace workaround, but I thought it is worthy to share and to hear your
>>>> opinion, as well as feedback from other UFFD users.
>>>>
>>>> The issue I encountered regards the ordering of UFFD events tbat might not
>>>> reflect the actual order in which events took place.
>>>>
>>>> In more detail, UFFD events (e.g., unmap, fork) are not ordered against
>>>> themselves [*]. The mm-lock is dropped before notifying the userspace
>>>> UFFD-monitor, and therefore there is no guarantee as to whether the order of
>>>> the events actually reflects the order in which the events took place.
>>>> This can prevent a UFFD-monitor from using the events to track which
>>>> ranges are mapped. Specifically, UFFD_EVENT_FORK message and a
>>>> UFFD_EVENT_UNMAP message (which reflects unmap in the parent process) can
>>>> be reordered, if the events are triggered by two different threads. In
>>>> this case the UFFD-monitor cannot figure from the events whether the
>>>> child process has the unmapped memory range still mapped (because fork
>>>> happened first) or not.
>>>
>>> Yeah, it seems that something like this is possible:
>>>
>>>
>>> fork()					munmap()
>>> 	mmap_write_unlock();
>>> 						mmap_write_lock_killable();
>>> 						do_things();
>>> 						mmap_{read,write}_unlock();
>>> 						userfaultfd_unmap_complete();
>>> 	dup_userfaultfd_complete();
>>>
>>
>> I was thinking about other possible races, e.g., MADV_DONTNEED/MADV_FREE
>> racing with UFFD_EVENT_PAGEFAULT -- where we only hold the mmap_lock in
>> read mode. But not sure if they apply.
> 
> The userspace can live with these, at least for uffd missing page faults.
> If the monitor will try to resolve a page fault for a removed area, the
> errno from UFFDIO_COPY/ZERO can be used to detect such races.

I was wondering if the monitor could get confused if he just resolved a
page fault via UFFDIO_COPY/ZERO and then receives a REMOVE event.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: userfaultfd: usability issue due to lack of UFFD events ordering
  2022-01-31 14:12       ` David Hildenbrand
@ 2022-01-31 14:28         ` Mike Rapoport
  2022-01-31 14:41           ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Rapoport @ 2022-01-31 14:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nadav Amit, Mike Rapoport, Andrea Arcangeli, Peter Xu, Linux-MM

On Mon, Jan 31, 2022 at 03:12:36PM +0100, David Hildenbrand wrote:
> On 31.01.22 15:05, Mike Rapoport wrote:
> > On Mon, Jan 31, 2022 at 11:48:27AM +0100, David Hildenbrand wrote:
> >> On 31.01.22 11:42, Mike Rapoport wrote:
> >>> Hi Nadav,
> >>>
> >>> On Sat, Jan 29, 2022 at 10:23:55PM -0800, Nadav Amit wrote:
> >>>> Using userfautlfd and looking at the kernel code, I encountered a usability
> >>>> issue that complicates userspace UFFD-monitor implementation. I obviosuly
> >>>> might be wrong, so I would appreciate a (polite?) feedback. I do have a
> >>>> userspace workaround, but I thought it is worthy to share and to hear your
> >>>> opinion, as well as feedback from other UFFD users.
> >>>>
> >>>> The issue I encountered regards the ordering of UFFD events tbat might not
> >>>> reflect the actual order in which events took place.
> >>>>
> >>>> In more detail, UFFD events (e.g., unmap, fork) are not ordered against
> >>>> themselves [*]. The mm-lock is dropped before notifying the userspace
> >>>> UFFD-monitor, and therefore there is no guarantee as to whether the order of
> >>>> the events actually reflects the order in which the events took place.
> >>>> This can prevent a UFFD-monitor from using the events to track which
> >>>> ranges are mapped. Specifically, UFFD_EVENT_FORK message and a
> >>>> UFFD_EVENT_UNMAP message (which reflects unmap in the parent process) can
> >>>> be reordered, if the events are triggered by two different threads. In
> >>>> this case the UFFD-monitor cannot figure from the events whether the
> >>>> child process has the unmapped memory range still mapped (because fork
> >>>> happened first) or not.
> >>>
> >>> Yeah, it seems that something like this is possible:
> >>>
> >>>
> >>> fork()					munmap()
> >>> 	mmap_write_unlock();
> >>> 						mmap_write_lock_killable();
> >>> 						do_things();
> >>> 						mmap_{read,write}_unlock();
> >>> 						userfaultfd_unmap_complete();
> >>> 	dup_userfaultfd_complete();
> >>>
> >>
> >> I was thinking about other possible races, e.g., MADV_DONTNEED/MADV_FREE
> >> racing with UFFD_EVENT_PAGEFAULT -- where we only hold the mmap_lock in
> >> read mode. But not sure if they apply.
> > 
> > The userspace can live with these, at least for uffd missing page faults.
> > If the monitor will try to resolve a page fault for a removed area, the
> > errno from UFFDIO_COPY/ZERO can be used to detect such races.
> 
> I was wondering if the monitor could get confused if he just resolved a
> page fault via UFFDIO_COPY/ZERO and then receives a REMOVE event.

And why would it be confused?
 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: userfaultfd: usability issue due to lack of UFFD events ordering
  2022-01-31 14:28         ` Mike Rapoport
@ 2022-01-31 14:41           ` David Hildenbrand
  2022-01-31 18:47             ` Mike Rapoport
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2022-01-31 14:41 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Nadav Amit, Mike Rapoport, Andrea Arcangeli, Peter Xu, Linux-MM

On 31.01.22 15:28, Mike Rapoport wrote:
> On Mon, Jan 31, 2022 at 03:12:36PM +0100, David Hildenbrand wrote:
>> On 31.01.22 15:05, Mike Rapoport wrote:
>>> On Mon, Jan 31, 2022 at 11:48:27AM +0100, David Hildenbrand wrote:
>>>> On 31.01.22 11:42, Mike Rapoport wrote:
>>>>> Hi Nadav,
>>>>>
>>>>> On Sat, Jan 29, 2022 at 10:23:55PM -0800, Nadav Amit wrote:
>>>>>> Using userfautlfd and looking at the kernel code, I encountered a usability
>>>>>> issue that complicates userspace UFFD-monitor implementation. I obviosuly
>>>>>> might be wrong, so I would appreciate a (polite?) feedback. I do have a
>>>>>> userspace workaround, but I thought it is worthy to share and to hear your
>>>>>> opinion, as well as feedback from other UFFD users.
>>>>>>
>>>>>> The issue I encountered regards the ordering of UFFD events tbat might not
>>>>>> reflect the actual order in which events took place.
>>>>>>
>>>>>> In more detail, UFFD events (e.g., unmap, fork) are not ordered against
>>>>>> themselves [*]. The mm-lock is dropped before notifying the userspace
>>>>>> UFFD-monitor, and therefore there is no guarantee as to whether the order of
>>>>>> the events actually reflects the order in which the events took place.
>>>>>> This can prevent a UFFD-monitor from using the events to track which
>>>>>> ranges are mapped. Specifically, UFFD_EVENT_FORK message and a
>>>>>> UFFD_EVENT_UNMAP message (which reflects unmap in the parent process) can
>>>>>> be reordered, if the events are triggered by two different threads. In
>>>>>> this case the UFFD-monitor cannot figure from the events whether the
>>>>>> child process has the unmapped memory range still mapped (because fork
>>>>>> happened first) or not.
>>>>>
>>>>> Yeah, it seems that something like this is possible:
>>>>>
>>>>>
>>>>> fork()					munmap()
>>>>> 	mmap_write_unlock();
>>>>> 						mmap_write_lock_killable();
>>>>> 						do_things();
>>>>> 						mmap_{read,write}_unlock();
>>>>> 						userfaultfd_unmap_complete();
>>>>> 	dup_userfaultfd_complete();
>>>>>
>>>>
>>>> I was thinking about other possible races, e.g., MADV_DONTNEED/MADV_FREE
>>>> racing with UFFD_EVENT_PAGEFAULT -- where we only hold the mmap_lock in
>>>> read mode. But not sure if they apply.
>>>
>>> The userspace can live with these, at least for uffd missing page faults.
>>> If the monitor will try to resolve a page fault for a removed area, the
>>> errno from UFFDIO_COPY/ZERO can be used to detect such races.
>>
>> I was wondering if the monitor could get confused if he just resolved a
>> page fault via UFFDIO_COPY/ZERO and then receives a REMOVE event.
> 
> And why would it be confused?

My thinking was that the monitor might use REMOVE events to track which
pages are actually populated. If you receive REMOVE after
UFFDIO_COPY/ZERO the monitor would conclude that the page is not
populated, just like if we'd get the MADV_DONTNEED/MADV_REMOVE
immediately after placing a page.

Of course, it heavily depends on the target use case in the monitor or I
might just be wrong.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: userfaultfd: usability issue due to lack of UFFD events ordering
  2022-01-31 10:42 ` Mike Rapoport
  2022-01-31 10:48   ` David Hildenbrand
@ 2022-01-31 17:23   ` Nadav Amit
  2022-01-31 17:28     ` David Hildenbrand
  1 sibling, 1 reply; 18+ messages in thread
From: Nadav Amit @ 2022-01-31 17:23 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: David Hildenbrand, Mike Rapoport, Andrea Arcangeli, Peter Xu, Linux-MM


> On Jan 31, 2022, at 2:42 AM, Mike Rapoport <rppt@kernel.org> wrote:
> 
> Hi Nadav,
> 
> On Sat, Jan 29, 2022 at 10:23:55PM -0800, Nadav Amit wrote:
>> Using userfautlfd and looking at the kernel code, I encountered a usability
>> issue that complicates userspace UFFD-monitor implementation. I obviosuly
>> might be wrong, so I would appreciate a (polite?) feedback. I do have a
>> userspace workaround, but I thought it is worthy to share and to hear your
>> opinion, as well as feedback from other UFFD users.
>> 
>> The issue I encountered regards the ordering of UFFD events tbat might not
>> reflect the actual order in which events took place.
>> 
>> In more detail, UFFD events (e.g., unmap, fork) are not ordered against
>> themselves [*]. The mm-lock is dropped before notifying the userspace
>> UFFD-monitor, and therefore there is no guarantee as to whether the order of
>> the events actually reflects the order in which the events took place.
>> This can prevent a UFFD-monitor from using the events to track which
>> ranges are mapped. Specifically, UFFD_EVENT_FORK message and a
>> UFFD_EVENT_UNMAP message (which reflects unmap in the parent process) can
>> be reordered, if the events are triggered by two different threads. In
>> this case the UFFD-monitor cannot figure from the events whether the
>> child process has the unmapped memory range still mapped (because fork
>> happened first) or not.
> 
> Yeah, it seems that something like this is possible:
> 
> 
> fork()					munmap()
> 	mmap_write_unlock();
> 						mmap_write_lock_killable();
> 						do_things();
> 						mmap_{read,write}_unlock();
> 						userfaultfd_unmap_complete();
> 	dup_userfaultfd_complete();
> 
> A solution could be to split uffd_*_complete() to two parts: one that
> queues up the event message and the second one that waits for it to be read
> by the monitor. The first part then can run befor mm-lock is released.
> 
> If you can think of something nicer, it'll be really great!

Thanks for the quick response. Your solution is possible, but then the
order between events and page-faults is certainly not kept - as David
mentioned: regardless of mm-lock that is not always taken for write,
events and page-faults are on two separate lists, and queued page-faults
are reported before events.

I am also not sure how simple/performant it is, since it would require
an additional refcount for userfaultfd_wait_queue to prevent it from
disappearing between the time it is enqueued to the time it blocks.

Another option is to associate some “generation” or “sequence number”
with every event and change the PAI to include it. It still leaves the
problem of ordering MADV_DONTNEED and page-faults though.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: userfaultfd: usability issue due to lack of UFFD events ordering
  2022-01-31 17:23   ` Nadav Amit
@ 2022-01-31 17:28     ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2022-01-31 17:28 UTC (permalink / raw)
  To: Nadav Amit, Mike Rapoport
  Cc: Mike Rapoport, Andrea Arcangeli, Peter Xu, Linux-MM

On 31.01.22 18:23, Nadav Amit wrote:
> 
>> On Jan 31, 2022, at 2:42 AM, Mike Rapoport <rppt@kernel.org> wrote:
>>
>> Hi Nadav,
>>
>> On Sat, Jan 29, 2022 at 10:23:55PM -0800, Nadav Amit wrote:
>>> Using userfautlfd and looking at the kernel code, I encountered a usability
>>> issue that complicates userspace UFFD-monitor implementation. I obviosuly
>>> might be wrong, so I would appreciate a (polite?) feedback. I do have a
>>> userspace workaround, but I thought it is worthy to share and to hear your
>>> opinion, as well as feedback from other UFFD users.
>>>
>>> The issue I encountered regards the ordering of UFFD events tbat might not
>>> reflect the actual order in which events took place.
>>>
>>> In more detail, UFFD events (e.g., unmap, fork) are not ordered against
>>> themselves [*]. The mm-lock is dropped before notifying the userspace
>>> UFFD-monitor, and therefore there is no guarantee as to whether the order of
>>> the events actually reflects the order in which the events took place.
>>> This can prevent a UFFD-monitor from using the events to track which
>>> ranges are mapped. Specifically, UFFD_EVENT_FORK message and a
>>> UFFD_EVENT_UNMAP message (which reflects unmap in the parent process) can
>>> be reordered, if the events are triggered by two different threads. In
>>> this case the UFFD-monitor cannot figure from the events whether the
>>> child process has the unmapped memory range still mapped (because fork
>>> happened first) or not.
>>
>> Yeah, it seems that something like this is possible:
>>
>>
>> fork()					munmap()
>> 	mmap_write_unlock();
>> 						mmap_write_lock_killable();
>> 						do_things();
>> 						mmap_{read,write}_unlock();
>> 						userfaultfd_unmap_complete();
>> 	dup_userfaultfd_complete();
>>
>> A solution could be to split uffd_*_complete() to two parts: one that
>> queues up the event message and the second one that waits for it to be read
>> by the monitor. The first part then can run befor mm-lock is released.
>>
>> If you can think of something nicer, it'll be really great!
> 
> Thanks for the quick response. Your solution is possible, but then the
> order between events and page-faults is certainly not kept - as David
> mentioned: regardless of mm-lock that is not always taken for write,
> events and page-faults are on two separate lists, and queued page-faults
> are reported before events.

Of course, for the issue I brought up (if it's a real issue), the
question is if we could "adjust the documentation" to state that there
are no ordering guarantees. IMHO at least the fork()+munmap() needs a
proper fix, because otherwise, we might really end up with an API that's
partially useless -- as you correctly state.

> 
> I am also not sure how simple/performant it is, since it would require
> an additional refcount for userfaultfd_wait_queue to prevent it from
> disappearing between the time it is enqueued to the time it blocks.
> 
> Another option is to associate some “generation” or “sequence number”
> with every event and change the PAI to include it. It still leaves the
> problem of ordering MADV_DONTNEED and page-faults though.
> 

My first thought was to include a timestamp. But requiring user space to
restore the order based on a timestamp might be really ... weird.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: userfaultfd: usability issue due to lack of UFFD events ordering
  2022-01-31 14:41           ` David Hildenbrand
@ 2022-01-31 18:47             ` Mike Rapoport
  2022-01-31 22:39               ` Nadav Amit
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Rapoport @ 2022-01-31 18:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nadav Amit, Mike Rapoport, Andrea Arcangeli, Peter Xu, Linux-MM

On Mon, Jan 31, 2022 at 03:41:05PM +0100, David Hildenbrand wrote:
> On 31.01.22 15:28, Mike Rapoport wrote:
> > On Mon, Jan 31, 2022 at 03:12:36PM +0100, David Hildenbrand wrote:
> >> On 31.01.22 15:05, Mike Rapoport wrote:
> >>> On Mon, Jan 31, 2022 at 11:48:27AM +0100, David Hildenbrand wrote:
> >>>> On 31.01.22 11:42, Mike Rapoport wrote:
> >>>>> Hi Nadav,
> >>>>>
> >>>>> On Sat, Jan 29, 2022 at 10:23:55PM -0800, Nadav Amit wrote:
> >>>>>> Using userfautlfd and looking at the kernel code, I encountered a usability
> >>>>>> issue that complicates userspace UFFD-monitor implementation. I obviosuly
> >>>>>> might be wrong, so I would appreciate a (polite?) feedback. I do have a
> >>>>>> userspace workaround, but I thought it is worthy to share and to hear your
> >>>>>> opinion, as well as feedback from other UFFD users.
> >>>>>>
> >>>>>> The issue I encountered regards the ordering of UFFD events tbat might not
> >>>>>> reflect the actual order in which events took place.
> >>>>>>
> >>>>>> In more detail, UFFD events (e.g., unmap, fork) are not ordered against
> >>>>>> themselves [*]. The mm-lock is dropped before notifying the userspace
> >>>>>> UFFD-monitor, and therefore there is no guarantee as to whether the order of
> >>>>>> the events actually reflects the order in which the events took place.
> >>>>>> This can prevent a UFFD-monitor from using the events to track which
> >>>>>> ranges are mapped. Specifically, UFFD_EVENT_FORK message and a
> >>>>>> UFFD_EVENT_UNMAP message (which reflects unmap in the parent process) can
> >>>>>> be reordered, if the events are triggered by two different threads. In
> >>>>>> this case the UFFD-monitor cannot figure from the events whether the
> >>>>>> child process has the unmapped memory range still mapped (because fork
> >>>>>> happened first) or not.
> >>>>>
> >>>>> Yeah, it seems that something like this is possible:
> >>>>>
> >>>>>
> >>>>> fork()					munmap()
> >>>>> 	mmap_write_unlock();
> >>>>> 						mmap_write_lock_killable();
> >>>>> 						do_things();
> >>>>> 						mmap_{read,write}_unlock();
> >>>>> 						userfaultfd_unmap_complete();
> >>>>> 	dup_userfaultfd_complete();
> >>>>>
> >>>>
> >>>> I was thinking about other possible races, e.g., MADV_DONTNEED/MADV_FREE
> >>>> racing with UFFD_EVENT_PAGEFAULT -- where we only hold the mmap_lock in
> >>>> read mode. But not sure if they apply.
> >>>
> >>> The userspace can live with these, at least for uffd missing page faults.
> >>> If the monitor will try to resolve a page fault for a removed area, the
> >>> errno from UFFDIO_COPY/ZERO can be used to detect such races.
> >>
> >> I was wondering if the monitor could get confused if he just resolved a
> >> page fault via UFFDIO_COPY/ZERO and then receives a REMOVE event.
> > 
> > And why would it be confused?
> 
> My thinking was that the monitor might use REMOVE events to track which
> pages are actually populated. If you receive REMOVE after
> UFFDIO_COPY/ZERO the monitor would conclude that the page is not
> populated, just like if we'd get the MADV_DONTNEED/MADV_REMOVE
> immediately after placing a page.

I still don't follow your usecase.

In CRIU we simply discard whatever content we had to fill when there is
REMOVE event. If a page fault occurs in that region we use UFFDIO_ZEROPAGE,
just as it would happen in "normal" page fault processing 
(note, CRIU does not support uffd with hugetlb or shmem)
 
> Of course, it heavily depends on the target use case in the monitor or I
> might just be wrong.
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: userfaultfd: usability issue due to lack of UFFD events ordering
  2022-01-31 18:47             ` Mike Rapoport
@ 2022-01-31 22:39               ` Nadav Amit
  2022-02-01  9:10                 ` Mike Rapoport
  2022-02-10  7:48                 ` Peter Xu
  0 siblings, 2 replies; 18+ messages in thread
From: Nadav Amit @ 2022-01-31 22:39 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: David Hildenbrand, Mike Rapoport, Andrea Arcangeli, Peter Xu, Linux-MM



> On Jan 31, 2022, at 10:47 AM, Mike Rapoport <rppt@kernel.org> wrote:
> 
> On Mon, Jan 31, 2022 at 03:41:05PM +0100, David Hildenbrand wrote:
>> On 31.01.22 15:28, Mike Rapoport wrote:
>>> On Mon, Jan 31, 2022 at 03:12:36PM +0100, David Hildenbrand wrote:
>>>> On 31.01.22 15:05, Mike Rapoport wrote:
>>>>> On Mon, Jan 31, 2022 at 11:48:27AM +0100, David Hildenbrand wrote:
>>>>>> On 31.01.22 11:42, Mike Rapoport wrote:
>>>>>>> Hi Nadav,
>>>>>>> 
>>>>>>> On Sat, Jan 29, 2022 at 10:23:55PM -0800, Nadav Amit wrote:
>>>>>>>> Using userfautlfd and looking at the kernel code, I encountered a usability
>>>>>>>> issue that complicates userspace UFFD-monitor implementation. I obviosuly
>>>>>>>> might be wrong, so I would appreciate a (polite?) feedback. I do have a
>>>>>>>> userspace workaround, but I thought it is worthy to share and to hear your
>>>>>>>> opinion, as well as feedback from other UFFD users.
>>>>>>>> 
>>>>>>>> The issue I encountered regards the ordering of UFFD events tbat might not
>>>>>>>> reflect the actual order in which events took place.
>>>>>>>> 
>>>>>>>> In more detail, UFFD events (e.g., unmap, fork) are not ordered against
>>>>>>>> themselves [*]. The mm-lock is dropped before notifying the userspace
>>>>>>>> UFFD-monitor, and therefore there is no guarantee as to whether the order of
>>>>>>>> the events actually reflects the order in which the events took place.
>>>>>>>> This can prevent a UFFD-monitor from using the events to track which
>>>>>>>> ranges are mapped. Specifically, UFFD_EVENT_FORK message and a
>>>>>>>> UFFD_EVENT_UNMAP message (which reflects unmap in the parent process) can
>>>>>>>> be reordered, if the events are triggered by two different threads. In
>>>>>>>> this case the UFFD-monitor cannot figure from the events whether the
>>>>>>>> child process has the unmapped memory range still mapped (because fork
>>>>>>>> happened first) or not.
>>>>>>> 
>>>>>>> Yeah, it seems that something like this is possible:
>>>>>>> 
>>>>>>> 
>>>>>>> fork()					munmap()
>>>>>>> 	mmap_write_unlock();
>>>>>>> 						mmap_write_lock_killable();
>>>>>>> 						do_things();
>>>>>>> 						mmap_{read,write}_unlock();
>>>>>>> 						userfaultfd_unmap_complete();
>>>>>>> 	dup_userfaultfd_complete();
>>>>>>> 
>>>>>> 
>>>>>> I was thinking about other possible races, e.g., MADV_DONTNEED/MADV_FREE
>>>>>> racing with UFFD_EVENT_PAGEFAULT -- where we only hold the mmap_lock in
>>>>>> read mode. But not sure if they apply.
>>>>> 
>>>>> The userspace can live with these, at least for uffd missing page faults.
>>>>> If the monitor will try to resolve a page fault for a removed area, the
>>>>> errno from UFFDIO_COPY/ZERO can be used to detect such races.
>>>> 
>>>> I was wondering if the monitor could get confused if he just resolved a
>>>> page fault via UFFDIO_COPY/ZERO and then receives a REMOVE event.
>>> 
>>> And why would it be confused?
>> 
>> My thinking was that the monitor might use REMOVE events to track which
>> pages are actually populated. If you receive REMOVE after
>> UFFDIO_COPY/ZERO the monitor would conclude that the page is not
>> populated, just like if we'd get the MADV_DONTNEED/MADV_REMOVE
>> immediately after placing a page.
> 
> I still don't follow your usecase.
> 
> In CRIU we simply discard whatever content we had to fill when there is
> REMOVE event. If a page fault occurs in that region we use UFFDIO_ZEROPAGE,
> just as it would happen in "normal" page fault processing 
> (note, CRIU does not support uffd with hugetlb or shmem)

I think that the point that David makes is valid.

There are use-cases in which you do need to know the order between
user-initiated MADV_DONTNEED and page-faults. For instance, if you
build a userspace paging mechanism, you need to know whether the
page content is zero or whatever is held in the disk.

I presume mmap_changing was designed for a similar purpose, assuming
that if you had a page-fault that started before MADV_DONTNEED, and
you try to serve it using copy-ioctl, the copy would fail. I think
that this works only if you assume that there is a single UFFD
monitor thread (that reads the uffd and issues appropriate ioctl’s),
and that all operations are performed synchronously (which I am
trying to avoid using io-uring).

Otherwise, a copy ioctl that is initiated before MADV_DONTNEED
(to resolve page-fault) can take place after the userspace was 
already notified of UFFD_EVENT_REMOVE (i.e., mmap_changing==0),
and there is no way to cancel the copy that was initiated. As
a result, following MADV_DONTNEED, the memory would not be zeroed.

As for me, I decided that due to the lack of ordering, I just
cannot use the UFFD events, and I have to rely on ptrace to obtain
order of these events. I might be wrong, but any solution is not
trivial and is likely to require API changes.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: userfaultfd: usability issue due to lack of UFFD events ordering
  2022-01-31 22:39               ` Nadav Amit
@ 2022-02-01  9:10                 ` Mike Rapoport
  2022-02-10  7:48                 ` Peter Xu
  1 sibling, 0 replies; 18+ messages in thread
From: Mike Rapoport @ 2022-02-01  9:10 UTC (permalink / raw)
  To: Nadav Amit
  Cc: David Hildenbrand, Mike Rapoport, Andrea Arcangeli, Peter Xu, Linux-MM

On Mon, Jan 31, 2022 at 02:39:01PM -0800, Nadav Amit wrote:
> > On Jan 31, 2022, at 10:47 AM, Mike Rapoport <rppt@kernel.org> wrote:
> > 
> > On Mon, Jan 31, 2022 at 03:41:05PM +0100, David Hildenbrand wrote:
> 
> As for me, I decided that due to the lack of ordering, I just
> cannot use the UFFD events, and I have to rely on ptrace to obtain
> order of these events. I might be wrong, but any solution is not
> trivial and is likely to require API changes.

I think you are right and any solution that will allow userspace to track
the order of all events (page faults and address space changes) will
require API changes. 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: userfaultfd: usability issue due to lack of UFFD events ordering
  2022-01-31 22:39               ` Nadav Amit
  2022-02-01  9:10                 ` Mike Rapoport
@ 2022-02-10  7:48                 ` Peter Xu
  2022-02-10 18:42                   ` Nadav Amit
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Xu @ 2022-02-10  7:48 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Mike Rapoport, David Hildenbrand, Mike Rapoport,
	Andrea Arcangeli, Linux-MM

Hi, Nadav & all,

On Mon, Jan 31, 2022 at 02:39:01PM -0800, Nadav Amit wrote:
> There are use-cases in which you do need to know the order between
> user-initiated MADV_DONTNEED and page-faults. For instance, if you
> build a userspace paging mechanism, you need to know whether the
> page content is zero or whatever is held in the disk.

When there's no uffd monitor, concurrent page faults with MADV_DONTNEED can
already result in undefined behavior, right? Assuming the page fault is a write
with non-zero data, then the page can either contain zero or non-zero data
at last, IIUC.

If above is true, I'm wondering whether it's already impossible to do it right
when there is an uffd monitor?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: userfaultfd: usability issue due to lack of UFFD events ordering
  2022-02-10  7:48                 ` Peter Xu
@ 2022-02-10 18:42                   ` Nadav Amit
  2022-02-14  4:02                     ` Peter Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Nadav Amit @ 2022-02-10 18:42 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Rapoport, David Hildenbrand, Mike Rapoport,
	Andrea Arcangeli, Linux-MM

> On Feb 9, 2022, at 11:48 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> Hi, Nadav & all,
> 
> On Mon, Jan 31, 2022 at 02:39:01PM -0800, Nadav Amit wrote:
>> There are use-cases in which you do need to know the order between
>> user-initiated MADV_DONTNEED and page-faults. For instance, if you
>> build a userspace paging mechanism, you need to know whether the
>> page content is zero or whatever is held in the disk.
> 
> When there's no uffd monitor, concurrent page faults with MADV_DONTNEED can
> already result in undefined behavior, right? Assuming the page fault is a write
> with non-zero data, then the page can either contain zero or non-zero data
> at last, IIUC.
> 
> If above is true, I'm wondering whether it's already impossible to do it right
> when there is an uffd monitor?

I think that the MADV_DONTNEED/PF-resolution “race" only affects usage-models
that handle the page-fault concurrently with UFFD-monitoring (using multiple
monitor threads or IO-uring as I try to do). At least for use-cases such as
live-migration.

I think the scenario you have in mind is the following, which is resolved
with mmap_changing that Mike introduced some time ago:

  UFFD monitor		App thread #0		App thread #1
  ------------		-------------		-------------
  						#PF
  UFFD Read
   [#PF]						
			MADV_DONTNEED
			 mmap_changing = 1

			 userfaultfd_event_wait_completion()
			  [queue event, wait]  
  UFFD-copy
   -EAGAIN since mmmap_changing > 0

mmap_changing will keep being elevated, and UFFD-copy not served (fail) until
the monitor reads the UFFD event. The monitor, in this scenario, is single
threaded and therefore orders UFFD-read and UFFD-copy, preventing them from
racing.

Assuming the monitor is smart enough to reevaluate the course of action after
MADV_DONTNEED is handled, it should be safe. Personally, I do not like the
burden that this scheme puts on the monitor, the fact it needs to retry or
even the return value [I think it should be EBUSY since immediate retry would
fail. With IO-uring, EAGAIN triggers an immediate retry, which is useless.]
Yet, concurrent UFFD-event/#PF can be handled properly by a smart monitor.

*However*, userfaultfd events seem as very hard to use (to say the least) in
the following cases:

1. The UFFD-copy is issued by one thread and the UFFD-read is performed by
   another. For me this is the most painful even if you may consider it
   as “unorthodox”. It is very useful for performance, especially if the
   UFFD-copy is large.

2. If the race is between 2 userfaultfd *events*. The events might not be
   properly ordered (i.e., the order in which they are read does not reflect
   the order in which they occurred) despite the use of *_userfaultfd_prep(),
   since they are only queued (to be reported and trigger wake) by
   userfaultfd_event_wait_completion(), after the VMA and PTEs were updated
   and more importantly after mmap-lock was dropped.

   This means that if you have fork and MADV_DONTNEED, the monitor might see
   their order inverted, and won’t be able to know whether the child has the
   pages zapped or not.

   Other races are possible too, for instance between mremap() and munmap().
   In most cases the monitor might be able (with quite some work) to
   figure out that the order of the events it received does not make sense
   and the events must have been reordered. Yet, implementing something like
   that is far from trivial and there are some cases that are probably
   impossible to resolve just based on the UFFD read events.

I personally addressed this issue with seccomp+ptrace to trap on
entry/exit to relevant syscalls (e.g., munmap, mmap, fork), and
prevent concurrent calls to obtain correct order of the events. It is far
from trivial and introduces some overheads, but I did not find a better
solution.

Let me know if I am missing anything.

Thanks,
Nadav

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: userfaultfd: usability issue due to lack of UFFD events ordering
  2022-02-10 18:42                   ` Nadav Amit
@ 2022-02-14  4:02                     ` Peter Xu
  2022-02-15 22:35                       ` Nadav Amit
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2022-02-14  4:02 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Mike Rapoport, David Hildenbrand, Mike Rapoport,
	Andrea Arcangeli, Linux-MM

Nadav,

On Thu, Feb 10, 2022 at 10:42:43AM -0800, Nadav Amit wrote:
> I think that the MADV_DONTNEED/PF-resolution “race" only affects usage-models
> that handle the page-fault concurrently with UFFD-monitoring (using multiple
> monitor threads or IO-uring as I try to do). At least for use-cases such as
> live-migration.
> 
> I think the scenario you have in mind is the following, which is resolved
> with mmap_changing that Mike introduced some time ago:
> 
>   UFFD monitor		App thread #0		App thread #1
>   ------------		-------------		-------------
>   						#PF
>   UFFD Read
>    [#PF]						
> 			MADV_DONTNEED
> 			 mmap_changing = 1
> 
> 			 userfaultfd_event_wait_completion()
> 			  [queue event, wait]  
>   UFFD-copy
>    -EAGAIN since mmmap_changing > 0
> 
> 
> mmap_changing will keep being elevated, and UFFD-copy not served (fail) until
> the monitor reads the UFFD event. The monitor, in this scenario, is single
> threaded and therefore orders UFFD-read and UFFD-copy, preventing them from
> racing.
> 
> Assuming the monitor is smart enough to reevaluate the course of action after
> MADV_DONTNEED is handled, it should be safe. Personally, I do not like the
> burden that this scheme puts on the monitor, the fact it needs to retry or
> even the return value [I think it should be EBUSY since immediate retry would
> fail. With IO-uring, EAGAIN triggers an immediate retry, which is useless.]
> Yet, concurrent UFFD-event/#PF can be handled properly by a smart monitor.
> 
> *However*, userfaultfd events seem as very hard to use (to say the least) in
> the following cases:
> 
> 1. The UFFD-copy is issued by one thread and the UFFD-read is performed by
>    another. For me this is the most painful even if you may consider it
>    as “unorthodox”. It is very useful for performance, especially if the
>    UFFD-copy is large.

This is definitely a valid use case for uffd, and IMHO that's a good base model
when the uffd app is performance critical.

> 
> 2. If the race is between 2 userfaultfd *events*. The events might not be
>    properly ordered (i.e., the order in which they are read does not reflect
>    the order in which they occurred) despite the use of *_userfaultfd_prep(),
>    since they are only queued (to be reported and trigger wake) by
>    userfaultfd_event_wait_completion(), after the VMA and PTEs were updated
>    and more importantly after mmap-lock was dropped.
> 
>    This means that if you have fork and MADV_DONTNEED, the monitor might see
>    their order inverted, and won’t be able to know whether the child has the
>    pages zapped or not.
> 
>    Other races are possible too, for instance between mremap() and munmap().
>    In most cases the monitor might be able (with quite some work) to
>    figure out that the order of the events it received does not make sense
>    and the events must have been reordered. Yet, implementing something like
>    that is far from trivial and there are some cases that are probably
>    impossible to resolve just based on the UFFD read events.
> 
> I personally addressed this issue with seccomp+ptrace to trap on
> entry/exit to relevant syscalls (e.g., munmap, mmap, fork), and
> prevent concurrent calls to obtain correct order of the events. It is far
> from trivial and introduces some overheads, but I did not find a better
> solution.

Thanks for explaining.

I also digged out the discussion threads between you and Mike and that's a good
one too summarizing the problems:

https://lore.kernel.org/all/5921BA80-F263-4F8D-B7E6-316CEB602B51@gmail.com/

Scenario 4 is kind of special imho along all those, because that's the only one
that can be workarounded by user application by only copying pages one by one.
I know you were even leveraging iouring in your local tree, so that's probably
not a solution at all for you. But I'm just trying to start thinking without
that scenario for now.

Per my understanding, a major issue regarding the rest of the scenarios is
ordering of uffd messages may not match with how things are happening.  This
actually contains two problems.

First of all, mmap_sem is mostly held read for all page faults and most of the
mm changes except e.g. fork, then we can never serialize them.  Not to mention
uffd events releases mmap_sem within prep and completion.  Let's call it
problem 1.

The other problem 2 is we can never serialize faults against events.

For problem 1, I do sense something that mmap_sem is just not suitable for uffd
scenario. Say, we grant concurrent with most of the events like dontneed and
mremap, but when uffd ordering is a concern we may not want to grant that
concurrency.  I'm wondering whether it means uffd may need its own semaphore to
achieve this.  So for all events that uffd cares we take write lock on a new
uffd_sem after mmap_sem, meanwhile we don't release that uffd_sem after prep of
events, not until completion (the message is read).  It'll slow down uffd
tracked systems but guarantees ordering.

At the meantime, I'm wildly thinking whether we can tackle with the other
problem by merging the page fault queue with the event queue, aka, event_wqh
and fault_pending_wqh.  Obviously we'll need to identify the messages when
read() and conditionally move then into fault_wqh only if they come from page
faults, but that seems doable?

Not sure above makes any sense, as I could have missed something.  Meanwhile I
think even if we order all the messages to match with facts there're still some
other issues that are outliers of this, but let's see how it sounds so far.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: userfaultfd: usability issue due to lack of UFFD events ordering
  2022-02-14  4:02                     ` Peter Xu
@ 2022-02-15 22:35                       ` Nadav Amit
  2022-02-16  8:27                         ` Peter Xu
  2022-02-17 21:15                         ` Mike Rapoport
  0 siblings, 2 replies; 18+ messages in thread
From: Nadav Amit @ 2022-02-15 22:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Rapoport, David Hildenbrand, Mike Rapoport,
	Andrea Arcangeli, Linux-MM

> On Feb 13, 2022, at 8:02 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> Thanks for explaining.
> 
> I also digged out the discussion threads between you and Mike and that's a good
> one too summarizing the problems:
> 
> https://lore.kernel.org/all/5921BA80-F263-4F8D-B7E6-316CEB602B51@gmail.com/
> 
> Scenario 4 is kind of special imho along all those, because that's the only one
> that can be workarounded by user application by only copying pages one by one.
> I know you were even leveraging iouring in your local tree, so that's probably
> not a solution at all for you. But I'm just trying to start thinking without
> that scenario for now.
> 
> Per my understanding, a major issue regarding the rest of the scenarios is
> ordering of uffd messages may not match with how things are happening.  This
> actually contains two problems.
> 
> First of all, mmap_sem is mostly held read for all page faults and most of the
> mm changes except e.g. fork, then we can never serialize them.  Not to mention
> uffd events releases mmap_sem within prep and completion.  Let's call it
> problem 1.
> 
> The other problem 2 is we can never serialize faults against events.
> 
> For problem 1, I do sense something that mmap_sem is just not suitable for uffd
> scenario. Say, we grant concurrent with most of the events like dontneed and
> mremap, but when uffd ordering is a concern we may not want to grant that
> concurrency.  I'm wondering whether it means uffd may need its own semaphore to
> achieve this.  So for all events that uffd cares we take write lock on a new
> uffd_sem after mmap_sem, meanwhile we don't release that uffd_sem after prep of
> events, not until completion (the message is read).  It'll slow down uffd
> tracked systems but guarantees ordering.

Peter,

Thanks for finding the time and looking into the issues that I encountered.

Your approach sounds possible, but it sounds to me unsafe to acquire uffd_sem
after mmap_lock, since it might cause deadlocks (e.g., if a process uses events
to manage its own memory).

> 
> At the meantime, I'm wildly thinking whether we can tackle with the other
> problem by merging the page fault queue with the event queue, aka, event_wqh
> and fault_pending_wqh.  Obviously we'll need to identify the messages when
> read() and conditionally move then into fault_wqh only if they come from page
> faults, but that seems doable?

This, I guess is necessary in addition to your aforementioned proposal to have
some semaphore protecting, can do the trick.

While I got your attention, let me share some other challenges I encountered
using userfaultfd. They might be unrelated, but perhaps you can keep them in
the back of your mind. Nobody should suffer as I did ;-)

1. mmap_changing (i.e., -EAGAIN on ioctls) makes using userfaultfd harder than
it should be, especially when using io-uring as I wish to do.

I think it is not too hard to address by changing the API. For instance, if
uffd-ctx had a uffd-generation that would increase on each event, the user
could have provided an ioctl-generation as part of copy/zero/etc ioctls, and
the kernel would only fail the operation if ioctl copy/zero/etc operation
only succeeds if the uffd-generation is lower/equal than the one provided by
the user. 

2. userfaultfd is separated from other tracing/instrumentation mechanisms in
the kernel. I, for instance, also wanted to track mmap events (let’s put
aside for a second why). Tracking these events can be done with ptrace or
perf_event_open() but then it is hard to correlate these events with
userfaultfd. It would have been easier for users, I think, if userfaultfd
notifications were provided through ptrace/tracepoints mechanisms as well.

3. Nesting/chaining. It is not easy to allow two monitors to use userfaultfd
concurrently. This seems as a general problem that I believe ptrace suffers
from too. I know it might seem far-fetched to have 2 monitors at the moment,
but I think that any tracking/instrumentation mechanism (e.g., ptrace,
software-dirty, not to mention hardware virtualization) should be designed
from the beginning with such support as adding it in a later stage can be
tricky.

4. Missing state. It would be useful to provide the TID of the faulting
thread. I will send a patch for this one once I get the necessary
internal approvals.

Thanks again,
Nadav

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: userfaultfd: usability issue due to lack of UFFD events ordering
  2022-02-15 22:35                       ` Nadav Amit
@ 2022-02-16  8:27                         ` Peter Xu
  2022-02-17 21:15                         ` Mike Rapoport
  1 sibling, 0 replies; 18+ messages in thread
From: Peter Xu @ 2022-02-16  8:27 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Mike Rapoport, David Hildenbrand, Mike Rapoport,
	Andrea Arcangeli, Linux-MM

On Tue, Feb 15, 2022 at 02:35:09PM -0800, Nadav Amit wrote:
> 
> 
> > On Feb 13, 2022, at 8:02 PM, Peter Xu <peterx@redhat.com> wrote:
> > 
> > Thanks for explaining.
> > 
> > I also digged out the discussion threads between you and Mike and that's a good
> > one too summarizing the problems:
> > 
> > https://lore.kernel.org/all/5921BA80-F263-4F8D-B7E6-316CEB602B51@gmail.com/
> > 
> > Scenario 4 is kind of special imho along all those, because that's the only one
> > that can be workarounded by user application by only copying pages one by one.
> > I know you were even leveraging iouring in your local tree, so that's probably
> > not a solution at all for you. But I'm just trying to start thinking without
> > that scenario for now.
> > 
> > Per my understanding, a major issue regarding the rest of the scenarios is
> > ordering of uffd messages may not match with how things are happening.  This
> > actually contains two problems.
> > 
> > First of all, mmap_sem is mostly held read for all page faults and most of the
> > mm changes except e.g. fork, then we can never serialize them.  Not to mention
> > uffd events releases mmap_sem within prep and completion.  Let's call it
> > problem 1.
> > 
> > The other problem 2 is we can never serialize faults against events.
> > 
> > For problem 1, I do sense something that mmap_sem is just not suitable for uffd
> > scenario. Say, we grant concurrent with most of the events like dontneed and
> > mremap, but when uffd ordering is a concern we may not want to grant that
> > concurrency.  I'm wondering whether it means uffd may need its own semaphore to
> > achieve this.  So for all events that uffd cares we take write lock on a new
> > uffd_sem after mmap_sem, meanwhile we don't release that uffd_sem after prep of
> > events, not until completion (the message is read).  It'll slow down uffd
> > tracked systems but guarantees ordering.
> 
> Peter,
> 
> Thanks for finding the time and looking into the issues that I encountered.
> 
> Your approach sounds possible, but it sounds to me unsafe to acquire uffd_sem
> after mmap_lock, since it might cause deadlocks (e.g., if a process uses events
> to manage its own memory).

Right, it's unsafe if to be taken after mmap_sem.  If to do so IIUC we need to
take it before mmap_sem hence we can release mmap_sem under it.

In my mind that could be a feature bit UFFD_FEATURE_STRICT_ORDERING, when it's
set then the mm bound to the userfaultfd file will have a flag set within the
mm->flags, let's say MMF_UFFD_STRICT_ORDER.

Then for uffd related syscalls like fork(), mremap() and so on we conditionally
take that uffd_sem and we need to do that before mmap_sem.  We take it write
for all the uffd event contexts, and take it read for all the uffd page faults.

But even if above would work again I have little confidence that it'll work in
reality. Firstly it does look odd already that an uffd lock needs to be taken
before the whole mm's, starting to affect common workloads even not using uffd
(even the flag lookup could affect cacheline, I think, but not sure how slower
it would be).  Not to mention that should greatly slow down the tracee process.
It definitely needs more thoughts anyway.

> 
> > 
> > At the meantime, I'm wildly thinking whether we can tackle with the other
> > problem by merging the page fault queue with the event queue, aka, event_wqh
> > and fault_pending_wqh.  Obviously we'll need to identify the messages when
> > read() and conditionally move then into fault_wqh only if they come from page
> > faults, but that seems doable?
> 
> This, I guess is necessary in addition to your aforementioned proposal to have
> some semaphore protecting, can do the trick.
> 
> While I got your attention, let me share some other challenges I encountered
> using userfaultfd. They might be unrelated, but perhaps you can keep them in
> the back of your mind. Nobody should suffer as I did ;-)

Heh.

> 
> 1. mmap_changing (i.e., -EAGAIN on ioctls) makes using userfaultfd harder than
> it should be, especially when using io-uring as I wish to do.
> 
> I think it is not too hard to address by changing the API. For instance, if
> uffd-ctx had a uffd-generation that would increase on each event, the user
> could have provided an ioctl-generation as part of copy/zero/etc ioctls, and
> the kernel would only fail the operation if ioctl copy/zero/etc operation
> only succeeds if the uffd-generation is lower/equal than the one provided by
> the user. 

Assuming that gen_id is copied over from the uffd message, and if that counter
only increases, then I don't understand why it can be lower than the user
provided.

I don't quite get how that solves your problem too, since -EAGAIN can still
trigger.  I must have missed something.

> 
> 2. userfaultfd is separated from other tracing/instrumentation mechanisms in
> the kernel. I, for instance, also wanted to track mmap events (let’s put
> aside for a second why). Tracking these events can be done with ptrace or
> perf_event_open() but then it is hard to correlate these events with
> userfaultfd. It would have been easier for users, I think, if userfaultfd
> notifications were provided through ptrace/tracepoints mechanisms as well.
> 
> 3. Nesting/chaining. It is not easy to allow two monitors to use userfaultfd
> concurrently. This seems as a general problem that I believe ptrace suffers
> from too. I know it might seem far-fetched to have 2 monitors at the moment,
> but I think that any tracking/instrumentation mechanism (e.g., ptrace,
> software-dirty, not to mention hardware virtualization) should be designed
> from the beginning with such support as adding it in a later stage can be
> tricky.

2) and 3) definitely need more thoughts..

PS: I think I first read your name from a paper on the nested virt. :-) But I
forgot the details.

> 
> 4. Missing state. It would be useful to provide the TID of the faulting
> thread. I will send a patch for this one once I get the necessary
> internal approvals.

Before I fully digest your reply and the problems, I want to make sure you are
aware of UFFD_FEATURE_THREAD_ID.. I don't know how you missed it, but it does
sound like what you wanted.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: userfaultfd: usability issue due to lack of UFFD events ordering
  2022-02-15 22:35                       ` Nadav Amit
  2022-02-16  8:27                         ` Peter Xu
@ 2022-02-17 21:15                         ` Mike Rapoport
  1 sibling, 0 replies; 18+ messages in thread
From: Mike Rapoport @ 2022-02-17 21:15 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Peter Xu, David Hildenbrand, Mike Rapoport, Andrea Arcangeli, Linux-MM

On Tue, Feb 15, 2022 at 02:35:09PM -0800, Nadav Amit wrote:
> 
> 
> > On Feb 13, 2022, at 8:02 PM, Peter Xu <peterx@redhat.com> wrote:
> > 
> > Thanks for explaining.
> > 
> > I also digged out the discussion threads between you and Mike and that's a good
> > one too summarizing the problems:
> > 
> > https://lore.kernel.org/all/5921BA80-F263-4F8D-B7E6-316CEB602B51@gmail.com/
> > 
> > Scenario 4 is kind of special imho along all those, because that's the only one
> > that can be workarounded by user application by only copying pages one by one.
> > I know you were even leveraging iouring in your local tree, so that's probably
> > not a solution at all for you. But I'm just trying to start thinking without
> > that scenario for now.
> > 
> > Per my understanding, a major issue regarding the rest of the scenarios is
> > ordering of uffd messages may not match with how things are happening.  This
> > actually contains two problems.
> > 
> > First of all, mmap_sem is mostly held read for all page faults and most of the
> > mm changes except e.g. fork, then we can never serialize them.  Not to mention
> > uffd events releases mmap_sem within prep and completion.  Let's call it
> > problem 1.
> > 
> > The other problem 2 is we can never serialize faults against events.
> > 
> > For problem 1, I do sense something that mmap_sem is just not suitable for uffd
> > scenario. Say, we grant concurrent with most of the events like dontneed and
> > mremap, but when uffd ordering is a concern we may not want to grant that
> > concurrency.  I'm wondering whether it means uffd may need its own semaphore to
> > achieve this.  So for all events that uffd cares we take write lock on a new
> > uffd_sem after mmap_sem, meanwhile we don't release that uffd_sem after prep of
> > events, not until completion (the message is read).  It'll slow down uffd
> > tracked systems but guarantees ordering.
> 
> Peter,
> 
> Thanks for finding the time and looking into the issues that I encountered.
> 
> Your approach sounds possible, but it sounds to me unsafe to acquire uffd_sem
> after mmap_lock, since it might cause deadlocks (e.g., if a process uses events
> to manage its own memory).
> 
> > 
> > At the meantime, I'm wildly thinking whether we can tackle with the other
> > problem by merging the page fault queue with the event queue, aka, event_wqh
> > and fault_pending_wqh.  Obviously we'll need to identify the messages when
> > read() and conditionally move then into fault_wqh only if they come from page
> > faults, but that seems doable?
> 
> This, I guess is necessary in addition to your aforementioned proposal to have
> some semaphore protecting, can do the trick.
> 
> While I got your attention, let me share some other challenges I encountered
> using userfaultfd. They might be unrelated, but perhaps you can keep them in
> the back of your mind. Nobody should suffer as I did ;-)
> 
> 1. mmap_changing (i.e., -EAGAIN on ioctls) makes using userfaultfd harder than
> it should be, especially when using io-uring as I wish to do.
> 
> I think it is not too hard to address by changing the API. For instance, if
> uffd-ctx had a uffd-generation that would increase on each event, the user
> could have provided an ioctl-generation as part of copy/zero/etc ioctls, and
> the kernel would only fail the operation if ioctl copy/zero/etc operation
> only succeeds if the uffd-generation is lower/equal than the one provided by
> the user. 

Do you mean that if there were page faults with generations 1 and 3 and,
say, MADV_DONTNEED with generation 2, then even if the uffd copy that resolves
page fault 1 races with MADV_DONTNEED it will go through and the copy for
page fault 3 will fail?

But how would you order zapping the pages and copying into them internally?
Or may understanding of your idea was completely off?

As for technicality of adding a generation to uffd_msg and to
uffdio_{copy,zero,etc}, we can use __u32 reserved in the first one and 32
bits from mode in the second with a bit of care for wraparound.
 
> 2. userfaultfd is separated from other tracing/instrumentation mechanisms in
> the kernel. I, for instance, also wanted to track mmap events (let’s put
> aside for a second why). Tracking these events can be done with ptrace or
> perf_event_open() but then it is hard to correlate these events with
> userfaultfd. It would have been easier for users, I think, if userfaultfd
> notifications were provided through ptrace/tracepoints mechanisms as well.

This sounds like opening Pandora box ;-)

I think it's possible to trace userfaultfd events to some extent with a
probe at userfaultfd_event_wait_completion() entry and handle_userfault().
The "interesting" information is passed to these functions as parameters
and I believe all the data can be extracted with tools like bpftrace.
 
> 3. Nesting/chaining. It is not easy to allow two monitors to use userfaultfd
> concurrently. This seems as a general problem that I believe ptrace suffers
> from too. I know it might seem far-fetched to have 2 monitors at the moment,
> but I think that any tracking/instrumentation mechanism (e.g., ptrace,
> software-dirty, not to mention hardware virtualization) should be designed
> from the beginning with such support as adding it in a later stage can be
> tricky.

It's not too far fetched to have nested userfaultfd contexts even now. If
CRIU would need to post-copy restore a process that uses userfaultfd it
will need to deal with nested uffds.
 
> Thanks again,
> Nadav

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-02-17 21:15 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-30  6:23 userfaultfd: usability issue due to lack of UFFD events ordering Nadav Amit
2022-01-31 10:42 ` Mike Rapoport
2022-01-31 10:48   ` David Hildenbrand
2022-01-31 14:05     ` Mike Rapoport
2022-01-31 14:12       ` David Hildenbrand
2022-01-31 14:28         ` Mike Rapoport
2022-01-31 14:41           ` David Hildenbrand
2022-01-31 18:47             ` Mike Rapoport
2022-01-31 22:39               ` Nadav Amit
2022-02-01  9:10                 ` Mike Rapoport
2022-02-10  7:48                 ` Peter Xu
2022-02-10 18:42                   ` Nadav Amit
2022-02-14  4:02                     ` Peter Xu
2022-02-15 22:35                       ` Nadav Amit
2022-02-16  8:27                         ` Peter Xu
2022-02-17 21:15                         ` Mike Rapoport
2022-01-31 17:23   ` Nadav Amit
2022-01-31 17:28     ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).