* userfaultfd: usability issue due to lack of UFFD events ordering @ 2022-01-30 6:23 Nadav Amit 2022-01-31 10:42 ` Mike Rapoport 0 siblings, 1 reply; 18+ messages in thread From: Nadav Amit @ 2022-01-30 6:23 UTC (permalink / raw) To: David Hildenbrand, Mike Rapoport, Andrea Arcangeli, Peter Xu; +Cc: Linux-MM Using userfautlfd and looking at the kernel code, I encountered a usability issue that complicates userspace UFFD-monitor implementation. I obviosuly might be wrong, so I would appreciate a (polite?) feedback. I do have a userspace workaround, but I thought it is worthy to share and to hear your opinion, as well as feedback from other UFFD users. The issue I encountered regards the ordering of UFFD events tbat might not reflect the actual order in which events took place. In more detail, UFFD events (e.g., unmap, fork) are not ordered against themselves [*]. The mm-lock is dropped before notifying the userspace UFFD-monitor, and therefore there is no guarantee as to whether the order of the events actually reflects the order in which the events took place. This can prevent a UFFD-monitor from using the events to track which ranges are mapped. Specifically, UFFD_EVENT_FORK message and a UFFD_EVENT_UNMAP message (which reflects unmap in the parent process) can be reordered, if the events are triggered by two different threads. In this case the UFFD-monitor cannot figure from the events whether the child process has the unmapped memory range still mapped (because fork happened first) or not. Obviously, it does not make sense to keep holding mm-lock while notifying the user, as it can even lead to deadlocks. Userspace UFFD-monitors can workaround this issue by using seccomp+ptrace instead of UFFD-events to obtain order of the events or examine /proc/[pid]/smaps. Yet, this introduces overheads, is complicated, and I doubt anyone does so. I wonder if the API is reasonable, or whether I am missing something. Thanks, Nadav [*] Note that I do not discuss UFFD-monitor issued ioctl's, but the order between UFFD-events. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: userfaultfd: usability issue due to lack of UFFD events ordering 2022-01-30 6:23 userfaultfd: usability issue due to lack of UFFD events ordering Nadav Amit @ 2022-01-31 10:42 ` Mike Rapoport 2022-01-31 10:48 ` David Hildenbrand 2022-01-31 17:23 ` Nadav Amit 0 siblings, 2 replies; 18+ messages in thread From: Mike Rapoport @ 2022-01-31 10:42 UTC (permalink / raw) To: Nadav Amit Cc: David Hildenbrand, Mike Rapoport, Andrea Arcangeli, Peter Xu, Linux-MM Hi Nadav, On Sat, Jan 29, 2022 at 10:23:55PM -0800, Nadav Amit wrote: > Using userfautlfd and looking at the kernel code, I encountered a usability > issue that complicates userspace UFFD-monitor implementation. I obviosuly > might be wrong, so I would appreciate a (polite?) feedback. I do have a > userspace workaround, but I thought it is worthy to share and to hear your > opinion, as well as feedback from other UFFD users. > > The issue I encountered regards the ordering of UFFD events tbat might not > reflect the actual order in which events took place. > > In more detail, UFFD events (e.g., unmap, fork) are not ordered against > themselves [*]. The mm-lock is dropped before notifying the userspace > UFFD-monitor, and therefore there is no guarantee as to whether the order of > the events actually reflects the order in which the events took place. > This can prevent a UFFD-monitor from using the events to track which > ranges are mapped. Specifically, UFFD_EVENT_FORK message and a > UFFD_EVENT_UNMAP message (which reflects unmap in the parent process) can > be reordered, if the events are triggered by two different threads. In > this case the UFFD-monitor cannot figure from the events whether the > child process has the unmapped memory range still mapped (because fork > happened first) or not. Yeah, it seems that something like this is possible: fork() munmap() mmap_write_unlock(); mmap_write_lock_killable(); do_things(); mmap_{read,write}_unlock(); userfaultfd_unmap_complete(); dup_userfaultfd_complete(); A solution could be to split uffd_*_complete() to two parts: one that queues up the event message and the second one that waits for it to be read by the monitor. The first part then can run befor mm-lock is released. If you can think of something nicer, it'll be really great! > Obviously, it does not make sense to keep holding mm-lock while notifying the > user, as it can even lead to deadlocks. Userspace UFFD-monitors can > workaround this issue by using seccomp+ptrace instead of UFFD-events to > obtain order of the events or examine /proc/[pid]/smaps. Yet, this introduces > overheads, is complicated, and I doubt anyone does so. I wonder if the API is > reasonable, or whether I am missing something. > > Thanks, > Nadav > > [*] Note that I do not discuss UFFD-monitor issued ioctl's, but the order > between UFFD-events. > -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: userfaultfd: usability issue due to lack of UFFD events ordering 2022-01-31 10:42 ` Mike Rapoport @ 2022-01-31 10:48 ` David Hildenbrand 2022-01-31 14:05 ` Mike Rapoport 2022-01-31 17:23 ` Nadav Amit 1 sibling, 1 reply; 18+ messages in thread From: David Hildenbrand @ 2022-01-31 10:48 UTC (permalink / raw) To: Mike Rapoport, Nadav Amit Cc: Mike Rapoport, Andrea Arcangeli, Peter Xu, Linux-MM On 31.01.22 11:42, Mike Rapoport wrote: > Hi Nadav, > > On Sat, Jan 29, 2022 at 10:23:55PM -0800, Nadav Amit wrote: >> Using userfautlfd and looking at the kernel code, I encountered a usability >> issue that complicates userspace UFFD-monitor implementation. I obviosuly >> might be wrong, so I would appreciate a (polite?) feedback. I do have a >> userspace workaround, but I thought it is worthy to share and to hear your >> opinion, as well as feedback from other UFFD users. >> >> The issue I encountered regards the ordering of UFFD events tbat might not >> reflect the actual order in which events took place. >> >> In more detail, UFFD events (e.g., unmap, fork) are not ordered against >> themselves [*]. The mm-lock is dropped before notifying the userspace >> UFFD-monitor, and therefore there is no guarantee as to whether the order of >> the events actually reflects the order in which the events took place. >> This can prevent a UFFD-monitor from using the events to track which >> ranges are mapped. Specifically, UFFD_EVENT_FORK message and a >> UFFD_EVENT_UNMAP message (which reflects unmap in the parent process) can >> be reordered, if the events are triggered by two different threads. In >> this case the UFFD-monitor cannot figure from the events whether the >> child process has the unmapped memory range still mapped (because fork >> happened first) or not. > > Yeah, it seems that something like this is possible: > > > fork() munmap() > mmap_write_unlock(); > mmap_write_lock_killable(); > do_things(); > mmap_{read,write}_unlock(); > userfaultfd_unmap_complete(); > dup_userfaultfd_complete(); > I was thinking about other possible races, e.g., MADV_DONTNEED/MADV_FREE racing with UFFD_EVENT_PAGEFAULT -- where we only hold the mmap_lock in read mode. But not sure if they apply. The fork() vs. munmap() is somewhat "obviously problematic" :) -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: userfaultfd: usability issue due to lack of UFFD events ordering 2022-01-31 10:48 ` David Hildenbrand @ 2022-01-31 14:05 ` Mike Rapoport 2022-01-31 14:12 ` David Hildenbrand 0 siblings, 1 reply; 18+ messages in thread From: Mike Rapoport @ 2022-01-31 14:05 UTC (permalink / raw) To: David Hildenbrand Cc: Nadav Amit, Mike Rapoport, Andrea Arcangeli, Peter Xu, Linux-MM On Mon, Jan 31, 2022 at 11:48:27AM +0100, David Hildenbrand wrote: > On 31.01.22 11:42, Mike Rapoport wrote: > > Hi Nadav, > > > > On Sat, Jan 29, 2022 at 10:23:55PM -0800, Nadav Amit wrote: > >> Using userfautlfd and looking at the kernel code, I encountered a usability > >> issue that complicates userspace UFFD-monitor implementation. I obviosuly > >> might be wrong, so I would appreciate a (polite?) feedback. I do have a > >> userspace workaround, but I thought it is worthy to share and to hear your > >> opinion, as well as feedback from other UFFD users. > >> > >> The issue I encountered regards the ordering of UFFD events tbat might not > >> reflect the actual order in which events took place. > >> > >> In more detail, UFFD events (e.g., unmap, fork) are not ordered against > >> themselves [*]. The mm-lock is dropped before notifying the userspace > >> UFFD-monitor, and therefore there is no guarantee as to whether the order of > >> the events actually reflects the order in which the events took place. > >> This can prevent a UFFD-monitor from using the events to track which > >> ranges are mapped. Specifically, UFFD_EVENT_FORK message and a > >> UFFD_EVENT_UNMAP message (which reflects unmap in the parent process) can > >> be reordered, if the events are triggered by two different threads. In > >> this case the UFFD-monitor cannot figure from the events whether the > >> child process has the unmapped memory range still mapped (because fork > >> happened first) or not. > > > > Yeah, it seems that something like this is possible: > > > > > > fork() munmap() > > mmap_write_unlock(); > > mmap_write_lock_killable(); > > do_things(); > > mmap_{read,write}_unlock(); > > userfaultfd_unmap_complete(); > > dup_userfaultfd_complete(); > > > > I was thinking about other possible races, e.g., MADV_DONTNEED/MADV_FREE > racing with UFFD_EVENT_PAGEFAULT -- where we only hold the mmap_lock in > read mode. But not sure if they apply. The userspace can live with these, at least for uffd missing page faults. If the monitor will try to resolve a page fault for a removed area, the errno from UFFDIO_COPY/ZERO can be used to detect such races. > The fork() vs. munmap() is somewhat "obviously problematic" :) Nothing funny about it ;-) > -- > Thanks, > > David / dhildenb > > -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: userfaultfd: usability issue due to lack of UFFD events ordering 2022-01-31 14:05 ` Mike Rapoport @ 2022-01-31 14:12 ` David Hildenbrand 2022-01-31 14:28 ` Mike Rapoport 0 siblings, 1 reply; 18+ messages in thread From: David Hildenbrand @ 2022-01-31 14:12 UTC (permalink / raw) To: Mike Rapoport Cc: Nadav Amit, Mike Rapoport, Andrea Arcangeli, Peter Xu, Linux-MM On 31.01.22 15:05, Mike Rapoport wrote: > On Mon, Jan 31, 2022 at 11:48:27AM +0100, David Hildenbrand wrote: >> On 31.01.22 11:42, Mike Rapoport wrote: >>> Hi Nadav, >>> >>> On Sat, Jan 29, 2022 at 10:23:55PM -0800, Nadav Amit wrote: >>>> Using userfautlfd and looking at the kernel code, I encountered a usability >>>> issue that complicates userspace UFFD-monitor implementation. I obviosuly >>>> might be wrong, so I would appreciate a (polite?) feedback. I do have a >>>> userspace workaround, but I thought it is worthy to share and to hear your >>>> opinion, as well as feedback from other UFFD users. >>>> >>>> The issue I encountered regards the ordering of UFFD events tbat might not >>>> reflect the actual order in which events took place. >>>> >>>> In more detail, UFFD events (e.g., unmap, fork) are not ordered against >>>> themselves [*]. The mm-lock is dropped before notifying the userspace >>>> UFFD-monitor, and therefore there is no guarantee as to whether the order of >>>> the events actually reflects the order in which the events took place. >>>> This can prevent a UFFD-monitor from using the events to track which >>>> ranges are mapped. Specifically, UFFD_EVENT_FORK message and a >>>> UFFD_EVENT_UNMAP message (which reflects unmap in the parent process) can >>>> be reordered, if the events are triggered by two different threads. In >>>> this case the UFFD-monitor cannot figure from the events whether the >>>> child process has the unmapped memory range still mapped (because fork >>>> happened first) or not. >>> >>> Yeah, it seems that something like this is possible: >>> >>> >>> fork() munmap() >>> mmap_write_unlock(); >>> mmap_write_lock_killable(); >>> do_things(); >>> mmap_{read,write}_unlock(); >>> userfaultfd_unmap_complete(); >>> dup_userfaultfd_complete(); >>> >> >> I was thinking about other possible races, e.g., MADV_DONTNEED/MADV_FREE >> racing with UFFD_EVENT_PAGEFAULT -- where we only hold the mmap_lock in >> read mode. But not sure if they apply. > > The userspace can live with these, at least for uffd missing page faults. > If the monitor will try to resolve a page fault for a removed area, the > errno from UFFDIO_COPY/ZERO can be used to detect such races. I was wondering if the monitor could get confused if he just resolved a page fault via UFFDIO_COPY/ZERO and then receives a REMOVE event. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: userfaultfd: usability issue due to lack of UFFD events ordering 2022-01-31 14:12 ` David Hildenbrand @ 2022-01-31 14:28 ` Mike Rapoport 2022-01-31 14:41 ` David Hildenbrand 0 siblings, 1 reply; 18+ messages in thread From: Mike Rapoport @ 2022-01-31 14:28 UTC (permalink / raw) To: David Hildenbrand Cc: Nadav Amit, Mike Rapoport, Andrea Arcangeli, Peter Xu, Linux-MM On Mon, Jan 31, 2022 at 03:12:36PM +0100, David Hildenbrand wrote: > On 31.01.22 15:05, Mike Rapoport wrote: > > On Mon, Jan 31, 2022 at 11:48:27AM +0100, David Hildenbrand wrote: > >> On 31.01.22 11:42, Mike Rapoport wrote: > >>> Hi Nadav, > >>> > >>> On Sat, Jan 29, 2022 at 10:23:55PM -0800, Nadav Amit wrote: > >>>> Using userfautlfd and looking at the kernel code, I encountered a usability > >>>> issue that complicates userspace UFFD-monitor implementation. I obviosuly > >>>> might be wrong, so I would appreciate a (polite?) feedback. I do have a > >>>> userspace workaround, but I thought it is worthy to share and to hear your > >>>> opinion, as well as feedback from other UFFD users. > >>>> > >>>> The issue I encountered regards the ordering of UFFD events tbat might not > >>>> reflect the actual order in which events took place. > >>>> > >>>> In more detail, UFFD events (e.g., unmap, fork) are not ordered against > >>>> themselves [*]. The mm-lock is dropped before notifying the userspace > >>>> UFFD-monitor, and therefore there is no guarantee as to whether the order of > >>>> the events actually reflects the order in which the events took place. > >>>> This can prevent a UFFD-monitor from using the events to track which > >>>> ranges are mapped. Specifically, UFFD_EVENT_FORK message and a > >>>> UFFD_EVENT_UNMAP message (which reflects unmap in the parent process) can > >>>> be reordered, if the events are triggered by two different threads. In > >>>> this case the UFFD-monitor cannot figure from the events whether the > >>>> child process has the unmapped memory range still mapped (because fork > >>>> happened first) or not. > >>> > >>> Yeah, it seems that something like this is possible: > >>> > >>> > >>> fork() munmap() > >>> mmap_write_unlock(); > >>> mmap_write_lock_killable(); > >>> do_things(); > >>> mmap_{read,write}_unlock(); > >>> userfaultfd_unmap_complete(); > >>> dup_userfaultfd_complete(); > >>> > >> > >> I was thinking about other possible races, e.g., MADV_DONTNEED/MADV_FREE > >> racing with UFFD_EVENT_PAGEFAULT -- where we only hold the mmap_lock in > >> read mode. But not sure if they apply. > > > > The userspace can live with these, at least for uffd missing page faults. > > If the monitor will try to resolve a page fault for a removed area, the > > errno from UFFDIO_COPY/ZERO can be used to detect such races. > > I was wondering if the monitor could get confused if he just resolved a > page fault via UFFDIO_COPY/ZERO and then receives a REMOVE event. And why would it be confused? -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: userfaultfd: usability issue due to lack of UFFD events ordering 2022-01-31 14:28 ` Mike Rapoport @ 2022-01-31 14:41 ` David Hildenbrand 2022-01-31 18:47 ` Mike Rapoport 0 siblings, 1 reply; 18+ messages in thread From: David Hildenbrand @ 2022-01-31 14:41 UTC (permalink / raw) To: Mike Rapoport Cc: Nadav Amit, Mike Rapoport, Andrea Arcangeli, Peter Xu, Linux-MM On 31.01.22 15:28, Mike Rapoport wrote: > On Mon, Jan 31, 2022 at 03:12:36PM +0100, David Hildenbrand wrote: >> On 31.01.22 15:05, Mike Rapoport wrote: >>> On Mon, Jan 31, 2022 at 11:48:27AM +0100, David Hildenbrand wrote: >>>> On 31.01.22 11:42, Mike Rapoport wrote: >>>>> Hi Nadav, >>>>> >>>>> On Sat, Jan 29, 2022 at 10:23:55PM -0800, Nadav Amit wrote: >>>>>> Using userfautlfd and looking at the kernel code, I encountered a usability >>>>>> issue that complicates userspace UFFD-monitor implementation. I obviosuly >>>>>> might be wrong, so I would appreciate a (polite?) feedback. I do have a >>>>>> userspace workaround, but I thought it is worthy to share and to hear your >>>>>> opinion, as well as feedback from other UFFD users. >>>>>> >>>>>> The issue I encountered regards the ordering of UFFD events tbat might not >>>>>> reflect the actual order in which events took place. >>>>>> >>>>>> In more detail, UFFD events (e.g., unmap, fork) are not ordered against >>>>>> themselves [*]. The mm-lock is dropped before notifying the userspace >>>>>> UFFD-monitor, and therefore there is no guarantee as to whether the order of >>>>>> the events actually reflects the order in which the events took place. >>>>>> This can prevent a UFFD-monitor from using the events to track which >>>>>> ranges are mapped. Specifically, UFFD_EVENT_FORK message and a >>>>>> UFFD_EVENT_UNMAP message (which reflects unmap in the parent process) can >>>>>> be reordered, if the events are triggered by two different threads. In >>>>>> this case the UFFD-monitor cannot figure from the events whether the >>>>>> child process has the unmapped memory range still mapped (because fork >>>>>> happened first) or not. >>>>> >>>>> Yeah, it seems that something like this is possible: >>>>> >>>>> >>>>> fork() munmap() >>>>> mmap_write_unlock(); >>>>> mmap_write_lock_killable(); >>>>> do_things(); >>>>> mmap_{read,write}_unlock(); >>>>> userfaultfd_unmap_complete(); >>>>> dup_userfaultfd_complete(); >>>>> >>>> >>>> I was thinking about other possible races, e.g., MADV_DONTNEED/MADV_FREE >>>> racing with UFFD_EVENT_PAGEFAULT -- where we only hold the mmap_lock in >>>> read mode. But not sure if they apply. >>> >>> The userspace can live with these, at least for uffd missing page faults. >>> If the monitor will try to resolve a page fault for a removed area, the >>> errno from UFFDIO_COPY/ZERO can be used to detect such races. >> >> I was wondering if the monitor could get confused if he just resolved a >> page fault via UFFDIO_COPY/ZERO and then receives a REMOVE event. > > And why would it be confused? My thinking was that the monitor might use REMOVE events to track which pages are actually populated. If you receive REMOVE after UFFDIO_COPY/ZERO the monitor would conclude that the page is not populated, just like if we'd get the MADV_DONTNEED/MADV_REMOVE immediately after placing a page. Of course, it heavily depends on the target use case in the monitor or I might just be wrong. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: userfaultfd: usability issue due to lack of UFFD events ordering 2022-01-31 14:41 ` David Hildenbrand @ 2022-01-31 18:47 ` Mike Rapoport 2022-01-31 22:39 ` Nadav Amit 0 siblings, 1 reply; 18+ messages in thread From: Mike Rapoport @ 2022-01-31 18:47 UTC (permalink / raw) To: David Hildenbrand Cc: Nadav Amit, Mike Rapoport, Andrea Arcangeli, Peter Xu, Linux-MM On Mon, Jan 31, 2022 at 03:41:05PM +0100, David Hildenbrand wrote: > On 31.01.22 15:28, Mike Rapoport wrote: > > On Mon, Jan 31, 2022 at 03:12:36PM +0100, David Hildenbrand wrote: > >> On 31.01.22 15:05, Mike Rapoport wrote: > >>> On Mon, Jan 31, 2022 at 11:48:27AM +0100, David Hildenbrand wrote: > >>>> On 31.01.22 11:42, Mike Rapoport wrote: > >>>>> Hi Nadav, > >>>>> > >>>>> On Sat, Jan 29, 2022 at 10:23:55PM -0800, Nadav Amit wrote: > >>>>>> Using userfautlfd and looking at the kernel code, I encountered a usability > >>>>>> issue that complicates userspace UFFD-monitor implementation. I obviosuly > >>>>>> might be wrong, so I would appreciate a (polite?) feedback. I do have a > >>>>>> userspace workaround, but I thought it is worthy to share and to hear your > >>>>>> opinion, as well as feedback from other UFFD users. > >>>>>> > >>>>>> The issue I encountered regards the ordering of UFFD events tbat might not > >>>>>> reflect the actual order in which events took place. > >>>>>> > >>>>>> In more detail, UFFD events (e.g., unmap, fork) are not ordered against > >>>>>> themselves [*]. The mm-lock is dropped before notifying the userspace > >>>>>> UFFD-monitor, and therefore there is no guarantee as to whether the order of > >>>>>> the events actually reflects the order in which the events took place. > >>>>>> This can prevent a UFFD-monitor from using the events to track which > >>>>>> ranges are mapped. Specifically, UFFD_EVENT_FORK message and a > >>>>>> UFFD_EVENT_UNMAP message (which reflects unmap in the parent process) can > >>>>>> be reordered, if the events are triggered by two different threads. In > >>>>>> this case the UFFD-monitor cannot figure from the events whether the > >>>>>> child process has the unmapped memory range still mapped (because fork > >>>>>> happened first) or not. > >>>>> > >>>>> Yeah, it seems that something like this is possible: > >>>>> > >>>>> > >>>>> fork() munmap() > >>>>> mmap_write_unlock(); > >>>>> mmap_write_lock_killable(); > >>>>> do_things(); > >>>>> mmap_{read,write}_unlock(); > >>>>> userfaultfd_unmap_complete(); > >>>>> dup_userfaultfd_complete(); > >>>>> > >>>> > >>>> I was thinking about other possible races, e.g., MADV_DONTNEED/MADV_FREE > >>>> racing with UFFD_EVENT_PAGEFAULT -- where we only hold the mmap_lock in > >>>> read mode. But not sure if they apply. > >>> > >>> The userspace can live with these, at least for uffd missing page faults. > >>> If the monitor will try to resolve a page fault for a removed area, the > >>> errno from UFFDIO_COPY/ZERO can be used to detect such races. > >> > >> I was wondering if the monitor could get confused if he just resolved a > >> page fault via UFFDIO_COPY/ZERO and then receives a REMOVE event. > > > > And why would it be confused? > > My thinking was that the monitor might use REMOVE events to track which > pages are actually populated. If you receive REMOVE after > UFFDIO_COPY/ZERO the monitor would conclude that the page is not > populated, just like if we'd get the MADV_DONTNEED/MADV_REMOVE > immediately after placing a page. I still don't follow your usecase. In CRIU we simply discard whatever content we had to fill when there is REMOVE event. If a page fault occurs in that region we use UFFDIO_ZEROPAGE, just as it would happen in "normal" page fault processing (note, CRIU does not support uffd with hugetlb or shmem) > Of course, it heavily depends on the target use case in the monitor or I > might just be wrong. > > -- > Thanks, > > David / dhildenb > -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: userfaultfd: usability issue due to lack of UFFD events ordering 2022-01-31 18:47 ` Mike Rapoport @ 2022-01-31 22:39 ` Nadav Amit 2022-02-01 9:10 ` Mike Rapoport 2022-02-10 7:48 ` Peter Xu 0 siblings, 2 replies; 18+ messages in thread From: Nadav Amit @ 2022-01-31 22:39 UTC (permalink / raw) To: Mike Rapoport Cc: David Hildenbrand, Mike Rapoport, Andrea Arcangeli, Peter Xu, Linux-MM > On Jan 31, 2022, at 10:47 AM, Mike Rapoport <rppt@kernel.org> wrote: > > On Mon, Jan 31, 2022 at 03:41:05PM +0100, David Hildenbrand wrote: >> On 31.01.22 15:28, Mike Rapoport wrote: >>> On Mon, Jan 31, 2022 at 03:12:36PM +0100, David Hildenbrand wrote: >>>> On 31.01.22 15:05, Mike Rapoport wrote: >>>>> On Mon, Jan 31, 2022 at 11:48:27AM +0100, David Hildenbrand wrote: >>>>>> On 31.01.22 11:42, Mike Rapoport wrote: >>>>>>> Hi Nadav, >>>>>>> >>>>>>> On Sat, Jan 29, 2022 at 10:23:55PM -0800, Nadav Amit wrote: >>>>>>>> Using userfautlfd and looking at the kernel code, I encountered a usability >>>>>>>> issue that complicates userspace UFFD-monitor implementation. I obviosuly >>>>>>>> might be wrong, so I would appreciate a (polite?) feedback. I do have a >>>>>>>> userspace workaround, but I thought it is worthy to share and to hear your >>>>>>>> opinion, as well as feedback from other UFFD users. >>>>>>>> >>>>>>>> The issue I encountered regards the ordering of UFFD events tbat might not >>>>>>>> reflect the actual order in which events took place. >>>>>>>> >>>>>>>> In more detail, UFFD events (e.g., unmap, fork) are not ordered against >>>>>>>> themselves [*]. The mm-lock is dropped before notifying the userspace >>>>>>>> UFFD-monitor, and therefore there is no guarantee as to whether the order of >>>>>>>> the events actually reflects the order in which the events took place. >>>>>>>> This can prevent a UFFD-monitor from using the events to track which >>>>>>>> ranges are mapped. Specifically, UFFD_EVENT_FORK message and a >>>>>>>> UFFD_EVENT_UNMAP message (which reflects unmap in the parent process) can >>>>>>>> be reordered, if the events are triggered by two different threads. In >>>>>>>> this case the UFFD-monitor cannot figure from the events whether the >>>>>>>> child process has the unmapped memory range still mapped (because fork >>>>>>>> happened first) or not. >>>>>>> >>>>>>> Yeah, it seems that something like this is possible: >>>>>>> >>>>>>> >>>>>>> fork() munmap() >>>>>>> mmap_write_unlock(); >>>>>>> mmap_write_lock_killable(); >>>>>>> do_things(); >>>>>>> mmap_{read,write}_unlock(); >>>>>>> userfaultfd_unmap_complete(); >>>>>>> dup_userfaultfd_complete(); >>>>>>> >>>>>> >>>>>> I was thinking about other possible races, e.g., MADV_DONTNEED/MADV_FREE >>>>>> racing with UFFD_EVENT_PAGEFAULT -- where we only hold the mmap_lock in >>>>>> read mode. But not sure if they apply. >>>>> >>>>> The userspace can live with these, at least for uffd missing page faults. >>>>> If the monitor will try to resolve a page fault for a removed area, the >>>>> errno from UFFDIO_COPY/ZERO can be used to detect such races. >>>> >>>> I was wondering if the monitor could get confused if he just resolved a >>>> page fault via UFFDIO_COPY/ZERO and then receives a REMOVE event. >>> >>> And why would it be confused? >> >> My thinking was that the monitor might use REMOVE events to track which >> pages are actually populated. If you receive REMOVE after >> UFFDIO_COPY/ZERO the monitor would conclude that the page is not >> populated, just like if we'd get the MADV_DONTNEED/MADV_REMOVE >> immediately after placing a page. > > I still don't follow your usecase. > > In CRIU we simply discard whatever content we had to fill when there is > REMOVE event. If a page fault occurs in that region we use UFFDIO_ZEROPAGE, > just as it would happen in "normal" page fault processing > (note, CRIU does not support uffd with hugetlb or shmem) I think that the point that David makes is valid. There are use-cases in which you do need to know the order between user-initiated MADV_DONTNEED and page-faults. For instance, if you build a userspace paging mechanism, you need to know whether the page content is zero or whatever is held in the disk. I presume mmap_changing was designed for a similar purpose, assuming that if you had a page-fault that started before MADV_DONTNEED, and you try to serve it using copy-ioctl, the copy would fail. I think that this works only if you assume that there is a single UFFD monitor thread (that reads the uffd and issues appropriate ioctl’s), and that all operations are performed synchronously (which I am trying to avoid using io-uring). Otherwise, a copy ioctl that is initiated before MADV_DONTNEED (to resolve page-fault) can take place after the userspace was already notified of UFFD_EVENT_REMOVE (i.e., mmap_changing==0), and there is no way to cancel the copy that was initiated. As a result, following MADV_DONTNEED, the memory would not be zeroed. As for me, I decided that due to the lack of ordering, I just cannot use the UFFD events, and I have to rely on ptrace to obtain order of these events. I might be wrong, but any solution is not trivial and is likely to require API changes. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: userfaultfd: usability issue due to lack of UFFD events ordering 2022-01-31 22:39 ` Nadav Amit @ 2022-02-01 9:10 ` Mike Rapoport 2022-02-10 7:48 ` Peter Xu 1 sibling, 0 replies; 18+ messages in thread From: Mike Rapoport @ 2022-02-01 9:10 UTC (permalink / raw) To: Nadav Amit Cc: David Hildenbrand, Mike Rapoport, Andrea Arcangeli, Peter Xu, Linux-MM On Mon, Jan 31, 2022 at 02:39:01PM -0800, Nadav Amit wrote: > > On Jan 31, 2022, at 10:47 AM, Mike Rapoport <rppt@kernel.org> wrote: > > > > On Mon, Jan 31, 2022 at 03:41:05PM +0100, David Hildenbrand wrote: > > As for me, I decided that due to the lack of ordering, I just > cannot use the UFFD events, and I have to rely on ptrace to obtain > order of these events. I might be wrong, but any solution is not > trivial and is likely to require API changes. I think you are right and any solution that will allow userspace to track the order of all events (page faults and address space changes) will require API changes. -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: userfaultfd: usability issue due to lack of UFFD events ordering 2022-01-31 22:39 ` Nadav Amit 2022-02-01 9:10 ` Mike Rapoport @ 2022-02-10 7:48 ` Peter Xu 2022-02-10 18:42 ` Nadav Amit 1 sibling, 1 reply; 18+ messages in thread From: Peter Xu @ 2022-02-10 7:48 UTC (permalink / raw) To: Nadav Amit Cc: Mike Rapoport, David Hildenbrand, Mike Rapoport, Andrea Arcangeli, Linux-MM Hi, Nadav & all, On Mon, Jan 31, 2022 at 02:39:01PM -0800, Nadav Amit wrote: > There are use-cases in which you do need to know the order between > user-initiated MADV_DONTNEED and page-faults. For instance, if you > build a userspace paging mechanism, you need to know whether the > page content is zero or whatever is held in the disk. When there's no uffd monitor, concurrent page faults with MADV_DONTNEED can already result in undefined behavior, right? Assuming the page fault is a write with non-zero data, then the page can either contain zero or non-zero data at last, IIUC. If above is true, I'm wondering whether it's already impossible to do it right when there is an uffd monitor? Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: userfaultfd: usability issue due to lack of UFFD events ordering 2022-02-10 7:48 ` Peter Xu @ 2022-02-10 18:42 ` Nadav Amit 2022-02-14 4:02 ` Peter Xu 0 siblings, 1 reply; 18+ messages in thread From: Nadav Amit @ 2022-02-10 18:42 UTC (permalink / raw) To: Peter Xu Cc: Mike Rapoport, David Hildenbrand, Mike Rapoport, Andrea Arcangeli, Linux-MM > On Feb 9, 2022, at 11:48 PM, Peter Xu <peterx@redhat.com> wrote: > > Hi, Nadav & all, > > On Mon, Jan 31, 2022 at 02:39:01PM -0800, Nadav Amit wrote: >> There are use-cases in which you do need to know the order between >> user-initiated MADV_DONTNEED and page-faults. For instance, if you >> build a userspace paging mechanism, you need to know whether the >> page content is zero or whatever is held in the disk. > > When there's no uffd monitor, concurrent page faults with MADV_DONTNEED can > already result in undefined behavior, right? Assuming the page fault is a write > with non-zero data, then the page can either contain zero or non-zero data > at last, IIUC. > > If above is true, I'm wondering whether it's already impossible to do it right > when there is an uffd monitor? I think that the MADV_DONTNEED/PF-resolution “race" only affects usage-models that handle the page-fault concurrently with UFFD-monitoring (using multiple monitor threads or IO-uring as I try to do). At least for use-cases such as live-migration. I think the scenario you have in mind is the following, which is resolved with mmap_changing that Mike introduced some time ago: UFFD monitor App thread #0 App thread #1 ------------ ------------- ------------- #PF UFFD Read [#PF] MADV_DONTNEED mmap_changing = 1 userfaultfd_event_wait_completion() [queue event, wait] UFFD-copy -EAGAIN since mmmap_changing > 0 mmap_changing will keep being elevated, and UFFD-copy not served (fail) until the monitor reads the UFFD event. The monitor, in this scenario, is single threaded and therefore orders UFFD-read and UFFD-copy, preventing them from racing. Assuming the monitor is smart enough to reevaluate the course of action after MADV_DONTNEED is handled, it should be safe. Personally, I do not like the burden that this scheme puts on the monitor, the fact it needs to retry or even the return value [I think it should be EBUSY since immediate retry would fail. With IO-uring, EAGAIN triggers an immediate retry, which is useless.] Yet, concurrent UFFD-event/#PF can be handled properly by a smart monitor. *However*, userfaultfd events seem as very hard to use (to say the least) in the following cases: 1. The UFFD-copy is issued by one thread and the UFFD-read is performed by another. For me this is the most painful even if you may consider it as “unorthodox”. It is very useful for performance, especially if the UFFD-copy is large. 2. If the race is between 2 userfaultfd *events*. The events might not be properly ordered (i.e., the order in which they are read does not reflect the order in which they occurred) despite the use of *_userfaultfd_prep(), since they are only queued (to be reported and trigger wake) by userfaultfd_event_wait_completion(), after the VMA and PTEs were updated and more importantly after mmap-lock was dropped. This means that if you have fork and MADV_DONTNEED, the monitor might see their order inverted, and won’t be able to know whether the child has the pages zapped or not. Other races are possible too, for instance between mremap() and munmap(). In most cases the monitor might be able (with quite some work) to figure out that the order of the events it received does not make sense and the events must have been reordered. Yet, implementing something like that is far from trivial and there are some cases that are probably impossible to resolve just based on the UFFD read events. I personally addressed this issue with seccomp+ptrace to trap on entry/exit to relevant syscalls (e.g., munmap, mmap, fork), and prevent concurrent calls to obtain correct order of the events. It is far from trivial and introduces some overheads, but I did not find a better solution. Let me know if I am missing anything. Thanks, Nadav ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: userfaultfd: usability issue due to lack of UFFD events ordering 2022-02-10 18:42 ` Nadav Amit @ 2022-02-14 4:02 ` Peter Xu 2022-02-15 22:35 ` Nadav Amit 0 siblings, 1 reply; 18+ messages in thread From: Peter Xu @ 2022-02-14 4:02 UTC (permalink / raw) To: Nadav Amit Cc: Mike Rapoport, David Hildenbrand, Mike Rapoport, Andrea Arcangeli, Linux-MM Nadav, On Thu, Feb 10, 2022 at 10:42:43AM -0800, Nadav Amit wrote: > I think that the MADV_DONTNEED/PF-resolution “race" only affects usage-models > that handle the page-fault concurrently with UFFD-monitoring (using multiple > monitor threads or IO-uring as I try to do). At least for use-cases such as > live-migration. > > I think the scenario you have in mind is the following, which is resolved > with mmap_changing that Mike introduced some time ago: > > UFFD monitor App thread #0 App thread #1 > ------------ ------------- ------------- > #PF > UFFD Read > [#PF] > MADV_DONTNEED > mmap_changing = 1 > > userfaultfd_event_wait_completion() > [queue event, wait] > UFFD-copy > -EAGAIN since mmmap_changing > 0 > > > mmap_changing will keep being elevated, and UFFD-copy not served (fail) until > the monitor reads the UFFD event. The monitor, in this scenario, is single > threaded and therefore orders UFFD-read and UFFD-copy, preventing them from > racing. > > Assuming the monitor is smart enough to reevaluate the course of action after > MADV_DONTNEED is handled, it should be safe. Personally, I do not like the > burden that this scheme puts on the monitor, the fact it needs to retry or > even the return value [I think it should be EBUSY since immediate retry would > fail. With IO-uring, EAGAIN triggers an immediate retry, which is useless.] > Yet, concurrent UFFD-event/#PF can be handled properly by a smart monitor. > > *However*, userfaultfd events seem as very hard to use (to say the least) in > the following cases: > > 1. The UFFD-copy is issued by one thread and the UFFD-read is performed by > another. For me this is the most painful even if you may consider it > as “unorthodox”. It is very useful for performance, especially if the > UFFD-copy is large. This is definitely a valid use case for uffd, and IMHO that's a good base model when the uffd app is performance critical. > > 2. If the race is between 2 userfaultfd *events*. The events might not be > properly ordered (i.e., the order in which they are read does not reflect > the order in which they occurred) despite the use of *_userfaultfd_prep(), > since they are only queued (to be reported and trigger wake) by > userfaultfd_event_wait_completion(), after the VMA and PTEs were updated > and more importantly after mmap-lock was dropped. > > This means that if you have fork and MADV_DONTNEED, the monitor might see > their order inverted, and won’t be able to know whether the child has the > pages zapped or not. > > Other races are possible too, for instance between mremap() and munmap(). > In most cases the monitor might be able (with quite some work) to > figure out that the order of the events it received does not make sense > and the events must have been reordered. Yet, implementing something like > that is far from trivial and there are some cases that are probably > impossible to resolve just based on the UFFD read events. > > I personally addressed this issue with seccomp+ptrace to trap on > entry/exit to relevant syscalls (e.g., munmap, mmap, fork), and > prevent concurrent calls to obtain correct order of the events. It is far > from trivial and introduces some overheads, but I did not find a better > solution. Thanks for explaining. I also digged out the discussion threads between you and Mike and that's a good one too summarizing the problems: https://lore.kernel.org/all/5921BA80-F263-4F8D-B7E6-316CEB602B51@gmail.com/ Scenario 4 is kind of special imho along all those, because that's the only one that can be workarounded by user application by only copying pages one by one. I know you were even leveraging iouring in your local tree, so that's probably not a solution at all for you. But I'm just trying to start thinking without that scenario for now. Per my understanding, a major issue regarding the rest of the scenarios is ordering of uffd messages may not match with how things are happening. This actually contains two problems. First of all, mmap_sem is mostly held read for all page faults and most of the mm changes except e.g. fork, then we can never serialize them. Not to mention uffd events releases mmap_sem within prep and completion. Let's call it problem 1. The other problem 2 is we can never serialize faults against events. For problem 1, I do sense something that mmap_sem is just not suitable for uffd scenario. Say, we grant concurrent with most of the events like dontneed and mremap, but when uffd ordering is a concern we may not want to grant that concurrency. I'm wondering whether it means uffd may need its own semaphore to achieve this. So for all events that uffd cares we take write lock on a new uffd_sem after mmap_sem, meanwhile we don't release that uffd_sem after prep of events, not until completion (the message is read). It'll slow down uffd tracked systems but guarantees ordering. At the meantime, I'm wildly thinking whether we can tackle with the other problem by merging the page fault queue with the event queue, aka, event_wqh and fault_pending_wqh. Obviously we'll need to identify the messages when read() and conditionally move then into fault_wqh only if they come from page faults, but that seems doable? Not sure above makes any sense, as I could have missed something. Meanwhile I think even if we order all the messages to match with facts there're still some other issues that are outliers of this, but let's see how it sounds so far. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: userfaultfd: usability issue due to lack of UFFD events ordering 2022-02-14 4:02 ` Peter Xu @ 2022-02-15 22:35 ` Nadav Amit 2022-02-16 8:27 ` Peter Xu 2022-02-17 21:15 ` Mike Rapoport 0 siblings, 2 replies; 18+ messages in thread From: Nadav Amit @ 2022-02-15 22:35 UTC (permalink / raw) To: Peter Xu Cc: Mike Rapoport, David Hildenbrand, Mike Rapoport, Andrea Arcangeli, Linux-MM > On Feb 13, 2022, at 8:02 PM, Peter Xu <peterx@redhat.com> wrote: > > Thanks for explaining. > > I also digged out the discussion threads between you and Mike and that's a good > one too summarizing the problems: > > https://lore.kernel.org/all/5921BA80-F263-4F8D-B7E6-316CEB602B51@gmail.com/ > > Scenario 4 is kind of special imho along all those, because that's the only one > that can be workarounded by user application by only copying pages one by one. > I know you were even leveraging iouring in your local tree, so that's probably > not a solution at all for you. But I'm just trying to start thinking without > that scenario for now. > > Per my understanding, a major issue regarding the rest of the scenarios is > ordering of uffd messages may not match with how things are happening. This > actually contains two problems. > > First of all, mmap_sem is mostly held read for all page faults and most of the > mm changes except e.g. fork, then we can never serialize them. Not to mention > uffd events releases mmap_sem within prep and completion. Let's call it > problem 1. > > The other problem 2 is we can never serialize faults against events. > > For problem 1, I do sense something that mmap_sem is just not suitable for uffd > scenario. Say, we grant concurrent with most of the events like dontneed and > mremap, but when uffd ordering is a concern we may not want to grant that > concurrency. I'm wondering whether it means uffd may need its own semaphore to > achieve this. So for all events that uffd cares we take write lock on a new > uffd_sem after mmap_sem, meanwhile we don't release that uffd_sem after prep of > events, not until completion (the message is read). It'll slow down uffd > tracked systems but guarantees ordering. Peter, Thanks for finding the time and looking into the issues that I encountered. Your approach sounds possible, but it sounds to me unsafe to acquire uffd_sem after mmap_lock, since it might cause deadlocks (e.g., if a process uses events to manage its own memory). > > At the meantime, I'm wildly thinking whether we can tackle with the other > problem by merging the page fault queue with the event queue, aka, event_wqh > and fault_pending_wqh. Obviously we'll need to identify the messages when > read() and conditionally move then into fault_wqh only if they come from page > faults, but that seems doable? This, I guess is necessary in addition to your aforementioned proposal to have some semaphore protecting, can do the trick. While I got your attention, let me share some other challenges I encountered using userfaultfd. They might be unrelated, but perhaps you can keep them in the back of your mind. Nobody should suffer as I did ;-) 1. mmap_changing (i.e., -EAGAIN on ioctls) makes using userfaultfd harder than it should be, especially when using io-uring as I wish to do. I think it is not too hard to address by changing the API. For instance, if uffd-ctx had a uffd-generation that would increase on each event, the user could have provided an ioctl-generation as part of copy/zero/etc ioctls, and the kernel would only fail the operation if ioctl copy/zero/etc operation only succeeds if the uffd-generation is lower/equal than the one provided by the user. 2. userfaultfd is separated from other tracing/instrumentation mechanisms in the kernel. I, for instance, also wanted to track mmap events (let’s put aside for a second why). Tracking these events can be done with ptrace or perf_event_open() but then it is hard to correlate these events with userfaultfd. It would have been easier for users, I think, if userfaultfd notifications were provided through ptrace/tracepoints mechanisms as well. 3. Nesting/chaining. It is not easy to allow two monitors to use userfaultfd concurrently. This seems as a general problem that I believe ptrace suffers from too. I know it might seem far-fetched to have 2 monitors at the moment, but I think that any tracking/instrumentation mechanism (e.g., ptrace, software-dirty, not to mention hardware virtualization) should be designed from the beginning with such support as adding it in a later stage can be tricky. 4. Missing state. It would be useful to provide the TID of the faulting thread. I will send a patch for this one once I get the necessary internal approvals. Thanks again, Nadav ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: userfaultfd: usability issue due to lack of UFFD events ordering 2022-02-15 22:35 ` Nadav Amit @ 2022-02-16 8:27 ` Peter Xu 2022-02-17 21:15 ` Mike Rapoport 1 sibling, 0 replies; 18+ messages in thread From: Peter Xu @ 2022-02-16 8:27 UTC (permalink / raw) To: Nadav Amit Cc: Mike Rapoport, David Hildenbrand, Mike Rapoport, Andrea Arcangeli, Linux-MM On Tue, Feb 15, 2022 at 02:35:09PM -0800, Nadav Amit wrote: > > > > On Feb 13, 2022, at 8:02 PM, Peter Xu <peterx@redhat.com> wrote: > > > > Thanks for explaining. > > > > I also digged out the discussion threads between you and Mike and that's a good > > one too summarizing the problems: > > > > https://lore.kernel.org/all/5921BA80-F263-4F8D-B7E6-316CEB602B51@gmail.com/ > > > > Scenario 4 is kind of special imho along all those, because that's the only one > > that can be workarounded by user application by only copying pages one by one. > > I know you were even leveraging iouring in your local tree, so that's probably > > not a solution at all for you. But I'm just trying to start thinking without > > that scenario for now. > > > > Per my understanding, a major issue regarding the rest of the scenarios is > > ordering of uffd messages may not match with how things are happening. This > > actually contains two problems. > > > > First of all, mmap_sem is mostly held read for all page faults and most of the > > mm changes except e.g. fork, then we can never serialize them. Not to mention > > uffd events releases mmap_sem within prep and completion. Let's call it > > problem 1. > > > > The other problem 2 is we can never serialize faults against events. > > > > For problem 1, I do sense something that mmap_sem is just not suitable for uffd > > scenario. Say, we grant concurrent with most of the events like dontneed and > > mremap, but when uffd ordering is a concern we may not want to grant that > > concurrency. I'm wondering whether it means uffd may need its own semaphore to > > achieve this. So for all events that uffd cares we take write lock on a new > > uffd_sem after mmap_sem, meanwhile we don't release that uffd_sem after prep of > > events, not until completion (the message is read). It'll slow down uffd > > tracked systems but guarantees ordering. > > Peter, > > Thanks for finding the time and looking into the issues that I encountered. > > Your approach sounds possible, but it sounds to me unsafe to acquire uffd_sem > after mmap_lock, since it might cause deadlocks (e.g., if a process uses events > to manage its own memory). Right, it's unsafe if to be taken after mmap_sem. If to do so IIUC we need to take it before mmap_sem hence we can release mmap_sem under it. In my mind that could be a feature bit UFFD_FEATURE_STRICT_ORDERING, when it's set then the mm bound to the userfaultfd file will have a flag set within the mm->flags, let's say MMF_UFFD_STRICT_ORDER. Then for uffd related syscalls like fork(), mremap() and so on we conditionally take that uffd_sem and we need to do that before mmap_sem. We take it write for all the uffd event contexts, and take it read for all the uffd page faults. But even if above would work again I have little confidence that it'll work in reality. Firstly it does look odd already that an uffd lock needs to be taken before the whole mm's, starting to affect common workloads even not using uffd (even the flag lookup could affect cacheline, I think, but not sure how slower it would be). Not to mention that should greatly slow down the tracee process. It definitely needs more thoughts anyway. > > > > > At the meantime, I'm wildly thinking whether we can tackle with the other > > problem by merging the page fault queue with the event queue, aka, event_wqh > > and fault_pending_wqh. Obviously we'll need to identify the messages when > > read() and conditionally move then into fault_wqh only if they come from page > > faults, but that seems doable? > > This, I guess is necessary in addition to your aforementioned proposal to have > some semaphore protecting, can do the trick. > > While I got your attention, let me share some other challenges I encountered > using userfaultfd. They might be unrelated, but perhaps you can keep them in > the back of your mind. Nobody should suffer as I did ;-) Heh. > > 1. mmap_changing (i.e., -EAGAIN on ioctls) makes using userfaultfd harder than > it should be, especially when using io-uring as I wish to do. > > I think it is not too hard to address by changing the API. For instance, if > uffd-ctx had a uffd-generation that would increase on each event, the user > could have provided an ioctl-generation as part of copy/zero/etc ioctls, and > the kernel would only fail the operation if ioctl copy/zero/etc operation > only succeeds if the uffd-generation is lower/equal than the one provided by > the user. Assuming that gen_id is copied over from the uffd message, and if that counter only increases, then I don't understand why it can be lower than the user provided. I don't quite get how that solves your problem too, since -EAGAIN can still trigger. I must have missed something. > > 2. userfaultfd is separated from other tracing/instrumentation mechanisms in > the kernel. I, for instance, also wanted to track mmap events (let’s put > aside for a second why). Tracking these events can be done with ptrace or > perf_event_open() but then it is hard to correlate these events with > userfaultfd. It would have been easier for users, I think, if userfaultfd > notifications were provided through ptrace/tracepoints mechanisms as well. > > 3. Nesting/chaining. It is not easy to allow two monitors to use userfaultfd > concurrently. This seems as a general problem that I believe ptrace suffers > from too. I know it might seem far-fetched to have 2 monitors at the moment, > but I think that any tracking/instrumentation mechanism (e.g., ptrace, > software-dirty, not to mention hardware virtualization) should be designed > from the beginning with such support as adding it in a later stage can be > tricky. 2) and 3) definitely need more thoughts.. PS: I think I first read your name from a paper on the nested virt. :-) But I forgot the details. > > 4. Missing state. It would be useful to provide the TID of the faulting > thread. I will send a patch for this one once I get the necessary > internal approvals. Before I fully digest your reply and the problems, I want to make sure you are aware of UFFD_FEATURE_THREAD_ID.. I don't know how you missed it, but it does sound like what you wanted. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: userfaultfd: usability issue due to lack of UFFD events ordering 2022-02-15 22:35 ` Nadav Amit 2022-02-16 8:27 ` Peter Xu @ 2022-02-17 21:15 ` Mike Rapoport 1 sibling, 0 replies; 18+ messages in thread From: Mike Rapoport @ 2022-02-17 21:15 UTC (permalink / raw) To: Nadav Amit Cc: Peter Xu, David Hildenbrand, Mike Rapoport, Andrea Arcangeli, Linux-MM On Tue, Feb 15, 2022 at 02:35:09PM -0800, Nadav Amit wrote: > > > > On Feb 13, 2022, at 8:02 PM, Peter Xu <peterx@redhat.com> wrote: > > > > Thanks for explaining. > > > > I also digged out the discussion threads between you and Mike and that's a good > > one too summarizing the problems: > > > > https://lore.kernel.org/all/5921BA80-F263-4F8D-B7E6-316CEB602B51@gmail.com/ > > > > Scenario 4 is kind of special imho along all those, because that's the only one > > that can be workarounded by user application by only copying pages one by one. > > I know you were even leveraging iouring in your local tree, so that's probably > > not a solution at all for you. But I'm just trying to start thinking without > > that scenario for now. > > > > Per my understanding, a major issue regarding the rest of the scenarios is > > ordering of uffd messages may not match with how things are happening. This > > actually contains two problems. > > > > First of all, mmap_sem is mostly held read for all page faults and most of the > > mm changes except e.g. fork, then we can never serialize them. Not to mention > > uffd events releases mmap_sem within prep and completion. Let's call it > > problem 1. > > > > The other problem 2 is we can never serialize faults against events. > > > > For problem 1, I do sense something that mmap_sem is just not suitable for uffd > > scenario. Say, we grant concurrent with most of the events like dontneed and > > mremap, but when uffd ordering is a concern we may not want to grant that > > concurrency. I'm wondering whether it means uffd may need its own semaphore to > > achieve this. So for all events that uffd cares we take write lock on a new > > uffd_sem after mmap_sem, meanwhile we don't release that uffd_sem after prep of > > events, not until completion (the message is read). It'll slow down uffd > > tracked systems but guarantees ordering. > > Peter, > > Thanks for finding the time and looking into the issues that I encountered. > > Your approach sounds possible, but it sounds to me unsafe to acquire uffd_sem > after mmap_lock, since it might cause deadlocks (e.g., if a process uses events > to manage its own memory). > > > > > At the meantime, I'm wildly thinking whether we can tackle with the other > > problem by merging the page fault queue with the event queue, aka, event_wqh > > and fault_pending_wqh. Obviously we'll need to identify the messages when > > read() and conditionally move then into fault_wqh only if they come from page > > faults, but that seems doable? > > This, I guess is necessary in addition to your aforementioned proposal to have > some semaphore protecting, can do the trick. > > While I got your attention, let me share some other challenges I encountered > using userfaultfd. They might be unrelated, but perhaps you can keep them in > the back of your mind. Nobody should suffer as I did ;-) > > 1. mmap_changing (i.e., -EAGAIN on ioctls) makes using userfaultfd harder than > it should be, especially when using io-uring as I wish to do. > > I think it is not too hard to address by changing the API. For instance, if > uffd-ctx had a uffd-generation that would increase on each event, the user > could have provided an ioctl-generation as part of copy/zero/etc ioctls, and > the kernel would only fail the operation if ioctl copy/zero/etc operation > only succeeds if the uffd-generation is lower/equal than the one provided by > the user. Do you mean that if there were page faults with generations 1 and 3 and, say, MADV_DONTNEED with generation 2, then even if the uffd copy that resolves page fault 1 races with MADV_DONTNEED it will go through and the copy for page fault 3 will fail? But how would you order zapping the pages and copying into them internally? Or may understanding of your idea was completely off? As for technicality of adding a generation to uffd_msg and to uffdio_{copy,zero,etc}, we can use __u32 reserved in the first one and 32 bits from mode in the second with a bit of care for wraparound. > 2. userfaultfd is separated from other tracing/instrumentation mechanisms in > the kernel. I, for instance, also wanted to track mmap events (let’s put > aside for a second why). Tracking these events can be done with ptrace or > perf_event_open() but then it is hard to correlate these events with > userfaultfd. It would have been easier for users, I think, if userfaultfd > notifications were provided through ptrace/tracepoints mechanisms as well. This sounds like opening Pandora box ;-) I think it's possible to trace userfaultfd events to some extent with a probe at userfaultfd_event_wait_completion() entry and handle_userfault(). The "interesting" information is passed to these functions as parameters and I believe all the data can be extracted with tools like bpftrace. > 3. Nesting/chaining. It is not easy to allow two monitors to use userfaultfd > concurrently. This seems as a general problem that I believe ptrace suffers > from too. I know it might seem far-fetched to have 2 monitors at the moment, > but I think that any tracking/instrumentation mechanism (e.g., ptrace, > software-dirty, not to mention hardware virtualization) should be designed > from the beginning with such support as adding it in a later stage can be > tricky. It's not too far fetched to have nested userfaultfd contexts even now. If CRIU would need to post-copy restore a process that uses userfaultfd it will need to deal with nested uffds. > Thanks again, > Nadav -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: userfaultfd: usability issue due to lack of UFFD events ordering 2022-01-31 10:42 ` Mike Rapoport 2022-01-31 10:48 ` David Hildenbrand @ 2022-01-31 17:23 ` Nadav Amit 2022-01-31 17:28 ` David Hildenbrand 1 sibling, 1 reply; 18+ messages in thread From: Nadav Amit @ 2022-01-31 17:23 UTC (permalink / raw) To: Mike Rapoport Cc: David Hildenbrand, Mike Rapoport, Andrea Arcangeli, Peter Xu, Linux-MM > On Jan 31, 2022, at 2:42 AM, Mike Rapoport <rppt@kernel.org> wrote: > > Hi Nadav, > > On Sat, Jan 29, 2022 at 10:23:55PM -0800, Nadav Amit wrote: >> Using userfautlfd and looking at the kernel code, I encountered a usability >> issue that complicates userspace UFFD-monitor implementation. I obviosuly >> might be wrong, so I would appreciate a (polite?) feedback. I do have a >> userspace workaround, but I thought it is worthy to share and to hear your >> opinion, as well as feedback from other UFFD users. >> >> The issue I encountered regards the ordering of UFFD events tbat might not >> reflect the actual order in which events took place. >> >> In more detail, UFFD events (e.g., unmap, fork) are not ordered against >> themselves [*]. The mm-lock is dropped before notifying the userspace >> UFFD-monitor, and therefore there is no guarantee as to whether the order of >> the events actually reflects the order in which the events took place. >> This can prevent a UFFD-monitor from using the events to track which >> ranges are mapped. Specifically, UFFD_EVENT_FORK message and a >> UFFD_EVENT_UNMAP message (which reflects unmap in the parent process) can >> be reordered, if the events are triggered by two different threads. In >> this case the UFFD-monitor cannot figure from the events whether the >> child process has the unmapped memory range still mapped (because fork >> happened first) or not. > > Yeah, it seems that something like this is possible: > > > fork() munmap() > mmap_write_unlock(); > mmap_write_lock_killable(); > do_things(); > mmap_{read,write}_unlock(); > userfaultfd_unmap_complete(); > dup_userfaultfd_complete(); > > A solution could be to split uffd_*_complete() to two parts: one that > queues up the event message and the second one that waits for it to be read > by the monitor. The first part then can run befor mm-lock is released. > > If you can think of something nicer, it'll be really great! Thanks for the quick response. Your solution is possible, but then the order between events and page-faults is certainly not kept - as David mentioned: regardless of mm-lock that is not always taken for write, events and page-faults are on two separate lists, and queued page-faults are reported before events. I am also not sure how simple/performant it is, since it would require an additional refcount for userfaultfd_wait_queue to prevent it from disappearing between the time it is enqueued to the time it blocks. Another option is to associate some “generation” or “sequence number” with every event and change the PAI to include it. It still leaves the problem of ordering MADV_DONTNEED and page-faults though. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: userfaultfd: usability issue due to lack of UFFD events ordering 2022-01-31 17:23 ` Nadav Amit @ 2022-01-31 17:28 ` David Hildenbrand 0 siblings, 0 replies; 18+ messages in thread From: David Hildenbrand @ 2022-01-31 17:28 UTC (permalink / raw) To: Nadav Amit, Mike Rapoport Cc: Mike Rapoport, Andrea Arcangeli, Peter Xu, Linux-MM On 31.01.22 18:23, Nadav Amit wrote: > >> On Jan 31, 2022, at 2:42 AM, Mike Rapoport <rppt@kernel.org> wrote: >> >> Hi Nadav, >> >> On Sat, Jan 29, 2022 at 10:23:55PM -0800, Nadav Amit wrote: >>> Using userfautlfd and looking at the kernel code, I encountered a usability >>> issue that complicates userspace UFFD-monitor implementation. I obviosuly >>> might be wrong, so I would appreciate a (polite?) feedback. I do have a >>> userspace workaround, but I thought it is worthy to share and to hear your >>> opinion, as well as feedback from other UFFD users. >>> >>> The issue I encountered regards the ordering of UFFD events tbat might not >>> reflect the actual order in which events took place. >>> >>> In more detail, UFFD events (e.g., unmap, fork) are not ordered against >>> themselves [*]. The mm-lock is dropped before notifying the userspace >>> UFFD-monitor, and therefore there is no guarantee as to whether the order of >>> the events actually reflects the order in which the events took place. >>> This can prevent a UFFD-monitor from using the events to track which >>> ranges are mapped. Specifically, UFFD_EVENT_FORK message and a >>> UFFD_EVENT_UNMAP message (which reflects unmap in the parent process) can >>> be reordered, if the events are triggered by two different threads. In >>> this case the UFFD-monitor cannot figure from the events whether the >>> child process has the unmapped memory range still mapped (because fork >>> happened first) or not. >> >> Yeah, it seems that something like this is possible: >> >> >> fork() munmap() >> mmap_write_unlock(); >> mmap_write_lock_killable(); >> do_things(); >> mmap_{read,write}_unlock(); >> userfaultfd_unmap_complete(); >> dup_userfaultfd_complete(); >> >> A solution could be to split uffd_*_complete() to two parts: one that >> queues up the event message and the second one that waits for it to be read >> by the monitor. The first part then can run befor mm-lock is released. >> >> If you can think of something nicer, it'll be really great! > > Thanks for the quick response. Your solution is possible, but then the > order between events and page-faults is certainly not kept - as David > mentioned: regardless of mm-lock that is not always taken for write, > events and page-faults are on two separate lists, and queued page-faults > are reported before events. Of course, for the issue I brought up (if it's a real issue), the question is if we could "adjust the documentation" to state that there are no ordering guarantees. IMHO at least the fork()+munmap() needs a proper fix, because otherwise, we might really end up with an API that's partially useless -- as you correctly state. > > I am also not sure how simple/performant it is, since it would require > an additional refcount for userfaultfd_wait_queue to prevent it from > disappearing between the time it is enqueued to the time it blocks. > > Another option is to associate some “generation” or “sequence number” > with every event and change the PAI to include it. It still leaves the > problem of ordering MADV_DONTNEED and page-faults though. > My first thought was to include a timestamp. But requiring user space to restore the order based on a timestamp might be really ... weird. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2022-02-17 21:15 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-01-30 6:23 userfaultfd: usability issue due to lack of UFFD events ordering Nadav Amit 2022-01-31 10:42 ` Mike Rapoport 2022-01-31 10:48 ` David Hildenbrand 2022-01-31 14:05 ` Mike Rapoport 2022-01-31 14:12 ` David Hildenbrand 2022-01-31 14:28 ` Mike Rapoport 2022-01-31 14:41 ` David Hildenbrand 2022-01-31 18:47 ` Mike Rapoport 2022-01-31 22:39 ` Nadav Amit 2022-02-01 9:10 ` Mike Rapoport 2022-02-10 7:48 ` Peter Xu 2022-02-10 18:42 ` Nadav Amit 2022-02-14 4:02 ` Peter Xu 2022-02-15 22:35 ` Nadav Amit 2022-02-16 8:27 ` Peter Xu 2022-02-17 21:15 ` Mike Rapoport 2022-01-31 17:23 ` Nadav Amit 2022-01-31 17:28 ` David Hildenbrand
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).