* [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes @ 2020-12-25 9:25 Nadav Amit 2020-12-25 9:25 ` [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect Nadav Amit ` (2 more replies) 0 siblings, 3 replies; 96+ messages in thread From: Nadav Amit @ 2020-12-25 9:25 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, Nadav Amit, Andrea Arcangeli, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra From: Nadav Amit <namit@vmware.com> This patch-set went from v1 to RFCv2, as there is still an ongoing discussion regarding the way of solving the recently found races due to deferred TLB flushes. These patches are only sent for reference for now, and can be applied later if no better solution is taken. In a nutshell, write-protecting PTEs with deferred TLB flushes was mostly performed while holding mmap_lock for write. This prevented concurrent page-fault handler invocations from mistakenly assuming that a page is write-protected when in fact, due to the deferred TLB flush, other CPU could still write to the page. Such a write can cause a memory corruption if it takes place after the page was copied (in cow_user_page()), and before the PTE was flushed (by wp_page_copy()). However, the userfaultfd and soft-dirty mechanisms did not take mmap_lock for write, but only for read, which made such races possible. Since commit 09854ba94c6a ("mm: do_wp_page() simplification") these races became more likely to take place as non-COW'd pages are more likely to be COW'd instead of being reused. Both of the races that these patches are intended to resolve were produced on v5.10. To avoid the performance overhead some alternative solutions that do not require to acquire mmap_lock for write were proposed, specifically for userfaultfd. So far no better solution that can be backported was proposed for the soft-dirty case. v1->RFCv2: - Better (i.e., correct) description of the userfaultfd buggy case [Yu] - Patch for the soft-dirty case Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Nadav Amit (2): mm/userfaultfd: fix memory corruption due to writeprotect fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup fs/proc/task_mmu.c | 27 +++++++++++++-------------- mm/mprotect.c | 3 ++- mm/userfaultfd.c | 15 +++++++++++++-- 3 files changed, 28 insertions(+), 17 deletions(-) -- 2.25.1 ^ permalink raw reply [flat|nested] 96+ messages in thread
* [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2020-12-25 9:25 [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Nadav Amit @ 2020-12-25 9:25 ` Nadav Amit 2021-01-04 12:22 ` Peter Zijlstra 2021-01-05 15:08 ` Peter Xu 2020-12-25 9:25 ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Nadav Amit 2021-03-02 22:13 ` [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Peter Xu 2 siblings, 2 replies; 96+ messages in thread From: Nadav Amit @ 2020-12-25 9:25 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, Nadav Amit, Andrea Arcangeli, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra From: Nadav Amit <namit@vmware.com> Userfaultfd self-test fails occasionally, indicating a memory corruption. Analyzing this problem indicates that there is a real bug since mmap_lock is only taken for read in mwriteprotect_range() and defers flushes, and since there is insufficient consideration of concurrent deferred TLB flushes in wp_page_copy(). Although the PTE is flushed from the TLBs in wp_page_copy(), this flush takes place after the copy has already been performed, and therefore changes of the page are possible between the time of the copy and the time in which the PTE is flushed. To make matters worse, memory-unprotection using userfaultfd also poses a problem. Although memory unprotection is logically a promotion of PTE permissions, and therefore should not require a TLB flush, the current userrfaultfd code might actually cause a demotion of the architectural PTE permission: when userfaultfd_writeprotect() unprotects memory region, it unintentionally *clears* the RW-bit if it was already set. Note that this unprotecting a PTE that is not write-protected is a valid use-case: the userfaultfd monitor might ask to unprotect a region that holds both write-protected and write-unprotected PTEs. The scenario that happens in selftests/vm/userfaultfd is as follows: cpu0 cpu1 cpu2 ---- ---- ---- [ Writable PTE cached in TLB ] userfaultfd_writeprotect() [ write-*unprotect* ] mwriteprotect_range() mmap_read_lock() change_protection() change_protection_range() ... change_pte_range() [ *clear* “write”-bit ] [ defer TLB flushes ] [ page-fault ] ... wp_page_copy() cow_user_page() [ copy page ] [ write to old page ] ... set_pte_at_notify() A similar scenario can happen: cpu0 cpu1 cpu2 cpu3 ---- ---- ---- ---- [ Writable PTE cached in TLB ] userfaultfd_writeprotect() [ write-protect ] [ deferred TLB flush ] userfaultfd_writeprotect() [ write-unprotect ] [ deferred TLB flush] [ page-fault ] wp_page_copy() cow_user_page() [ copy page ] ... [ write to page ] set_pte_at_notify() As Yu Zhao pointed, these races became more apparent since commit 09854ba94c6a ("mm: do_wp_page() simplification") which made wp_page_copy() more likely to take place, specifically if page_count(page) > 1. Note that one might consider additional potentially dangerous scenarios, which are not directly related to the deferred TLB flushes. A memory corruption might in theory occur if after the page is copied by cow_user_page() and before the PTE is set, the PTE is write-unprotected (by a concurrent page-fault handler) and then protected again (by subsequent calls to userfaultfd_writeprotect() to protect and unprotect the page). In practice, it seems that such scenarios cannot happen. To resolve the aforementioned races, acquire mmap_lock for write when write-protecting userfaultfd region using ioctl's. Keep acquiring mmap_lock for read when unprotecting memory, but do not change the write-bit set when performing userfaultfd write-unprotection. This solution can introduce performance regression to userfaultfd write-protection. Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Fixes: 292924b26024 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit") Signed-off-by: Nadav Amit <namit@vmware.com> --- mm/mprotect.c | 3 ++- mm/userfaultfd.c | 15 +++++++++++++-- 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index ab709023e9aa..c08c4055b051 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, oldpte = *pte; if (pte_present(oldpte)) { pte_t ptent; - bool preserve_write = prot_numa && pte_write(oldpte); + bool preserve_write = (prot_numa || uffd_wp_resolve) && + pte_write(oldpte); /* * Avoid trapping faults against the zero or KSM diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 9a3d451402d7..7423808640ef 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -652,7 +652,15 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, /* Does the address range wrap, or is the span zero-sized? */ BUG_ON(start + len <= start); - mmap_read_lock(dst_mm); + /* + * Although we do not change the VMA, we have to ensure deferred TLB + * flushes are performed before page-faults can be handled. Otherwise + * we can get inconsistent TLB state. + */ + if (enable_wp) + mmap_write_lock(dst_mm); + else + mmap_read_lock(dst_mm); /* * If memory mappings are changing because of non-cooperative @@ -686,6 +694,9 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, err = 0; out_unlock: - mmap_read_unlock(dst_mm); + if (enable_wp) + mmap_write_unlock(dst_mm); + else + mmap_read_unlock(dst_mm); return err; } -- 2.25.1 ^ permalink raw reply related [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2020-12-25 9:25 ` [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect Nadav Amit @ 2021-01-04 12:22 ` Peter Zijlstra 2021-01-04 19:24 ` Andrea Arcangeli 2021-01-05 15:08 ` Peter Xu 1 sibling, 1 reply; 96+ messages in thread From: Peter Zijlstra @ 2021-01-04 12:22 UTC (permalink / raw) To: Nadav Amit Cc: linux-mm, linux-kernel, Nadav Amit, Andrea Arcangeli, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Mel Gorman On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > The scenario that happens in selftests/vm/userfaultfd is as follows: > > cpu0 cpu1 cpu2 > ---- ---- ---- > [ Writable PTE > cached in TLB ] > userfaultfd_writeprotect() > [ write-*unprotect* ] > mwriteprotect_range() > mmap_read_lock() > change_protection() > > change_protection_range() > ... > change_pte_range() > [ *clear* “write”-bit ] > [ defer TLB flushes ] > [ page-fault ] > ... > wp_page_copy() > cow_user_page() > [ copy page ] > [ write to old > page ] > ... > set_pte_at_notify() Yuck! Isn't this all rather similar to the problem that resulted in the tlb_flush_pending mess? I still think that's all fundamentally buggered, the much saner solution (IMO) would've been to make things wait for the pending flush, instead of doing a local flush and fudging things like we do now. Then the above could be fixed by having wp_page_copy() wait for the pending invalidate (although a more fine-grained pending state would be awesome). The below probably doesn't compile and will probably cause massive header fail at the very least, but does show the general. diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 07d9acb5b19c..0210547ac424 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -649,7 +649,8 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm) * * Therefore we must rely on tlb_flush_*() to guarantee order. */ - atomic_dec(&mm->tlb_flush_pending); + if (atomic_dec_and_test(&mm->tlb_flush_pending)) + wake_up_var(&mm->tlb_flush_pending); } static inline bool mm_tlb_flush_pending(struct mm_struct *mm) @@ -677,6 +678,12 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm) return atomic_read(&mm->tlb_flush_pending) > 1; } +static inline void wait_tlb_flush_pending(struct mm_struct *mm) +{ + wait_var_event(&mm->tlb_flush_pending, + atomic_read(&mm->tlb_flush_pending) == 0); +} + struct vm_fault; /** diff --git a/mm/memory.c b/mm/memory.c index feff48e1465a..3c36bca2972a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3087,6 +3087,8 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; + wait_tlb_flush_pending(vma->vm_mm); + if (userfaultfd_pte_wp(vma, *vmf->pte)) { pte_unmap_unlock(vmf->pte, vmf->ptl); return handle_userfault(vmf, VM_UFFD_WP); ^ permalink raw reply related [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2021-01-04 12:22 ` Peter Zijlstra @ 2021-01-04 19:24 ` Andrea Arcangeli 2021-01-04 19:35 ` Nadav Amit ` (2 more replies) 0 siblings, 3 replies; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-04 19:24 UTC (permalink / raw) To: Peter Zijlstra Cc: Nadav Amit, linux-mm, linux-kernel, Nadav Amit, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Mel Gorman Hello, On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > > > The scenario that happens in selftests/vm/userfaultfd is as follows: > > > > cpu0 cpu1 cpu2 > > ---- ---- ---- > > [ Writable PTE > > cached in TLB ] > > userfaultfd_writeprotect() > > [ write-*unprotect* ] > > mwriteprotect_range() > > mmap_read_lock() > > change_protection() > > > > change_protection_range() > > ... > > change_pte_range() > > [ *clear* “write”-bit ] > > [ defer TLB flushes ] > > [ page-fault ] > > ... > > wp_page_copy() > > cow_user_page() > > [ copy page ] > > [ write to old > > page ] > > ... > > set_pte_at_notify() > > Yuck! > Note, the above was posted before we figured out the details so it wasn't showing the real deferred tlb flush that caused problems (the one showed on the left causes zero issues). The problematic one not pictured is the one of the wrprotect that has to be running in another CPU which is also isn't picture above. More accurate traces are posted later in the thread. > Isn't this all rather similar to the problem that resulted in the > tlb_flush_pending mess? > > I still think that's all fundamentally buggered, the much saner solution > (IMO) would've been to make things wait for the pending flush, instead How do intend you wait in PT lock while the writer also has to take PT lock repeatedly before it can do wake_up_var? If you release the PT lock before calling wait_tlb_flush_pending it all falls apart again. This I guess explains why a local pte/hugepmd smp local invlpg is the only working solution for this issue, similarly to how it's done in rmap. > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 07d9acb5b19c..0210547ac424 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -649,7 +649,8 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm) > * > * Therefore we must rely on tlb_flush_*() to guarantee order. > */ > - atomic_dec(&mm->tlb_flush_pending); > + if (atomic_dec_and_test(&mm->tlb_flush_pending)) > + wake_up_var(&mm->tlb_flush_pending); > } > > static inline bool mm_tlb_flush_pending(struct mm_struct *mm) > @@ -677,6 +678,12 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm) > return atomic_read(&mm->tlb_flush_pending) > 1; > } > > +static inline void wait_tlb_flush_pending(struct mm_struct *mm) > +{ > + wait_var_event(&mm->tlb_flush_pending, > + atomic_read(&mm->tlb_flush_pending) == 0); > +} I appreciate the effort in not regressing soft dirty and uffd-wp writeprotect to disk-I/O spindle bandwidth and not using mmap_sem for writing. At the same time what was posted so far wasn't clean enough but it wasn't even tested... if we abstract it in some clean way and we mark all connected points (soft dirty, uffd-wp, the wrprotect page fault), then I can be optimistic it will remain understandable when we look at it again a few years down the road. Or at the very least it can't get worse than the "tlb_flush_pending mess" you mentioned above. flush_tlb_batched_pending() has to be orthogonally re-reviewed for those things Nadav pointed out. But I'd rather keep that review in a separate thread since any bug in that code has zero connection to this issue. The basic idea is similar but the methods and logic are different and our flush here will be granular and it's going to be only run if VM_SOFTDIRTY isn't set and soft dirty is compiled in, or if VM_UFFD_WP is set. The flush_tlb_batched_pending is mm wide, unconditional etc.. Pretty much all different. Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2021-01-04 19:24 ` Andrea Arcangeli @ 2021-01-04 19:35 ` Nadav Amit 2021-01-04 20:19 ` Andrea Arcangeli 2021-01-05 8:13 ` Peter Zijlstra 2021-01-05 8:58 ` Peter Zijlstra 2 siblings, 1 reply; 96+ messages in thread From: Nadav Amit @ 2021-01-04 19:35 UTC (permalink / raw) To: Andrea Arcangeli Cc: Peter Zijlstra, linux-mm, lkml, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Mel Gorman > On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > Hello, > > On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: >> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: >> >>> The scenario that happens in selftests/vm/userfaultfd is as follows: >>> >>> cpu0 cpu1 cpu2 >>> ---- ---- ---- >>> [ Writable PTE >>> cached in TLB ] >>> userfaultfd_writeprotect() >>> [ write-*unprotect* ] >>> mwriteprotect_range() >>> mmap_read_lock() >>> change_protection() >>> >>> change_protection_range() >>> ... >>> change_pte_range() >>> [ *clear* “write”-bit ] >>> [ defer TLB flushes ] >>> [ page-fault ] >>> ... >>> wp_page_copy() >>> cow_user_page() >>> [ copy page ] >>> [ write to old >>> page ] >>> ... >>> set_pte_at_notify() >> >> Yuck! > > Note, the above was posted before we figured out the details so it > wasn't showing the real deferred tlb flush that caused problems (the > one showed on the left causes zero issues). Actually it was posted after (note that this is v2). The aforementioned scenario that Peter regards to is the one that I actually encountered (not the second scenario that is “theoretical”). This scenario that Peter regards is indeed more “stupid” in the sense that we should just not write-protect the PTE on userfaultfd write-unprotect. Let me know if I made any mistake in the description. > The problematic one not pictured is the one of the wrprotect that has > to be running in another CPU which is also isn't picture above. More > accurate traces are posted later in the thread. I think I included this scenario as well in the commit log (of v2). Let me know if I screwed up and the description is not clear. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2021-01-04 19:35 ` Nadav Amit @ 2021-01-04 20:19 ` Andrea Arcangeli 2021-01-04 20:39 ` Nadav Amit 0 siblings, 1 reply; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-04 20:19 UTC (permalink / raw) To: Nadav Amit Cc: Peter Zijlstra, linux-mm, lkml, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Mel Gorman On Mon, Jan 04, 2021 at 07:35:06PM +0000, Nadav Amit wrote: > > On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > Hello, > > > > On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: > >> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > >> > >>> The scenario that happens in selftests/vm/userfaultfd is as follows: > >>> > >>> cpu0 cpu1 cpu2 > >>> ---- ---- ---- > >>> [ Writable PTE > >>> cached in TLB ] > >>> userfaultfd_writeprotect() > >>> [ write-*unprotect* ] > >>> mwriteprotect_range() > >>> mmap_read_lock() > >>> change_protection() > >>> > >>> change_protection_range() > >>> ... > >>> change_pte_range() > >>> [ *clear* “write”-bit ] > >>> [ defer TLB flushes ] > >>> [ page-fault ] > >>> ... > >>> wp_page_copy() > >>> cow_user_page() > >>> [ copy page ] > >>> [ write to old > >>> page ] > >>> ... > >>> set_pte_at_notify() > >> > >> Yuck! > > > > Note, the above was posted before we figured out the details so it > > wasn't showing the real deferred tlb flush that caused problems (the > > one showed on the left causes zero issues). > > Actually it was posted after (note that this is v2). The aforementioned > scenario that Peter regards to is the one that I actually encountered (not > the second scenario that is “theoretical”). This scenario that Peter regards > is indeed more “stupid” in the sense that we should just not write-protect > the PTE on userfaultfd write-unprotect. > > Let me know if I made any mistake in the description. I didn't say there is a mistake. I said it is not showing the real deferred tlb flush that cause problems. The issue here is that we have a "defer tlb flush" that runs after "write to old page". If you look at the above, you're induced to think the "defer tlb flush" that causes issues is the one in cpu0. It's not. That is totally harmless. > > > The problematic one not pictured is the one of the wrprotect that has > > to be running in another CPU which is also isn't picture above. More > > accurate traces are posted later in the thread. > > I think I included this scenario as well in the commit log (of v2). Let me > know if I screwed up and the description is not clear. Instead of not showing the real "defer tlb flush" in the trace and then fixing it up in the comment, why don't you take the trace showing the real problematic "defer tlb flush"? No need to reinvent it. https://lkml.kernel.org/r/X+JJqK91plkBVisG@redhat.com See here the detail underlined: deferred tlb flush <- too late XXXXXXXXXXXXXX BUG RACE window close here This show the real deferred tlb flush, your v2 does not include it instead. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2021-01-04 20:19 ` Andrea Arcangeli @ 2021-01-04 20:39 ` Nadav Amit 2021-01-04 21:01 ` Andrea Arcangeli 0 siblings, 1 reply; 96+ messages in thread From: Nadav Amit @ 2021-01-04 20:39 UTC (permalink / raw) To: Andrea Arcangeli Cc: Peter Zijlstra, linux-mm, lkml, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Mel Gorman > On Jan 4, 2021, at 12:19 PM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > On Mon, Jan 04, 2021 at 07:35:06PM +0000, Nadav Amit wrote: >>> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: >>> >>> Hello, >>> >>> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: >>>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: >>>> >>>>> The scenario that happens in selftests/vm/userfaultfd is as follows: >>>>> >>>>> cpu0 cpu1 cpu2 >>>>> ---- ---- ---- >>>>> [ Writable PTE >>>>> cached in TLB ] >>>>> userfaultfd_writeprotect() >>>>> [ write-*unprotect* ] >>>>> mwriteprotect_range() >>>>> mmap_read_lock() >>>>> change_protection() >>>>> >>>>> change_protection_range() >>>>> ... >>>>> change_pte_range() >>>>> [ *clear* “write”-bit ] >>>>> [ defer TLB flushes ] >>>>> [ page-fault ] >>>>> ... >>>>> wp_page_copy() >>>>> cow_user_page() >>>>> [ copy page ] >>>>> [ write to old >>>>> page ] >>>>> ... >>>>> set_pte_at_notify() >>>> >>>> Yuck! >>> >>> Note, the above was posted before we figured out the details so it >>> wasn't showing the real deferred tlb flush that caused problems (the >>> one showed on the left causes zero issues). >> >> Actually it was posted after (note that this is v2). The aforementioned >> scenario that Peter regards to is the one that I actually encountered (not >> the second scenario that is “theoretical”). This scenario that Peter regards >> is indeed more “stupid” in the sense that we should just not write-protect >> the PTE on userfaultfd write-unprotect. >> >> Let me know if I made any mistake in the description. > > I didn't say there is a mistake. I said it is not showing the real > deferred tlb flush that cause problems. > > The issue here is that we have a "defer tlb flush" that runs after > "write to old page". > > If you look at the above, you're induced to think the "defer tlb > flush" that causes issues is the one in cpu0. It's not. That is > totally harmless. I do not understand what you say. The deferred TLB flush on cpu0 *is* the the one that causes the problem. The PTE is write-protected (although it is a userfaultfd unprotect operation), causing cpu1 to encounter a #PF, handle the page-fault (and copy), while cpu2 keeps writing to the source page. If cpu0 did not defer the TLB flush, this problem would not happen. >>> The problematic one not pictured is the one of the wrprotect that has >>> to be running in another CPU which is also isn't picture above. More >>> accurate traces are posted later in the thread. >> >> I think I included this scenario as well in the commit log (of v2). Let me >> know if I screwed up and the description is not clear. > > Instead of not showing the real "defer tlb flush" in the trace and > then fixing it up in the comment, why don't you take the trace showing > the real problematic "defer tlb flush"? No need to reinvent it. The scenario you mention is indeed identical to the second scenario I mention in the commit log. I think the version I included is cleared since it shows the write that triggers the corruption instead of discussing “windows”, which might be less clear. Running copy_user_page() with stale TLB is by itself not a bug if you detect it later (e.g., using pte_same()). Note that my second scenario is also consistent in style with the first scenario. I am not married to my description and if you (and others) insist I would copy-paste your version. > This show the real deferred tlb flush, your v2 does not include it > instead. Are you talking about the first scenario (write-unprotect), the second one (write-protect followed by write-unprotect), both? It seems to me that all the deferred TLB flushes are mentioned at the point they are deferred. I can add the point in which they are performed. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2021-01-04 20:39 ` Nadav Amit @ 2021-01-04 21:01 ` Andrea Arcangeli 2021-01-04 21:26 ` Nadav Amit 0 siblings, 1 reply; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-04 21:01 UTC (permalink / raw) To: Nadav Amit Cc: Peter Zijlstra, linux-mm, lkml, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Mel Gorman On Mon, Jan 04, 2021 at 08:39:37PM +0000, Nadav Amit wrote: > > On Jan 4, 2021, at 12:19 PM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > On Mon, Jan 04, 2021 at 07:35:06PM +0000, Nadav Amit wrote: > >>> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > >>> > >>> Hello, > >>> > >>> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: > >>>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > >>>> > >>>>> The scenario that happens in selftests/vm/userfaultfd is as follows: > >>>>> > >>>>> cpu0 cpu1 cpu2 > >>>>> ---- ---- ---- > >>>>> [ Writable PTE > >>>>> cached in TLB ] > >>>>> userfaultfd_writeprotect() > >>>>> [ write-*unprotect* ] > >>>>> mwriteprotect_range() > >>>>> mmap_read_lock() > >>>>> change_protection() > >>>>> > >>>>> change_protection_range() > >>>>> ... > >>>>> change_pte_range() > >>>>> [ *clear* “write”-bit ] > >>>>> [ defer TLB flushes ] > >>>>> [ page-fault ] > >>>>> ... > >>>>> wp_page_copy() > >>>>> cow_user_page() > >>>>> [ copy page ] > >>>>> [ write to old > >>>>> page ] > >>>>> ... > >>>>> set_pte_at_notify() > >>>> > >>>> Yuck! > >>> > >>> Note, the above was posted before we figured out the details so it > >>> wasn't showing the real deferred tlb flush that caused problems (the > >>> one showed on the left causes zero issues). > >> > >> Actually it was posted after (note that this is v2). The aforementioned > >> scenario that Peter regards to is the one that I actually encountered (not > >> the second scenario that is “theoretical”). This scenario that Peter regards > >> is indeed more “stupid” in the sense that we should just not write-protect > >> the PTE on userfaultfd write-unprotect. > >> > >> Let me know if I made any mistake in the description. > > > > I didn't say there is a mistake. I said it is not showing the real > > deferred tlb flush that cause problems. > > > > The issue here is that we have a "defer tlb flush" that runs after > > "write to old page". > > > > If you look at the above, you're induced to think the "defer tlb > > flush" that causes issues is the one in cpu0. It's not. That is > > totally harmless. > > I do not understand what you say. The deferred TLB flush on cpu0 *is* the > the one that causes the problem. The PTE is write-protected (although it is > a userfaultfd unprotect operation), causing cpu1 to encounter a #PF, handle > the page-fault (and copy), while cpu2 keeps writing to the source page. If > cpu0 did not defer the TLB flush, this problem would not happen. Your argument "If cpu0 did not defer the TLB flush, this problem would not happen" is identical to "if the cpu0 has a small TLB size and the tlb entry is recycled, the problem would not happen". There are a multitude of factors that are unrelated to the real problematic deferred tlb flush that may happen and still won't cause the issue, including a parallel IPI. The point is that we don't need to worry about the "defer TLB flushes" of the un-wrprotect, because you said earlier that deferring tlb flushes when you're doing "permission promotions" does not cause problems. The only "deferred tlb flush" we need to worry about, not in the picture, is the one following the actual permission removal (the wrprotection). > it shows the write that triggers the corruption instead of discussing > “windows”, which might be less clear. Running copy_user_page() with stale I think showing exactly where the race window opens is key to understand the issue, but then that's the way I work and feel free to think it in any other way that may sound simpler. I just worried people thinks the deferred tlb flush in your v2 trace is the one that causes problem when obviously it's not since it follows a permission promotion. Once that is clear, feel free to reject my trace. All I care about is that performance don't regress from CPU-speed to disk I/O spindle speed, for soft dirty and uffd-wp. > I am not married to my description and if you (and others) insist I would > copy-paste your version. I definitely don't insist, I only wanted to clarify in case it may not have been clear the problematic deferred tlb flush wasn't part of your trace. > Are you talking about the first scenario (write-unprotect), the second one > (write-protect followed by write-unprotect), both? It seems to me that all > the deferred TLB flushes are mentioned at the point they are deferred. I can > add the point in which they are performed. The only case that has an issue for uffd-wp is in my trace and only ever happens if there's a wrprotect in flight, the deferred tlb flush of the wrprotect is deferred (and that's the problematic one that closes the window when it finally runs) and un-wrprotect runs. The window opens when the un-wrprotect unlocks the PT lock. The deferred tlb flush of un-wrprotect is as relevant for this race, as random tlb flushes from IPI or the TLB being small or none. Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2021-01-04 21:01 ` Andrea Arcangeli @ 2021-01-04 21:26 ` Nadav Amit 2021-01-05 18:45 ` Andrea Arcangeli 0 siblings, 1 reply; 96+ messages in thread From: Nadav Amit @ 2021-01-04 21:26 UTC (permalink / raw) To: Andrea Arcangeli Cc: Peter Zijlstra, linux-mm, lkml, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Mel Gorman > On Jan 4, 2021, at 1:01 PM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > On Mon, Jan 04, 2021 at 08:39:37PM +0000, Nadav Amit wrote: >>> On Jan 4, 2021, at 12:19 PM, Andrea Arcangeli <aarcange@redhat.com> wrote: >>> >>> On Mon, Jan 04, 2021 at 07:35:06PM +0000, Nadav Amit wrote: >>>>> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: >>>>> >>>>> Hello, >>>>> >>>>> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: >>>>>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: >>>>>> >>>>>>> The scenario that happens in selftests/vm/userfaultfd is as follows: >>>>>>> >>>>>>> cpu0 cpu1 cpu2 >>>>>>> ---- ---- ---- >>>>>>> [ Writable PTE >>>>>>> cached in TLB ] >>>>>>> userfaultfd_writeprotect() >>>>>>> [ write-*unprotect* ] >>>>>>> mwriteprotect_range() >>>>>>> mmap_read_lock() >>>>>>> change_protection() >>>>>>> >>>>>>> change_protection_range() >>>>>>> ... >>>>>>> change_pte_range() >>>>>>> [ *clear* “write”-bit ] >>>>>>> [ defer TLB flushes ] >>>>>>> [ page-fault ] >>>>>>> ... >>>>>>> wp_page_copy() >>>>>>> cow_user_page() >>>>>>> [ copy page ] >>>>>>> [ write to old >>>>>>> page ] >>>>>>> ... >>>>>>> set_pte_at_notify() >>>>>> >>>>>> Yuck! >>>>> >>>>> Note, the above was posted before we figured out the details so it >>>>> wasn't showing the real deferred tlb flush that caused problems (the >>>>> one showed on the left causes zero issues). >>>> >>>> Actually it was posted after (note that this is v2). The aforementioned >>>> scenario that Peter regards to is the one that I actually encountered (not >>>> the second scenario that is “theoretical”). This scenario that Peter regards >>>> is indeed more “stupid” in the sense that we should just not write-protect >>>> the PTE on userfaultfd write-unprotect. >>>> >>>> Let me know if I made any mistake in the description. >>> >>> I didn't say there is a mistake. I said it is not showing the real >>> deferred tlb flush that cause problems. >>> >>> The issue here is that we have a "defer tlb flush" that runs after >>> "write to old page". >>> >>> If you look at the above, you're induced to think the "defer tlb >>> flush" that causes issues is the one in cpu0. It's not. That is >>> totally harmless. >> >> I do not understand what you say. The deferred TLB flush on cpu0 *is* the >> the one that causes the problem. The PTE is write-protected (although it is >> a userfaultfd unprotect operation), causing cpu1 to encounter a #PF, handle >> the page-fault (and copy), while cpu2 keeps writing to the source page. If >> cpu0 did not defer the TLB flush, this problem would not happen. > > Your argument "If cpu0 did not defer the TLB flush, this problem would > not happen" is identical to "if the cpu0 has a small TLB size and the > tlb entry is recycled, the problem would not happen". > > There are a multitude of factors that are unrelated to the real > problematic deferred tlb flush that may happen and still won't cause > the issue, including a parallel IPI. > > The point is that we don't need to worry about the "defer TLB flushes" > of the un-wrprotect, because you said earlier that deferring tlb > flushes when you're doing "permission promotions" does not cause > problems. > > The only "deferred tlb flush" we need to worry about, not in the > picture, is the one following the actual permission removal (the > wrprotection). I think you are missing the point of this scenario, which is different than the second scenario. In this scenario, change_pte_range(), when called to do userfaultfd’s *unprotect* operation, did not preserve the write-bit if it was already set. Instead change_pte_range() *cleared* the write-bit. So upon a logical permission promotion operation - userfaultfd *unprotect* - you got a physical permission demotion, turning RW PTEs into RO. This problem is fully resolved by this part of the patch: --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, oldpte = *pte; if (pte_present(oldpte)) { pte_t ptent; - bool preserve_write = prot_numa && pte_write(oldpte); + bool preserve_write = (prot_numa || uffd_wp_resolve) && + pte_write(oldpte); You can argue that this not directly related to the deferred TLB flush, as once this chunk is added, a TLB flush would not be needed at all for userfaultfd-unprotect. But I consider it a part of the problem, especially since this is what triggered the userfaultfd self-tests to fail. >> it shows the write that triggers the corruption instead of discussing >> “windows”, which might be less clear. Running copy_user_page() with stale > > I think showing exactly where the race window opens is key to > understand the issue, but then that's the way I work and feel free to > think it in any other way that may sound simpler. > > I just worried people thinks the deferred tlb flush in your v2 trace > is the one that causes problem when obviously it's not since it > follows a permission promotion. Once that is clear, feel free to > reject my trace. > > All I care about is that performance don't regress from CPU-speed to > disk I/O spindle speed, for soft dirty and uffd-wp. I would feel more comfortable if you provide patches for uffd-wp. If you want, I will do it, but I restate that I do not feel comfortable with this solution (worried as it seems a bit ad-hoc and might leave out a scenario we all missed or cause a TLB shootdown storm). As for soft-dirty, I thought that you said that you do not see a better (backportable) solution for soft-dirty. Correct me if I am wrong. Anyhow, I will add your comments regarding the stale TLB window to make the description clearer. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2021-01-04 21:26 ` Nadav Amit @ 2021-01-05 18:45 ` Andrea Arcangeli 2021-01-05 19:05 ` Nadav Amit 0 siblings, 1 reply; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-05 18:45 UTC (permalink / raw) To: Nadav Amit Cc: Peter Zijlstra, linux-mm, lkml, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Mel Gorman On Mon, Jan 04, 2021 at 09:26:33PM +0000, Nadav Amit wrote: > I would feel more comfortable if you provide patches for uffd-wp. If you > want, I will do it, but I restate that I do not feel comfortable with this > solution (worried as it seems a bit ad-hoc and might leave out a scenario > we all missed or cause a TLB shootdown storm). > > As for soft-dirty, I thought that you said that you do not see a better > (backportable) solution for soft-dirty. Correct me if I am wrong. I think they should use the same technique, since they deal with the exact same challenge. I will try to cleanup the patch in the meantime. I can also try to do the additional cleanups to clear_refs to eliminate the tlb_gather completely since it doesn't gather any page and it has no point in using it. > Anyhow, I will add your comments regarding the stale TLB window to make the > description clearer. Having the mmap_write_lock solution as backup won't hurt, but I think it's only for planB if planA doesn't work and the only stable tree that will have to apply this is v5.9.x. All previous don't need any change in this respect. So there's no worry of rejects. It worked by luck until Aug 2020, but it did so reliably or somebody would have noticed already. And it's not exploitable either, it just works stable, but it was prone to break if the kernel changed in some other way, and it eventually changed in Aug 2020 when an unrelated patch happened to the reuse logic. If you want to maintain the mmap_write_lock patch if you could drop the preserved_write and adjust the Fixes to target Aug 2020 it'd be ideal. The uffd-wp needs a different optimization that maybe Peter is already working on or I can include in the patchset for this, but definitely in a separate commit because it's orthogonal. It's great you noticed the W->RO transition of un-wprotect so we can optimize that too (it will have a positive runtime effect, it's not just theoretical since it's normal to unwrprotect a huge range once the postcopy snapshotting of the virtual machine is complete), I was thinking at the previous case discussed in the other thread. I just don't like to slow down a feature required in the future for implementing postcopy live snapshotting or other snapshots to userland processes (for the non-KVM case, also unprivileged by default if using bounce buffers to feed the syscalls) that can be used by open source hypervisors to beat proprietary hypervisors like vmware. The security concern of uffd-wp that allows to enlarge the window of use-after-free kernel bugs, is not as a concern as it is for regular processes. First the jailer model can obtain the uffd before dropping all caps and before firing up seccomp in the child, so it won't even require to lift the unprivileged_userfaultfd in the superior and cleaner monolithic jailer model. If the uffd and uffd-wp can only run in rust-vmm and qemu, that userland is system software to be trusted as the kernel from the guest point of view. It's similar to fuse, if somebody gets into the fuse process it can also stop the kernel initiated faults. From that respect fuse is also system software despite it runs in userland. In other words I think if there's a vm-escape that takes control of rust-vmm userland, the last worry is the fact it can stop kernel initiated page faults because the jailer took an uffd before drop privs. Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2021-01-05 18:45 ` Andrea Arcangeli @ 2021-01-05 19:05 ` Nadav Amit 2021-01-05 19:45 ` Andrea Arcangeli 0 siblings, 1 reply; 96+ messages in thread From: Nadav Amit @ 2021-01-05 19:05 UTC (permalink / raw) To: Andrea Arcangeli, Peter Xu Cc: Peter Zijlstra, linux-mm, lkml, Yu Zhao, Andy Lutomirski, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Mel Gorman > On Jan 5, 2021, at 10:45 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > On Mon, Jan 04, 2021 at 09:26:33PM +0000, Nadav Amit wrote: >> I would feel more comfortable if you provide patches for uffd-wp. If you >> want, I will do it, but I restate that I do not feel comfortable with this >> solution (worried as it seems a bit ad-hoc and might leave out a scenario >> we all missed or cause a TLB shootdown storm). >> >> As for soft-dirty, I thought that you said that you do not see a better >> (backportable) solution for soft-dirty. Correct me if I am wrong. > > I think they should use the same technique, since they deal with the > exact same challenge. I will try to cleanup the patch in the meantime. > > I can also try to do the additional cleanups to clear_refs to > eliminate the tlb_gather completely since it doesn't gather any page > and it has no point in using it. > >> Anyhow, I will add your comments regarding the stale TLB window to make the >> description clearer. > > Having the mmap_write_lock solution as backup won't hurt, but I think > it's only for planB if planA doesn't work and the only stable tree > that will have to apply this is v5.9.x. All previous don't need any > change in this respect. So there's no worry of rejects. > > It worked by luck until Aug 2020, but it did so reliably or somebody > would have noticed already. And it's not exploitable either, it just > works stable, but it was prone to break if the kernel changed in some > other way, and it eventually changed in Aug 2020 when an unrelated > patch happened to the reuse logic. > > If you want to maintain the mmap_write_lock patch if you could drop > the preserved_write and adjust the Fixes to target Aug 2020 it'd be > ideal. The uffd-wp needs a different optimization that maybe Peter is > already working on or I can include in the patchset for this, but > definitely in a separate commit because it's orthogonal. > > It's great you noticed the W->RO transition of un-wprotect so we can > optimize that too (it will have a positive runtime effect, it's not > just theoretical since it's normal to unwrprotect a huge range once > the postcopy snapshotting of the virtual machine is complete), I was > thinking at the previous case discussed in the other thread. Understood. I will separate it to a different patch and use your version. I am sorry that I missed Peter Xu feedback for that. As I understand that this will not be backported, I will see if I can get rid of the TLB flush and the inc_tlb_flush_pending() for write-unprotect case as well (which I think I mentioned before). > > I just don't like to slow down a feature required in the future for > implementing postcopy live snapshotting or other snapshots to userland > processes (for the non-KVM case, also unprivileged by default if using > bounce buffers to feed the syscalls) that can be used by open source > hypervisors to beat proprietary hypervisors like vmware. Ouch, that’s uncalled for. I am sure that you understand that I have no hidden agenda and we all have the same goal. Regards, Nadav ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2021-01-05 19:05 ` Nadav Amit @ 2021-01-05 19:45 ` Andrea Arcangeli 2021-01-05 20:06 ` Nadav Amit 0 siblings, 1 reply; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-05 19:45 UTC (permalink / raw) To: Nadav Amit Cc: Peter Xu, Peter Zijlstra, linux-mm, lkml, Yu Zhao, Andy Lutomirski, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Mel Gorman On Tue, Jan 05, 2021 at 07:05:22PM +0000, Nadav Amit wrote: > > On Jan 5, 2021, at 10:45 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > I just don't like to slow down a feature required in the future for > > implementing postcopy live snapshotting or other snapshots to userland > > processes (for the non-KVM case, also unprivileged by default if using > > bounce buffers to feed the syscalls) that can be used by open source > > hypervisors to beat proprietary hypervisors like vmware. > > Ouch, that’s uncalled for. I am sure that you understand that I have no > hidden agenda and we all have the same goal. Ehm I never said you had an hidden agenda, so I'm not exactly why you're accusing me of something I never said. I merely pointed out one relevant justification for increasing kernel complexity here by not slowing down clear_refs longstanding O(N) clear_refs/softdirty feature and the recent uffd-wp O(1) feature, is to be more competitive with proprietary software solutions, since at least for uffd-wp, postcopy live snapshotting that the #1 use case. I never questioned your contribution or your preference, to be even more explicit, it never crossed my mind that you have an hidden agenda. However since everyone already acked your patches and I'm not ok with your patches because they will give a hit to KVM postcopy live snapshotting and other container clear_refs users, I have to justify why I NAK your patches and remaining competitive with proprietary hypervisors is one of them, so I don't see what is wrong to state a tangible end goal here. Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2021-01-05 19:45 ` Andrea Arcangeli @ 2021-01-05 20:06 ` Nadav Amit 2021-01-05 21:06 ` Andrea Arcangeli 0 siblings, 1 reply; 96+ messages in thread From: Nadav Amit @ 2021-01-05 20:06 UTC (permalink / raw) To: Andrea Arcangeli Cc: Peter Xu, Peter Zijlstra, linux-mm, lkml, Yu Zhao, Andy Lutomirski, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Mel Gorman > On Jan 5, 2021, at 11:45 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > On Tue, Jan 05, 2021 at 07:05:22PM +0000, Nadav Amit wrote: >>> On Jan 5, 2021, at 10:45 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: >>> I just don't like to slow down a feature required in the future for >>> implementing postcopy live snapshotting or other snapshots to userland >>> processes (for the non-KVM case, also unprivileged by default if using >>> bounce buffers to feed the syscalls) that can be used by open source >>> hypervisors to beat proprietary hypervisors like vmware. >> >> Ouch, that’s uncalled for. I am sure that you understand that I have no >> hidden agenda and we all have the same goal. > > Ehm I never said you had an hidden agenda, so I'm not exactly why > you're accusing me of something I never said. > > I merely pointed out one relevant justification for increasing kernel > complexity here by not slowing down clear_refs longstanding O(N) > clear_refs/softdirty feature and the recent uffd-wp O(1) feature, is > to be more competitive with proprietary software solutions, since > at least for uffd-wp, postcopy live snapshotting that the #1 use > case. > > I never questioned your contribution or your preference, to be even > more explicit, it never crossed my mind that you have an hidden > agenda. > > However since everyone already acked your patches and I'm not ok with > your patches because they will give a hit to KVM postcopy live > snapshotting and other container clear_refs users, I have to justify > why I NAK your patches and remaining competitive with proprietary > hypervisors is one of them, so I don't see what is wrong to state a > tangible end goal here. I fully understand your objection to my patches and it is a valid objection, which I will address. I just thought that there might be some insinuation, as you mentioned VMware by name. My response was half-jokingly and should have had a smiley to prevent you from wasting your time on the explanation. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2021-01-05 20:06 ` Nadav Amit @ 2021-01-05 21:06 ` Andrea Arcangeli 2021-01-05 21:43 ` Peter Xu 0 siblings, 1 reply; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-05 21:06 UTC (permalink / raw) To: Nadav Amit Cc: Peter Xu, Peter Zijlstra, linux-mm, lkml, Yu Zhao, Andy Lutomirski, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Mel Gorman On Tue, Jan 05, 2021 at 08:06:22PM +0000, Nadav Amit wrote: > I just thought that there might be some insinuation, as you mentioned VMware > by name. My response was half-jokingly and should have had a smiley to > prevent you from wasting your time on the explanation. No problem, actually I appreciate you pointed out to give me the extra opportunity to further clarify I wasn't implying anything like that, sorry again for any confusion I may have generated. I mentioned vmware because I'd be shocked if for the whole duration of the wrprotect on the guest physical memory it'd have to halt all minor faults and all memory freeing like it would happen to rust-vmm/qemu if we take the mmap_write_lock, that's all. Or am I wrong about this? For uffd-wp avoiding the mmap_write_lock isn't an immediate concern (obviously so in the rust-vmm case which won't even do postcopy live migration), but the above concern applies for the long term and maybe mid term for qemu. The postcopy live snapshoitting was the #1 use case so it's hard not to mention it, but there's still other interesting userland use cases of uffd-wp with various users already testing it in their apps, that may ultimately become more prevalent, who knows. The point is that those that will experiment with uffd-wp will run a benchmark, post a blog, others will see the blog, they will test too in their app and post their blog. It needs to deliver the full acceleration immediately, otherwise the evaluation may show it as a fail or not worth it. In theory we could just say, we'll optimize it later if significant userbase emerge, but in my view it's bit of a chicken egg problem and I'm afraid that such theory may not work well in practice. Still, for the initial fix, avoiding the mmap_write_lock seems more important actually for clear_refs than for uffd-wp. uffd-wp is somewhat lucky and will just share any solution to keep clear_refs scalable, since the issue is identical. Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2021-01-05 21:06 ` Andrea Arcangeli @ 2021-01-05 21:43 ` Peter Xu 0 siblings, 0 replies; 96+ messages in thread From: Peter Xu @ 2021-01-05 21:43 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nadav Amit, Peter Zijlstra, linux-mm, lkml, Yu Zhao, Andy Lutomirski, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Mel Gorman On Tue, Jan 05, 2021 at 04:06:27PM -0500, Andrea Arcangeli wrote: > The postcopy live snapshoitting was the #1 use case so it's hard not > to mention it, but there's still other interesting userland use cases > of uffd-wp with various users already testing it in their apps, that > may ultimately become more prevalent, who knows. That's true. AFAIU umap [1] uses uffd-wp for their computings already. I didn't really measure how far it can go, but currently the library is highly concurrent, for example, there're quite a few macros that can tune the parallelism of the library [2]: UMAP_PAGE_FILLERS This is the number of worker threads that will perform read operations from the backing store (including read-ahead) for a specific umap region. UMAP_PAGE_EVICTORS This is the number of worker threads that will perform evictions of pages. Eviction includes writing to the backing store if the page is dirty and telling the operating system that the page is no longer needed. The write lock means at least all the evictor threads will be serialized, immediately makes UMAP_PAGE_EVICTORS meaningless... not to mention all the rest of read lock takers (filler threads, worker threads, etc.). So if it happens, I bet LLNL will suddenly observe a drastic drop after upgrading the kernel.. I don't know why umap didn't hit the tlb issue already. It seems to me that issues may only trigger with COW right after a stalled tlb so COW is the only one affected (or, is it?) while umap may not use cow that lot by accident. But I could be completely wrong on that. [1] https://github.com/LLNL/umap [2] https://llnl-umap.readthedocs.io/en/develop/environment_variables.html -- Peter Xu ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2021-01-04 19:24 ` Andrea Arcangeli 2021-01-04 19:35 ` Nadav Amit @ 2021-01-05 8:13 ` Peter Zijlstra 2021-01-05 8:52 ` Nadav Amit 2021-01-05 8:58 ` Peter Zijlstra 2 siblings, 1 reply; 96+ messages in thread From: Peter Zijlstra @ 2021-01-05 8:13 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nadav Amit, linux-mm, linux-kernel, Nadav Amit, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Mel Gorman On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote: > The problematic one not pictured is the one of the wrprotect that has > to be running in another CPU which is also isn't picture above. More > accurate traces are posted later in the thread. What thread? I don't seem to have discovered it yet, and the cover letter from Nadav doesn't seem to have a msgid linking it either. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2021-01-05 8:13 ` Peter Zijlstra @ 2021-01-05 8:52 ` Nadav Amit 2021-01-05 14:26 ` Peter Zijlstra 0 siblings, 1 reply; 96+ messages in thread From: Nadav Amit @ 2021-01-05 8:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Andrea Arcangeli, linux-mm, lkml, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Mel Gorman > On Jan 5, 2021, at 12:13 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote: >> The problematic one not pictured is the one of the wrprotect that has >> to be running in another CPU which is also isn't picture above. More >> accurate traces are posted later in the thread. > > What thread? I don't seem to have discovered it yet, and the cover > letter from Nadav doesn't seem to have a msgid linking it either. Sorry for that: https://lore.kernel.org/lkml/X+K7JMrTEC9SpVIB@google.com/T/ ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2021-01-05 8:52 ` Nadav Amit @ 2021-01-05 14:26 ` Peter Zijlstra 0 siblings, 0 replies; 96+ messages in thread From: Peter Zijlstra @ 2021-01-05 14:26 UTC (permalink / raw) To: Nadav Amit Cc: Andrea Arcangeli, linux-mm, lkml, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Mel Gorman On Tue, Jan 05, 2021 at 12:52:48AM -0800, Nadav Amit wrote: > > On Jan 5, 2021, at 12:13 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote: > >> The problematic one not pictured is the one of the wrprotect that has > >> to be running in another CPU which is also isn't picture above. More > >> accurate traces are posted later in the thread. > > > > What thread? I don't seem to have discovered it yet, and the cover > > letter from Nadav doesn't seem to have a msgid linking it either. > > Sorry for that: > > https://lore.kernel.org/lkml/X+K7JMrTEC9SpVIB@google.com/T/ Much reading later.. OK, go with the write-lock for now. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2021-01-04 19:24 ` Andrea Arcangeli 2021-01-04 19:35 ` Nadav Amit 2021-01-05 8:13 ` Peter Zijlstra @ 2021-01-05 8:58 ` Peter Zijlstra 2021-01-05 9:22 ` Nadav Amit 2021-01-05 17:58 ` Andrea Arcangeli 2 siblings, 2 replies; 96+ messages in thread From: Peter Zijlstra @ 2021-01-05 8:58 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nadav Amit, linux-mm, linux-kernel, Nadav Amit, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Mel Gorman On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote: > On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: > > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > > > > > The scenario that happens in selftests/vm/userfaultfd is as follows: > > > > > > cpu0 cpu1 cpu2 > > > ---- ---- ---- > > > [ Writable PTE > > > cached in TLB ] > > > userfaultfd_writeprotect() > > > [ write-*unprotect* ] > > > mwriteprotect_range() > > > mmap_read_lock() > > > change_protection() > > > > > > change_protection_range() > > > ... > > > change_pte_range() > > > [ *clear* “write”-bit ] > > > [ defer TLB flushes ] > > > [ page-fault ] > > > ... > > > wp_page_copy() > > > cow_user_page() > > > [ copy page ] > > > [ write to old > > > page ] > > > ... > > > set_pte_at_notify() > > > > Yuck! > > > > Note, the above was posted before we figured out the details so it > wasn't showing the real deferred tlb flush that caused problems (the > one showed on the left causes zero issues). > > The problematic one not pictured is the one of the wrprotect that has > to be running in another CPU which is also isn't picture above. More > accurate traces are posted later in the thread. Lets assume CPU0 does a read-lock, W -> RO with deferred flush. > > Isn't this all rather similar to the problem that resulted in the > > tlb_flush_pending mess? > > > > I still think that's all fundamentally buggered, the much saner solution > > (IMO) would've been to make things wait for the pending flush, instead > > How do intend you wait in PT lock while the writer also has to take PT > lock repeatedly before it can do wake_up_var? > > If you release the PT lock before calling wait_tlb_flush_pending it > all falls apart again. I suppose you can check for pending, if found, release lock, wait for 0, and re-take the fault? > This I guess explains why a local pte/hugepmd smp local invlpg is the > only working solution for this issue, similarly to how it's done in rmap. In that case a local invalidate on CPU1 simply doesn't help anything. CPU1 needs to do a global invalidate or wait for the in-progress one to complete, such that CPU2 is sure to not have a W entry left before CPU1 goes and copies the page. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2021-01-05 8:58 ` Peter Zijlstra @ 2021-01-05 9:22 ` Nadav Amit 2021-01-05 17:58 ` Andrea Arcangeli 1 sibling, 0 replies; 96+ messages in thread From: Nadav Amit @ 2021-01-05 9:22 UTC (permalink / raw) To: Peter Zijlstra Cc: Andrea Arcangeli, linux-mm, linux-kernel, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Mel Gorman > On Jan 5, 2021, at 12:58 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote: >> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: >>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: >>> >>>> The scenario that happens in selftests/vm/userfaultfd is as follows: >>>> >>>> cpu0 cpu1 cpu2 >>>> ---- ---- ---- >>>> [ Writable PTE >>>> cached in TLB ] >>>> userfaultfd_writeprotect() >>>> [ write-*unprotect* ] >>>> mwriteprotect_range() >>>> mmap_read_lock() >>>> change_protection() >>>> >>>> change_protection_range() >>>> ... >>>> change_pte_range() >>>> [ *clear* “write”-bit ] >>>> [ defer TLB flushes ] >>>> [ page-fault ] >>>> ... >>>> wp_page_copy() >>>> cow_user_page() >>>> [ copy page ] >>>> [ write to old >>>> page ] >>>> ... >>>> set_pte_at_notify() >>> >>> Yuck! >> >> Note, the above was posted before we figured out the details so it >> wasn't showing the real deferred tlb flush that caused problems (the >> one showed on the left causes zero issues). >> >> The problematic one not pictured is the one of the wrprotect that has >> to be running in another CPU which is also isn't picture above. More >> accurate traces are posted later in the thread. > > Lets assume CPU0 does a read-lock, W -> RO with deferred flush. This is the second scenario that is mentioned in the patch. (The first one is relatively easy to address by not clearing the write-bit). >>> Isn't this all rather similar to the problem that resulted in the >>> tlb_flush_pending mess? >>> >>> I still think that's all fundamentally buggered, the much saner solution >>> (IMO) would've been to make things wait for the pending flush, instead >> >> How do intend you wait in PT lock while the writer also has to take PT >> lock repeatedly before it can do wake_up_var? >> >> If you release the PT lock before calling wait_tlb_flush_pending it >> all falls apart again. > > I suppose you can check for pending, if found, release lock, wait for 0, > and re-take the fault? My personal take on this issue (which for full disclosure I think Andrea disagrees with) is that it the most important enhancement is to reduce the number of cases which we mistakenly think that we must wait for pending TLB flush. It will not be free though. As to the enhancement that you propose: although it seems as a valid enhancement to me, I think that it is more robust to make forward progress when possible (as done today). This is especially important if the proposed enhancement cannot be checked by lockdep. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2021-01-05 8:58 ` Peter Zijlstra 2021-01-05 9:22 ` Nadav Amit @ 2021-01-05 17:58 ` Andrea Arcangeli 1 sibling, 0 replies; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-05 17:58 UTC (permalink / raw) To: Peter Zijlstra Cc: Nadav Amit, linux-mm, linux-kernel, Nadav Amit, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Mel Gorman On Tue, Jan 05, 2021 at 09:58:57AM +0100, Peter Zijlstra wrote: > On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote: > > On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: > > > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > > > > > > > The scenario that happens in selftests/vm/userfaultfd is as follows: > > > > > > > > cpu0 cpu1 cpu2 > > > > ---- ---- ---- > > > > [ Writable PTE > > > > cached in TLB ] > > > > userfaultfd_writeprotect() > > > > [ write-*unprotect* ] > > > > mwriteprotect_range() > > > > mmap_read_lock() > > > > change_protection() > > > > > > > > change_protection_range() > > > > ... > > > > change_pte_range() > > > > [ *clear* “write”-bit ] > > > > [ defer TLB flushes ] > > > > [ page-fault ] > > > > ... > > > > wp_page_copy() > > > > cow_user_page() > > > > [ copy page ] > > > > [ write to old > > > > page ] > > > > ... > > > > set_pte_at_notify() > > > > > > Yuck! > > > > > > > Note, the above was posted before we figured out the details so it > > wasn't showing the real deferred tlb flush that caused problems (the > > one showed on the left causes zero issues). > > > > The problematic one not pictured is the one of the wrprotect that has > > to be running in another CPU which is also isn't picture above. More > > accurate traces are posted later in the thread. > > Lets assume CPU0 does a read-lock, W -> RO with deferred flush. I was mistaken saying the deferred tlb flush was not shown in the v2 trace, just this appears a new different case we didn't happen to consider before. In the previous case we discussed earlier, when un-wrprotect above is called it never should have been a W->RO since a wrprotect run first. Doesn't it ring a bell that if an un-wrprotect does a W->RO transition, something is a bit going backwards? I don't recall from previous discussion that un-wrprotect was considered as called on read-write memory. I think we need the below change to fix this new case. if (uffd_wp) { + if (unlikely(pte_uffd_wp(oldpte))) + continue; ptent = pte_wrprotect(ptent); ptent = pte_mkuffd_wp(ptent); } else if (uffd_wp_resolve) { + if (unlikely(!pte_uffd_wp(oldpte))) + continue; /* * Leave the write bit to be handled * by PF interrupt handler, then * things like COW could be properly * handled. */ ptent = pte_clear_uffd_wp(ptent); } I now get why the v2 patch touches preserved_write, but this is not about preserve_write, it's not about leaving the write bit alone. This is about leaving the whole pte alone if the uffd-wp bit doesn't actually change. We shouldn't just defer the tlb flush if un-wprotect is called on read-write memory: we should not have flushed the tlb at all in such case. Same for hugepmd in huge_memory.c which will be somewhere else. Once the above is optimized, then un-wrprotect as in MM_CP_UFFD_WP_RESOLVE is usually preceded by wrprotect as in MM_CP_UFFD_WP, and so it'll never be a W->RO but a RO->RO transition that just clears the uffd_wp flag and nothing else and whose tlb flush is in turn irrelevant. The fix discussed still works for this new case too: I'm not suggesting we should rely on the above optimization for the tlb safety. The above is just a missing optimization. > > > Isn't this all rather similar to the problem that resulted in the > > > tlb_flush_pending mess? > > > > > > I still think that's all fundamentally buggered, the much saner solution > > > (IMO) would've been to make things wait for the pending flush, instead > > > > How do intend you wait in PT lock while the writer also has to take PT > > lock repeatedly before it can do wake_up_var? > > > > If you release the PT lock before calling wait_tlb_flush_pending it > > all falls apart again. > > I suppose you can check for pending, if found, release lock, wait for 0, > and re-take the fault? Aborting the page fault unconditionally while MADV_DONTNEED is running on some other unrelated vma, sounds not desirable. Doing it only for !VM_SOFTDIRTY or soft dirty not compiled in sounds less bad but it would still mean that while clear_refs is running, no thread can write to any anon memory of the process. > > This I guess explains why a local pte/hugepmd smp local invlpg is the > > only working solution for this issue, similarly to how it's done in rmap. > > In that case a local invalidate on CPU1 simply doesn't help anything. > > CPU1 needs to do a global invalidate or wait for the in-progress one to > complete, such that CPU2 is sure to not have a W entry left before CPU1 > goes and copies the page. Yes, it was a global invlpg, definitely not local sorry for the confusion, as in the PoC posted here which needs cleaning up: https://lkml.kernel.org/r/X+QLr1WmGXMs33Ld@redhat.com + flush_tlb_page(vma, vmf->address); I think instead of the flush_tlb_page above, we just need an ad-hoc abstraction there. The added complexity to the page fault common code consist in having to call such abstract call in the right place of the page fault. The vm_flags to check will be the same for both the flush_tlb_page and the wait_tlb_pending approaches. Once the filter on vm_flags pass, the only difference is between "flush_tlb_page; return void" or "PT unlock; wait_; return VM_FAULT_RETRY" so it looks more an implementation detail with a different tradeoff at runtime. Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2020-12-25 9:25 ` [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect Nadav Amit 2021-01-04 12:22 ` Peter Zijlstra @ 2021-01-05 15:08 ` Peter Xu 2021-01-05 18:08 ` Andrea Arcangeli 2021-01-05 19:07 ` Nadav Amit 1 sibling, 2 replies; 96+ messages in thread From: Peter Xu @ 2021-01-05 15:08 UTC (permalink / raw) To: Nadav Amit Cc: linux-mm, linux-kernel, Nadav Amit, Andrea Arcangeli, Yu Zhao, Andy Lutomirski, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > diff --git a/mm/mprotect.c b/mm/mprotect.c > index ab709023e9aa..c08c4055b051 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > oldpte = *pte; > if (pte_present(oldpte)) { > pte_t ptent; > - bool preserve_write = prot_numa && pte_write(oldpte); > + bool preserve_write = (prot_numa || uffd_wp_resolve) && > + pte_write(oldpte); Irrelevant of the other tlb issue, this is a standalone one and I commented in v1 about simply ignore the change if necessary; unluckily that seems to be ignored.. so I'll try again - would below be slightly better? if (uffd_wp_resolve && !pte_uffd_wp(oldpte)) continue; Firstly, current patch is confusing at least to me, because "uffd_wp_resolve" means "unprotect the pte", whose write bit should mostly be cleared already when uffd_wp_resolve is applicable. Then "preserve_write" for that pte looks odd already. Meanwhile, if that really happens (when pte write bit set, but during a uffd_wp_resolve request) imho there is really nothing we can do, so we should simply avoid touching that at all, and also avoid ptep_modify_prot_start, pte_modify, ptep_modify_prot_commit, calls etc., which takes extra cost. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2021-01-05 15:08 ` Peter Xu @ 2021-01-05 18:08 ` Andrea Arcangeli 2021-01-05 18:41 ` Peter Xu 2021-01-05 19:07 ` Nadav Amit 1 sibling, 1 reply; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-05 18:08 UTC (permalink / raw) To: Peter Xu Cc: Nadav Amit, linux-mm, linux-kernel, Nadav Amit, Yu Zhao, Andy Lutomirski, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra On Tue, Jan 05, 2021 at 10:08:13AM -0500, Peter Xu wrote: > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > > diff --git a/mm/mprotect.c b/mm/mprotect.c > > index ab709023e9aa..c08c4055b051 100644 > > --- a/mm/mprotect.c > > +++ b/mm/mprotect.c > > @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > > oldpte = *pte; > > if (pte_present(oldpte)) { > > pte_t ptent; > > - bool preserve_write = prot_numa && pte_write(oldpte); > > + bool preserve_write = (prot_numa || uffd_wp_resolve) && > > + pte_write(oldpte); > > Irrelevant of the other tlb issue, this is a standalone one and I commented in > v1 about simply ignore the change if necessary; unluckily that seems to be > ignored.. so I'll try again - would below be slightly better? > > if (uffd_wp_resolve && !pte_uffd_wp(oldpte)) > continue; I posted the exact same code before seeing the above so I take it as a good sign :). I'd suggest to add the reverse check to the uffd_wp too. > Firstly, current patch is confusing at least to me, because "uffd_wp_resolve" > means "unprotect the pte", whose write bit should mostly be cleared already > when uffd_wp_resolve is applicable. Then "preserve_write" for that pte looks > odd already. > > Meanwhile, if that really happens (when pte write bit set, but during a > uffd_wp_resolve request) imho there is really nothing we can do, so we should > simply avoid touching that at all, and also avoid ptep_modify_prot_start, > pte_modify, ptep_modify_prot_commit, calls etc., which takes extra cost. Agreed. It should not just defer the flush, by doing continue we will not flush anything. So ultimately the above will be an orthogonal optimization, but now I get the why the deferred tlb flush on the cpu0 of the v2 patch was the problematic one. I didn't see we lacked the above optimization and I thought we were discussing still the regular case where un-wrprotect is called on a pte with uffd-wp set. thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2021-01-05 18:08 ` Andrea Arcangeli @ 2021-01-05 18:41 ` Peter Xu 2021-01-05 18:55 ` Andrea Arcangeli 0 siblings, 1 reply; 96+ messages in thread From: Peter Xu @ 2021-01-05 18:41 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nadav Amit, linux-mm, linux-kernel, Nadav Amit, Yu Zhao, Andy Lutomirski, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra On Tue, Jan 05, 2021 at 01:08:48PM -0500, Andrea Arcangeli wrote: > On Tue, Jan 05, 2021 at 10:08:13AM -0500, Peter Xu wrote: > > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > > > diff --git a/mm/mprotect.c b/mm/mprotect.c > > > index ab709023e9aa..c08c4055b051 100644 > > > --- a/mm/mprotect.c > > > +++ b/mm/mprotect.c > > > @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > > > oldpte = *pte; > > > if (pte_present(oldpte)) { > > > pte_t ptent; > > > - bool preserve_write = prot_numa && pte_write(oldpte); > > > + bool preserve_write = (prot_numa || uffd_wp_resolve) && > > > + pte_write(oldpte); > > > > Irrelevant of the other tlb issue, this is a standalone one and I commented in > > v1 about simply ignore the change if necessary; unluckily that seems to be > > ignored.. so I'll try again - would below be slightly better? > > > > if (uffd_wp_resolve && !pte_uffd_wp(oldpte)) > > continue; > > I posted the exact same code before seeing the above so I take it as a good > sign :). I'd suggest to add the reverse check to the uffd_wp too. Agreed. I didn't mention uffd_wp check (which I actually mentioned in the reply to v1 patchset) here only because the uffd_wp check is pure optimization; while the uffd_wp_resolve check is more critical because it is potentially a fix of similar tlb flushing issue where we could have demoted the pte without being noticed, so I think it's indeed more important as Nadav wanted to fix in the same patch. It would be even nicer if we have both covered (all of them can be in unlikely() as Andrea suggested in the other email), then maybe nicer as a standalone patch, then mention about the difference of the two in the commit log (majorly, the resolving change will be more than optimization). Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2021-01-05 18:41 ` Peter Xu @ 2021-01-05 18:55 ` Andrea Arcangeli 0 siblings, 0 replies; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-05 18:55 UTC (permalink / raw) To: Peter Xu Cc: Nadav Amit, linux-mm, linux-kernel, Nadav Amit, Yu Zhao, Andy Lutomirski, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra On Tue, Jan 05, 2021 at 01:41:34PM -0500, Peter Xu wrote: > Agreed. I didn't mention uffd_wp check (which I actually mentioned in the reply > to v1 patchset) here only because the uffd_wp check is pure optimization; while Agreed it's a pure optimization. Only if we used the group lock to fix this (which we didn't since it wouldn't help clear_refs to avoid the performance regression), the optimization would have become not an optimization anymore. > the uffd_wp_resolve check is more critical because it is potentially a fix of > similar tlb flushing issue where we could have demoted the pte without being > noticed, so I think it's indeed more important as Nadav wanted to fix in the > same patch. I didn't get why that was touched in the same patch, I already suggested to remove that optimization... > It would be even nicer if we have both covered (all of them can be in > unlikely() as Andrea suggested in the other email), then maybe nicer as a > standalone patch, then mention about the difference of the two in the commit > log (majorly, the resolving change will be more than optimization). Yes, if you want to go ahead optimizing both cases of the UFFDIO_WRITEPROTECT, I don't think there's any dependency on this. The huge_memory.c also needs covering but I didn't look at it, hopefully the code will result as clean as in the pte case. I'll try to cleanup the tlb flush in the meantime to see if it look maintainable after the cleanups. Then we can change it to wait_pending_flush(); return VM_FAULT_RETRY model if we want to or if the IPI is slower, at least clear_refs will still not block on random pagein or swapin from disk, but only anon memory write access will block while clear_refs run. Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2021-01-05 15:08 ` Peter Xu 2021-01-05 18:08 ` Andrea Arcangeli @ 2021-01-05 19:07 ` Nadav Amit 2021-01-05 19:43 ` Peter Xu 1 sibling, 1 reply; 96+ messages in thread From: Nadav Amit @ 2021-01-05 19:07 UTC (permalink / raw) To: Peter Xu Cc: linux-mm, lkml, Andrea Arcangeli, Yu Zhao, Andy Lutomirski, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra > On Jan 5, 2021, at 7:08 AM, Peter Xu <peterx@redhat.com> wrote: > > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: >> diff --git a/mm/mprotect.c b/mm/mprotect.c >> index ab709023e9aa..c08c4055b051 100644 >> --- a/mm/mprotect.c >> +++ b/mm/mprotect.c >> @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, >> oldpte = *pte; >> if (pte_present(oldpte)) { >> pte_t ptent; >> - bool preserve_write = prot_numa && pte_write(oldpte); >> + bool preserve_write = (prot_numa || uffd_wp_resolve) && >> + pte_write(oldpte); > > Irrelevant of the other tlb issue, this is a standalone one and I commented in > v1 about simply ignore the change if necessary; unluckily that seems to be > ignored.. so I'll try again - would below be slightly better? > > if (uffd_wp_resolve && !pte_uffd_wp(oldpte)) > continue; > > Firstly, current patch is confusing at least to me, because "uffd_wp_resolve" > means "unprotect the pte", whose write bit should mostly be cleared already > when uffd_wp_resolve is applicable. Then "preserve_write" for that pte looks > odd already. > > Meanwhile, if that really happens (when pte write bit set, but during a > uffd_wp_resolve request) imho there is really nothing we can do, so we should > simply avoid touching that at all, and also avoid ptep_modify_prot_start, > pte_modify, ptep_modify_prot_commit, calls etc., which takes extra cost. Sorry for missing your feedback before. What you suggest makes perfect sense. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect 2021-01-05 19:07 ` Nadav Amit @ 2021-01-05 19:43 ` Peter Xu 0 siblings, 0 replies; 96+ messages in thread From: Peter Xu @ 2021-01-05 19:43 UTC (permalink / raw) To: Nadav Amit Cc: linux-mm, lkml, Andrea Arcangeli, Yu Zhao, Andy Lutomirski, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra On Tue, Jan 05, 2021 at 07:07:51PM +0000, Nadav Amit wrote: > > On Jan 5, 2021, at 7:08 AM, Peter Xu <peterx@redhat.com> wrote: > > > > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > >> diff --git a/mm/mprotect.c b/mm/mprotect.c > >> index ab709023e9aa..c08c4055b051 100644 > >> --- a/mm/mprotect.c > >> +++ b/mm/mprotect.c > >> @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, > >> oldpte = *pte; > >> if (pte_present(oldpte)) { > >> pte_t ptent; > >> - bool preserve_write = prot_numa && pte_write(oldpte); > >> + bool preserve_write = (prot_numa || uffd_wp_resolve) && > >> + pte_write(oldpte); > > > > Irrelevant of the other tlb issue, this is a standalone one and I commented in > > v1 about simply ignore the change if necessary; unluckily that seems to be > > ignored.. so I'll try again - would below be slightly better? > > > > if (uffd_wp_resolve && !pte_uffd_wp(oldpte)) > > continue; > > > > Firstly, current patch is confusing at least to me, because "uffd_wp_resolve" > > means "unprotect the pte", whose write bit should mostly be cleared already > > when uffd_wp_resolve is applicable. Then "preserve_write" for that pte looks > > odd already. > > > > Meanwhile, if that really happens (when pte write bit set, but during a > > uffd_wp_resolve request) imho there is really nothing we can do, so we should > > simply avoid touching that at all, and also avoid ptep_modify_prot_start, > > pte_modify, ptep_modify_prot_commit, calls etc., which takes extra cost. > > Sorry for missing your feedback before. What you suggest makes perfect > sense. No problem. I actually appreciated a lot for all your great works on these. The strange thing is the userfaultfd kselftest seems to be working always fine locally to me (probably another reason that I mostly test uffd-wp with umapsort), so I won't be able to reproduce some issue you (and Andrea) have encountered. It's great you unveiled all these hard tlb problems and nailed them down so lives should be easier for all of us. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 96+ messages in thread
* [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup 2020-12-25 9:25 [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Nadav Amit 2020-12-25 9:25 ` [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect Nadav Amit @ 2020-12-25 9:25 ` Nadav Amit 2021-01-05 15:08 ` Will Deacon 2021-01-05 18:20 ` Andrea Arcangeli 2021-03-02 22:13 ` [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Peter Xu 2 siblings, 2 replies; 96+ messages in thread From: Nadav Amit @ 2020-12-25 9:25 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, Nadav Amit, Andrea Arcangeli, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra From: Nadav Amit <namit@vmware.com> Clearing soft-dirty through /proc/[pid]/clear_refs can cause memory corruption as it clears the dirty-bit without acquiring the mmap_lock for write and defers TLB flushes. As a result of this behavior, it is possible that one of the CPUs would have the stale PTE cached in its TLB and keep updating the page while another thread triggers a page-fault, and the page-fault handler would copy the old page into a new one. Since the copying is performed without holding the page-table lock, it is possible that after the copying, and before the PTE is actually flushed, the CPU that cached the stale PTE in the TLB would keep changing the page. These changes would be lost and memory corruption would occur. As Yu Zhao pointed, this race became more apparent since commit 09854ba94c6a ("mm: do_wp_page() simplification") which made wp_page_copy() more likely to take place, specifically if page_count(page) > 1. The following test produces the failure quite well on 5.10 and my machine. Note that the test is tailored for recent kernels behavior in which wp_page_copy() is called when page_count(page) != 1, but the fact the test does not fail on older kernels does not mean they are not affected. #define _GNU_SOURCE #include <sys/types.h> #include <sys/stat.h> #include <sys/mman.h> #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <assert.h> #include <fcntl.h> #include <string.h> #include <threads.h> #include <stdatomic.h> #define PAGE_SIZE (4096) #define TLB_SIZE (2000) #define N_PAGES (300000) #define ITERATIONS (2000) #define N_THREADS (2) static int stop; static char *m; static int writer(void *argp) { unsigned long t_idx = (unsigned long)argp; int i, cnt = 0; while (!atomic_load(&stop)) { cnt++; atomic_fetch_add((atomic_int *)m, 1); /* * First thread only accesses the page to have it cached in the * TLB. */ if (t_idx == 0) continue; /* * Other threads access enough entries to cause eviction from * the TLB and trigger #PF upon the next access (before the TLB * flush of clear_ref actually takes place). */ for (i = 1; i < TLB_SIZE; i++) { if (atomic_load((atomic_int *)(m + PAGE_SIZE * i))) { fprintf(stderr, "unexpected error\n"); exit(1); } } } return cnt; } /* * Runs mlock/munlock in the background to raise the page-count of the * page and force copying instead of reusing the page. Raising the * page-count is possible in better ways, e.g., registering io_uring * buffers. */ static int do_mlock(void *argp) { while (!atomic_load(&stop)) { if (mlock(m, PAGE_SIZE) || munlock(m, PAGE_SIZE)) { perror("mlock/munlock"); exit(1); } } return 0; } int main(void) { int r, cnt, fd, total = 0; long i; thrd_t thr[N_THREADS]; thrd_t mlock_thr; fd = open("/proc/self/clear_refs", O_WRONLY, 0666); if (fd < 0) { perror("open"); exit(1); } /* * Have large memory for clear_ref, so there would be some time between * the unmap and the actual deferred flush. */ m = mmap(NULL, PAGE_SIZE * N_PAGES, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE, -1, 0); if (m == MAP_FAILED) { perror("mmap"); exit(1); } for (i = 0; i < N_THREADS; i++) { r = thrd_create(&thr[i], writer, (void *)i); assert(r == thrd_success); } r = thrd_create(&mlock_thr, do_mlock, (void *)i); assert(r == thrd_success); for (i = 0; i < ITERATIONS; i++) { r = pwrite(fd, "4", 1, 0); if (r < 0) { perror("pwrite"); exit(1); } } atomic_store(&stop, 1); r = thrd_join(mlock_thr, NULL); assert(r == thrd_success); for (i = 0; i < N_THREADS; i++) { r = thrd_join(thr[i], &cnt); assert(r == thrd_success); total += cnt; } r = atomic_load((atomic_int *)(m)); if (r != total) { fprintf(stderr, "failed: expected=%d actual=%d\n", total, r); exit(-1); } fprintf(stderr, "ok\n"); return 0; } Fix it by taking mmap_lock for write when clearing soft-dirty. Note that the test keeps failing without the pending fix of the missing TLB flushes in clear_refs_write() [1]. [1] https://lore.kernel.org/patchwork/patch/1351776/ Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Fixes: 0f8975ec4db2 ("mm: soft-dirty bits for user memory changes tracking") Signed-off-by: Nadav Amit <namit@vmware.com> --- fs/proc/task_mmu.c | 27 +++++++++++++-------------- 1 file changed, 13 insertions(+), 14 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 217aa2705d5d..39b2bd27af79 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1189,6 +1189,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, struct mm_struct *mm; struct vm_area_struct *vma; enum clear_refs_types type; + bool write_lock = false; struct mmu_gather tlb; int itype; int rv; @@ -1236,21 +1237,16 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, } tlb_gather_mmu(&tlb, mm, 0, -1); if (type == CLEAR_REFS_SOFT_DIRTY) { + mmap_read_unlock(mm); + if (mmap_write_lock_killable(mm)) { + count = -EINTR; + goto out_mm; + } for (vma = mm->mmap; vma; vma = vma->vm_next) { - if (!(vma->vm_flags & VM_SOFTDIRTY)) - continue; - mmap_read_unlock(mm); - if (mmap_write_lock_killable(mm)) { - count = -EINTR; - goto out_mm; - } - for (vma = mm->mmap; vma; vma = vma->vm_next) { - vma->vm_flags &= ~VM_SOFTDIRTY; - vma_set_page_prot(vma); - } - mmap_write_downgrade(mm); - break; + vma->vm_flags &= ~VM_SOFTDIRTY; + vma_set_page_prot(vma); } + write_lock = true; mmu_notifier_range_init(&range, MMU_NOTIFY_SOFT_DIRTY, 0, NULL, mm, 0, -1UL); @@ -1261,7 +1257,10 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, if (type == CLEAR_REFS_SOFT_DIRTY) mmu_notifier_invalidate_range_end(&range); tlb_finish_mmu(&tlb, 0, -1); - mmap_read_unlock(mm); + if (write_lock) + mmap_write_unlock(mm); + else + mmap_read_unlock(mm); out_mm: mmput(mm); } -- 2.25.1 ^ permalink raw reply related [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup 2020-12-25 9:25 ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Nadav Amit @ 2021-01-05 15:08 ` Will Deacon 2021-01-05 18:20 ` Andrea Arcangeli 1 sibling, 0 replies; 96+ messages in thread From: Will Deacon @ 2021-01-05 15:08 UTC (permalink / raw) To: Nadav Amit Cc: linux-mm, linux-kernel, Nadav Amit, Andrea Arcangeli, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Peter Zijlstra On Fri, Dec 25, 2020 at 01:25:29AM -0800, Nadav Amit wrote: > From: Nadav Amit <namit@vmware.com> > > Clearing soft-dirty through /proc/[pid]/clear_refs can cause memory > corruption as it clears the dirty-bit without acquiring the mmap_lock > for write and defers TLB flushes. > > As a result of this behavior, it is possible that one of the CPUs would > have the stale PTE cached in its TLB and keep updating the page while > another thread triggers a page-fault, and the page-fault handler would > copy the old page into a new one. [...] > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 217aa2705d5d..39b2bd27af79 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -1189,6 +1189,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, > struct mm_struct *mm; > struct vm_area_struct *vma; > enum clear_refs_types type; > + bool write_lock = false; > struct mmu_gather tlb; > int itype; > int rv; > @@ -1236,21 +1237,16 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, > } > tlb_gather_mmu(&tlb, mm, 0, -1); > if (type == CLEAR_REFS_SOFT_DIRTY) { > + mmap_read_unlock(mm); > + if (mmap_write_lock_killable(mm)) { > + count = -EINTR; > + goto out_mm; > + } > for (vma = mm->mmap; vma; vma = vma->vm_next) { > - if (!(vma->vm_flags & VM_SOFTDIRTY)) > - continue; > - mmap_read_unlock(mm); > - if (mmap_write_lock_killable(mm)) { > - count = -EINTR; > - goto out_mm; > - } > - for (vma = mm->mmap; vma; vma = vma->vm_next) { > - vma->vm_flags &= ~VM_SOFTDIRTY; > - vma_set_page_prot(vma); > - } > - mmap_write_downgrade(mm); > - break; > + vma->vm_flags &= ~VM_SOFTDIRTY; > + vma_set_page_prot(vma); > } > + write_lock = true; > > mmu_notifier_range_init(&range, MMU_NOTIFY_SOFT_DIRTY, > 0, NULL, mm, 0, -1UL); > @@ -1261,7 +1257,10 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, > if (type == CLEAR_REFS_SOFT_DIRTY) > mmu_notifier_invalidate_range_end(&range); > tlb_finish_mmu(&tlb, 0, -1); > - mmap_read_unlock(mm); > + if (write_lock) > + mmap_write_unlock(mm); > + else > + mmap_read_unlock(mm); > out_mm: > mmput(mm); I probably wouldn't bother with the 'write_lock' variable, and just check 'type == CLEAR_REFS_SOFT_DIRTY' instead. But that's trivial and I don't have strong opinions, so: Acked-by: Will Deacon <will@kernel.org> Are you intending to land this for 5.11? If so, I can just rebase my other series on top of this. Will ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup 2020-12-25 9:25 ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Nadav Amit 2021-01-05 15:08 ` Will Deacon @ 2021-01-05 18:20 ` Andrea Arcangeli 2021-01-05 19:26 ` Nadav Amit 1 sibling, 1 reply; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-05 18:20 UTC (permalink / raw) To: Nadav Amit Cc: linux-mm, linux-kernel, Nadav Amit, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra On Fri, Dec 25, 2020 at 01:25:29AM -0800, Nadav Amit wrote: > Fixes: 0f8975ec4db2 ("mm: soft-dirty bits for user memory changes tracking") Targeting a backport down to 2013 when nothing could wrong in practice with page_mapcount sounds backwards and unnecessarily risky. In theory it was already broken and in theory 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 is absolutely perfect and the previous code of 2013 is completely wrong, but in practice the code from 2013 worked perfectly until Aug 21 2020. Since nothing at all could go wrong in soft dirty and uffd-wp until 09854ba94c6aad7886996bfbee2530b3d8a7f4f4, the Fixes need to target that, definitely not a patch from 2013. This means the backports will apply clean, they don't need a simple solution but one that doesn't regress the performance of open source virtual machines and open source products using clear_refs and uffd-wp in general. Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup 2021-01-05 18:20 ` Andrea Arcangeli @ 2021-01-05 19:26 ` Nadav Amit 2021-01-05 20:39 ` Andrea Arcangeli 0 siblings, 1 reply; 96+ messages in thread From: Nadav Amit @ 2021-01-05 19:26 UTC (permalink / raw) To: Andrea Arcangeli Cc: linux-mm, lkml, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra > On Jan 5, 2021, at 10:20 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > On Fri, Dec 25, 2020 at 01:25:29AM -0800, Nadav Amit wrote: >> Fixes: 0f8975ec4db2 ("mm: soft-dirty bits for user memory changes tracking") > > Targeting a backport down to 2013 when nothing could wrong in practice > with page_mapcount sounds backwards and unnecessarily risky. > > In theory it was already broken and in theory > 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 is absolutely perfect and the > previous code of 2013 is completely wrong, but in practice the code > from 2013 worked perfectly until Aug 21 2020. Well… If you consider the bug that Will recently fixed [1], then soft-dirty was broken (for a different, yet related reason) since 0758cd830494 ("asm-generic/tlb: avoid potential double flush”). This is not to say that I argue that the patch should be backported to 2013, just to say that memory corruption bugs can be unnoticed. [1] https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-will@kernel.org/ > > Since nothing at all could go wrong in soft dirty and uffd-wp until > 09854ba94c6aad7886996bfbee2530b3d8a7f4f4, the Fixes need to target > that, definitely not a patch from 2013. > > This means the backports will apply clean, they don't need a simple > solution but one that doesn't regress the performance of open source > virtual machines and open source products using clear_refs and uffd-wp > in general. To summarize my action items based your (and others) feedback on both patches: 1. I will break the first patch into two different patches, one with the “optimization” for write-unprotect, based on your feedback. It will not be backported. 2. I will try to add a patch to avoid TLB flushes on userfaultfd-writeunprotect. It will also not be backported. 3. Let me know if you want me to use your version of testing mm_tlb_flush_pending() and conditionally flushing, wait for new version fro you or Peter or to go with taking mmap_lock for write. Thanks again, Nadav ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup 2021-01-05 19:26 ` Nadav Amit @ 2021-01-05 20:39 ` Andrea Arcangeli 2021-01-05 21:20 ` Yu Zhao ` (2 more replies) 0 siblings, 3 replies; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-05 20:39 UTC (permalink / raw) To: Nadav Amit Cc: linux-mm, lkml, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra On Tue, Jan 05, 2021 at 07:26:43PM +0000, Nadav Amit wrote: > > On Jan 5, 2021, at 10:20 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > On Fri, Dec 25, 2020 at 01:25:29AM -0800, Nadav Amit wrote: > >> Fixes: 0f8975ec4db2 ("mm: soft-dirty bits for user memory changes tracking") > > > > Targeting a backport down to 2013 when nothing could wrong in practice > > with page_mapcount sounds backwards and unnecessarily risky. > > > > In theory it was already broken and in theory > > 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 is absolutely perfect and the > > previous code of 2013 is completely wrong, but in practice the code > > from 2013 worked perfectly until Aug 21 2020. > > Well… If you consider the bug that Will recently fixed [1], then soft-dirty > was broken (for a different, yet related reason) since 0758cd830494 > ("asm-generic/tlb: avoid potential double flush”). > > This is not to say that I argue that the patch should be backported to 2013, > just to say that memory corruption bugs can be unnoticed. > > [1] https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-will@kernel.org/ Is this a fix or a cleanup? The above is precisely what I said earlier that tlb_gather had no reason to stay in clear_refs and it had to use inc_tlb_flush_pending as mprotect, but it's not a fix? Is it? I suggested it as a pure cleanup. So again no backport required. The commit says fix this but it means "clean this up". Now there are plenty of bugs can go unnoticed for decades, including dirtycow and the very bug that allowed the fork child to attack the parent with vmsplice that ultimately motivated the page_mapcount->page_count in do_wp_page in Aug 2020. Now let's take another example: 7066f0f933a1fd707bb38781866657769cff7efc which also was found by source review only and never happened in practice, and unlike dirtycow and the vmsplice attack on parent was not reproducible even at will after it was found (even then it wouldn't be reproducible exploitable). So you can take 7066f0f933a1fd707bb38781866657769cff7efc as the example of theoretical issue that might still crash the kernel if unlucky. So before 7066f0f933a1fd707bb38781866657769cff7efc, things worked by luck but not reliably so. How are all those above relevant here? In my view none of the above is relevant. As I already stated this specific issue, for both uffd-wp and clear_refs wasn't even a theoretical bug before 2020 Aug, it is not like 7066f0f933a1fd707bb38781866657769cff7efc, and it's not like https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-will@kernel.org/ which appears a pure cleanup and doesn't need backporting to any tree. The uffd-wp clear_refs corruption mathematically could not happen before Aug 2020, it worked by luck too, but unlike 7066f0f933a1fd707bb38781866657769cff7efc reliably so. Philosophically I obviously agree the bug originated in 2013, but the stable trees don't feel research material that would care about such a intellectual detail. So setting a Fixes back to 2013 that would go mess with all stable tree by actively backporting a performance regressions to clear_refs that can break runtime performance to fix a philosophical issue that isn't even a theoretical issue, doesn't sound ideal to me. > To summarize my action items based your (and others) feedback on both > patches: > > 1. I will break the first patch into two different patches, one with the > “optimization” for write-unprotect, based on your feedback. It will not > be backported. > > 2. I will try to add a patch to avoid TLB flushes on > userfaultfd-writeunprotect. It will also not be backported. I think 1 and 2 above could be in the same patch. Mixing an uffd-wp optimization with the actual fix the memory corruption wasn't ideal, but doing the same optimization to both wrprotect and un-wrprotect in the same patch sounds ideal. The commit explanation would be identical and it can be de-duplicated this way. I'd suggest to coordinate with Peter on that, since I wasn't planning to work on this if somebody else offered to do it. > 3. Let me know if you want me to use your version of testing > mm_tlb_flush_pending() and conditionally flushing, wait for new version fro > you or Peter or to go with taking mmap_lock for write. Yes, as you suggested, I'm trying to clean it up and send a new version. Ultimately my view is there are an huge number of cases where mmap_write_lock or some other heavy lock that will require occasionally to block on I/O is beyond impossible not to take. Even speculative page faults only attack the low hanging anon memory and there's still MADV_DONTNEED/FREE and other stuff that may have to run in parallel with UFFDIO_WRITEPROTECT and clear_refs, not just page faults. As a reminder: the only case when modifying the vmas is allowed under mmap_read_lock (I already tried once to make it safer by adding READ_ONCE/WRITE_ONCE but wasn't merged see https://www.spinics.net/lists/linux-mm/msg173420.html), is when updating vm_end/vm_start in growsdown/up, where the vma is extended down or up in the page fault under only mmap_read_lock. I'm doing all I can to document and make it more explicit the complexity we deal with in the code (as well as reducing the gcc dependency in emitting atomic writes to update vm_end/vm_start, as we should do in ptes as well in theory). As you may notice in the feedback from the above submission not all even realized that we're modifying vmas already under mmap_read_lock. So it'd be great to get help to merge that READ_ONCE/WRITE_ONCE cleanup that is still valid and pending for merge but it needs forward porting. This one, for both soft dirty and uffd_wrprotect, is a walk in the park to optimize in comparison to the vma modifications. From my point of view in fact, doing the tlb flush or the wait on the atomic to be released, does not increase kernel complexity compared to what we had until now. I think we already had this complexity before Aug 2020, but we didn't realize it, and that's why thing then broke in clear_refs in Aug 2020 because of an unrelated change that finally exposed the complexity. By handling the race so that we stop depending on an undocumented page_mapcount dependency, we won't be increasing complexity, we'll be merely documenting the complexity we already had to begin with, so that it'll be less likely to bite us again in the future if it's handled explicitly. Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup 2021-01-05 20:39 ` Andrea Arcangeli @ 2021-01-05 21:20 ` Yu Zhao 2021-01-05 21:22 ` Nadav Amit 2021-01-05 21:55 ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Peter Xu 2 siblings, 0 replies; 96+ messages in thread From: Yu Zhao @ 2021-01-05 21:20 UTC (permalink / raw) To: Andrea Arcangeli, Nadav Amit Cc: linux-mm, lkml, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra On Tue, Jan 05, 2021 at 03:39:35PM -0500, Andrea Arcangeli wrote: > On Tue, Jan 05, 2021 at 07:26:43PM +0000, Nadav Amit wrote: > > > On Jan 5, 2021, at 10:20 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > > > On Fri, Dec 25, 2020 at 01:25:29AM -0800, Nadav Amit wrote: > > >> Fixes: 0f8975ec4db2 ("mm: soft-dirty bits for user memory changes tracking") > > > > > > Targeting a backport down to 2013 when nothing could wrong in practice > > > with page_mapcount sounds backwards and unnecessarily risky. > > > > > > In theory it was already broken and in theory > > > 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 is absolutely perfect and the > > > previous code of 2013 is completely wrong, but in practice the code > > > from 2013 worked perfectly until Aug 21 2020. > > > > Well… If you consider the bug that Will recently fixed [1], then soft-dirty > > was broken (for a different, yet related reason) since 0758cd830494 > > ("asm-generic/tlb: avoid potential double flush”). > > > > This is not to say that I argue that the patch should be backported to 2013, > > just to say that memory corruption bugs can be unnoticed. > > > > [1] https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-will@kernel.org/ > > Is this a fix or a cleanup? > > The above is precisely what I said earlier that tlb_gather had no > reason to stay in clear_refs and it had to use inc_tlb_flush_pending > as mprotect, but it's not a fix? Is it? I suggested it as a pure > cleanup. So again no backport required. The commit says fix this but > it means "clean this up". > > Now there are plenty of bugs can go unnoticed for decades, including > dirtycow and the very bug that allowed the fork child to attack the > parent with vmsplice that ultimately motivated the > page_mapcount->page_count in do_wp_page in Aug 2020. > > Now let's take another example: > 7066f0f933a1fd707bb38781866657769cff7efc which also was found by > source review only and never happened in practice, and unlike dirtycow > and the vmsplice attack on parent was not reproducible even at will > after it was found (even then it wouldn't be reproducible > exploitable). So you can take 7066f0f933a1fd707bb38781866657769cff7efc > as the example of theoretical issue that might still crash the kernel > if unlucky. So before 7066f0f933a1fd707bb38781866657769cff7efc, things > worked by luck but not reliably so. > > How are all those above relevant here? > > In my view none of the above is relevant. > > As I already stated this specific issue, for both uffd-wp and > clear_refs wasn't even a theoretical bug before 2020 Aug, it is not > like 7066f0f933a1fd707bb38781866657769cff7efc, and it's not like > https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-will@kernel.org/ > which appears a pure cleanup and doesn't need backporting to any > tree. > > The uffd-wp clear_refs corruption mathematically could not happen > before Aug 2020, it worked by luck too, but unlike > 7066f0f933a1fd707bb38781866657769cff7efc reliably so. > > Philosophically I obviously agree the bug originated in 2013, but the > stable trees don't feel research material that would care about such a > intellectual detail. > > So setting a Fixes back to 2013 that would go mess with all stable > tree by actively backporting a performance regressions to clear_refs > that can break runtime performance to fix a philosophical issue that > isn't even a theoretical issue, doesn't sound ideal to me. > > > To summarize my action items based your (and others) feedback on both > > patches: > > > > 1. I will break the first patch into two different patches, one with the > > “optimization” for write-unprotect, based on your feedback. It will not > > be backported. > > > > 2. I will try to add a patch to avoid TLB flushes on > > userfaultfd-writeunprotect. It will also not be backported. > > I think 1 and 2 above could be in the same patch. Mixing an uffd-wp optimization with the > actual fix the memory corruption wasn't ideal, but doing the same > optimization to both wrprotect and un-wrprotect in the same patch > sounds ideal. The commit explanation would be identical and it can be > de-duplicated this way. > > I'd suggest to coordinate with Peter on that, since I wasn't planning > to work on this if somebody else offered to do it. I probably could post something based on the local flush idea we discussed, but it won't be in this month. It seems to me there is much has to be done, e.g., auditing all clearing of the writable & the dirty bits, document the exactly steps when clearing them to prevent similar problems in the future. I'd be happy to review your patches too if you could have them sooner. Meanwhile, Nadav, my reviewed-by on your patch stands, since it's straightforward and safe for backport. > > 3. Let me know if you want me to use your version of testing > > mm_tlb_flush_pending() and conditionally flushing, wait for new version fro > > you or Peter or to go with taking mmap_lock for write. > > Yes, as you suggested, I'm trying to clean it up and send a new > version. > > Ultimately my view is there are an huge number of cases where > mmap_write_lock or some other heavy lock that will require > occasionally to block on I/O is beyond impossible not to take. Even > speculative page faults only attack the low hanging anon memory and > there's still MADV_DONTNEED/FREE and other stuff that may have to run > in parallel with UFFDIO_WRITEPROTECT and clear_refs, not just page > faults. > > As a reminder: the only case when modifying the vmas is allowed under > mmap_read_lock (I already tried once to make it safer by adding > READ_ONCE/WRITE_ONCE but wasn't merged see > https://www.spinics.net/lists/linux-mm/msg173420.html), is when > updating vm_end/vm_start in growsdown/up, where the vma is extended > down or up in the page fault under only mmap_read_lock. > > I'm doing all I can to document and make it more explicit the > complexity we deal with in the code (as well as reducing the gcc > dependency in emitting atomic writes to update vm_end/vm_start, as we > should do in ptes as well in theory). As you may notice in the > feedback from the above submission not all even realized that we're > modifying vmas already under mmap_read_lock. So it'd be great to get > help to merge that READ_ONCE/WRITE_ONCE cleanup that is still valid > and pending for merge but it needs forward porting. > > This one, for both soft dirty and uffd_wrprotect, is a walk in the > park to optimize in comparison to the vma modifications. > > From my point of view in fact, doing the tlb flush or the wait on the > atomic to be released, does not increase kernel complexity compared to > what we had until now. > > I think we already had this complexity before Aug 2020, but we didn't > realize it, and that's why thing then broke in clear_refs in Aug 2020 > because of an unrelated change that finally exposed the complexity. > > By handling the race so that we stop depending on an undocumented > page_mapcount dependency, we won't be increasing complexity, we'll be > merely documenting the complexity we already had to begin with, so > that it'll be less likely to bite us again in the future if it's > handled explicitly. > > Thanks, > Andrea > ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup 2021-01-05 20:39 ` Andrea Arcangeli 2021-01-05 21:20 ` Yu Zhao @ 2021-01-05 21:22 ` Nadav Amit 2021-01-05 22:16 ` Will Deacon ` (2 more replies) 2021-01-05 21:55 ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Peter Xu 2 siblings, 3 replies; 96+ messages in thread From: Nadav Amit @ 2021-01-05 21:22 UTC (permalink / raw) To: Andrea Arcangeli Cc: linux-mm, lkml, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra > On Jan 5, 2021, at 12:39 PM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > On Tue, Jan 05, 2021 at 07:26:43PM +0000, Nadav Amit wrote: >>> On Jan 5, 2021, at 10:20 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: >>> >>> On Fri, Dec 25, 2020 at 01:25:29AM -0800, Nadav Amit wrote: >>>> Fixes: 0f8975ec4db2 ("mm: soft-dirty bits for user memory changes tracking") >>> >>> Targeting a backport down to 2013 when nothing could wrong in practice >>> with page_mapcount sounds backwards and unnecessarily risky. >>> >>> In theory it was already broken and in theory >>> 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 is absolutely perfect and the >>> previous code of 2013 is completely wrong, but in practice the code >>> from 2013 worked perfectly until Aug 21 2020. >> >> Well… If you consider the bug that Will recently fixed [1], then soft-dirty >> was broken (for a different, yet related reason) since 0758cd830494 >> ("asm-generic/tlb: avoid potential double flush”). >> >> This is not to say that I argue that the patch should be backported to 2013, >> just to say that memory corruption bugs can be unnoticed. >> >> [1] https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-will@kernel.org/ > > Is this a fix or a cleanup? > > The above is precisely what I said earlier that tlb_gather had no > reason to stay in clear_refs and it had to use inc_tlb_flush_pending > as mprotect, but it's not a fix? Is it? I suggested it as a pure > cleanup. So again no backport required. The commit says fix this but > it means "clean this up". It is actually a fix. I think the commit log is not entirely correct and should include: Fixes: 0758cd830494 ("asm-generic/tlb: avoid potential double flush”). Since 0758cd830494, calling tlb_finish_mmu() without any previous call to pte_free_tlb() and friends does not flush the TLB. The soft-dirty bug producer that I sent fails without this patch of Will. > So setting a Fixes back to 2013 that would go mess with all stable > tree by actively backporting a performance regressions to clear_refs > that can break runtime performance to fix a philosophical issue that > isn't even a theoretical issue, doesn't sound ideal to me. Point taken. > >> To summarize my action items based your (and others) feedback on both >> patches: >> >> 1. I will break the first patch into two different patches, one with the >> “optimization” for write-unprotect, based on your feedback. It will not >> be backported. >> >> 2. I will try to add a patch to avoid TLB flushes on >> userfaultfd-writeunprotect. It will also not be backported. > > I think 1 and 2 above could be in the same patch. Mixing an uffd-wp optimization with the > actual fix the memory corruption wasn't ideal, but doing the same > optimization to both wrprotect and un-wrprotect in the same patch > sounds ideal. The commit explanation would be identical and it can be > de-duplicated this way. > > I'd suggest to coordinate with Peter on that, since I wasn't planning > to work on this if somebody else offered to do it. > >> 3. Let me know if you want me to use your version of testing >> mm_tlb_flush_pending() and conditionally flushing, wait for new version fro >> you or Peter or to go with taking mmap_lock for write. > > Yes, as you suggested, I'm trying to clean it up and send a new > version. > > Ultimately my view is there are an huge number of cases where > mmap_write_lock or some other heavy lock that will require > occasionally to block on I/O is beyond impossible not to take. Even > speculative page faults only attack the low hanging anon memory and > there's still MADV_DONTNEED/FREE and other stuff that may have to run > in parallel with UFFDIO_WRITEPROTECT and clear_refs, not just page > faults. > > As a reminder: the only case when modifying the vmas is allowed under > mmap_read_lock (I already tried once to make it safer by adding > READ_ONCE/WRITE_ONCE but wasn't merged see > https://www.spinics.net/lists/linux-mm/msg173420.html), is when > updating vm_end/vm_start in growsdown/up, where the vma is extended > down or up in the page fault under only mmap_read_lock. > > I'm doing all I can to document and make it more explicit the > complexity we deal with in the code (as well as reducing the gcc > dependency in emitting atomic writes to update vm_end/vm_start, as we > should do in ptes as well in theory). As you may notice in the > feedback from the above submission not all even realized that we're > modifying vmas already under mmap_read_lock. So it'd be great to get > help to merge that READ_ONCE/WRITE_ONCE cleanup that is still valid > and pending for merge but it needs forward porting. > > This one, for both soft dirty and uffd_wrprotect, is a walk in the > park to optimize in comparison to the vma modifications. I am sure you are right. > > From my point of view in fact, doing the tlb flush or the wait on the > atomic to be released, does not increase kernel complexity compared to > what we had until now. It is also about performance due to unwarranted TLB flushes. I think avoiding them requires some finer granularity detection of pending page-faults. But anyhow, I still owe some TLB optimization patches (and v2 for userfaultfd+iouring) before I can even look at that. In addition, as I stated before, having some clean interfaces that tell whether a TLB flush is needed or not would be helpful and simpler to follow. For instance, we can have is_pte_prot_demotion(oldprot, newprot) to figure out whether a TLB flush is needed in change_pte_range() and avoid unnecessary flushes when unprotecting pages with either mprotect() or userfaultfd. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup 2021-01-05 21:22 ` Nadav Amit @ 2021-01-05 22:16 ` Will Deacon 2021-01-06 0:29 ` Andrea Arcangeli 2021-01-06 0:02 ` Andrea Arcangeli 2021-01-07 20:04 ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Andrea Arcangeli 2 siblings, 1 reply; 96+ messages in thread From: Will Deacon @ 2021-01-05 22:16 UTC (permalink / raw) To: Nadav Amit Cc: Andrea Arcangeli, linux-mm, lkml, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Peter Zijlstra On Tue, Jan 05, 2021 at 09:22:51PM +0000, Nadav Amit wrote: > > On Jan 5, 2021, at 12:39 PM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > On Tue, Jan 05, 2021 at 07:26:43PM +0000, Nadav Amit wrote: > >>> On Jan 5, 2021, at 10:20 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > >>> > >>> On Fri, Dec 25, 2020 at 01:25:29AM -0800, Nadav Amit wrote: > >>>> Fixes: 0f8975ec4db2 ("mm: soft-dirty bits for user memory changes tracking") > >>> > >>> Targeting a backport down to 2013 when nothing could wrong in practice > >>> with page_mapcount sounds backwards and unnecessarily risky. > >>> > >>> In theory it was already broken and in theory > >>> 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 is absolutely perfect and the > >>> previous code of 2013 is completely wrong, but in practice the code > >>> from 2013 worked perfectly until Aug 21 2020. > >> > >> Well… If you consider the bug that Will recently fixed [1], then soft-dirty > >> was broken (for a different, yet related reason) since 0758cd830494 > >> ("asm-generic/tlb: avoid potential double flush”). > >> > >> This is not to say that I argue that the patch should be backported to 2013, > >> just to say that memory corruption bugs can be unnoticed. > >> > >> [1] https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-will@kernel.org/ > > > > Is this a fix or a cleanup? > > > > The above is precisely what I said earlier that tlb_gather had no > > reason to stay in clear_refs and it had to use inc_tlb_flush_pending > > as mprotect, but it's not a fix? Is it? I suggested it as a pure > > cleanup. So again no backport required. The commit says fix this but > > it means "clean this up". > > It is actually a fix. I think the commit log is not entirely correct and > should include: > > Fixes: 0758cd830494 ("asm-generic/tlb: avoid potential double flush”). > > Since 0758cd830494, calling tlb_finish_mmu() without any previous call to > pte_free_tlb() and friends does not flush the TLB. The soft-dirty bug > producer that I sent fails without this patch of Will. Yes, it's a fix, but I didn't rush it for 5.10 because I don't think rushing this sort of thing does anybody any favours. I agree that the commit log should be updated; I mentioned this report in the cover letter: https://lore.kernel.org/linux-mm/CA+32v5zzFYJQ7eHfJP-2OHeR+6p5PZsX=RDJNU6vGF3hLO+j-g@mail.gmail.com/ demonstrating that somebody has independently stumbled over the missing TLB invalidation in userspace, but it's not as bad as the other issues we've been discussing in this thread. Will ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup 2021-01-05 22:16 ` Will Deacon @ 2021-01-06 0:29 ` Andrea Arcangeli 0 siblings, 0 replies; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-06 0:29 UTC (permalink / raw) To: Will Deacon Cc: Nadav Amit, linux-mm, lkml, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Peter Zijlstra On Tue, Jan 05, 2021 at 10:16:29PM +0000, Will Deacon wrote: > On Tue, Jan 05, 2021 at 09:22:51PM +0000, Nadav Amit wrote: > > > On Jan 5, 2021, at 12:39 PM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > > > On Tue, Jan 05, 2021 at 07:26:43PM +0000, Nadav Amit wrote: > > >>> On Jan 5, 2021, at 10:20 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > >>> > > >>> On Fri, Dec 25, 2020 at 01:25:29AM -0800, Nadav Amit wrote: > > >>>> Fixes: 0f8975ec4db2 ("mm: soft-dirty bits for user memory changes tracking") > > >>> > > >>> Targeting a backport down to 2013 when nothing could wrong in practice > > >>> with page_mapcount sounds backwards and unnecessarily risky. > > >>> > > >>> In theory it was already broken and in theory > > >>> 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 is absolutely perfect and the > > >>> previous code of 2013 is completely wrong, but in practice the code > > >>> from 2013 worked perfectly until Aug 21 2020. > > >> > > >> Well… If you consider the bug that Will recently fixed [1], then soft-dirty > > >> was broken (for a different, yet related reason) since 0758cd830494 > > >> ("asm-generic/tlb: avoid potential double flush”). > > >> > > >> This is not to say that I argue that the patch should be backported to 2013, > > >> just to say that memory corruption bugs can be unnoticed. > > >> > > >> [1] https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-will@kernel.org/ > > > > > > Is this a fix or a cleanup? > > > > > > The above is precisely what I said earlier that tlb_gather had no > > > reason to stay in clear_refs and it had to use inc_tlb_flush_pending > > > as mprotect, but it's not a fix? Is it? I suggested it as a pure > > > cleanup. So again no backport required. The commit says fix this but > > > it means "clean this up". > > > > It is actually a fix. I think the commit log is not entirely correct and > > should include: > > > > Fixes: 0758cd830494 ("asm-generic/tlb: avoid potential double flush”). Agreed. > > > > Since 0758cd830494, calling tlb_finish_mmu() without any previous call to > > pte_free_tlb() and friends does not flush the TLB. The soft-dirty bug > > producer that I sent fails without this patch of Will. > > Yes, it's a fix, but I didn't rush it for 5.10 because I don't think rushing > this sort of thing does anybody any favours. I agree that the commit log > should be updated; I mentioned this report in the cover letter: > > https://lore.kernel.org/linux-mm/CA+32v5zzFYJQ7eHfJP-2OHeR+6p5PZsX=RDJNU6vGF3hLO+j-g@mail.gmail.com/ > > demonstrating that somebody has independently stumbled over the missing TLB > invalidation in userspace, but it's not as bad as the other issues we've been > discussing in this thread. Thanks for the explanation Nadav and Will. The fact the code was a 100% match to the cleanup I independently suggested a few weeks ago to reduce the confusion in clear_refs, made me overlook the difference 0758cd830494 made. I didn't realize the flush got optimized away if no gathering happened. Backporting this sort of thing with Fixes I guess tends to give the same kind of favors as rushing it for 5.10, but then in general the Fixes is accurate here. Overall it looks obviously safe cleanup and it is also a fix starting in v5.6, so I don't think this can cause more issues than what it sure fixes at least. The cleanup was needed anyway, even before it become a fix, since if it was mandatory to use tlb_gather when you purely need inc_tlb_flush_pending, then mprotect couldn't get away with it. I guess the the optimization in 0758cd830494 just made it more explicit that no code should use tlb_gather if it doesn't need to gather any page. Maybe adding some commentary in the comment on top of tlb_gather_mmu about the new behavior wouldn't hurt. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup 2021-01-05 21:22 ` Nadav Amit 2021-01-05 22:16 ` Will Deacon @ 2021-01-06 0:02 ` Andrea Arcangeli 2021-01-07 20:04 ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Andrea Arcangeli 2 siblings, 0 replies; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-06 0:02 UTC (permalink / raw) To: Nadav Amit Cc: linux-mm, lkml, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra On Tue, Jan 05, 2021 at 09:22:51PM +0000, Nadav Amit wrote: > It is also about performance due to unwarranted TLB flushes. If there will be a problem switching to the wait_flush_pending() model suggested by Peter may not even require changes to the common code in memory.c since I'm thinking it may not even need to take a failure path if we plug it in the same place of the tlb flush. So instead of the flush we could always block there until we read zero in the atomic, then smp_rmb() and we're ready to start the copy. So either we flush IPI if we didn't read zero, or we block until we read zero, the difference is going to be hidden to do_wp_page. All do_wp_page cares about is that by the time the abstract call returns, there's no stale TLB left for such pte. If it is achieved by blocking and waiting or flushing the TLB it doesn't matter too much. So thinking of how bad the IPI will do, with the improved arm64 tlb flushing code in production, we keep track of how many simultaneous mm context there are, specifically to skip the SMP-unscalable TLBI broadcast on arm64 like we already avoid IPIs on lazy tlbs on x86 (see x86 tlb_is_not_lazy in native_flush_tlb_others). In other words the IPI will materialize only if there's more than one thread running while clear_refs run. All lazy tlbs won't get IPIs on both x86 upstream and arm64 enterprise. This won't help multithreaded processes that compute from all CPUs at all times but even multiple vcpu threads aren't always guaranteed to be running at all times. My main concern would be an IPI flood that slowdown clear_refs and UFFDIO_WRITEPROTECT, but an incremental optimization (not required for correctness) is to have UFFDIO_WRITEPROTECT and clear_refs switch to lazy tlb mode before they call inc_tlb_flush_pending() and unlazy only after dec_tlb_flush_pending. So it's possible to at least guarantee the IPI won't slow down them down. > In addition, as I stated before, having some clean interfaces that tell > whether a TLB flush is needed or not would be helpful and simpler to follow. > For instance, we can have is_pte_prot_demotion(oldprot, newprot) to figure > out whether a TLB flush is needed in change_pte_range() and avoid > unnecessary flushes when unprotecting pages with either mprotect() or > userfaultfd. When you mentioned this earlier I was thinking what happens then with flush_tlb_fix_spurious_fault(). The fact it's safe doesn't guarantee it's a performance win if there's a stream of spurious faults as result. So it'd need to be checked, especially as in the case of mprotect where the flush can be deferred and coalesced in a single IPI at the end so there's not so much to gain from it anyway. If you can guarantee there won't be a flood suprious wrprotect faults, then it'll be a nice optimization. Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-05 21:22 ` Nadav Amit 2021-01-05 22:16 ` Will Deacon 2021-01-06 0:02 ` Andrea Arcangeli @ 2021-01-07 20:04 ` Andrea Arcangeli 2021-01-07 20:04 ` [PATCH 1/2] mm: proc: Invalidate TLB after clearing soft-dirty page state Andrea Arcangeli ` (2 more replies) 2 siblings, 3 replies; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-07 20:04 UTC (permalink / raw) To: linux-mm, linux-kernel, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Linus Torvalds, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai Hello, I prepared in 2/2 a fix to make UFFDIO_WRITEPROTECT and clear_refs_write cope with page_count in do_wp_page. It'd stack perfectly on top of 1/2 from will which fixes an orthogonal regression and it'd need to be applied first since its Fixes tag comes first. I hope this patchset shows and my initial my answer in https://lkml.kernel.org/r/X+PoXCizo392PBX7@redhat.com shows I tried to keep an open mind and to try to fix what 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 broke. Even in the commit of 2/2 I wrote "is completely correct" despite I had to change my mind about that. It turns out the memory corruption caused by the breakage in the TLB flushing is a walk in the park to fix for clear_refs and UFFDIO_WRITEPROTECT, that is only the tip of the icerberg. To simplify, let's forget the mmap_read_lock and let's assume we hypothetically throw away the mmap_read_lock from the kernel and UFFDIO_WRITEPROTECT and clear_refs and everything else takes the mmap_write_lock only. Even then, clear_refs and UFFDIO_WRITEPROTECT will remain broken if the memory they're wrprotecting is GUP pinned by a secondary MMU or RDMA or something, that is reading the memory through a read GUP pin. You can only make fork safe from the page_count by COWing any page that has a GUP pin, because fork is actually allowed to COW (and that will also fix the longstanding fork vs threads vs GUP race as result, which I tried already once in https://lkml.kernel.org/r/20090311165833.GI27823@random.random ). However it's fundamentally flawed and forbidden to COW after clear_refs and UFFDIO_WRITEPROTECT, if fork or clone have never been called and there's any GUP pin on the pages that were wrprotected. In other words, the page_count check in do_wp_page, extends the old fork vs gup vs threads race, so that it also happens within a single thread without fork() involvement. The above scenario is even more explicit and unfixable when it's not just a single page but something bigger mapped by multiple processes that was GUP pinned. Either COWs would have to be forbidden and features clear_refs dropped from the kernel and mprotect also would be strictly forbidden to ever leave any pte_write bit clear for any reason, or do_wp_page requires full accuracy on exclusive pages in MAP_PRIVATE private COWs or MAP_ANON do_wp_page. In simple terms: the page_count check in do_wp_page makes it impossible to wrprotect memory, if such memory is under a !FOLL_WRITE GUP pin. It's a feature (not a bug) that the GUP pin must not trigger a COW, and this is also already explicitly documented in comments in the current source in places that are still using mapcount, or we'd be probably dealing with more breakage than what's reproducible right now. Can we take a step back and start looking at what started all this VM breakage? I mean specifically Jann's discovery that parent can attack the child after the child does drop privs by using vmsplice long term unprivileged GUP pins? vmsplice syscall API is insecure allowing long term GUP PINs without privilege. Before touching any of the COW code, something had to be done on vmsplice because even after 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 there's still no way to way to tame the other VM breakage and working DoS caused by vmsplice. (I already sent the vmsplice DoS PoC exploit to who of may concern on Aug 27 2020) zygote before worrying about COWs, needs to block vmsplice in the child (and io_uring until it's fixed) with seccomp by default until the syscall become privileged. I also recommended such change to podman default allowlist, but apparently it wasn't committed checking below, vmsplice still there in the allowlist unfortunately. I'll try to suggest it again in a follow up. https://github.com/containers/common/blob/master/pkg/seccomp/seccomp.json io_uring unlike vmsplice can remain unprivileged, but it needs to use the mmu notifier to make those GUP pins become VM neutral. After io_uring is fixed with mmu notifier, and vmsplice becomes a privileged syscall, the concern that remains for the zygote model is on par with the fact that there is RAM carelessly shared with L1 and L2 cache also shared between parent and child. It'll take DMA and burning the flash in order to keep poking for the parent to write at the wrong time. So the phone would get so hot or the battery would run out of juice before the attack can expose data from the parent. So for an attacker it may be easier to look for a side channel with flush and reload on the shared L2 that doesn't rely on more costly GUP transient pins. So the concern that started all this, once vmsplice and io_uring are both fixes, in my view becomes theoretical. Especially on enterprise (non-embedded), this issue is not even theoretical but it's fully irrelevant, since execve has to be used after drop privs (or the app needs to use MADV_DONTFORK or unshare all memory if using a jailer that doesn't execve) to avoid the aforementioned side channel to remain. Only RAM constrained embedded devices are justified to take shortcuts with the ensuing side channel security concern that emerges from it. For all the above reasons, because so far the cure has been worse than the disease itself, I'd recommend enterprise distro kernels to ignore the zygote embedded model attack for the time being and not to backport anything in this regard. What should not be backported, specifically starts in 17839856fd588f4ab6b789f482ed3ffd7c403e1f. 17839856fd588f4ab6b789f482ed3ffd7c403e1f was supposed to be fine and not break anything and unfortunately I was too busy while 17839856fd588f4ab6b789f482ed3ffd7c403e1f morphed into 09854ba94c6aad7886996bfbee2530b3d8a7f4f4, so I still miss a whole huge discussion in that transition. I don't know what was fundamentally flawed in 17839856fd588f4ab6b789f482ed3ffd7c403e1f yet. All I comment about here is purely the current end result: 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 not any of the intermediate steps that brought us there. I already mentioned the page_count wasn't workable in do_wp_page in Message-ID: <20200527212005.GC31990@redhat.com> on May 27 2020, quote: "I don't see how we can check the page_count to decide if to COW or not in the wrprotect fault, given [..] introduced in >=" Then Hugh said he wasn't happy about on 1 Sep 2020 in Message-ID: <alpine.LSU.2.11.2008312207450.1212@eggly.anvils>. In https://lkml.kernel.org/r/X+O49HrcK1fBDk0Q@redhat.com I suggested "I hope we can find a way put the page_mapcount back where there" and now I have to double down and suggest that the page_count is fundamentally unsafe in do_wp_page. I see how the page_count would also fix the specific zygote child->parent attack and I kept an open mind hoping it would just solve all problems magically. So I tried to fix even clear_refs to cope with it, but this is only the tip of the icerbeg of what really breaks. So in short I contextually self-NAK 2/2 of this patchset and we need to somehow reverse 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 instead. Thanks, Andrea Andrea Arcangeli (1): mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending Will Deacon (1): mm: proc: Invalidate TLB after clearing soft-dirty page state fs/proc/task_mmu.c | 26 ++++++++++++--- include/linux/mm.h | 46 +++++++++++++++++++++++++++ include/linux/mm_types.h | 5 +++ kernel/fork.c | 1 + mm/memory.c | 69 ++++++++++++++++++++++++++++++++++++++++ mm/mprotect.c | 4 +++ 6 files changed, 146 insertions(+), 5 deletions(-) ^ permalink raw reply [flat|nested] 96+ messages in thread
* [PATCH 1/2] mm: proc: Invalidate TLB after clearing soft-dirty page state 2021-01-07 20:04 ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Andrea Arcangeli @ 2021-01-07 20:04 ` Andrea Arcangeli 2021-01-07 20:04 ` [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending Andrea Arcangeli 2021-01-07 20:25 ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Jason Gunthorpe 2 siblings, 0 replies; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-07 20:04 UTC (permalink / raw) To: linux-mm, linux-kernel, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Linus Torvalds, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai From: Will Deacon <will@kernel.org> Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double flush"), TLB invalidation is elided in tlb_finish_mmu() if no entries were batched via the tlb_remove_*() functions. Consequently, the page-table modifications performed by clear_refs_write() in response to a write to /proc/<pid>/clear_refs do not perform TLB invalidation. Although this is fine when simply aging the ptes, in the case of clearing the "soft-dirty" state we can end up with entries where pte_write() is false, yet a writable mapping remains in the TLB. Fix this by avoiding the mmu_gather API altogether: managing both the 'tlb_flush_pending' flag on the 'mm_struct' and explicit TLB invalidation for the sort-dirty path, much like mprotect() does already. Fixes: 0758cd830494 ("asm-generic/tlb: avoid potential double flush") Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- fs/proc/task_mmu.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index ee5a235b3056..a127262ba517 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1189,7 +1189,6 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, struct mm_struct *mm; struct vm_area_struct *vma; enum clear_refs_types type; - struct mmu_gather tlb; int itype; int rv; @@ -1234,7 +1233,6 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, count = -EINTR; goto out_mm; } - tlb_gather_mmu(&tlb, mm, 0, -1); if (type == CLEAR_REFS_SOFT_DIRTY) { for (vma = mm->mmap; vma; vma = vma->vm_next) { if (!(vma->vm_flags & VM_SOFTDIRTY)) @@ -1252,15 +1250,18 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, break; } + inc_tlb_flush_pending(mm); mmu_notifier_range_init(&range, MMU_NOTIFY_SOFT_DIRTY, 0, NULL, mm, 0, -1UL); mmu_notifier_invalidate_range_start(&range); } walk_page_range(mm, 0, mm->highest_vm_end, &clear_refs_walk_ops, &cp); - if (type == CLEAR_REFS_SOFT_DIRTY) + if (type == CLEAR_REFS_SOFT_DIRTY) { mmu_notifier_invalidate_range_end(&range); - tlb_finish_mmu(&tlb, 0, -1); + flush_tlb_mm(mm); + dec_tlb_flush_pending(mm); + } mmap_read_unlock(mm); out_mm: mmput(mm); ^ permalink raw reply related [flat|nested] 96+ messages in thread
* [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-07 20:04 ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Andrea Arcangeli 2021-01-07 20:04 ` [PATCH 1/2] mm: proc: Invalidate TLB after clearing soft-dirty page state Andrea Arcangeli @ 2021-01-07 20:04 ` Andrea Arcangeli 2021-01-07 20:17 ` Linus Torvalds 2021-01-07 21:36 ` kernel test robot 2021-01-07 20:25 ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Jason Gunthorpe 2 siblings, 2 replies; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-07 20:04 UTC (permalink / raw) To: linux-mm, linux-kernel, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Linus Torvalds, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai NOTE: the "Fixes" tag used here is to optimize the backporting, but 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 is completely correct. Despite being correct it happened to uncover some implicit assumption some other code made on a very specific behavior in do_wp_page that had to be altered by such commit. The page_mapcount is always guaranteed == 1 for an exclusive anon page, so when it was used to decide if an exclusive page under wrprotection could be reused (as in wp_page_reuse), the outcome would always come true. The page_mapcount had to be replaced with the page_count because it couldn't account for GUP pins, so after that change, for the first time, wp_page_copy can now be called also for exclusive anon pages that are underway wrprotection. Even then everything is still fine for all cases that wrprotect with the mmap_write_lock since the COW faults cannot run concurrently in such case. However there are two cases that could wrprotecting exclusive anon pages with only the mmap_read_lock: soft dirty clear_refs_write and UFFDIO_WRITEPROTECT. Both of them would benefit from keeping their wrprotection runtime scalable and to keep using the mmap_read_lock without having to switch to the mmap_write_lock. To stick to the mmap_read_lock, for both UFFDIO_WRITEPROTECT and clear_refs_write we need to handle the new reality that there can be COWs (as in wp_page_copy) happening on exclusive anon pages that are under wrprotection, but with the respective TLB flush still deferred. An example of the problematic UFFDIO_USERFAULTFD runtime is shown below. CPU0 CPU 1 CPU 2 ------ -------- ------- userfaultfd_wrprotect(mode_wp = true) PT lock atomic set _PAGE_UFFD_WP and clear _PAGE_WRITE PT unlock do_page_fault FAULT_FLAG_WRITE userfaultfd_wrprotect(mode_wp = false) PT lock ATOMIC clear _PAGE_UFFD_WP <- problem /* _PAGE_WRITE not set */ PT unlock XXXXXXXXXXXXXX BUG RACE window open here PT lock FAULT_FLAG_WRITE is set by CPU _PAGE_WRITE is still clear in pte PT unlock wp_page_copy cow_user_page runs with stale TLB deferred tlb flush <- too late XXXXXXXXXXXXXX BUG RACE window close here userfaultfd_wrprotect(mode_wp = true) is never a problem because as long as the uffd-wp flag is set in the pte/hugepmd the page fault is guaranteed to reach a dead end in handle_userfault(). The window for uffd-wp not to be set while the pte has been wrprotected but the TLB flush is still pending, is opened when userfaultfd_wrprotect(mode_wp = false) releases the PT-lock as shown above and it closes when the first deferred TLB flush runs. If do_wp_page->wp_copy_page runs within such window, some userland writes can get lost in the copy and they can end up in the original page that gets discarded. The softy dirty runtime is similar and it would be like below: CPU0 CPU 1 CPU 2 ------ -------- ------- instantiate writable TLB clear_refs_write PT lock pte_wrprotect PT unlock do_page_fault FAULT_FLAG_WRITE PT lock FAULT_FLAG_WRITE is set by CPU _PAGE_WRITE is still clear in pte PT unlock wp_page_copy cow_user_page... writes through the TLB ...cow_user_page So to handle this race a wrprotect_tlb_flush_pending atomic counter is added to the vma. This counter needs to be elevated while holding the mmap_read_lock before taking the PT lock to wrprotect the pagetable and it can only be decreased after the deferred TLB flush is complete. This way the page fault can trivially serialize against pending TLB flushes using a new helper: sync_wrprotect_tlb_flush_pending(). Testing with the userfaultfd selftest is showing 100% reproducible mm corruption with writes getting lost, before this commit. $ ./userfaultfd anon 100 100 nr_pages: 25600, nr_pages_per_cpu: 3200 bounces: 99, mode: rnd racing read, userfaults: 773 missing (215+205+58+114+72+85+18+6), 9011 wp (1714+1714+886+1227+1009+1278+646+537) [..] bounces: 72, mode: poll, userfaults: 720 missing (187+148+102+49+92+103+24+15), 9885 wp (1452+1175+1104+1667+1101+1365+913+1108) bounces: 71, mode: rnd racing ver read, page_nr 25241 memory corruption 6 7 After the commit the userland memory corruption is gone as expected. Cc: stable@kernel.org Reported-by: Nadav Amit <namit@vmware.com> Suggested-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- fs/proc/task_mmu.c | 17 +++++++++- include/linux/mm.h | 46 +++++++++++++++++++++++++++ include/linux/mm_types.h | 5 +++ kernel/fork.c | 1 + mm/memory.c | 69 ++++++++++++++++++++++++++++++++++++++++ mm/mprotect.c | 4 +++ 6 files changed, 141 insertions(+), 1 deletion(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index a127262ba517..e75cb135db02 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1235,8 +1235,20 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, } if (type == CLEAR_REFS_SOFT_DIRTY) { for (vma = mm->mmap; vma; vma = vma->vm_next) { - if (!(vma->vm_flags & VM_SOFTDIRTY)) + struct vm_area_struct *tmp; + if (!(vma->vm_flags & VM_SOFTDIRTY)) { + inc_wrprotect_tlb_flush_pending(vma); continue; + } + + /* + * Rollback wrprotect_tlb_flush_pending before + * releasing the mmap_lock. + */ + for (vma = mm->mmap; vma != tmp; + vma = vma->vm_next) + dec_wrprotect_tlb_flush_pending(vma); + mmap_read_unlock(mm); if (mmap_write_lock_killable(mm)) { count = -EINTR; @@ -1245,6 +1257,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, for (vma = mm->mmap; vma; vma = vma->vm_next) { vma->vm_flags &= ~VM_SOFTDIRTY; vma_set_page_prot(vma); + inc_wrprotect_tlb_flush_pending(vma); } mmap_write_downgrade(mm); break; @@ -1260,6 +1273,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, if (type == CLEAR_REFS_SOFT_DIRTY) { mmu_notifier_invalidate_range_end(&range); flush_tlb_mm(mm); + for (vma = mm->mmap; vma; vma = vma->vm_next) + dec_wrprotect_tlb_flush_pending(vma); dec_tlb_flush_pending(mm); } mmap_read_unlock(mm); diff --git a/include/linux/mm.h b/include/linux/mm.h index ecdf8a8cd6ae..caa1d9a71cb2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3177,5 +3177,51 @@ unsigned long wp_shared_mapping_range(struct address_space *mapping, extern int sysctl_nr_trim_pages; +/* + * NOTE: the mmap_lock must be hold and it cannot be released at any + * time in between inc_wrprotect_tlb_flush_pending() and + * dec_wrprotect_tlb_flush_pending(). + * + * This counter has to be elevated before taking the PT-lock to + * wrprotect pagetables, if the TLB isn't flushed before the + * PT-unlock. + * + * The only reader is the page fault so this has to be elevated (in + * addition of the mm->tlb_flush_pending) only when the mmap_read_lock + * is taken instead of the mmap_write_lock (otherwise the page fault + * couldn't run concurrently in the first place). + * + * This doesn't need to be elevated when clearing pagetables even if + * only holding the mmap_read_lock (as in MADV_DONTNEED). The page + * fault doesn't risk to access the data of the page that is still + * under tlb-gather deferred flushing, if the pagetable is none, + * because the pagetable doesn't point to it anymore. + * + * This counter is read more specifically by the page fault when it + * has to issue a COW that doesn't result in page re-use because of + * the lack of stability of the page_count (vs speculative pagecache + * lookups) or because of a GUP pin exist on an exclusive and writable + * anon page. + * + * If this counter is elevated and the pageteable is wrprotected (as + * in !pte/pmd_write) and present, it means the page may be still + * modified by userland through a stale TLB entry that was + * instantiated before the wrprotection. In such case the COW fault, + * if it decides not to re-use the page, will have to either wait this + * counter to return zero, or it needs to flush the TLB before + * proceeding copying the page. + */ +static inline void inc_wrprotect_tlb_flush_pending(struct vm_area_struct *vma) +{ + atomic_inc(&vma->wrprotect_tlb_flush_pending); + VM_WARN_ON_ONCE(atomic_read(&vma->wrprotect_tlb_flush_pending) <= 0); +} + +static inline void dec_wrprotect_tlb_flush_pending(struct vm_area_struct *vma) +{ + atomic_dec(&vma->wrprotect_tlb_flush_pending); + VM_WARN_ON_ONCE(atomic_read(&vma->wrprotect_tlb_flush_pending) < 0); +} + #endif /* __KERNEL__ */ #endif /* _LINUX_MM_H */ diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 07d9acb5b19c..e3f412c43c30 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -369,6 +369,11 @@ struct vm_area_struct { struct mempolicy *vm_policy; /* NUMA policy for the VMA */ #endif struct vm_userfaultfd_ctx vm_userfaultfd_ctx; + /* + * When elevated, it indicates that a deferred TLB flush may + * be pending after a pagetable write protection on this vma. + */ + atomic_t wrprotect_tlb_flush_pending; } __randomize_layout; struct core_thread { diff --git a/kernel/fork.c b/kernel/fork.c index 37720a6d04ea..7a658c608f3a 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -365,6 +365,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) *new = data_race(*orig); INIT_LIST_HEAD(&new->anon_vma_chain); new->vm_next = new->vm_prev = NULL; + VM_WARN_ON_ONCE(atomic_read(&new->wrprotect_tlb_flush_pending)); } return new; } diff --git a/mm/memory.c b/mm/memory.c index feff48e1465a..e8e407443119 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2809,6 +2809,73 @@ static inline void wp_page_reuse(struct vm_fault *vmf) count_vm_event(PGREUSE); } +/* + * This synchronization helper, by the time it returns, has to enforce + * there cannot be stale writable TLB entries left, on any page mapped + * as wrprotected in the pagetables in this vma range. + * + * This is normally used only by the COW fault immediately before + * copying the page to make it proof against stale TLB entries (as the + * one pictured below in CPU 2). + * + * CPU 0 CPU 1 CPU 2 + * ----- ----- ----- + * writable TLB instantiated + * mmap_lock_read + * inc_wrprotect_tlb_flush_pending() + * PT_lock + * wrprotect the pte + * PT unlock + * mmap_lock_read + * PT_lock + * vmf->orig_pte = pte + * do_wp_page() + * PT_unlock + * wp_page_copy() + * sync_wrprotect_tlb_flush_pending() + * found wrprotect_tlb_flush_pending elevated + * flush_tlb_page() + * writable TLB invalidated [1] + * kret of sync_wrprotect_tlb_flush_pending() + * cow_user_page() [2] + * + * The whole objective of the wrprotect_tlb_flush_pending atomic + * counter is to enforce [1] happens before [2] in the above sequence. + * + * Without having to alter the caller of this helper, it'd also be + * possible to replace the flush_tlb_page with a wait for + * wrprotect_tlb_flush_pending counter to return zero using the same + * logic as above. In such case the point [1] would be replaced by + * dec_wrprotect_tlb_flush_pending() happening in CPU 1. + * + * In terms of memory ordering guarantees: all page payload reads of + * page mapped by a wrprotected pagetable, executed after this + * function returns, must not be allowed to be reordered before the + * read of the wrprotect_tlb_flush_pending atomic counter at the start + * of the function. So this function has to provide acquire semantics + * to those page payload reads. + */ +static __always_inline +void sync_wrprotect_tlb_flush_pending(struct vm_area_struct *vma, + unsigned long address) +{ + int val = atomic_read(&vma->wrprotect_tlb_flush_pending); + if (val) { + /* + * flush_tlb_page() needs to deliver acquire semantics + * implicitly. Archs using IPIs to flush remote TLBs + * provide those with csd_lock_wait(). + */ + flush_tlb_page(vma, address); + } else { + /* + * Prevent the read of the wrprotect page payload to be + * reordered before the above atomic_read(). + */ + smp_rmb(); + } +} + /* * Handle the case of a page which we actually need to copy to a new page. * @@ -2849,6 +2916,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) if (!new_page) goto oom; + sync_wrprotect_tlb_flush_pending(vma, vmf->address); + if (!cow_user_page(new_page, old_page, vmf)) { /* * COW failed, if the fault was solved by other, diff --git a/mm/mprotect.c b/mm/mprotect.c index ab709023e9aa..6b7a52662de8 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -335,6 +335,8 @@ static unsigned long change_protection_range(struct vm_area_struct *vma, pgd = pgd_offset(mm, addr); flush_cache_range(vma, addr, end); inc_tlb_flush_pending(mm); + if (unlikely(cp_flags & MM_CP_UFFD_WP_ALL)) + inc_wrprotect_tlb_flush_pending(vma); do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(pgd)) @@ -346,6 +348,8 @@ static unsigned long change_protection_range(struct vm_area_struct *vma, /* Only flush the TLB if we actually modified any entries: */ if (pages) flush_tlb_range(vma, start, end); + if (unlikely(cp_flags & MM_CP_UFFD_WP_ALL)) + dec_wrprotect_tlb_flush_pending(vma); dec_tlb_flush_pending(mm); return pages; ^ permalink raw reply related [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-07 20:04 ` [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending Andrea Arcangeli @ 2021-01-07 20:17 ` Linus Torvalds 2021-01-07 20:25 ` Linus Torvalds 2021-01-07 20:58 ` Andrea Arcangeli 2021-01-07 21:36 ` kernel test robot 1 sibling, 2 replies; 96+ messages in thread From: Linus Torvalds @ 2021-01-07 20:17 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai On Thu, Jan 7, 2021 at 12:04 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > However there are two cases that could wrprotecting exclusive anon > pages with only the mmap_read_lock: I still think the real fix is "Don't do that then", and just take the write lock. The UFFDIO_WRITEPROTECT case simply isn't that critical. It's not a normal operation. Same goes for softdirty. Why have those become _so_ magical that they can break the VM for everybody else? Linus ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-07 20:17 ` Linus Torvalds @ 2021-01-07 20:25 ` Linus Torvalds 2021-01-07 20:58 ` Andrea Arcangeli 1 sibling, 0 replies; 96+ messages in thread From: Linus Torvalds @ 2021-01-07 20:25 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai On Thu, Jan 7, 2021 at 12:17 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I still think the real fix is "Don't do that then", and just take the > write lock. The alternative, of course, is to just make sure the page table flush is done inside the page table lock (and then we make COW do the copy inside of it). But this whole "we know UFFD breaks all rules, we'll add even more crap to it" approach is horrendous. Linus ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-07 20:17 ` Linus Torvalds 2021-01-07 20:25 ` Linus Torvalds @ 2021-01-07 20:58 ` Andrea Arcangeli 2021-01-07 21:29 ` Linus Torvalds 1 sibling, 1 reply; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-07 20:58 UTC (permalink / raw) To: Linus Torvalds Cc: Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai Hi Linus, On Thu, Jan 07, 2021 at 12:17:40PM -0800, Linus Torvalds wrote: > On Thu, Jan 7, 2021 at 12:04 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > However there are two cases that could wrprotecting exclusive anon > > pages with only the mmap_read_lock: > > I still think the real fix is "Don't do that then", and just take the > write lock. > > The UFFDIO_WRITEPROTECT case simply isn't that critical. It's not a > normal operation. Same goes for softdirty. > > Why have those become _so_ magical that they can break the VM for > everybody else? I see what you mean above and I agree. Like said below: == In simple terms: the page_count check in do_wp_page makes it impossible to wrprotect memory, if such memory is under a !FOLL_WRITE GUP pin. == So to simplify let's ignore UFFDIO_WRITEPROTECT here, which is new and adds no dependency on top of clear_refs in this respect. So yes, if we drop any code that has to wrprotect memory in place in the kernel (since all userland memory can be under GUP pin in read mode) and we make such an operation illegal, then it's fine, but that means clear_refs has to go or it has to fail if there's a GUP pin during the wrprotection. The problem is it's not even possible to detect reliably if there's really a long term GUP pin because of speculative pagecache lookups. We would need to declare that any secondary MMU which is supposed to be VM-neutral using mmu notifier like a GPU or a RDMA device, cannot be used in combination on clear_refs and it would need to be enforced fully in userland. Most mmu notifier users drop the GUP pin during the invalidate for extra safety in case an invalidate goes missing: they would all need to drop FOLL_GET to be compliant and stop causing memory corruption if clear_refs shall be still allowed to happen on mmu notifier capable secondary MMUs. Even then how does userland know which devices attaches to the memory with mmu notifer and never using FOLL_GET and which aren't? It doesn't sound reliable to enforce this in userland. So I don't see how clear_refs can be saved. Now let's make another example that still shows at least some fundamental inefficiency that has nothing to do with clear_refs. Let's suppose a GUP pin is taken on a subpageA by a RDMA device by process A (parent). Let's now assume subpageB is mapped in process B (child of process A). Both subpageA and subpageB are exclusive (mapcount == 1) and read-write but they share the same page_count atomic counter (only the page_mapcounts are subpage granular). To still tame the zygote concern with the page_count in do_wp_page, then process B when it forks a child (processC) would forever have to do an extra superflous COW even after process C exits. Is that what we want on top of the fundamental unsafety added to clear_refs? Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-07 20:58 ` Andrea Arcangeli @ 2021-01-07 21:29 ` Linus Torvalds 2021-01-07 21:53 ` John Hubbard 2021-01-07 22:31 ` Andrea Arcangeli 0 siblings, 2 replies; 96+ messages in thread From: Linus Torvalds @ 2021-01-07 21:29 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai On Thu, Jan 7, 2021 at 12:59 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > The problem is it's not even possible to detect reliably if there's > really a long term GUP pin because of speculative pagecache lookups. So none of the normal code _needs_ that any more these days, which is what I think is so nice. Any pinning will do the COW, and then we have the logic to make sure it stays writable, and that keeps everything nicely coherent and is all fairly simple. And yes, it does mean that if somebody then explicitly write-protects a page, it may end up being COW'ed after all, but if you first pinned it, and then started playing with the protections of that page, why should you be surprised? So to me, this sounds like a "don't do that then" situation. Anybody who does page pinning and wants coherency should NOT TOUCH THE MAPPING IT PINNED. (And if you do touch it, it's your own fault, and you get to keep both of the broken pieces) Now, I do agree that from a QoI standpoint, it would be really lovely if we actually enforced it. I'm not entirely sure we can, but maybe it would be reasonable to use that mm->has_pinned && page_maybe_dma_pinned(page) at least as the beginning of a heuristic. In fact, I do think that "page_maybe_dma_pinned()" could possibly be made stronger than it is. Because at *THAT* point, we might say "we know a pinned page always must have a page_mapcount() of 1" - since as part of pinning it and doing the GUP_PIN, we forced the COW, and then subsequent fork() operations enforce it too. So I do think that it might be possible to make that clear_refs code notice "this page is pinned, I can't mark it WP without the pinning coherency breaking". It might not even be hard. But admittedly I'm somewhat handwaving here, and I might not have thought of some situation. Linus ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-07 21:29 ` Linus Torvalds @ 2021-01-07 21:53 ` John Hubbard 2021-01-07 22:00 ` Linus Torvalds 2021-01-15 11:27 ` Jan Kara 2021-01-07 22:31 ` Andrea Arcangeli 1 sibling, 2 replies; 96+ messages in thread From: John Hubbard @ 2021-01-07 21:53 UTC (permalink / raw) To: Linus Torvalds, Andrea Arcangeli Cc: Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai On 1/7/21 1:29 PM, Linus Torvalds wrote: > On Thu, Jan 7, 2021 at 12:59 PM Andrea Arcangeli <aarcange@redhat.com> wrote: >> >> The problem is it's not even possible to detect reliably if there's >> really a long term GUP pin because of speculative pagecache lookups. > > So none of the normal code _needs_ that any more these days, which is > what I think is so nice. Any pinning will do the COW, and then we have > the logic to make sure it stays writable, and that keeps everything > nicely coherent and is all fairly simple. > > And yes, it does mean that if somebody then explicitly write-protects > a page, it may end up being COW'ed after all, but if you first pinned > it, and then started playing with the protections of that page, why > should you be surprised? > > So to me, this sounds like a "don't do that then" situation. > > Anybody who does page pinning and wants coherency should NOT TOUCH THE > MAPPING IT PINNED. > > (And if you do touch it, it's your own fault, and you get to keep both > of the broken pieces) > > Now, I do agree that from a QoI standpoint, it would be really lovely > if we actually enforced it. I'm not entirely sure we can, but maybe it > would be reasonable to use that > > mm->has_pinned && page_maybe_dma_pinned(page) > > at least as the beginning of a heuristic. > > In fact, I do think that "page_maybe_dma_pinned()" could possibly be > made stronger than it is. Because at *THAT* point, we might say "we What exactly did you have in mind, to make it stronger? I think the answer is in this email but I don't quite see it yet... Also, now seems to be a good time to mention that I've been thinking about a number of pup/gup pinning cases (Direct IO, GPU/NIC, NVMe/storage peer to peer with GUP/NIC, and HMM support for atomic operations from a device). And it seems like the following approach would help: * Use pin_user_pages/FOLL_PIN for long-term pins. Long-term here (thanks to Jason for this point) means "user space owns the lifetime". We might even end up deleting either FOLL_PIN or FOLL_LONGTERM, because this would make them mean the same thing. The idea is that there are no "short term" pins of this kind of memory. * Continue to use FOLL_GET (only) for Direct IO. That's a big change of plans, because several of us had thought that Direct IO needs FOLL_PIN. However, this recent conversation, plus my list of cases above, seems to indicate otherwise. That's because we only have one refcount approach for marking pages in this way, and we should spend it on the long-term pinned pages. Those are both hard to identify otherwise, and actionable once we identify them. Direct IO pins, on the other hand, are more transient. We can probably live without tagging Direct IO pages as FOLL_PIN. I think. This is all assuming that we make progress in the area of "if it's not a page_maybe_dma_pinned() page, then we can wait for it or otherwise do reasonable things about the refcount". So we end up with a clear (-ish) difference between pages that can be waited for, and pages that should not be waited for in the kernel. I hope this helps, but if it's too much of a side-track, please disregard. thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-07 21:53 ` John Hubbard @ 2021-01-07 22:00 ` Linus Torvalds 2021-01-07 22:14 ` John Hubbard 2021-01-15 11:27 ` Jan Kara 1 sibling, 1 reply; 96+ messages in thread From: Linus Torvalds @ 2021-01-07 22:00 UTC (permalink / raw) To: John Hubbard Cc: Andrea Arcangeli, Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai On Thu, Jan 7, 2021 at 1:53 PM John Hubbard <jhubbard@nvidia.com> wrote: > > > > > Now, I do agree that from a QoI standpoint, it would be really lovely > > if we actually enforced it. I'm not entirely sure we can, but maybe it > > would be reasonable to use that > > > > mm->has_pinned && page_maybe_dma_pinned(page) > > > > at least as the beginning of a heuristic. > > > > In fact, I do think that "page_maybe_dma_pinned()" could possibly be > > made stronger than it is. Because at *THAT* point, we might say "we > > What exactly did you have in mind, to make it stronger? I think the > answer is in this email but I don't quite see it yet... Literally just adding a " && page_mapcount(page) == 1" in there (probably best done inside page_maybe_dma_pinned() itself) > Direct IO pins, on the other hand, are more transient. We can probably live > without tagging Direct IO pages as FOLL_PIN. I think. Yes. I think direct-IO writes should be able to just do a transient GUP, and if it causes a COW fault that isn't coherent, that's the correct semantics, I think (ie the direct-IO will see the original data, the COW faulter will get it's own private copy to make changes to). I think pinning should be primarily limited to things that _require_ coherency (ie you pin because you're going to do some active two-way communication using that page) Does that match your thinking? Linus ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-07 22:00 ` Linus Torvalds @ 2021-01-07 22:14 ` John Hubbard 2021-01-07 22:20 ` Linus Torvalds 0 siblings, 1 reply; 96+ messages in thread From: John Hubbard @ 2021-01-07 22:14 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai On 1/7/21 2:00 PM, Linus Torvalds wrote: > On Thu, Jan 7, 2021 at 1:53 PM John Hubbard <jhubbard@nvidia.com> wrote: >> >>> >>> Now, I do agree that from a QoI standpoint, it would be really lovely >>> if we actually enforced it. I'm not entirely sure we can, but maybe it >>> would be reasonable to use that >>> >>> mm->has_pinned && page_maybe_dma_pinned(page) >>> >>> at least as the beginning of a heuristic. >>> >>> In fact, I do think that "page_maybe_dma_pinned()" could possibly be >>> made stronger than it is. Because at *THAT* point, we might say "we >> >> What exactly did you have in mind, to make it stronger? I think the >> answer is in this email but I don't quite see it yet... > > Literally just adding a " && page_mapcount(page) == 1" in there > (probably best done inside page_maybe_dma_pinned() itself) Well, that means that pages that are used for pinned DMA like this, can not be shared with other processes. Is that an acceptable limitation for the RDMA users? It seems a bit constraining, at first glance anyway. > >> Direct IO pins, on the other hand, are more transient. We can probably live >> without tagging Direct IO pages as FOLL_PIN. I think. > > Yes. I think direct-IO writes should be able to just do a transient > GUP, and if it causes a COW fault that isn't coherent, that's the > correct semantics, I think (ie the direct-IO will see the original > data, the COW faulter will get it's own private copy to make changes > to). > > I think pinning should be primarily limited to things that _require_ > coherency (ie you pin because you're going to do some active two-way > communication using that page) > > Does that match your thinking? > Yes, perfectly. I'm going to update Documentation/core-api/pin_user_pages.rst accordingly, once the dust settles on these discussions. thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-07 22:14 ` John Hubbard @ 2021-01-07 22:20 ` Linus Torvalds 2021-01-07 22:24 ` Linus Torvalds 0 siblings, 1 reply; 96+ messages in thread From: Linus Torvalds @ 2021-01-07 22:20 UTC (permalink / raw) To: John Hubbard Cc: Andrea Arcangeli, Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai On Thu, Jan 7, 2021 at 2:14 PM John Hubbard <jhubbard@nvidia.com> wrote: > > > > Literally just adding a " && page_mapcount(page) == 1" in there > > (probably best done inside page_maybe_dma_pinned() itself) > > Well, that means that pages that are used for pinned DMA like this, can > not be shared with other processes. Is that an acceptable limitation > for the RDMA users? It seems a bit constraining, at first glance anyway. Hmm, add a check for the page being PageAnon(), perhaps? If it's a shared vma, then the page can be pinned shared with multiple mappings, I agree. So yeah, I didn't think it through entirely.. And maybe I'm still missing something else.. Linus ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-07 22:20 ` Linus Torvalds @ 2021-01-07 22:24 ` Linus Torvalds 2021-01-07 22:37 ` John Hubbard 0 siblings, 1 reply; 96+ messages in thread From: Linus Torvalds @ 2021-01-07 22:24 UTC (permalink / raw) To: John Hubbard Cc: Andrea Arcangeli, Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai On Thu, Jan 7, 2021 at 2:20 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Hmm, add a check for the page being PageAnon(), perhaps? > > If it's a shared vma, then the page can be pinned shared with multiple > mappings, I agree. Or check the vma directly for whether it's a COW vma. That's probably a more obvious test, but would have to be done outside of page_maybe_dma_pinned(). For example, in copy_present_page(), we've already done that COW-vma test, so if we want to strengthen just _that_ test, then it would be sufficient to just add a /* This cannot be a pinned page if it has more than one mapping */ if (page_mappings(page) != 1) return 1; to copy_present_page() along with the existing page_maybe_dma_pinned() test. No? Linus ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-07 22:24 ` Linus Torvalds @ 2021-01-07 22:37 ` John Hubbard 0 siblings, 0 replies; 96+ messages in thread From: John Hubbard @ 2021-01-07 22:37 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai On 1/7/21 2:24 PM, Linus Torvalds wrote: > On Thu, Jan 7, 2021 at 2:20 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> Hmm, add a check for the page being PageAnon(), perhaps? >> >> If it's a shared vma, then the page can be pinned shared with multiple >> mappings, I agree. > > Or check the vma directly for whether it's a COW vma. That's probably > a more obvious test, but would have to be done outside of > page_maybe_dma_pinned(). > > For example, in copy_present_page(), we've already done that COW-vma > test, so if we want to strengthen just _that_ test, then it would be > sufficient to just add a > > /* This cannot be a pinned page if it has more than one mapping */ > if (page_mappings(page) != 1) > return 1; > > to copy_present_page() along with the existing page_maybe_dma_pinned() test. > > No? > > Linus That approach makes me a lot happier, yes. Because it doesn't add constraints to the RDMA cases, but still does what we need here. thanks, -- John Hubbard NVIDIA ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-07 21:53 ` John Hubbard 2021-01-07 22:00 ` Linus Torvalds @ 2021-01-15 11:27 ` Jan Kara 1 sibling, 0 replies; 96+ messages in thread From: Jan Kara @ 2021-01-15 11:27 UTC (permalink / raw) To: John Hubbard Cc: Linus Torvalds, Andrea Arcangeli, Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai On Thu 07-01-21 13:53:18, John Hubbard wrote: > On 1/7/21 1:29 PM, Linus Torvalds wrote: > > On Thu, Jan 7, 2021 at 12:59 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > > > The problem is it's not even possible to detect reliably if there's > > > really a long term GUP pin because of speculative pagecache lookups. > > > > So none of the normal code _needs_ that any more these days, which is > > what I think is so nice. Any pinning will do the COW, and then we have > > the logic to make sure it stays writable, and that keeps everything > > nicely coherent and is all fairly simple. > > > > And yes, it does mean that if somebody then explicitly write-protects > > a page, it may end up being COW'ed after all, but if you first pinned > > it, and then started playing with the protections of that page, why > > should you be surprised? > > > > So to me, this sounds like a "don't do that then" situation. > > > > Anybody who does page pinning and wants coherency should NOT TOUCH THE > > MAPPING IT PINNED. > > > > (And if you do touch it, it's your own fault, and you get to keep both > > of the broken pieces) > > > > Now, I do agree that from a QoI standpoint, it would be really lovely > > if we actually enforced it. I'm not entirely sure we can, but maybe it > > would be reasonable to use that > > > > mm->has_pinned && page_maybe_dma_pinned(page) > > > > at least as the beginning of a heuristic. > > > > In fact, I do think that "page_maybe_dma_pinned()" could possibly be > > made stronger than it is. Because at *THAT* point, we might say "we > > What exactly did you have in mind, to make it stronger? I think the > answer is in this email but I don't quite see it yet... > > Also, now seems to be a good time to mention that I've been thinking about > a number of pup/gup pinning cases (Direct IO, GPU/NIC, NVMe/storage peer > to peer with GUP/NIC, and HMM support for atomic operations from a device). > And it seems like the following approach would help: > > * Use pin_user_pages/FOLL_PIN for long-term pins. Long-term here (thanks > to Jason for this point) means "user space owns the lifetime". We might > even end up deleting either FOLL_PIN or FOLL_LONGTERM, because this would > make them mean the same thing. The idea is that there are no "short term" > pins of this kind of memory. > > * Continue to use FOLL_GET (only) for Direct IO. That's a big change of plans, > because several of us had thought that Direct IO needs FOLL_PIN. However, this > recent conversation, plus my list of cases above, seems to indicate otherwise. > That's because we only have one refcount approach for marking pages in this way, > and we should spend it on the long-term pinned pages. Those are both hard to > identify otherwise, and actionable once we identify them. Somewhat late to the game but I disagree here. I think direct IO still needs FOLL_PIN so that page_may_be_dma_pinned() returns true for it. At least for shared pages. Because filesystems/mm in the writeback path need to detect whether the page is pinned and thus its contents can change anytime without noticing, the page can be dirtied at random times etc. In that case we need to bounce the page during writeback (to avoid checksum failures), keep page as dirty in internal filesystem bookkeeping (and in MM as well) etc... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-07 21:29 ` Linus Torvalds 2021-01-07 21:53 ` John Hubbard @ 2021-01-07 22:31 ` Andrea Arcangeli 2021-01-07 22:42 ` Linus Torvalds 1 sibling, 1 reply; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-07 22:31 UTC (permalink / raw) To: Linus Torvalds Cc: Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai On Thu, Jan 07, 2021 at 01:29:43PM -0800, Linus Torvalds wrote: > On Thu, Jan 7, 2021 at 12:59 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > The problem is it's not even possible to detect reliably if there's > > really a long term GUP pin because of speculative pagecache lookups. > > So none of the normal code _needs_ that any more these days, which is > what I think is so nice. Any pinning will do the COW, and then we have > the logic to make sure it stays writable, and that keeps everything > nicely coherent and is all fairly simple. > > And yes, it does mean that if somebody then explicitly write-protects > a page, it may end up being COW'ed after all, but if you first pinned > it, and then started playing with the protections of that page, why > should you be surprised? > > So to me, this sounds like a "don't do that then" situation. > > Anybody who does page pinning and wants coherency should NOT TOUCH THE > MAPPING IT PINNED. > > (And if you do touch it, it's your own fault, and you get to keep both > of the broken pieces) > > Now, I do agree that from a QoI standpoint, it would be really lovely > if we actually enforced it. I'm not entirely sure we can, but maybe it > would be reasonable to use that > > mm->has_pinned && page_maybe_dma_pinned(page) > > at least as the beginning of a heuristic. > > In fact, I do think that "page_maybe_dma_pinned()" could possibly be > made stronger than it is. Because at *THAT* point, we might say "we > know a pinned page always must have a page_mapcount() of 1" - since as > part of pinning it and doing the GUP_PIN, we forced the COW, and then > subsequent fork() operations enforce it too. > > So I do think that it might be possible to make that clear_refs code > notice "this page is pinned, I can't mark it WP without the pinning > coherency breaking". > > It might not even be hard. But admittedly I'm somewhat handwaving > here, and I might not have thought of some situation. I suppose the objective would be to detect when it's a transient pin (as an O_DIRECT write) and fail clear_refs with an error for all other cases of real long term pins that need to keep reading at full PCI bandwidth, without extra GUP invocations after the wp_copy_page run. Because of the speculative lookups are making the count unstable, it's not even enough to use mmu notifier and never use FOLL_GET in GUP to make it safe again (unlike what I mentioned in a previous email). Random memory corruption will still silently materialize as result of the speculative lookups in the above scenario. My whole point here in starting this new thread to suggest page_count in do_wp_page is an untenable solution is that such commit silently broke every single long term PIN user that may be used in combination of clear_refs since 2013. Silent memory corruption undetected or a detectable error out of clear_refs, are both different side effects the same technical ABI break that rendered clear_refs fundamentally incompatible with clear_refs, so detecting it or not still an ABI break that is. I felt obliged to report there's something much deeper and fundamentally incompatible between the page_count in do_wp_page any wrprotection of exclusive memory in place, as if used in combination with any RDMA for example. The TLB flushing and the mmap_read/write_lock were just the tip of the iceberg and they're not the main concern anymore. In addition, the inefficiency caused by the fact the page_count effect is multiplied by 512 times or 262144 while the mapcount remains 4k granular, makes me think the page_count is unsuitable to be used there and this specific cure with page_count in do_wp_page, looks worse than the initial zygote disease. Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-07 22:31 ` Andrea Arcangeli @ 2021-01-07 22:42 ` Linus Torvalds 2021-01-07 22:51 ` Linus Torvalds 2021-01-07 23:28 ` Andrea Arcangeli 0 siblings, 2 replies; 96+ messages in thread From: Linus Torvalds @ 2021-01-07 22:42 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai On Thu, Jan 7, 2021 at 2:31 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > Random memory corruption will still silently materialize as result of > the speculative lookups in the above scenario. Explain. Yes, you'll get random memory corruption if you keep doing wrprotect() without mmap_sem held for writing. But I thought we agreed earlier that that is a bug. And I thought the softdirty code already got it for writing. Linus ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-07 22:42 ` Linus Torvalds @ 2021-01-07 22:51 ` Linus Torvalds 2021-01-07 23:48 ` Andrea Arcangeli 2021-01-07 23:28 ` Andrea Arcangeli 1 sibling, 1 reply; 96+ messages in thread From: Linus Torvalds @ 2021-01-07 22:51 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai On Thu, Jan 7, 2021 at 2:42 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > But I thought we agreed earlier that that is a bug. And I thought the > softdirty code already got it for writing. Ho humm. I had obviously not looked very much at that code. I had done a quick git grep, but now that I look closer, it *does* get the mmap_sem for writing, but only for that VM_SOFTDIRTY bit clearing, and then it does a mmap_write_downgrade(). So that's why I had looked more at the UFFD code, because that one was the one I was aware of doing this all with just the read lock. I _thought_ the softdirty code already took the write lock and wouldn't race with page faults. But I had missed that write_downgrade. So yeah, this code has the same issue. Anyway, the fix is - I think - the same I outlined earlier when I was talking about UFFD: take the thing for writing, so that you can't race. The alternate fix remains "make sure we always flush the TLB before releasing the page table lock, and make COW do the copy under the page table lock". But I really would prefer to just have this code follow all the usual rules, and if it does a write protect, then it should take the mmap_sem for writing. Why is that very simple rule so bad? (And see my unrelated but incidental note on it being a good idea to try to minimize latency by making surfe we don't do any IO under the mmap lock - whether held for reading _or_ writing. Because I do think we can improve in that area, if you have some good test-case). Linus ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-07 22:51 ` Linus Torvalds @ 2021-01-07 23:48 ` Andrea Arcangeli 2021-01-08 0:25 ` Linus Torvalds 0 siblings, 1 reply; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-07 23:48 UTC (permalink / raw) To: Linus Torvalds Cc: Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai On Thu, Jan 07, 2021 at 02:51:24PM -0800, Linus Torvalds wrote: > Ho humm. I had obviously not looked very much at that code. I had done > a quick git grep, but now that I look closer, it *does* get the > mmap_sem for writing, but only for that VM_SOFTDIRTY bit clearing, and > then it does a mmap_write_downgrade(). > > So that's why I had looked more at the UFFD code, because that one was > the one I was aware of doing this all with just the read lock. I > _thought_ the softdirty code already took the write lock and wouldn't > race with page faults. > > But I had missed that write_downgrade. So yeah, this code has the same issue. I overlooked the same thing initially. It's only when I noticed it also used mmap_read_lock, that I figured that the group lock thingy uffd-wp ad-hoc solution, despite it was fully self contained thanks to the handle_userfault() catcher for the uffd-wp bit in the pagetable, wasn't worth it since uffd-wp could always use whatever clear_refs used to solve it. > Anyway, the fix is - I think - the same I outlined earlier when I was > talking about UFFD: take the thing for writing, so that you can't > race. Sure. > The alternate fix remains "make sure we always flush the TLB before > releasing the page table lock, and make COW do the copy under the page > table lock". But I really would prefer to just have this code follow The copy under PT lock isn't enough. Flush TLB before releasing is enough of course. Note also the patch in 2/2 patch that I sent: https://lkml.kernel.org/r/20210107200402.31095-3-aarcange@redhat.com 2/2 would have been my preferred solution for both and it works fine. All corruption that was trivially reproducible with heavy selftest program in the kernel, is all gone. If only the TLB pending issue was the only regression page_count in do_wp_page introduced, I would have never suggested we should re-evaluate it. It'd be a good tradeoff in such case, even if it'd penalize the soft-dirty runtime, especially if we were allowed to deploy 2/2 as a non-blocking solution. Until yesterday I fully intended to just propose 1/2 and 2/2, with a whole different cover letter, CC stable and close this case. > all the usual rules, and if it does a write protect, then it should > take the mmap_sem for writing. The problem isn't about performance anymore, the problem is a silent ABI break to long term PIN user attached to an mm under clear_refs. > Why is that very simple rule so bad? > > (And see my unrelated but incidental note on it being a good idea to > try to minimize latency by making surfe we don't do any IO under the > mmap lock - whether held for reading _or_ writing. Because I do think > we can improve in that area, if you have some good test-case). That would be great indeed. Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-07 23:48 ` Andrea Arcangeli @ 2021-01-08 0:25 ` Linus Torvalds 2021-01-08 12:48 ` Will Deacon 0 siblings, 1 reply; 96+ messages in thread From: Linus Torvalds @ 2021-01-08 0:25 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai [-- Attachment #1: Type: text/plain, Size: 3091 bytes --] On Thu, Jan 7, 2021 at 3:48 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > > The alternate fix remains "make sure we always flush the TLB before > > releasing the page table lock, and make COW do the copy under the page > > table lock". But I really would prefer to just have this code follow > The copy under PT lock isn't enough. > > Flush TLB before releasing is enough of course. Right. That's why I said "and". You need both, afaik. But if we just do the mmap write lock, you need neither - then you just need to flush before you release the write lock. > Note also the patch in 2/2 patch that I sent: Yes, yes, and that's what I'm objecting to. All these stupid games with "flush_pending(" counts are complete garbage. They all come from the fact that this code doesn't hold the right lock. I don't understand you: in one breath you seem to say "yes, taking the write lock is the right thing to do", and then in the next one you point to this patch that adds all this garbage *because* it's not holding the write lock. All of those "tlb_flush_pending" things are wrong. They should not exist. The code in clear_refs_write() should hold the mmap_sem for writing, and do the TLB flush before it releases the mmap sem, and then it *cannot* race with the page faults. See what I'm saying? I refuse to apply your patch 2/2, because it all seems entirely wrong. What's doubly ludicrous about that is that the coide already _took_ the mmap_sem for writing, and spent extra cycles to downgrade it - INCORRECTLY - to a read-lock. And as far as I can tell, it doesn't even do anything expensive inside that (now downgraded) region, so the downgrading was (a) buggy (b) slower than just keeping the lock the way it had and (b) is because it had already done the expensive part (which was to get the lock in the first place). Just as an example, the whole "Rollback wrprotect_tlb_flush_pending" is all because it got the lock - again wrongly - as a read-lock initially, then it says "oh, I need to get a write lock", releases it, re-takes it as a write lock, does a tiny amount of work, and then - again incorrectly - downgrades it to a read-lock. To make it even worse (if that is possible) it actually had YET ANOTHER case - that CLEAR_REFS_MM_HIWATER_RSS - where it took the mmap sem for writing, did its thing, and then released it. So there's like *four* different locking mistakes in that single function. And it's not even an important function to begin with. It shgould just have done a single mmap_write_lock_killable(mm); ... mmap_write_unlock(mm); around the whole thing, instead of _any_ of that crazy stuff. That code is WRONG. And your PATCH 2/2 makes that insane code EVEN WORSE. Why the heck is that magic fs/proc/ interface allowed to get VM internals so wrong, and make things so much worse? Can you not see why I'm arguing with you? Please. Why is the correct patch not the attached one (apart from the obvious fact that I haven't tested it and maybe just screwed up completely - but you get the idea)? Linus [-- Attachment #2: patch --] [-- Type: application/octet-stream, Size: 1811 bytes --] fs/proc/task_mmu.c | 32 +++++++++----------------------- 1 file changed, 9 insertions(+), 23 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index ee5a235b3056..ab7d700b2caa 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1215,41 +1215,26 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, .type = type, }; + if (mmap_write_lock_killable(mm)) { + count = -EINTR; + goto out_mm; + } if (type == CLEAR_REFS_MM_HIWATER_RSS) { - if (mmap_write_lock_killable(mm)) { - count = -EINTR; - goto out_mm; - } - /* * Writing 5 to /proc/pid/clear_refs resets the peak * resident set size to this mm's current rss value. */ reset_mm_hiwater_rss(mm); - mmap_write_unlock(mm); - goto out_mm; + goto out_unlock; } - if (mmap_read_lock_killable(mm)) { - count = -EINTR; - goto out_mm; - } tlb_gather_mmu(&tlb, mm, 0, -1); if (type == CLEAR_REFS_SOFT_DIRTY) { for (vma = mm->mmap; vma; vma = vma->vm_next) { if (!(vma->vm_flags & VM_SOFTDIRTY)) continue; - mmap_read_unlock(mm); - if (mmap_write_lock_killable(mm)) { - count = -EINTR; - goto out_mm; - } - for (vma = mm->mmap; vma; vma = vma->vm_next) { - vma->vm_flags &= ~VM_SOFTDIRTY; - vma_set_page_prot(vma); - } - mmap_write_downgrade(mm); - break; + vma->vm_flags &= ~VM_SOFTDIRTY; + vma_set_page_prot(vma); } mmu_notifier_range_init(&range, MMU_NOTIFY_SOFT_DIRTY, @@ -1261,7 +1246,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, if (type == CLEAR_REFS_SOFT_DIRTY) mmu_notifier_invalidate_range_end(&range); tlb_finish_mmu(&tlb, 0, -1); - mmap_read_unlock(mm); +out_unlock: + mmap_write_unlock(mm); out_mm: mmput(mm); } ^ permalink raw reply related [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-08 0:25 ` Linus Torvalds @ 2021-01-08 12:48 ` Will Deacon 2021-01-08 16:14 ` Andrea Arcangeli 2021-01-08 17:30 ` Linus Torvalds 0 siblings, 2 replies; 96+ messages in thread From: Will Deacon @ 2021-01-08 12:48 UTC (permalink / raw) To: Linus Torvalds Cc: Andrea Arcangeli, Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai On Thu, Jan 07, 2021 at 04:25:54PM -0800, Linus Torvalds wrote: > Please. Why is the correct patch not the attached one (apart from the > obvious fact that I haven't tested it and maybe just screwed up > completely - but you get the idea)? It certainly looks simple and correct to me, although it means we're now taking the mmap sem for write in the case where we only want to clear the access flag, which should be fine with the thing only held for read, no? Will ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-08 12:48 ` Will Deacon @ 2021-01-08 16:14 ` Andrea Arcangeli 2021-01-08 17:39 ` Linus Torvalds 2021-01-08 17:30 ` Linus Torvalds 1 sibling, 1 reply; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-08 16:14 UTC (permalink / raw) To: Will Deacon Cc: Linus Torvalds, Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai, Nadav Amit, Andrew Morton Hello everyone, On Fri, Jan 08, 2021 at 12:48:16PM +0000, Will Deacon wrote: > On Thu, Jan 07, 2021 at 04:25:54PM -0800, Linus Torvalds wrote: > > Please. Why is the correct patch not the attached one (apart from the > > obvious fact that I haven't tested it and maybe just screwed up > > completely - but you get the idea)? > > It certainly looks simple and correct to me, although it means we're now > taking the mmap sem for write in the case where we only want to clear the > access flag, which should be fine with the thing only held for read, no? I'm curious, would you also suggest that fixing just the TLB flushing symptom is enough and we can forget about the ABI break coming from page_count used in do_wp_page? One random example: clear_refs will still break all long term GUP pins, are you ok with that too? page_count in do_wp_page is a fix for the original security issue from vmsplice (where the child is fooling the parent in taking the exclusive page in do_wp_page), that appears worse than the bug itself. page_count in do_wp_page, instead of isolating as malicious when the parent is reusing the page queued in the vmsplice pipe, is treating as malicious also all legit cases that had to reliably reuse the page to avoid the secondary MMUs to go out of sync. page_count in do_wp_page is like a credit card provider blocking all credit cards of all customers, because one credit card may have been cloned (by vmsplice), but nobody can know which one was it. Of course this technique will work perfectly as security fix because it will treat all credit card users as malicious and it'll block them all ("block as in preventing re-use of the anon page"). The problem are those other credit card users that weren't malicious that get their COW broken too. Those are the very long term GUP pins if any anon page can be still wrprotected anywhere in the VM. At the same time the real hanging fruit (vmsplice) that, if taken care of, would have rendered the bug purely theoretical in security terms hasn't been fixed yet, despite those unprivileged long term GUP pins causes more reproducible security issues than just the COW, since they can still DoS the OOM killer and they bypass at least the mlock enforcement, even for non compound pages. Of course just fixing vmsplice to require some privilege won't fix the bug in full, so it's not suitable long term solution, but it has to happen orthogonality for other reason, and it'd at least remove the short term security concern. In addition you're not experiencing the full fallout of the side effects of page_count used to decide if to re-use all anon COW pages because the bug is still there (with enterprise default config options at least). Not all credit cards are blocked yet with only 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 applied. Only after you will block them all, you will experience all the side effects of replacing the per-subpage finegrined mapcount with the compound-wide page count. The two statements above combined, result in my recommendation at this point to resolve this in userland by rendering the security issue theoretical by removing vmsplice from the OCI schema allowlist or by enforcing it fixing in userland by always using execve after drop privs (as crun always does when it starts the container of course). For the long term, I can't see how using page_count in do_wp_page is a tenable proposition, unless we either drop all secondary MMUs from the kernel or VM features like clear_refs are dropped or unless the page_count is magically stabilized and the speculative pagecache lookups are also dropped. If trying to manage the fallout by enforcing no anon page can ever be wrprotected in place (i.e. dropping clear_refs feature or rendering it unreliable by skipping elevated counts caused by spurious pagecache lookups), it'd still sounds a too fragile design and too prone to break to rely on that. There's random arch stuff even wrprotecting memory, even very vm86 does it under the hood (vm86 is unlikely it has a long term GUP pin on it of course, but still who knows?). I mean the VM core cannot make assumptions like: "this vm86 case can still wrprotect without worry because probably vm86 isn't used anymore with any advanced secondary MMU, so if there's a GUP pin it probably is a malicious vmsplice and not a RDMA or GPU or Virtualization secondary MMU". Then there's the secondary concern of the inefficiency it introduces with extra unnecessary copies when a single GUP pin will prevent reuse of 512 or 262144 subpages, in the 512 case also potentially mapped in different processes. The TLB flushing discussions registers as the last concern in my view. Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-08 16:14 ` Andrea Arcangeli @ 2021-01-08 17:39 ` Linus Torvalds 2021-01-08 17:53 ` Andrea Arcangeli 0 siblings, 1 reply; 96+ messages in thread From: Linus Torvalds @ 2021-01-08 17:39 UTC (permalink / raw) To: Andrea Arcangeli Cc: Will Deacon, Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai, Nadav Amit, Andrew Morton On Fri, Jan 8, 2021 at 8:14 AM Andrea Arcangeli <aarcange@redhat.com> wrote: > > page_count in do_wp_page is a fix for the original security issue Not just that. page_count() is simply the right and efficient thing to do. You talk about all these theoretical inefficiencies for cases like zygote and page pinning, which have never ever been seen except as a possible attack vector. Stop talking about irrelevant things. Stop trying to "optimize" things that never happen and don't matter. Instead, what matters is the *NORMAL* VM flow. Things like COW. Things like "oh, now that we check just the page count, we don't even need the page lock for the common case any more". > For the long term, I can't see how using page_count in do_wp_page is a > tenable proposition, I think you should re-calibrate your expectations, and accept that page_count() is the right thing to do, and live with it. And instead of worrying about irrelevant special-case code, start worrying about the code that gets triggered tens of thousands of times a second, on regular loads, without anybody doing anything odd or special at all, just running plain and normal shell scripts or any other normal traditional load. Those irrelevant special cases should be simple and work, not badly optimized to the point where they are buggy. And they are MUCH LESS IMPORTANT than the normal VM code, so if somebody does something odd, and it's slow, then that is the problem for the _odd_ case, not for the normal codepaths. This is why I refuse to add crazy new special cases to core code. Make the rules simple and straightforward, and make the code VM work well. Linus ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-08 17:39 ` Linus Torvalds @ 2021-01-08 17:53 ` Andrea Arcangeli 2021-01-08 19:25 ` Linus Torvalds 0 siblings, 1 reply; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-08 17:53 UTC (permalink / raw) To: Linus Torvalds Cc: Will Deacon, Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai, Nadav Amit, Andrew Morton On Fri, Jan 08, 2021 at 09:39:56AM -0800, Linus Torvalds wrote: > page_count() is simply the right and efficient thing to do. > > You talk about all these theoretical inefficiencies for cases like > zygote and page pinning, which have never ever been seen except as a > possible attack vector. Do you intend to eventually fix the zygote vmsplice case or not? Because in current upstream it's not fixed currently using the enterprise default config. > Stop talking about irrelevant things. Stop trying to "optimize" things > that never happen and don't matter. > > Instead, what matters is the *NORMAL* VM flow. > > Things like COW. > > Things like "oh, now that we check just the page count, we don't even > need the page lock for the common case any more". > > > For the long term, I can't see how using page_count in do_wp_page is a > > tenable proposition, > > I think you should re-calibrate your expectations, and accept that > page_count() is the right thing to do, and live with it. > > And instead of worrying about irrelevant special-case code, start Irrelevant special case as in: long term GUP pin on the memory? Or irrelevant special case as in: causing secondary MMU to hit silent data loss if a pte is ever wrprotected (arch code, vm86, whatever, all under mmap_write_lock of course). > worrying about the code that gets triggered tens of thousands of times > a second, on regular loads, without anybody doing anything odd or > special at all, just running plain and normal shell scripts or any > other normal traditional load. > > Those irrelevant special cases should be simple and work, not badly > optimized to the point where they are buggy. And they are MUCH LESS > IMPORTANT than the normal VM code, so if somebody does something odd, > and it's slow, then that is the problem for the _odd_ case, not for > the normal codepaths. > > This is why I refuse to add crazy new special cases to core code. Make > the rules simple and straightforward, and make the code VM work well. New special cases? which new cases? There's nothing new here besides the zygote that wasn't fully fixed with 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 and is actually the only new case I can imagine where page_count actually isn't a regression. All old cases that you seem to refer as irrelevant and are in production in v4.18, I don't see anything new here. Even for the pure COW case with zero GUP involvement an hugepage with cows happening in different processes, would forever hit wp_copy_page since count is always > 1 despite mapcount can be 1 for all subpages. A simple app doing fork/exec would forever copy all memory in the parent even after the exec is finished. Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-08 17:53 ` Andrea Arcangeli @ 2021-01-08 19:25 ` Linus Torvalds 2021-01-09 0:12 ` Andrea Arcangeli 0 siblings, 1 reply; 96+ messages in thread From: Linus Torvalds @ 2021-01-08 19:25 UTC (permalink / raw) To: Andrea Arcangeli Cc: Will Deacon, Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai, Nadav Amit, Andrew Morton On Fri, Jan 8, 2021 at 9:53 AM Andrea Arcangeli <aarcange@redhat.com> wrote: > > Do you intend to eventually fix the zygote vmsplice case or not? > Because in current upstream it's not fixed currently using the > enterprise default config. Is this the hugepage case? Neither of your patches actually touched that, so I've forgotten the details. > Irrelevant special case as in: long term GUP pin on the memory? Irrelevant special case in that (a) an extra COW shouldn't be a correctness issue unless somebody does something horribly wrong (and obviously the code that hasn't taken the mmap_lock for writing are then examples of that) and (b) it's not a performance issue either unless you can find a real load that does it. Hmm? Linus ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-08 19:25 ` Linus Torvalds @ 2021-01-09 0:12 ` Andrea Arcangeli 0 siblings, 0 replies; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-09 0:12 UTC (permalink / raw) To: Linus Torvalds Cc: Will Deacon, Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai, Nadav Amit, Andrew Morton On Fri, Jan 08, 2021 at 11:25:21AM -0800, Linus Torvalds wrote: > On Fri, Jan 8, 2021 at 9:53 AM Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > Do you intend to eventually fix the zygote vmsplice case or not? > > Because in current upstream it's not fixed currently using the > > enterprise default config. > > Is this the hugepage case? Neither of your patches actually touched > that, so I've forgotten the details. The two patches only fixed the TLB flushing deferral in clear_refs and uffd-wp. So I didn't actually try to fix the hugepage case by adding the page_count checks there too. I could try to do that at least it'd be consistent but I still would try to find an alternate solution later. > > Irrelevant special case as in: long term GUP pin on the memory? > > Irrelevant special case in that > > (a) an extra COW shouldn't be a correctness issue unless somebody > does something horribly wrong (and obviously the code that hasn't > taken the mmap_lock for writing are then examples of that) > > and > > (b) it's not a performance issue either unless you can find a real > load that does it. > > Hmm? For b) I don't have an hard time to imagine `ps` hanging for seconds, if clear_refs is touched on a 4T mm, but b) is not the main concern. Having to rely on a) is the main concern and it's not about tlb flushes but the long term GUP pins. Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-08 12:48 ` Will Deacon 2021-01-08 16:14 ` Andrea Arcangeli @ 2021-01-08 17:30 ` Linus Torvalds 1 sibling, 0 replies; 96+ messages in thread From: Linus Torvalds @ 2021-01-08 17:30 UTC (permalink / raw) To: Will Deacon Cc: Andrea Arcangeli, Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai On Fri, Jan 8, 2021 at 4:48 AM Will Deacon <will@kernel.org> wrote: > > It certainly looks simple and correct to me, although it means we're now > taking the mmap sem for write in the case where we only want to clear the > access flag, which should be fine with the thing only held for read, no? When I was looking at that code, I was thinking that the whole function should be split up to get rid of some of the indentation and the "goto out_mm". And yes, it would probably be good to split up up even more than that "initial mm lookup and error handling", and have an actual case statement for the different clear_ref 'type' cases. And then it would be fairly simple and clean to say "this case only needs the mmap_sem for read, that case needs it for write". So I don't disagree, but I think it should be a separate patch - if it even matters. Is this strange /proc case something that is even commonly done? Linus ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-07 22:42 ` Linus Torvalds 2021-01-07 22:51 ` Linus Torvalds @ 2021-01-07 23:28 ` Andrea Arcangeli 1 sibling, 0 replies; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-07 23:28 UTC (permalink / raw) To: Linus Torvalds Cc: Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jason Gunthorpe, Jan Kara, Kirill Tkhai On Thu, Jan 07, 2021 at 02:42:17PM -0800, Linus Torvalds wrote: > On Thu, Jan 7, 2021 at 2:31 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > Random memory corruption will still silently materialize as result of > > the speculative lookups in the above scenario. > > Explain. > > Yes, you'll get random memory corruption if you keep doing wrprotect() > without mmap_sem held for writing. I didn't meant that. > But I thought we agreed earlier that that is a bug. And I thought the > softdirty code already got it for writing. softdirty used mmap_read_lock too but this again isn't relevant here and for the sake of discussion we can safely assume mmap_read_lock doesn't exist in the kernel, and everything takes the mmap_write_lock whenever a mmap_lock is taken at all. I mean something bad will happen if a write happens, but soft dirty cannot register it because we didn't wrprotect the pte? Some dirty page won't be transferred to destination and it will be assumed there was no softy dirty event for such page? Otherwise it would mean that wrprotecting is simply optional for all pages under clear_refs? Not doing the final TLB flush in softdirty caused some issue even when there was no COW and the deferred flush only would delay the wrprotect fault: https://lore.kernel.org/linux-mm/CA+32v5zzFYJQ7eHfJP-2OHeR+6p5PZsX=RDJNU6vGF3hLO+j-g@mail.gmail.com/ https://lore.kernel.org/linux-mm/20210105221628.GA12854@willie-the-truck/ Skipping the wrprotection of the pte because of a speculative pagecache lookup elevating a random page_count, from the userland point of view, I guessed would behave as missing the final TLB flush before clear_refs returns to userland, just worse. Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending 2021-01-07 20:04 ` [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending Andrea Arcangeli 2021-01-07 20:17 ` Linus Torvalds @ 2021-01-07 21:36 ` kernel test robot 1 sibling, 0 replies; 96+ messages in thread From: kernel test robot @ 2021-01-07 21:36 UTC (permalink / raw) To: Andrea Arcangeli, linux-mm, linux-kernel, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon Cc: kbuild-all Hi Andrea, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on linux/master] [also build test WARNING on linus/master hnaz-linux-mm/master v5.11-rc2 next-20210104] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Andrea-Arcangeli/page_count-can-t-be-used-to-decide-when-wp_page_copy/20210108-040616 base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 09162bc32c880a791c6c0668ce0745cf7958f576 compiler: nds32le-linux-gcc (GCC) 9.3.0 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> "cppcheck warnings: (new ones prefixed by >>)" >> fs/proc/task_mmu.c:1248:33: warning: Uninitialized variable: tmp [uninitvar] for (vma = mm->mmap; vma != tmp; ^ vim +1248 fs/proc/task_mmu.c 1183 1184 static ssize_t clear_refs_write(struct file *file, const char __user *buf, 1185 size_t count, loff_t *ppos) 1186 { 1187 struct task_struct *task; 1188 char buffer[PROC_NUMBUF]; 1189 struct mm_struct *mm; 1190 struct vm_area_struct *vma; 1191 enum clear_refs_types type; 1192 int itype; 1193 int rv; 1194 1195 memset(buffer, 0, sizeof(buffer)); 1196 if (count > sizeof(buffer) - 1) 1197 count = sizeof(buffer) - 1; 1198 if (copy_from_user(buffer, buf, count)) 1199 return -EFAULT; 1200 rv = kstrtoint(strstrip(buffer), 10, &itype); 1201 if (rv < 0) 1202 return rv; 1203 type = (enum clear_refs_types)itype; 1204 if (type < CLEAR_REFS_ALL || type >= CLEAR_REFS_LAST) 1205 return -EINVAL; 1206 1207 task = get_proc_task(file_inode(file)); 1208 if (!task) 1209 return -ESRCH; 1210 mm = get_task_mm(task); 1211 if (mm) { 1212 struct mmu_notifier_range range; 1213 struct clear_refs_private cp = { 1214 .type = type, 1215 }; 1216 1217 if (type == CLEAR_REFS_MM_HIWATER_RSS) { 1218 if (mmap_write_lock_killable(mm)) { 1219 count = -EINTR; 1220 goto out_mm; 1221 } 1222 1223 /* 1224 * Writing 5 to /proc/pid/clear_refs resets the peak 1225 * resident set size to this mm's current rss value. 1226 */ 1227 reset_mm_hiwater_rss(mm); 1228 mmap_write_unlock(mm); 1229 goto out_mm; 1230 } 1231 1232 if (mmap_read_lock_killable(mm)) { 1233 count = -EINTR; 1234 goto out_mm; 1235 } 1236 if (type == CLEAR_REFS_SOFT_DIRTY) { 1237 for (vma = mm->mmap; vma; vma = vma->vm_next) { 1238 struct vm_area_struct *tmp; 1239 if (!(vma->vm_flags & VM_SOFTDIRTY)) { 1240 inc_wrprotect_tlb_flush_pending(vma); 1241 continue; 1242 } 1243 1244 /* 1245 * Rollback wrprotect_tlb_flush_pending before 1246 * releasing the mmap_lock. 1247 */ > 1248 for (vma = mm->mmap; vma != tmp; 1249 vma = vma->vm_next) 1250 dec_wrprotect_tlb_flush_pending(vma); 1251 1252 mmap_read_unlock(mm); 1253 if (mmap_write_lock_killable(mm)) { 1254 count = -EINTR; 1255 goto out_mm; 1256 } 1257 for (vma = mm->mmap; vma; vma = vma->vm_next) { 1258 vma->vm_flags &= ~VM_SOFTDIRTY; 1259 vma_set_page_prot(vma); 1260 inc_wrprotect_tlb_flush_pending(vma); 1261 } 1262 mmap_write_downgrade(mm); 1263 break; 1264 } 1265 1266 inc_tlb_flush_pending(mm); 1267 mmu_notifier_range_init(&range, MMU_NOTIFY_SOFT_DIRTY, 1268 0, NULL, mm, 0, -1UL); 1269 mmu_notifier_invalidate_range_start(&range); 1270 } 1271 walk_page_range(mm, 0, mm->highest_vm_end, &clear_refs_walk_ops, 1272 &cp); 1273 if (type == CLEAR_REFS_SOFT_DIRTY) { 1274 mmu_notifier_invalidate_range_end(&range); 1275 flush_tlb_mm(mm); 1276 for (vma = mm->mmap; vma; vma = vma->vm_next) 1277 dec_wrprotect_tlb_flush_pending(vma); 1278 dec_tlb_flush_pending(mm); 1279 } 1280 mmap_read_unlock(mm); 1281 out_mm: 1282 mmput(mm); 1283 } 1284 put_task_struct(task); 1285 1286 return count; 1287 } 1288 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-07 20:04 ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Andrea Arcangeli 2021-01-07 20:04 ` [PATCH 1/2] mm: proc: Invalidate TLB after clearing soft-dirty page state Andrea Arcangeli 2021-01-07 20:04 ` [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending Andrea Arcangeli @ 2021-01-07 20:25 ` Jason Gunthorpe 2021-01-07 20:32 ` Linus Torvalds 2021-01-07 21:45 ` Andrea Arcangeli 2 siblings, 2 replies; 96+ messages in thread From: Jason Gunthorpe @ 2021-01-07 20:25 UTC (permalink / raw) To: Andrea Arcangeli Cc: linux-mm, linux-kernel, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Linus Torvalds, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Thu, Jan 07, 2021 at 03:04:00PM -0500, Andrea Arcangeli wrote: > vmsplice syscall API is insecure allowing long term GUP PINs without > privilege. Lots of places are relying on pin_user_pages long term pins of memory, and cannot be converted to notifiers. I don't think it is reasonable to just declare that insecure and requires privileges, it is a huge ABI break. FWIW, vhost tries to use notifiers as a replacement for GUP, and I think it ended up quite strange and complicated. It is hard to maintain performance when every access to the pages needs to hold some protection against parallel invalidation. Jason ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-07 20:25 ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Jason Gunthorpe @ 2021-01-07 20:32 ` Linus Torvalds 2021-01-07 21:05 ` Linus Torvalds 2021-01-07 21:54 ` Andrea Arcangeli 2021-01-07 21:45 ` Andrea Arcangeli 1 sibling, 2 replies; 96+ messages in thread From: Linus Torvalds @ 2021-01-07 20:32 UTC (permalink / raw) To: Jason Gunthorpe Cc: Andrea Arcangeli, Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Thu, Jan 7, 2021 at 12:25 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > Lots of places are relying on pin_user_pages long term pins of memory, > and cannot be converted to notifiers. > > I don't think it is reasonable to just declare that insecure and > requires privileges, it is a huge ABI break. Also, I think GUP (and pin_user_pages() as a special case) is a lot more important and more commonly used than UFFD. Which is really why I think this needs to be fixed by just fixing UFFD to take the write lock. I think Andrea is blinded by his own love for UFFDIO: when I do a debian codesearch for UFFDIO_WRITEPROTECT, all it finds is the kernel and strace (and the qemu copies of the kernel headers). Does the debian code search cover everything? Obviously not. But if you cannot find A SINGLE USE of that thing in the Debian code search, then that is sure a sign of _something_. Linus ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-07 20:32 ` Linus Torvalds @ 2021-01-07 21:05 ` Linus Torvalds 2021-01-07 22:02 ` Andrea Arcangeli 2021-01-09 19:32 ` Matthew Wilcox 2021-01-07 21:54 ` Andrea Arcangeli 1 sibling, 2 replies; 96+ messages in thread From: Linus Torvalds @ 2021-01-07 21:05 UTC (permalink / raw) To: Jason Gunthorpe Cc: Andrea Arcangeli, Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Thu, Jan 7, 2021 at 12:32 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Which is really why I think this needs to be fixed by just fixing UFFD > to take the write lock. Side note, and not really related to UFFD, but the mmap_sem in general: I was at one point actually hoping that we could make the mmap_sem a spinlock, or at least make the rule be that we never do any IO under it. At which point a write lock hopefully really shouldn't be such a huge deal. The main source of IO under the mmap lock was traditionally the page faults obviously needing to read the data in, but we now try to handle that with the whole notion of page fault restart instead. But I'm 100% sure we don't do as good a job of it as we _could_ do, and there are probably a lot of other cases where we end up doing IO under the mmap lock simply because we can and nobody has looked at it very much. So if taking the mmap_sem for writing is a huge deal - because it ends up serializing with IO by people who take it for reading - I think that is something that might be worth really looking into. For example, right now I think we (still) only do the page fault retry once - and the second time if the page still isn't available, we'll actually wait with the mmap_sem held. That goes back to the very original page fault retry logic, when I was worried that some infinite retry would cause busy-waiting because somebody didn't do the proper "drop mmap_sem, then wait, then return retry". And if that actually causes problems, maybe we should just make sure to fix it? remove that FAULT_FLAG_TRIED bit entirely, and make the rule be that we always drop the mmap_sem and retry? Similarly, if there are users that don't set FAULT_FLAG_ALLOW_RETRY at all (because they don't have the logic to check if it's a re-try and re-do the mmap_sem etc), maybe we can just fix them. I think all the architectures do it properly in their page fault paths (I think Peter Xu converted them all - no?), but maybe there are cases of GUP that don't have it. Or maybe there is something else that I just didn't notice, where we end up having bad latencies on the mmap_sem. I think those would very much be worth fixing, so that if UFFDIO_WRITEPROTECT taking the mmapo_sem for writing causes problems, we can _fix_ those problems. But I think it's entirely wrong to treat UFFDIO_WRITEPROTECT as specially as Andrea seems to want to treat it. Particularly with absolutely zero use cases to back it up. Linus ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-07 21:05 ` Linus Torvalds @ 2021-01-07 22:02 ` Andrea Arcangeli 2021-01-07 22:17 ` Linus Torvalds 2021-01-09 19:32 ` Matthew Wilcox 1 sibling, 1 reply; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-07 22:02 UTC (permalink / raw) To: Linus Torvalds Cc: Jason Gunthorpe, Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Thu, Jan 07, 2021 at 01:05:19PM -0800, Linus Torvalds wrote: > I think those would very much be worth fixing, so that if > UFFDIO_WRITEPROTECT taking the mmapo_sem for writing causes problems, > we can _fix_ those problems. > > But I think it's entirely wrong to treat UFFDIO_WRITEPROTECT as > specially as Andrea seems to want to treat it. Particularly with > absolutely zero use cases to back it up. Again for the record: there's nothing at all special in UFFDIO_WRITEPROTECT in this respect. If you could stop mentioning UFFDIO_WRITEPROTECT and only focus on softdirty/clear_refs, maybe you wouldn't think my judgment is biased towards clear_refs/softdirty too. You can imagine the side effects of page_count doing a COW erroneously, as corollary of the fact that KSM won't ever allow to merge two pages if one of them is under GUP pin. Why is that? Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-07 22:02 ` Andrea Arcangeli @ 2021-01-07 22:17 ` Linus Torvalds 2021-01-07 22:56 ` Andrea Arcangeli 0 siblings, 1 reply; 96+ messages in thread From: Linus Torvalds @ 2021-01-07 22:17 UTC (permalink / raw) To: Andrea Arcangeli Cc: Jason Gunthorpe, Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Thu, Jan 7, 2021 at 2:03 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > If you could stop mentioning UFFDIO_WRITEPROTECT and only focus on > softdirty/clear_refs, maybe you wouldn't think my judgment is biased > towards clear_refs/softdirty too. So I think we can agree that even that softdirty case we can just handle by "don't do that then". if a page is pinned, the dirty bit of it makes no sense, because it might be dirtied complately asynchronously by the pinner. So I think none of the softdirty stuff should touch pinned pages. I think it falls solidly under that "don't do it then". Just skipping over them in clear_soft_dirty[_pmd]() doesn't look that hard, does it? Linus ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-07 22:17 ` Linus Torvalds @ 2021-01-07 22:56 ` Andrea Arcangeli 0 siblings, 0 replies; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-07 22:56 UTC (permalink / raw) To: Linus Torvalds Cc: Jason Gunthorpe, Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Thu, Jan 07, 2021 at 02:17:50PM -0800, Linus Torvalds wrote: > So I think we can agree that even that softdirty case we can just > handle by "don't do that then". Absolutely. The question is if somebody was happily running clear_refs with a RDMA attached to the process, by the time they update and reboot they'll find it the hard way with silent mm corruption currently. So I was obliged to report this issue and the fact there was very strong reason why page_count was not used there and it's even documented explicitly in the source: * [..] however we only use * page_trans_huge_mapcount() in the copy-on-write faults where we * need full accuracy to avoid breaking page pinning, [..] I didn't entirely forget the comment when I reiterated it in fact also in Message-ID: <20200527212005.GC31990@redhat.com> on May 27 2020 since I recalled there was a very explicit reason why we had to use page_mapcount in do_wp_page and deliver full accuracy. Now I cannot proof there's any such user that will break, but we'll find those with a 1 year or more of delay. Even the tlb flushing deferral that caused clear_refs_write to also corrupt userland memory and literally lose userland writes even without any special secondary MMU hardware being attached to the memory, took 6 months to find. > if a page is pinned, the dirty bit of it makes no sense, because it > might be dirtied complately asynchronously by the pinner. > > So I think none of the softdirty stuff should touch pinned pages. I > think it falls solidly under that "don't do it then". > > Just skipping over them in clear_soft_dirty[_pmd]() doesn't look that > hard, does it? 1) How do you know again if it's not speculative lookup or an O_DIRECT that made them look pinned? 2) it's a hugepage 1, a 4k pin will make soft dirty then skip 511 dirty bits by mistake including those pages that weren't pinned and that userland is still writing through the transhuge pmd. Until v4.x we had a page pin counter for each subpage so the above wouldn't have happened, but not anymore since it was simplified and improved but now the page_count is pretty inefficient there, even if it'd be possible to make safe. 3) the GUP(write=0) may be just reading from RAM and sending to the other end so clear_refs may have currently very much tracked CPU writes fine, with no interference whatsoever from the secondary MMU working in readonly on the memory. Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-07 21:05 ` Linus Torvalds 2021-01-07 22:02 ` Andrea Arcangeli @ 2021-01-09 19:32 ` Matthew Wilcox 2021-01-09 19:46 ` Linus Torvalds 1 sibling, 1 reply; 96+ messages in thread From: Matthew Wilcox @ 2021-01-09 19:32 UTC (permalink / raw) To: Linus Torvalds Cc: Jason Gunthorpe, Andrea Arcangeli, Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Thu, Jan 07, 2021 at 01:05:19PM -0800, Linus Torvalds wrote: > Side note, and not really related to UFFD, but the mmap_sem in > general: I was at one point actually hoping that we could make the > mmap_sem a spinlock, or at least make the rule be that we never do any > IO under it. At which point a write lock hopefully really shouldn't be > such a huge deal. There's a (small) group of us working towards that. It has some prerequisites, but where we're hoping to go currently: - Replace the vma rbtree with a b-tree protected with a spinlock - Page faults walk the b-tree under RCU, like peterz/laurent's SPF patchset - If we need to do I/O, take a refcount on the VMA After that, we can gradually move things out from mmap_sem protection to just the vma tree spinlock, or whatever makes sense for them. In a very real way the mmap_sem is the MM layer's BKL. ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-09 19:32 ` Matthew Wilcox @ 2021-01-09 19:46 ` Linus Torvalds 2021-01-15 14:30 ` Jan Kara 0 siblings, 1 reply; 96+ messages in thread From: Linus Torvalds @ 2021-01-09 19:46 UTC (permalink / raw) To: Matthew Wilcox Cc: Jason Gunthorpe, Andrea Arcangeli, Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Sat, Jan 9, 2021 at 11:33 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Jan 07, 2021 at 01:05:19PM -0800, Linus Torvalds wrote: > > Side note, and not really related to UFFD, but the mmap_sem in > > general: I was at one point actually hoping that we could make the > > mmap_sem a spinlock, or at least make the rule be that we never do any > > IO under it. At which point a write lock hopefully really shouldn't be > > such a huge deal. > > There's a (small) group of us working towards that. It has some > prerequisites, but where we're hoping to go currently: > > - Replace the vma rbtree with a b-tree protected with a spinlock > - Page faults walk the b-tree under RCU, like peterz/laurent's SPF patchset > - If we need to do I/O, take a refcount on the VMA > > After that, we can gradually move things out from mmap_sem protection > to just the vma tree spinlock, or whatever makes sense for them. In a > very real way the mmap_sem is the MM layer's BKL. Well, we could do the "no IO" part first, and keep the semaphore part. Some people actually prefer a semaphore to a spinlock, because it doesn't end up causing preemption issues. As long as you don't do IO (or memory allocations) under a semaphore (ok, in this case it's a rwsem, same difference), it might even be preferable to keep it as a semaphore rather than as a spinlock. So it doesn't necessarily have to go all the way - we _could_ just try something like "when taking the mmap_sem, set a thread flag" and then have a "warn if doing allocations or IO under that flag". And since this is about performance, not some hard requirement, it might not even matter if we catch all cases. If we fix it so that any regular load on most normal filesystems never see the warning, we'd already be golden. Of course, I think we've had issues with rw_sems for _other_ reasons. Waiman actually removed the reader optimistic spinning because it caused bad interactions with mixed reader-writer loads. So rwsemapores may end up not working as well as spinlocks if the common situation is "just wait a bit, you'll get it". Linus ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-09 19:46 ` Linus Torvalds @ 2021-01-15 14:30 ` Jan Kara 0 siblings, 0 replies; 96+ messages in thread From: Jan Kara @ 2021-01-15 14:30 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Jason Gunthorpe, Andrea Arcangeli, Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Sat 09-01-21 11:46:46, Linus Torvalds wrote: > On Sat, Jan 9, 2021 at 11:33 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Thu, Jan 07, 2021 at 01:05:19PM -0800, Linus Torvalds wrote: > > > Side note, and not really related to UFFD, but the mmap_sem in > > > general: I was at one point actually hoping that we could make the > > > mmap_sem a spinlock, or at least make the rule be that we never do any > > > IO under it. At which point a write lock hopefully really shouldn't be > > > such a huge deal. > > > > There's a (small) group of us working towards that. It has some > > prerequisites, but where we're hoping to go currently: > > > > - Replace the vma rbtree with a b-tree protected with a spinlock > > - Page faults walk the b-tree under RCU, like peterz/laurent's SPF patchset > > - If we need to do I/O, take a refcount on the VMA > > > > After that, we can gradually move things out from mmap_sem protection > > to just the vma tree spinlock, or whatever makes sense for them. In a > > very real way the mmap_sem is the MM layer's BKL. > > Well, we could do the "no IO" part first, and keep the semaphore part. > > Some people actually prefer a semaphore to a spinlock, because it > doesn't end up causing preemption issues. > > As long as you don't do IO (or memory allocations) under a semaphore > (ok, in this case it's a rwsem, same difference), it might even be > preferable to keep it as a semaphore rather than as a spinlock. > > So it doesn't necessarily have to go all the way - we _could_ just try > something like "when taking the mmap_sem, set a thread flag" and then > have a "warn if doing allocations or IO under that flag". > > And since this is about performance, not some hard requirement, it > might not even matter if we catch all cases. If we fix it so that any > regular load on most normal filesystems never see the warning, we'd > already be golden. Honestly, I'd *love* if a filesystem can be guaranteed that ->fault and ->mkwrite callbacks do not happen under mmap_sem (or if at least fs would be free to drop mmap_sem if it finds the page is not already cached / prepared for writing). Because for filesystems the locking of page fault is really painful as the lock ordering wrt mmap_sem is exactly oposite compared to read / write path (read & write path must be designed so that mmap_sem can be taken inside it to copy user data, fault path may be all happening under mmap_sem). As a result this has been a long term source of deadlocks, stale data exposure issues, and filesystem corruption issues due to insufficient locking for multiple filesystems. But when I was looking at what it would take to achieve this several years ago, fixing all GUP users to deal with mmap_sem being dropped during a fault was a gigantic task because there were users of GUP relying on mmap_sem being held for large code sections around the GUP call... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-07 20:32 ` Linus Torvalds 2021-01-07 21:05 ` Linus Torvalds @ 2021-01-07 21:54 ` Andrea Arcangeli 1 sibling, 0 replies; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-07 21:54 UTC (permalink / raw) To: Linus Torvalds Cc: Jason Gunthorpe, Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Thu, Jan 07, 2021 at 12:32:09PM -0800, Linus Torvalds wrote: > I think Andrea is blinded by his own love for UFFDIO: when I do a > debian codesearch for UFFDIO_WRITEPROTECT, all it finds is the kernel > and strace (and the qemu copies of the kernel headers). For the record, I feel obliged to reiterate I'm thinking purely in clear_refs terms here. It'd be great if we can only focus on clear_refs_write and nothing else. Like I said earlier, whatever way clear_refs/softdirty copes with do_wp_page, uffd-wp can do the identical thing so, uffd-wp is effectively irrelevant in this whole discussion. Clear-refs/softdirty predates uffd-wp by several years too. Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-07 20:25 ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Jason Gunthorpe 2021-01-07 20:32 ` Linus Torvalds @ 2021-01-07 21:45 ` Andrea Arcangeli 2021-01-08 13:36 ` Jason Gunthorpe 1 sibling, 1 reply; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-07 21:45 UTC (permalink / raw) To: Jason Gunthorpe Cc: linux-mm, linux-kernel, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Linus Torvalds, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Thu, Jan 07, 2021 at 04:25:25PM -0400, Jason Gunthorpe wrote: > On Thu, Jan 07, 2021 at 03:04:00PM -0500, Andrea Arcangeli wrote: > > > vmsplice syscall API is insecure allowing long term GUP PINs without > > privilege. > > Lots of places are relying on pin_user_pages long term pins of memory, > and cannot be converted to notifiers. > > I don't think it is reasonable to just declare that insecure and > requires privileges, it is a huge ABI break. Where's that ABI? Are there specs or a code example in kernel besides vmsplice itself? I don't see how it's possible to consider long term GUP pins completely unprivileged if not using mmu notifier. vmsplice doesn't even account them in rlimit (it cannot because it cannot identify all put_pages either). Long term GUP pins not using mmu notifier and not accounted in rlimit are an order of magnitude more VM-intrusive than mlock. The reason it's worse than mlock, even if ignore all performance feature that they break including numa bindings and that mlock doesn't risk to break, come because you can unmap the memory after taking those rlimit unaccounted GUP pins. So the OOM killer won't even have a chance to see the GUP pins coming. So it can't be that mlock has to be privileged but unconstrainted unaccounted long term GUP pins as in vmsplice are ok to stay unprivileged. Now io_uring does account the GPU pins in the mlock rlimit, but after the vma is unmapped it'd still cause the same confusion to OOM killer and in addition the assumption that each GUP pin cost 4k is also flawed. However io_uring model can use the mmu notifier without slowdown to the fast paths, so it's not going to cause any ABI break to fix it. Or to see it another way, it'd be fine to declare all mlock rlimits are obsolete and memcg is the only way to constrain RAM usage, but then mlock should stop being privileged, because mlock is a lesser concern and it won't risk to confuse the OOM killer at least. The good thing is the GUP pins won't escape memcg accounting but that accounting also doesn't come entirely free. > FWIW, vhost tries to use notifiers as a replacement for GUP, and I > think it ended up quite strange and complicated. It is hard to > maintain performance when every access to the pages needs to hold some > protection against parallel invalidation. And that's fine, this is all about if it should require a one liner change to add the username in the realtime group in /etc/group or not. You're focusing on your use case, but we've to put things in prospective of all these changes started. The whole zygote issue wouldn't even register if the child had the exact same credentials of the parent. Problem is the child dropped privileges and went with a luser id, that clearly cannot ptrace the parent, and so if long term unprivileged GUP pins are gone from the equation, what remains that the child can do is purely theoretical even before commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f. NOTE: I'm all for fixing the COW for good, but vmsplice or any long term GUP pin that is absolutely required to make such attack practical, looks the real low hanging fruit here to fix. However fixing it so clear_refs becomes fundamentally incompatible with mmu notifier users unless they all convert to pure !FOLL_GET GUPs, let alone long term GUP pins not using mmu notifier, doesn't look great. For vmsplice that new break-COW is the fix because it happens in the other process. For every legit long term GUP, where the break-COW happens in the single and only process, it's silent MM corruption. Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-07 21:45 ` Andrea Arcangeli @ 2021-01-08 13:36 ` Jason Gunthorpe 2021-01-08 17:00 ` Andrea Arcangeli 0 siblings, 1 reply; 96+ messages in thread From: Jason Gunthorpe @ 2021-01-08 13:36 UTC (permalink / raw) To: Andrea Arcangeli Cc: linux-mm, linux-kernel, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Linus Torvalds, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Thu, Jan 07, 2021 at 04:45:33PM -0500, Andrea Arcangeli wrote: > On Thu, Jan 07, 2021 at 04:25:25PM -0400, Jason Gunthorpe wrote: > > On Thu, Jan 07, 2021 at 03:04:00PM -0500, Andrea Arcangeli wrote: > > > > > vmsplice syscall API is insecure allowing long term GUP PINs without > > > privilege. > > > > Lots of places are relying on pin_user_pages long term pins of memory, > > and cannot be converted to notifiers. > > > > I don't think it is reasonable to just declare that insecure and > > requires privileges, it is a huge ABI break. > > Where's that ABI? Are there specs or a code example in kernel besides > vmsplice itself? If I understand you right, you are trying to say that the 193 pin_user_pages() callers cannot exist as unpriv any more? The majority cannot be converted to notifiers because they are DMA based. Every one of those is an ABI for something, and does not expect extra privilege to function. It would be a major breaking change to have pin_user_pages require some cap. > The whole zygote issue wouldn't even register if the child had the > exact same credentials of the parent. Problem is the child dropped > privileges and went with a luser id, that clearly cannot ptrace the > parent, and so if long term unprivileged GUP pins are gone from the > equation, what remains that the child can do is purely theoretical > even before commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f. Sorry, I'm not sure I've found a good explanation how ptrace and GUP are interacting to become a security problem. 17839 makes sense to me, and read-only GUP has been avoided by places like RDMA and others for a very long time because of these issues, adding the same idea to the core code looks OK. The semantics we discussed during the COW on fork thread for pin user pages were, more or less, that once pinned a page should not be silently removed from the mm it is currently in by COW or otherwise in the kernel. So maybe ptrace should not be COW'ing pinned pages at all, as that is exactly the same kind of silent corruption fork was causing. Jason ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-08 13:36 ` Jason Gunthorpe @ 2021-01-08 17:00 ` Andrea Arcangeli 2021-01-08 18:19 ` Jason Gunthorpe [not found] ` <20210109034958.6928-1-hdanton@sina.com> 0 siblings, 2 replies; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-08 17:00 UTC (permalink / raw) To: Jason Gunthorpe Cc: linux-mm, linux-kernel, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Linus Torvalds, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Fri, Jan 08, 2021 at 09:36:49AM -0400, Jason Gunthorpe wrote: > On Thu, Jan 07, 2021 at 04:45:33PM -0500, Andrea Arcangeli wrote: > > On Thu, Jan 07, 2021 at 04:25:25PM -0400, Jason Gunthorpe wrote: > > > On Thu, Jan 07, 2021 at 03:04:00PM -0500, Andrea Arcangeli wrote: > > > > > > > vmsplice syscall API is insecure allowing long term GUP PINs without ^^^^^^^^^ > > > > privilege. > > > > > > Lots of places are relying on pin_user_pages long term pins of memory, > > > and cannot be converted to notifiers. > > > > > > I don't think it is reasonable to just declare that insecure and > > > requires privileges, it is a huge ABI break. > > > > Where's that ABI? Are there specs or a code example in kernel besides > > vmsplice itself? > > If I understand you right, you are trying to say that the 193 > pin_user_pages() callers cannot exist as unpriv any more? 193, 1k 1m or their number in general, won't just make them safe... > The majority cannot be converted to notifiers because they are DMA > based. Every one of those is an ABI for something, and does not expect > extra privilege to function. It would be a major breaking change to > have pin_user_pages require some cap. ... what makes them safe is to be transient GUP pin and not long term. Please note the "long term" in the underlined line. O_DIRECT is perfectly ok to be unprivileged obviously. The VM can wait, eventually it goes away. Even a swapout is not an instant event and can be hold off by any number of other things besides a transient GUP pin. It can be hold off by PG_lock just to make an example. mlock however is long term, persistent, vmsplice takes persistent and can pin way too much memory for each mm, that doesn't feel safe. The more places doing stuff like that, the more likely one causes a safety issue, not the other way around it in fact. > > The whole zygote issue wouldn't even register if the child had the > > exact same credentials of the parent. Problem is the child dropped > > privileges and went with a luser id, that clearly cannot ptrace the > > parent, and so if long term unprivileged GUP pins are gone from the > > equation, what remains that the child can do is purely theoretical > > even before commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f. > > Sorry, I'm not sure I've found a good explanation how ptrace and GUP > are interacting to become a security problem. ptrace is not involved. What I meant by mentioning ptrace, is that if the child can ptrace the parent, then it doesn't matter if it can also do the below, so the security concern is zero in such case. With O_DIRECT or any transient pin you will never munmap while O_DIRECT is in flight, if you munmap it's undefined what happens in such case anyway. It is a theoretical security issue made practical by vmsplice API that allows to enlarge the window to years of time (not guaranteed milliseconds), to wait for the parent to trigger the wp_page_reuse. Remove vmsplice and the security issue in theory remains, but removed vmsplice it becomes irrelevant statistically speaking in practice. io_uring has similar concern but it can use mmu notifier, so it can totally fix it and be 100% safe from this. The scheduler disclosure date was 2020-08-25 so I can freely explain the case that motivated all these changes. case A) if !fork() { // in child mmap one page vmsplice takes gup pin long term on such page munmap one page // mapcount == 1 (parent mm) // page_count == 2 (gup in child, and parent mm) } else { parent writes to the page // mapcount == 1, wp_page_reuse } parent did a COW with mapcount == 1 so the parent will take over a page that is still GUP pinned in the child. That's the security issue because in this case the GUP pin is malicious. Now imagine this case B) mmap one page RDMA or any secondary MMU takes a long term GUP pin munmap one page // mapcount == 1 (parent mm) // page_count == 2 (gup in RDMA, and parent mm) How does the VM can tell between the two different cases? It can't. The current page_count in do_wp_page treats both cases the same and because page_count is 2 in both cases, it calls wp_page_copy in both cases breaking-COW in both cases. However, you know full well in the second case it is a feature and not a bug, that wp_page_reuse is called instead, and in fact it has to be called or it's a bug (and that's the bug page_count in do_wp_page introduces). So page_count in do_wp_page is breaking all valid users, to take care of the purely theoretical security issue that isn't a practical concern if only vmsplice is secured at least as good as mlock. page_count in do_wp_page is fundamentally flawed for all long term GUP pin done by secondary MMUs attached to the memory. The fix in 17839856fd588f4ab6b789f482ed3ffd7c403e1f had to work by triggering a GUP(write=1), that would break-COW while vmsplice runs, in turn fully resolving the security concern, but without breaking your very important case B. > 17839 makes sense to me, and read-only GUP has been avoided by places > like RDMA and others for a very long time because of these issues, > adding the same idea to the core code looks OK. Yes I acked 17839856fd588f4ab6b789f482ed3ffd7c403e1f since it looked the cleanest solution to take care of the purely theoretical security issue (purely theoretical after vmsplice is taken care of). I planned today to look what didn't work exactly in 17839856fd588f4ab6b789f482ed3ffd7c403e1f that may have required to move to 09854ba94c6aad7886996bfbee2530b3d8a7f4f4, it was an huge email thread and I was too busy with urgent work at the time. > The semantics we discussed during the COW on fork thread for pin user > pages were, more or less, that once pinned a page should not be > silently removed from the mm it is currently in by COW or otherwise in > the kernel. I don't get what you mean here. Could you elaborate? > So maybe ptrace should not be COW'ing pinned pages at all, as that is > exactly the same kind of silent corruption fork was causing. ptrace isn't involved, details above. Could you elaborate also if fork started corrupting with 17839856fd588f4ab6b789f482ed3ffd7c403e1f applied? In which commit exactly the corruption started. In general fork(), unless you copy all GUP pinned pages and you don't wrprotect them in fork(), must be handled by blocking all writes on the RDMA region in the parent, then you fork, only after child did the exec you're allowed to unblock the writes in the parent that holds the GUP long term pins. I don't see a notable difference from page_count or mapcount in do_wp_page in this respect: only copying in fork() if the page is pinned like I was also proposed here https://lkml.kernel.org/r/20090311165833.GI27823@random.random will also prevent having to block the writes until exec is run though. FWIW I obviously agree in copying in fork any pinned page, but that was supposed to be an orthogonal improvement and it wasn't supposed to fix a recent regression (the fork vs thread vs gup race always existed, and the need of stopping writes in between fork and exec also). Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-08 17:00 ` Andrea Arcangeli @ 2021-01-08 18:19 ` Jason Gunthorpe 2021-01-08 18:31 ` Andy Lutomirski ` (2 more replies) [not found] ` <20210109034958.6928-1-hdanton@sina.com> 1 sibling, 3 replies; 96+ messages in thread From: Jason Gunthorpe @ 2021-01-08 18:19 UTC (permalink / raw) To: Andrea Arcangeli Cc: linux-mm, linux-kernel, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Linus Torvalds, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Fri, Jan 08, 2021 at 12:00:36PM -0500, Andrea Arcangeli wrote: > > The majority cannot be converted to notifiers because they are DMA > > based. Every one of those is an ABI for something, and does not expect > > extra privilege to function. It would be a major breaking change to > > have pin_user_pages require some cap. > > ... what makes them safe is to be transient GUP pin and not long > term. > > Please note the "long term" in the underlined line. Many of them are long term, though only 50 or so have been marked specifically with FOLL_LONGTERM. I don't see how we can make such a major ABI break. Looking at it, vmsplice() is simply wrong. A long term page pin must use pin_user_pages(), and either FOLL_LONGTERM|FOLL_WRITE (write mode) FOLL_LONGTERM|FOLL_FORCE|FOLL_WRITE (read mode) ie it must COW and it must reject cases that are not longterm safe, like DAX and CMA and so on. These are the well established rules, vmsplice does not get a pass simply because it is using the CPU to memory copy as its "DMA". > speaking in practice. io_uring has similar concern but it can use mmu > notifier, so it can totally fix it and be 100% safe from this. IIRC io_uring does use FOLL_LONGTERM and FOLL_WRITE.. > The scheduler disclosure date was 2020-08-25 so I can freely explain > the case that motivated all these changes. > > case A) > > if !fork() { > // in child > mmap one page > vmsplice takes gup pin long term on such page > munmap one page > // mapcount == 1 (parent mm) > // page_count == 2 (gup in child, and parent mm) > } else { > parent writes to the page > // mapcount == 1, wp_page_reuse > } > > parent did a COW with mapcount == 1 so the parent will take over a > page that is still GUP pinned in the child. Sorry, I missed something, how does mmaping a fresh new page in the child impact the parent? I guess the issue is not to mmap but to GUP a shared page in a way that doesn't trigger COW during GUP and then munmap that page so a future parent COW does re-use, leaking access. It seems enforcing FOLL_WRITE to always COW on GUP closes this, right? This is what all correct FOLL_LONGTERM users do today, it is required for many other reasons beyond this interesting security issue. > However, you know full well in the second case it is a feature and not > a bug, that wp_page_reuse is called instead, and in fact it has to be > called or it's a bug (and that's the bug page_count in do_wp_page > introduces). What I was trying to explain below, is I think we agreed that a page under active FOLL_LONGTERM pin *can not* be write protected. Establishing the FOLL_LONGTERM pin (for read or write) must *always* break the write protection and the VM *cannot* later establish a new write protection on that page while the pin is active. Indeed, it is complete nonsense to try and write protect a page that has active DMA write activity! Changing the CPU page protection bits will not stop any DMA! Doing so will inevitably become a security problem with an attack similar to what you described. So this is what was done during fork() - fork will no longer write protect pages under FOLL_LONGTERM to make them COWable, instead it will copy them at fork time. Any other place doing write protect must also follow these same rules. I wasn't aware this could be used to create a security problem, but it does make sense. write protect really must mean writes to the memory must stop and that is fundementally incompatible with active DMA. Thus write protect of pages under DMA must be forbidden, as a matter of security. Jason ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-08 18:19 ` Jason Gunthorpe @ 2021-01-08 18:31 ` Andy Lutomirski 2021-01-08 18:38 ` Linus Torvalds 2021-01-08 23:34 ` Andrea Arcangeli 2021-01-08 18:59 ` Linus Torvalds 2021-01-08 22:43 ` Andrea Arcangeli 2 siblings, 2 replies; 96+ messages in thread From: Andy Lutomirski @ 2021-01-08 18:31 UTC (permalink / raw) To: Jason Gunthorpe Cc: Andrea Arcangeli, Linux-MM, LKML, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Linus Torvalds, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Fri, Jan 8, 2021 at 10:19 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Jan 08, 2021 at 12:00:36PM -0500, Andrea Arcangeli wrote: > > > The majority cannot be converted to notifiers because they are DMA > > > based. Every one of those is an ABI for something, and does not expect > > > extra privilege to function. It would be a major breaking change to > > > have pin_user_pages require some cap. > > > > ... what makes them safe is to be transient GUP pin and not long > > term. > > > > Please note the "long term" in the underlined line. > > Many of them are long term, though only 50 or so have been marked > specifically with FOLL_LONGTERM. I don't see how we can make such a > major ABI break. > > Looking at it, vmsplice() is simply wrong. A long term page pin must > use pin_user_pages(), and either FOLL_LONGTERM|FOLL_WRITE (write mode) > FOLL_LONGTERM|FOLL_FORCE|FOLL_WRITE (read mode) Can we just remove vmsplice() support? We could make it do a normal copy, thereby getting rid of a fair amount of nastiness and potential attacks. Even ignoring issues relating to the length of time that the vmsplice reference is alive, we also have whatever problems could be caused by a malicious or misguided user vmsplice()ing some memory and then modifying it. --Andy ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-08 18:31 ` Andy Lutomirski @ 2021-01-08 18:38 ` Linus Torvalds 2021-01-08 23:34 ` Andrea Arcangeli 1 sibling, 0 replies; 96+ messages in thread From: Linus Torvalds @ 2021-01-08 18:38 UTC (permalink / raw) To: Andy Lutomirski Cc: Jason Gunthorpe, Andrea Arcangeli, Linux-MM, LKML, Yu Zhao, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Fri, Jan 8, 2021 at 10:31 AM Andy Lutomirski <luto@kernel.org> wrote: > > Can we just remove vmsplice() support? We could make it do a normal > copy, thereby getting rid of a fair amount of nastiness and potential > attacks. Even ignoring issues relating to the length of time that the > vmsplice reference is alive, we also have whatever problems could be > caused by a malicious or misguided user vmsplice()ing some memory and > then modifying it. Well, that "misguided user" is kind of the point, originally. That's what zero-copying is all about. But we could certainly remove it in favor of copying, because zero-copy has seldom really been a huge advantage in practice outside of benchmarks. That said, I continue to not buy into Andrea's argument that page_count() is wrong. Instead, the argument is: (1) COW can never happen "too much": the definition of a private mapping is that you have your own copy of the data. (2) the one counter case I feel is valid is page pinning when used for a special "pseudo-shared memory" thing and that's basically what FOLL_GUP does. So _regardless_ of any vmsplice issues, I actually think that those two basic rules should be our guiding principle. And the corollary to (2) is that COW must absolutely NEVER re-use too little. And that _was_ the bug with vmsplice, in that it allowed re-use that it shouldn't have. Linus ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-08 18:31 ` Andy Lutomirski 2021-01-08 18:38 ` Linus Torvalds @ 2021-01-08 23:34 ` Andrea Arcangeli 2021-01-09 19:03 ` Andy Lutomirski 1 sibling, 1 reply; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-08 23:34 UTC (permalink / raw) To: Andy Lutomirski Cc: Jason Gunthorpe, Linux-MM, LKML, Yu Zhao, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Linus Torvalds, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Fri, Jan 08, 2021 at 10:31:24AM -0800, Andy Lutomirski wrote: > Can we just remove vmsplice() support? We could make it do a normal The single case I've seen vmsplice used so far, that was really cool is localhost live migration of qemu. However despite really cool, it wasn't merged in the end, and I don't recall exactly why. There are even more efficient (but slightly more complex) ways to do that than vmsplice: using MAP_SHARED gigapages or MAP_SHARED tmpfs with THP opted-in in the tmpfs mount, as guest physical memory instead of anon memory and finding a way not having it cleared by kexec, so you can also upgrade the host kernel and not just qemu... is a way more optimal way to PIN and move all pages through the pipe and still having to pay a superfluous copy on destination. My guess why it's not popular, and I may be completely wrong on this since I basically never used vmsplice (other than to proof of concept DoS my phone to verify the long term GUP pin exploit works), is that vmsplice is a more efficient, but not the most efficient option. Exactly like in the live migration in place, it's always more efficient to share a tmpfs THP backed region and have true zero copy, than going through a pipe that still does one copy at the receiving end. It may also be simpler and it's not dependent on F_SETPIPE_SIZE obscure tunings. So in the end it's still too slow for apps that requires maximum performance, and not worth the extra work for those that don't. I love vmsplice conceptually, just I'd rather prefer an luser cannot run it. > copy, thereby getting rid of a fair amount of nastiness and potential > attacks. Even ignoring issues relating to the length of time that the > vmsplice reference is alive, we also have whatever problems could be > caused by a malicious or misguided user vmsplice()ing some memory and > then modifying it. Sorry to ask but I'm curious, what also goes wrong if the user modifies memory under GUP pin from vmsplice? That's not obvious to see. Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-08 23:34 ` Andrea Arcangeli @ 2021-01-09 19:03 ` Andy Lutomirski 2021-01-09 19:15 ` Linus Torvalds 0 siblings, 1 reply; 96+ messages in thread From: Andy Lutomirski @ 2021-01-09 19:03 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andy Lutomirski, Jason Gunthorpe, Linux-MM, LKML, Yu Zhao, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Linus Torvalds, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai > On Jan 8, 2021, at 3:34 PM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > On Fri, Jan 08, 2021 at 10:31:24AM -0800, Andy Lutomirski wrote: >> Can we just remove vmsplice() support? We could make it do a normal > >> copy, thereby getting rid of a fair amount of nastiness and potential >> attacks. Even ignoring issues relating to the length of time that the >> vmsplice reference is alive, we also have whatever problems could be >> caused by a malicious or misguided user vmsplice()ing some memory and >> then modifying it. > > Sorry to ask but I'm curious, what also goes wrong if the user > modifies memory under GUP pin from vmsplice? That's not obvious to > see. It breaks the otherwise true rule that the data in pipe buffers is immutable. Even just quoting the manpage: SPLICE_F_GIFT The user pages are a gift to the kernel. The application may not modify this memory ever, otherwise the page cache and on- disk data may differ. That's no good. I can also imagine use cases in which modified vmsplice() pages that end up in various parts of the network stack could be problematic. For example, if you can arrange for TCP or, worse, TLS to transmit data and then retransmit modified data, you might get odd results. In the latter case, some security properties of TLS might be broken. --Andy ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-09 19:03 ` Andy Lutomirski @ 2021-01-09 19:15 ` Linus Torvalds 0 siblings, 0 replies; 96+ messages in thread From: Linus Torvalds @ 2021-01-09 19:15 UTC (permalink / raw) To: Andy Lutomirski Cc: Andrea Arcangeli, Jason Gunthorpe, Linux-MM, LKML, Yu Zhao, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Sat, Jan 9, 2021 at 11:03 AM Andy Lutomirski <luto@kernel.org> wrote: > > > > > Sorry to ask but I'm curious, what also goes wrong if the user > > modifies memory under GUP pin from vmsplice? That's not obvious to > > see. > > It breaks the otherwise true rule that the data in pipe buffers is > immutable. Note that this continued harping on vmsplice() is entirely misguided. Anything using GUP has the same issues. This really has nothing to do with vmsplice() per se. In many ways, vmsplice() might be the least of your issues, because it's fairly easy to just limit that for untrusted use. And no, that does not mean "we should make vmsplice root-only" kind of limiting. There are no security issues in any normal situation. Again, it's mainly about things that don't trust each other _despite_ running in similar contexts as far as the kernel is concerned. IOW, exactly that "zygote" kind of situation. If you are a JIT (whether Zygote or a web browser), you basically need to limit the things the untrusted JIT'ed code can do. And that limiting may include vmsplice(). But note the "include" part of "include vmsplice()". Any other GUP user really does have the same issues, it may just be less obvious and have very different timings (or depend on access to devices etc). Absolutely nothing cares about "data in pipe buffers changing" in any other case. You can already write any data you want to a pipe, it doesn't matter if it changes after the write or not. (In many ways, "data in the page cache" is a *much* more difficult issue for the kernel, and it's fundamental to any shared mmap. It's much more difficult because that data is obviously very much also accessible for DMA etc for writeout, and if you have something like "checksums are calculated separately and non-atomically from the actual DMA accesses", you will potentially get checksum errors where the actual disk contents don't match your separately calculated checksums until the _next_ write. This can actually be a feature - seeing "further modifications were concurrent to the write" - but most people end up considering it a bug). Linus ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-08 18:19 ` Jason Gunthorpe 2021-01-08 18:31 ` Andy Lutomirski @ 2021-01-08 18:59 ` Linus Torvalds 2021-01-08 22:43 ` Andrea Arcangeli 2 siblings, 0 replies; 96+ messages in thread From: Linus Torvalds @ 2021-01-08 18:59 UTC (permalink / raw) To: Jason Gunthorpe Cc: Andrea Arcangeli, Linux-MM, Linux Kernel Mailing List, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Fri, Jan 8, 2021 at 10:19 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > Sorry, I missed something, how does mmaping a fresh new page in the > child impact the parent? > > I guess the issue is not to mmap but to GUP a shared page No. It has nothing to do with a shared page. The problem with the COW in the child is that the parent now BELIEVES that it has a private copy (because page_mapcount() was 1), but it doesn't really. But because the parent *thought* it had a private copy of the page, when the _parent_ did a write, it would cause the page COW logic to go "you have exclusive access to the page, so I'll just make it writable". The parent then writes whatever private data to that page. That page is still in the system as a vmsplice'd page, and the child can now read that private data that was _supposed_ to be exclusive to the parent, but wasn't. And the thing is, blaming vmsplice() is entirely wrong. The exact same thing used to be able to happen with any GUP case, vmsplice() was just the simplest way to cause that non-mapped page access. But any GUP could do it, with the child basically fooling the parent into revealing data. Note that Zygote itself is in no way special from a technical standpoint, and this can happen after any random fork(). The only real difference is that in all *traditional* UNIX cases, this "child can see the parent's data with trickery before execve()" situation simply doesn't *matter*. In traditional fork() situations, the parent and the child are really the same program, and if you don't trust the child, then you don't trust the parent either. The Android Zygote case isn't _technically_ any different. But the difference is that because the whole idea with Zygote is to pre-map the JIT stuff for the child, you are in this special situation where the parent doesn't actually trust the child. See? No _technical_ difference. Exact same scenario as for any random fork() with GUP and COW going the wrong way. It just normally doesn't _matter_. And see above: because this is not really specific to vmsplice() (apart from that just being the easiest model), the _original_ fix for this was just "GUP will break COW early" commit: 17839856fd58 ("gup: document and work around "COW can break either way" issue") which is very straightforward: if you do a GUP lookup, you force that GUP to do the COW for you, so that nobody can then fool another process to think that it has a private page that can be re-used, but it really has a second reference to it. Because whoever took the "sneaky" GUP reference had to get their _own_ private copy first. But while that approach was very simple and very targeted (and I don't think it's wrong per se), it then caused other problems. In fact, it caused other problems for pretty much all the same cases that the current model causes problems for: all the odd special cases that do weird things to the VM. And because these problems were so odd, the alternate solution - and the thing I'm really pushing for - is to make the _core_ VM rules very simple and straightforward, and then the odd special cases have to live with those simple and straightforward rules. And the most core of those rules is that "page_mapcount()" fundamenally doesn't matter, because there are other references to pages that are all equally valid. Thinking that a page being "mapped" makes is special is wrong, as exemplified by any GUP case (but also as exemplified by the page cache or the swap cache, which were always a source of _other_ special cases for the COW code). So if you accept that notion of "page_mapcount()" is meaninfless being a truism (which Andrea obviously doesn't), then the logical extension of that is the set of rules I outlined in my reply to Andy: (a) COW can never happen "too much", and "page_count()" is the fundamental "somebody has a reference to this page" (b) page pinning and any other "this needs to be coherent" ends up being a special per-page "shared memory" case That special "shared memory page" thing in (b) is then that rule that when we pin a page, we make sure it's writable, and stays writable, so that COW never breaks the association. That's then the thing that causes problems for anybody who wants to write-protect stuff. Linus ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-08 18:19 ` Jason Gunthorpe 2021-01-08 18:31 ` Andy Lutomirski 2021-01-08 18:59 ` Linus Torvalds @ 2021-01-08 22:43 ` Andrea Arcangeli 2021-01-09 0:42 ` Jason Gunthorpe 2 siblings, 1 reply; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-08 22:43 UTC (permalink / raw) To: Jason Gunthorpe Cc: linux-mm, linux-kernel, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Linus Torvalds, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Fri, Jan 08, 2021 at 02:19:45PM -0400, Jason Gunthorpe wrote: > On Fri, Jan 08, 2021 at 12:00:36PM -0500, Andrea Arcangeli wrote: > > > The majority cannot be converted to notifiers because they are DMA > > > based. Every one of those is an ABI for something, and does not expect > > > extra privilege to function. It would be a major breaking change to > > > have pin_user_pages require some cap. > > > > ... what makes them safe is to be transient GUP pin and not long > > term. > > > > Please note the "long term" in the underlined line. > > Many of them are long term, though only 50 or so have been marked > specifically with FOLL_LONGTERM. I don't see how we can make such a > major ABI break. io_uring is one of those indeed and I already flagged it. This isn't a black and white issue, kernel memory is also pinned but it's not in movable pageblocks... How do you tell the VM in GUP to migrate memory to a non movable pageblock before pinning it? Because that's what it should do to create less breakage. For example iommu obviously need to be privileged, if your argument that it's enough to use the right API to take long term pins unconstrained, that's not the case. Pins are pins and prevent moving or freeing the memory, their effect is the same and again worse than mlock on many levels. (then I know on preempt-rt should behave like a pin, and that's fine, you disable all features for such purpose there) io_uring is fine in comparison to vmpslice but still not perfect, because it does the RLIMIT_MEMLOCK accounting but unfortunately, is tangibly unreliable since a pin can cost 2m or 1G (now 1G is basically privileged so it doesn't hurt to get the accounting wrong in such case, but it's still technically mixing counting apples as oranges). Maybe io_uring could keep not doing mmu notifier, I'd need more investigation to be sure, but what's the point of keeping it VM-breaking when it doesn't need to? Is io_uring required to setup the ring at high frequency? > Looking at it, vmsplice() is simply wrong. A long term page pin must > use pin_user_pages(), and either FOLL_LONGTERM|FOLL_WRITE (write mode) > FOLL_LONGTERM|FOLL_FORCE|FOLL_WRITE (read mode) > > ie it must COW and it must reject cases that are not longterm safe, > like DAX and CMA and so on. > > These are the well established rules, vmsplice does not get a pass Where are the established rules written down? pin_user_pages.rst doesn't even make a mention of FOLL_FORCE or FOLL_WRITE at all, mm.h same thing. In any case, the extra flags required in FOLL_LONGTERM should be implied by FOLL_LONGTERM itself, once it enters the gup code, because it's not cool having to FOLL_WRITE in all drivers for a GUP(write=0), let alone having to specify FOLL_FORCE for just a read. But this is going offtopic. > simply because it is using the CPU to memory copy as its "DMA". vmsplice can't find all put_pages that may release the pages when the pipe is read, or it'd be at least be able to do the unreliable RLIMIT_MEMLOCK accounting. I'm glad we agree vmsplice is unsafe. The PR for the seccomp filter is open so if you don't mind, I'll link your review as confirmation. > > speaking in practice. io_uring has similar concern but it can use mmu > > notifier, so it can totally fix it and be 100% safe from this. > > IIRC io_uring does use FOLL_LONGTERM and FOLL_WRITE.. Right it's one of those 50. FOLL_WRITE won't magically allow the memory to be swapped or migrated. To make another example a single unprivileged pin on the movable zone, can break memhotunplug unless you use the mmu notifier. Every other advanced feature falls apart. So again, if an unprivileged syscalls allows a very limited number of pages, maybe it checks also if it got a THP or a gigapage page from the pin, it sets its own limit, maybe again it's not a big concern. vmsplice currently with zero privilege allows this: 2 0 1074432 9589344 13548 1321860 4 0 4 172 2052 9997 5 2 93 0 0 -> vmsplice reproducer started here 1 0 1074432 8538184 13548 1325820 0 0 0 0 1973 8838 4 3 93 0 0 1 0 1074432 8538436 13548 1325524 0 0 0 0 1730 8168 4 2 94 0 0 1 0 1074432 8539096 13556 1321880 0 0 0 72 1811 8640 3 2 95 0 0 0 0 1074432 8539348 13564 1322028 0 0 0 36 1936 8684 4 2 95 0 0 -> vmsplice killed here 1 0 1074432 9586720 13564 1322248 0 0 0 0 1893 8514 4 2 94 0 0 That's ~1G that goes away for each task and I didn't even check if it's all THP pages getting in there, the rss is 3MB despite 1G is taken down in GUP pins with zero privilege: 1512 pts/25 S 0:00 0 0 133147 3044 0.0 ./vmsplice Again memcg is robust so it's not a concern for the host, the attack remains contained in the per-memcg OOM killer. It'd only reach the host OOM killer logic if the host itself does the accounting wrong and runs out of memory which can be enforced it won't happen. > > The scheduler disclosure date was 2020-08-25 so I can freely explain > > the case that motivated all these changes. > > > > case A) > > > > if !fork() { > > // in child > > mmap one page > > vmsplice takes gup pin long term on such page > > munmap one page > > // mapcount == 1 (parent mm) > > // page_count == 2 (gup in child, and parent mm) > > } else { > > parent writes to the page > > // mapcount == 1, wp_page_reuse > > } > > > > parent did a COW with mapcount == 1 so the parent will take over a > > page that is still GUP pinned in the child. > > Sorry, I missed something, how does mmaping a fresh new page in the > child impact the parent? Apologies... of course the "mmap" line had to be moved before fork. > I guess the issue is not to mmap but to GUP a shared page in a way > that doesn't trigger COW during GUP and then munmap that page so a > future parent COW does re-use, leaking access. Right. Jann reported the writes of the parent are readable then by reading the pipe 1 year later. > It seems enforcing FOLL_WRITE to always COW on GUP closes this, right? Exactly, it was supposed to do that. And I don't mean in the caller with FOLL_WRITE/write=1 explicitly set in vmsplice, I mean with 17839856fd588f4ab6b789f482ed3ffd7c403e1f which looked great to me as a solution for it. > This is what all correct FOLL_LONGTERM users do today, it is required > for many other reasons beyond this interesting security issue. Exactly. Except this also applies to O_DIRECT not just FOLL_LONGTERM, in theory. And only in theory. Any transient GUP pin no matter which fancy API you use to take it, is enough to open the window for the above attack, not just FOLL_LONGERM. However only unprivileged long term GUP pins can make this race reproducible. So this has to be fixed in the GUP core too, as it was supposed to be fixed for a while reliably (and it's not fixed anymore on current upstream if taking the GUP pin on a THP). For those with the reproducer for the bug fixed in 17839856fd588f4ab6b789f482ed3ffd7c403e1f here's the patch to apply to reproduce it once on v5.11 once again: --- vmsplice.c 2020-05-28 03:03:26.760303487 -0400 +++ vmsplice-v5.11.c 2021-01-08 17:28:37.028747370 -0500 @@ -24 +24 @@ - struct iovec iov = {.iov_base = data, .iov_len = 0x1000 }; + struct iovec iov = {.iov_base = data, .iov_len = 2*1024*1024 }; @@ -26 +26 @@ - SYSCHK(munmap(data, 0x1000)); + SYSCHK(munmap(data, 2*1024*1024)); @@ -28,2 +28,2 @@ - char buf[0x1000]; - SYSCHK(read(pipe_fds[0], buf, 0x1000)); + char buf[2*1024*1024]; + SYSCHK(read(pipe_fds[0], buf, 2*1024*1024)); @@ -34 +34 @@ - if (posix_memalign(&data, 0x1000, 0x1000)) + if (posix_memalign(&data, 2*1024*1024, 2*1024*1024)) @@ -35,0 +36,2 @@ + if (madvise(data, 2*1024*1024, MADV_HUGEPAGE)) + errx(1, "madvise()"); $ /tmp/x read string from child: THIS IS SECRET I exploited it just to be sure I didn't miss something in the source review of the THP code. So I hope after all this discussion I could at least provide 1 single useful information, if nothing else. > > However, you know full well in the second case it is a feature and not > > a bug, that wp_page_reuse is called instead, and in fact it has to be > > called or it's a bug (and that's the bug page_count in do_wp_page > > introduces). > > What I was trying to explain below, is I think we agreed that a page > under active FOLL_LONGTERM pin *can not* be write protected. > > Establishing the FOLL_LONGTERM pin (for read or write) must *always* > break the write protection and the VM *cannot* later establish a new > write protection on that page while the pin is active. > > Indeed, it is complete nonsense to try and write protect a page that > has active DMA write activity! Changing the CPU page protection bits > will not stop any DMA! Doing so will inevitably become a security > problem with an attack similar to what you described. > > So this is what was done during fork() - fork will no longer write > protect pages under FOLL_LONGTERM to make them COWable, instead it > will copy them at fork time. > > Any other place doing write protect must also follow these same > rules. > > I wasn't aware this could be used to create a security problem, but it > does make sense. write protect really must mean writes to the memory > must stop and that is fundementally incompatible with active DMA. > > Thus write protect of pages under DMA must be forbidden, as a matter > of security. You're thinking at your use case only. Thinking long term GUP pin is read-write DMA is very reductive. There doesn't need to be DMA at all. KVM and a shadow MMU can attach to the RAM in readonly totally fine. And if it writes, it'll write not through the PCI bus, still with the CPU access. In fact Peter did an awesome work by writing the dirty ring for the KVM shadow MMU and some vmx also provides a page modification logging on some CPUs. So we have already all the dirty tracking that protects the shadow pagetable: https://kvmforum2020.sched.com/event/eE4R/kvm-dirty-ring-a-new-approach-to-logging-peter-xu-red-hat So it's completely normal that you could plug that with clear_refs and wrprotecting the linux pagetable while a KVM mapping exists that absolutely must not go out of sync. Nothing at all can go wrong, unless wp_copy_page suddenly makes the secondary MMU go out of sync the moment you wrprotect the page with clear_refs. You don't even need readonly access from DMA for the above to make sense, the above makes perfect sense even with the secondary MMU and primary MMU all writing at the same time and it must not break. Overall a design where the only safety of a secondary MMU from going out of sync comes from the wrprotection not happening looks weak. Ultimately, what do we really gain from all this breakage? Where are the do_wp_page benchmarks comparing 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 against b7333b58f358f38d90d78e00c1ee5dec82df10ad ? Link? Definitely there's no benchmark in the git log justifying this sudden breakage on so many levels that even re-opened the old zygote bug as shown above. Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-08 22:43 ` Andrea Arcangeli @ 2021-01-09 0:42 ` Jason Gunthorpe 2021-01-09 2:50 ` Andrea Arcangeli 2021-01-13 21:56 ` Jerome Glisse 0 siblings, 2 replies; 96+ messages in thread From: Jason Gunthorpe @ 2021-01-09 0:42 UTC (permalink / raw) To: Andrea Arcangeli Cc: linux-mm, linux-kernel, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Linus Torvalds, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Fri, Jan 08, 2021 at 05:43:56PM -0500, Andrea Arcangeli wrote: > On Fri, Jan 08, 2021 at 02:19:45PM -0400, Jason Gunthorpe wrote: > > On Fri, Jan 08, 2021 at 12:00:36PM -0500, Andrea Arcangeli wrote: > > > > The majority cannot be converted to notifiers because they are DMA > > > > based. Every one of those is an ABI for something, and does not expect > > > > extra privilege to function. It would be a major breaking change to > > > > have pin_user_pages require some cap. > > > > > > ... what makes them safe is to be transient GUP pin and not long > > > term. > > > > > > Please note the "long term" in the underlined line. > > > > Many of them are long term, though only 50 or so have been marked > > specifically with FOLL_LONGTERM. I don't see how we can make such a > > major ABI break. > > io_uring is one of those indeed and I already flagged it. > > This isn't a black and white issue, kernel memory is also pinned but > it's not in movable pageblocks... How do you tell the VM in GUP to > migrate memory to a non movable pageblock before pinning it? Because > that's what it should do to create less breakage. There is already a patch series floating about to do exactly that for FOLL_LONGTERM pins based on the existing code in GUP for CMA migration > For example iommu obviously need to be privileged, if your argument > that it's enough to use the right API to take long term pins > unconstrained, that's not the case. Pins are pins and prevent moving > or freeing the memory, their effect is the same and again worse than > mlock on many levels. The ship sailed on this a decade ago, it is completely infeasible to go back now, it would completely break widely used things like GPU, RDMA and more. > Maybe io_uring could keep not doing mmu notifier, I'd need more > investigation to be sure, but what's the point of keeping it > VM-breaking when it doesn't need to? Is io_uring required to setup the > ring at high frequency? If we want to have a high speed copy_from_user like thing that is not based on page pins but on mmu notifiers, then we should make that infrastructure and the various places that need it should use common code. At least vhost and io_uring are good candidates. Otherwise, we are pretending that they are DMA and using the DMA centric pin_user_pages() interface, which we still have to support and make work. > In any case, the extra flags required in FOLL_LONGTERM should be > implied by FOLL_LONGTERM itself, once it enters the gup code, because > it's not cool having to FOLL_WRITE in all drivers for a GUP(write=0), > let alone having to specify FOLL_FORCE for just a read. But this is > going offtopic. We really should revise this, I've been thinking for a while we need to internalize into gup.c the FOLL_FORCE|FOLL_WRITE|FOLL_LONGTERM idiom at least.. > > simply because it is using the CPU to memory copy as its "DMA". > > vmsplice can't find all put_pages that may release the pages when the > pipe is read, or it'd be at least be able to do the unreliable > RLIMIT_MEMLOCK accounting. Yikes! So it can't even use pin_user_pages FOLL_LONGTERM properly because that requires unpin_user_pages(), which means finding all the unpin sites too :\ > I'm glad we agree vmsplice is unsafe. The PR for the seccomp filter is > open so if you don't mind, I'll link your review as confirmation. OK > To make another example a single unprivileged pin on the movable zone, > can break memhotunplug unless you use the mmu notifier. Every other > advanced feature falls apart. As above FOLL_LONGTERM will someday migrate from movable zones. The fact that people keep adding MM features that are incompatible with FOLL_LONGTERM is troublesome. However, the people who want hot-pluggable DIMMS don't get to veto the people who want RDMA, GPU and so on out of the kernel. (or vice versa) It seems we will end up with a MM where some work loads will be incompatible with some MM features. > So again, if an unprivileged syscalls allows a very limited number of > pages, maybe it checks also if it got a THP or a gigapage page from > the pin, it sets its own limit, maybe again it's not a big > concern. We also don't do a good job uniformly tracking rmlimit/etc. I'd ideally like to see that in the core code too. Someone once tried that a bit but we couldn't ge agreement what the right thing was because different drivers do different things. Sigh. > Any transient GUP pin no matter which fancy API you use to take it, is > enough to open the window for the above attack, not just FOLL_LONGERM. Yes, that is interesting. We've always known that the FOLL_LONGTERM special machinery is techincally needed for O_DIRECT and basically all other cases for coherence, but till now I hand't heard of a security argument. It does make sense :( > For those with the reproducer for the bug fixed in > 17839856fd588f4ab6b789f482ed3ffd7c403e1f here's the patch to apply to > reproduce it once on v5.11 once again: So this is still at least because vmsplice is buggy to use plain get_user_pages() for it's long term usage, and buggy to not use the FOLL_FORCE|FOLL_WRITE idiom for read :\ A small patch to make vmsplice set those flags on its gup would at least robustly close this immediate security problem without whatever side effects caused the revert of commit forcing that in GUP iteself. > You're thinking at your use case only. I'm thinking about the rules to make pin_user_pages(FOLL_LONGTERM) sane and working, yes. It is an API we have that is used widely, and really needs a solid definition. This idea we can just throw it out completely is a no-go to me. There are other similar APIs, like normal GUP, hmm_range_fault, and so on, but these are different things, with different rules. > Thinking long term GUP pin is read-write DMA is very reductive. > > There doesn't need to be DMA at all. > > KVM and a shadow MMU can attach to the RAM in readonly totally > fine. And if it writes, it'll write not through the PCI bus, still > with the CPU access. That is not gup FOLL_LONGTERM, that is mmu notifiers.. mmu notifier users who are using hmm_range_fault() do not ever take any page references when doing their work, that seems like the right approach, for a shadow mmu? > Nothing at all can go wrong, unless wp_copy_page suddenly makes the > secondary MMU go out of sync the moment you wrprotect the page with > clear_refs. To be honest, I've read most of this discussion, and the prior one, between you and Linus carefully, but I still don't understand what clear_refs is about or how KVM's use of mmu notifiers got broken. This is probably because I'm only a little familiar with those areas :\ Is it actually broken or just inefficient? If wp_copy_page is going more often than it should the secondary mmu should still fully track that? > Overall a design where the only safety of a secondary MMU from going > out of sync comes from the wrprotection not happening looks weak. To be clear, here I am only talking about pin_user_pages. We now have logic to tell if a page is under pin_user_pages(FOLL_LONGTERM) or not, and that is what is driving the copy on fork logic. secondary-mmu drivers using mmu notifier should not trigger this logic and should not restrict write protect. > Ultimately, what do we really gain from all this breakage? Well, the clean definition of pin_user_pages(FOLL_LONGTERM) is very positive for DMA drivers working in that area. Jason ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-09 0:42 ` Jason Gunthorpe @ 2021-01-09 2:50 ` Andrea Arcangeli 2021-01-11 14:30 ` Jason Gunthorpe 2021-01-13 21:56 ` Jerome Glisse 1 sibling, 1 reply; 96+ messages in thread From: Andrea Arcangeli @ 2021-01-09 2:50 UTC (permalink / raw) To: Jason Gunthorpe Cc: linux-mm, linux-kernel, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Linus Torvalds, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai Hello Jason, On Fri, Jan 08, 2021 at 08:42:55PM -0400, Jason Gunthorpe wrote: > There is already a patch series floating about to do exactly that for > FOLL_LONGTERM pins based on the existing code in GUP for CMA migration Sounds great. > The ship sailed on this a decade ago, it is completely infeasible to > go back now, it would completely break widely used things like GPU, > RDMA and more. For all those that aren't using mmu notifier and that rely solely on page pins, they still require privilege, except they do through /dev/ permissions. Just the fact there's no capability check in the read/write/ioctl doesn't mean those device inodes can be opened any luser: the fact the kernel allows it, doesn't mean the /dev/ permission does too. The same applies to /dev/kvm too, not just PCI device drivers. Device drivers that you need to open in /dev/ before you can take a GUP pin require whole different checks than syscalls like vmsplice and io_uring that are universally available. The very same GUP long term pinning kernel code can be perfectly safe to use without any permission check for a device driver of an iommu in /dev/, but completely unsafe for a syscall. > If we want to have a high speed copy_from_user like thing that is not > based on page pins but on mmu notifiers, then we should make that > infrastructure and the various places that need it should use common > code. At least vhost and io_uring are good candidates. Actually the mmu notifier doesn't strictly require pins, it only requires GUP. All users tend to use FOLL_GET just as a safety precaution (I already tried to optimize away the two atomics per GUP, but we were naked by the KVM maintainer that didn't want to take the risk, I would have, but it's a fair point indeed, obviously it's safer with the pin plus the mmu notifier, two is safer than one). I'm not sure how any copy-user could obviate a secondary MMU mapping, mappings and copies are mutually exclusive. Any copy would be breaking memory coherency in this environment. > Otherwise, we are pretending that they are DMA and using the DMA > centric pin_user_pages() interface, which we still have to support and > make work. vhost and io_uring would be pure software constructs, but there are hardware users of the GUP pin that don't use any DMA. The long term GUP pin is not only about PCI devices doing DMA. KVM is not ever using any DMA, despite it takes terabytes worth of very long term GUP pins. > > In any case, the extra flags required in FOLL_LONGTERM should be > > implied by FOLL_LONGTERM itself, once it enters the gup code, because > > it's not cool having to FOLL_WRITE in all drivers for a GUP(write=0), > > let alone having to specify FOLL_FORCE for just a read. But this is > > going offtopic. > > We really should revise this, I've been thinking for a while we need > to internalize into gup.c the FOLL_FORCE|FOLL_WRITE|FOLL_LONGTERM > idiom at least.. 100% agreed. > > > simply because it is using the CPU to memory copy as its "DMA". > > > > vmsplice can't find all put_pages that may release the pages when the > > pipe is read, or it'd be at least be able to do the unreliable > > RLIMIT_MEMLOCK accounting. > > Yikes! So it can't even use pin_user_pages FOLL_LONGTERM properly > because that requires unpin_user_pages(), which means finding all the > unpin sites too :\ Exactly. > > To make another example a single unprivileged pin on the movable zone, > > can break memhotunplug unless you use the mmu notifier. Every other > > advanced feature falls apart. > > As above FOLL_LONGTERM will someday migrate from movable zones. Something like: 1) migrate from movable zones contextually to GUP 2) be accounted on the compound_order not on the number of GUP (io_uring needs fixing here) 3) maybe account not only in rlimit, but also expose the total worth of GUP pins in page_order units (not pins) to the OOM killer to be added to the rss (will double count though). Maybe 3 is overkill but without it, OOM killer won't even see those GUP pin coming, so if not done it's still kind of unsafe, if done it'll risk double count. Even then a GUP pin, still prevents optimization, it can't converge in the right NUMA node the io ring just to make an example, but that's a secondary performance concern. The primary concern with the mmu notifier in io_uring is the take_all_locks latency. Longlived apps like qemu would be fine with mmu notifier. The main question is also if there's any short-lived latency io_uring usage... that wouldn't fly with take_all_locks. The problem with the mmu notifier as an universal solution, for example is that it can't wait for I/O completion of O_DIRECT since it has no clue where the put_page is to wait for it, otherwise we could avoid even the FOLL_GET for O_DIRECT and guarantee the I/O has to be completed before paging or anything can unmap the page under I/O from the pagetable. Even if we could reliably identify all the put_page of transient pins reliably, it would need to be always on. Currently we go the extra mile to require zero exclusive cachelines when it's unregistered and that makes the registering a latency outlier. > The fact that people keep adding MM features that are incompatible > with FOLL_LONGTERM is troublesome. Ehm in my view it's actually FOLL_LONGTERM without ability to use the mmu notifier that is troublesome :). It's funny how we look at the two opposite sides of the same coin. I'm sure there will be devices doing that will for a very long time, but they don't need to be perfect, the current handling is satisfactory, and we can do a best effort to improve things are described above but it's not critical. > However, the people who want hot-pluggable DIMMS don't get to veto the > people who want RDMA, GPU and so on out of the kernel. (or vice versa) > > It seems we will end up with a MM where some work loads will be > incompatible with some MM features. I see the incompatibility you describe as problem we have today, in the present, and that will fade with time. Reminds me when we had >4G of RAM and 32bit devices doing DMA. How many 32bit devices are there now? We're not talking here about any random PCI device, we're talking here about very special and very advanced devices that need to have "long term" GUP pins in order to operate, not the usual nvme/gigabit device where GUP pins are never long term. > We also don't do a good job uniformly tracking rmlimit/etc. I'd > ideally like to see that in the core code too. Someone once tried that > a bit but we couldn't ge agreement what the right thing was because > different drivers do different things. Sigh. Consolidating would be great I agree. > > Any transient GUP pin no matter which fancy API you use to take it, is > > enough to open the window for the above attack, not just FOLL_LONGERM. > > Yes, that is interesting. We've always known that the FOLL_LONGTERM > special machinery is techincally needed for O_DIRECT and basically all > other cases for coherence, but till now I hand't heard of a security > argument. It does make sense :( The security argument is really specific to such case described, and ideally whatever fix we do to close all windows, would cover all O_DIRECT too. > > For those with the reproducer for the bug fixed in > > 17839856fd588f4ab6b789f482ed3ffd7c403e1f here's the patch to apply to > > reproduce it once on v5.11 once again: > > So this is still at least because vmsplice is buggy to use plain > get_user_pages() for it's long term usage, and buggy to not use the > FOLL_FORCE|FOLL_WRITE idiom for read :\ > > A small patch to make vmsplice set those flags on its gup would at > least robustly close this immediate security problem without whatever > side effects caused the revert of commit forcing that in GUP iteself. Exactly, if we fix vmsplice, and we close the biggest window, what remains is so small it shouldn't be practical. We still have to close all windows then. > > You're thinking at your use case only. > > I'm thinking about the rules to make pin_user_pages(FOLL_LONGTERM) > sane and working, yes. It is an API we have that is used widely, and > really needs a solid definition. This idea we can just throw it out > completely is a no-go to me. > > There are other similar APIs, like normal GUP, hmm_range_fault, and so hmm depends on mmu notifier so there's no VM interference there. > on, but these are different things, with different rules. I'm not suggesting to throw out anything. It's like if you got a 32bit device, you did bounce buffers. If you got a CPU without MMU you got to deal with MMU=n. How many Linux VM features you can use MMU=n? Is it mlock accounting required with Linux built with MMU=n? (I'd be shocked if it can actually build but still) You have to live with the limitations the hardware delivers. vmsplice and io_uring have no limitation and zero hardware constraint, so they've not a single valid justification, unlike device drivers, in addition their access cannot be controlled through /dev/ permission like it happens regularly for all device drivers. > > Thinking long term GUP pin is read-write DMA is very reductive. > > > > There doesn't need to be DMA at all. > > > > KVM and a shadow MMU can attach to the RAM in readonly totally > > fine. And if it writes, it'll write not through the PCI bus, still > > with the CPU access. > > That is not gup FOLL_LONGTERM, that is mmu notifiers.. Correct. Although KVM initially used the equivalent of FOLL_LONGTERM back then. Then KVM become the primary MMU Notfifier user of course. The only difference between FOLL_LONGTERM and mmu notifier, is if the hardware is capable of handling it. There is no real difference other than that. > mmu notifier users who are using hmm_range_fault() do not ever take any > page references when doing their work, that seems like the right > approach, for a shadow mmu? They all can do like HMM or you can take the FOLL_GET as long as you remember put_page. Jerome also intended to optimize the KVM fault like that, but like said above, we were naked on that attempt. If there is the pin or not makes zero semantical difference, it's purely an optimization when there is no pin, and it's a bugcheck safety feature if there is the pin. By the time it can make a runtime difference if there is the pin or not, put_page has been called already. > > Nothing at all can go wrong, unless wp_copy_page suddenly makes the > > secondary MMU go out of sync the moment you wrprotect the page with > > clear_refs. > > To be honest, I've read most of this discussion, and the prior one, > between you and Linus carefully, but I still don't understand what > clear_refs is about or how KVM's use of mmu notifiers got broken. This > is probably because I'm only a little familiar with those areas :\ KVM use of mmu notifier is not related to this. clear_refs simply can wrprotect the page. Of any process. echo .. >/proc/self/clear_refs. Then you check in /proc/self/pagemap looking for soft dirty (or something like that). The point is that if you do echo ... >/proc/self/clear_refs on your pid, that has any FOLL_LONGTERM on its mm, it'll just cause your device driver to go out of sync with the mm. It'll see the old pages, before the spurious COWs. The CPU will use new pages (the spurious COWs). > Is it actually broken or just inefficient? If wp_copy_page is going > more often than it should the secondary mmu should still fully track > that? It's about the DMA going out of sync and losing view of the mm. In addition the TLB flush broke with the mmu_read_lock but that can be fixed somehow. The TLB flush, still only because of the spurious COWs, has now to cope with the fact that there can be spurious wp_page_copy right after wrprotecting a read-write page. Before that couldn't happen, fork() couldn't run since it takes mmap_write_lock, so if the pte was writable and transitioned to non-writable it'd mean it was a exclusive page and it would be guaranteed re-used, so the stale TLB would keep writing in place. The stale TLB is the exact same equivalent of your FOLL_LONGTERM, except it's the window the CPU has on the old page, the FOLL_LONGTERM is the window the PCI device has on the old page. The spurious COW is what makes TLB and PCI device go out of sync reading and writing to the old page, while the CPU moved on to a new page. The issue is really similar. > To be clear, here I am only talking about pin_user_pages. We now have > logic to tell if a page is under pin_user_pages(FOLL_LONGTERM) or not, > and that is what is driving the copy on fork logic. fork() wrprotects and like every other wrprotect, was just falling in the above scenario. > secondary-mmu drivers using mmu notifier should not trigger this logic > and should not restrict write protect. That's a great point. I didn't think the mmu notifier will invalidate the secondary MMU and ultimately issue a GUP after the wp_copy_page to keep it in sync. The funny thing that doesn't make sense is that wp_copy_page will only be invoked because the PIN was left by KVM on the page for that extra safety I was talking about earlier. Are we forced to drop all the page pins to be able to wrprotect the memory without being flooded by immediate COWs? So the ultimate breakpoint, is the FOLL_LONGTERM and no mmu notifier to go out of sync on a wrprotect, which can happen if the device is doing a readonly access long term. I quote you earlier: "A long term page pin must use pin_user_pages(), and either FOLL_LONGTERM|FOLL_WRITE (write mode) FOLL_LONGTERM|FOLL_FORCE|FOLL_WRITE (read mode)" You clearly contemplate the existance of a read mode, long term. That is also completely compatible with wrprotection. Why should we pick a model that forbids this to work? What do we get back from it? I only see unnecessary risk and inefficiencies coming back from it. > > Ultimately, what do we really gain from all this breakage? > > Well, the clean definition of pin_user_pages(FOLL_LONGTERM) is very > positive for DMA drivers working in that area. I was referring to page_count in do_wp_page, not pin_user_pages sorry for the confusion. Thanks, Andrea ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-09 2:50 ` Andrea Arcangeli @ 2021-01-11 14:30 ` Jason Gunthorpe 0 siblings, 0 replies; 96+ messages in thread From: Jason Gunthorpe @ 2021-01-11 14:30 UTC (permalink / raw) To: Andrea Arcangeli Cc: linux-mm, linux-kernel, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Linus Torvalds, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Fri, Jan 08, 2021 at 09:50:08PM -0500, Andrea Arcangeli wrote: > For all those that aren't using mmu notifier and that rely solely on > page pins, they still require privilege, except they do through /dev/ > permissions. It is normal that the dev nodes are a+rw so it doesn't really require privilege in any real sense. > Actually the mmu notifier doesn't strictly require pins, it only > requires GUP. All users tend to use FOLL_GET just as a safety > precaution (I already tried to optimize away the two atomics per GUP, > but we were naked by the KVM maintainer that didn't want to take the > risk, I would have, but it's a fair point indeed, obviously it's safer > with the pin plus the mmu notifier, two is safer than one). I'm not sure what holding the pin will do to reduce risk? If you get into a situation where you are stuffing a page into the SMMU that is not in the CPU's MMU then everything is lost. Holding a pin while carrying a page from the CPU page table to the SMMU just ensures that page isn't freed until it is installed, but once installed you are back to being broken. > I'm not sure how any copy-user could obviate a secondary MMU mapping, > mappings and copies are mutually exclusive. Any copy would be breaking > memory coherency in this environment. Because most places need to copy from user to stable kernel memory before processing data under user control. You can't just cast a user controlled pointer to a kstruct and use it - that is very likely a security bug. Still, the general version is something like kmap: map = user_map_setup(user_ptr, length) kptr = user_map_enter(map) [use kptr] user_map_leave(map, kptr) And inside it could use mmu notifiers, or gup, or whatever. user_map_setup() would register the notifier and user_map_enter() would validate the cache'd page pointer and block cached invalidation until user_map_leave(). > The primary concern with the mmu notifier in io_uring is the > take_all_locks latency. Just enabling mmu_notifier takes a performance hit on the entire process too, it is not such a simple decision.. We'd need benchmarks against a database or scientific application to see how negative the notifier actually becomes. > The problem with the mmu notifier as an universal solution, for > example is that it can't wait for I/O completion of O_DIRECT since it > has no clue where the put_page is to wait for it, otherwise we could > avoid even the FOLL_GET for O_DIRECT and guarantee the I/O has to be > completed before paging or anything can unmap the page under I/O from > the pagetable. GPU is already doing something like this, waiting in a notifier invalidate callback for DMA to finish before allowing invalidate to complete. It is horrendously complicated and I'm not sure blocking invalidate for a long time is actually much better for the MM.. > I see the incompatibility you describe as problem we have today, in > the present, and that will fade with time. > > Reminds me when we had >4G of RAM and 32bit devices doing DMA. How > many 32bit devices are there now? I'm not so sure anymore. A few years ago OpenCAPI and PCI PRI seemed like good things, but now with experience they carry pretty bad performance hits to use them. Lots of places are skipping them. CXL offers another chance at this, so we'll see again in another 5 years or so if it works out. It is not any easy problem to solve from a HW perspective. > We're not talking here about any random PCI device, we're talking here > about very special and very advanced devices that need to have "long > term" GUP pins in order to operate, not the usual nvme/gigabit device > where GUP pins are never long term. Beyond RDMA, netdev's XDP uses FOLL_LONGTERM, so do various video devices, lots of things related to virtualization like vfio, vdpa and vhost. I think this is a bit defeatist to say it doesn't matter. If anything as time goes on it seems to be growing, not shrinking currently. > The point is that if you do echo ... >/proc/self/clear_refs on your > pid, that has any FOLL_LONGTERM on its mm, it'll just cause your > device driver to go out of sync with the mm. It'll see the old pages, > before the spurious COWs. The CPU will use new pages (the spurious > COWs). But if you do that then clear-refs isn't going to work they way it thought either - this first needs some explanation for how clear_refs is supposed to work when DMA WRITE is active on the page. I'd certainly say causing a loss of synchrony is not acceptable, so if we keep Linus's version of COW then clear_refs has to not write protect pages under DMA. > > secondary-mmu drivers using mmu notifier should not trigger this logic > > and should not restrict write protect. > > That's a great point. I didn't think the mmu notifier will invalidate > the secondary MMU and ultimately issue a GUP after the wp_copy_page to > keep it in sync. It had better, or mmu notifiers are broken, right? > The funny thing that doesn't make sense is that wp_copy_page will only > be invoked because the PIN was left by KVM on the page for that extra > safety I was talking about earlier. Yes, with the COW change if kvm cares about this inefficiency it should not have the unnecessary pin. > You clearly contemplate the existance of a read mode, long term. That > is also completely compatible with wrprotection. We talked about a read mode, but we didn't flesh it out. It is not unconditionally compatible with wrprotect - most likely you still can't write protect a page under READ DMA because when you eventually take the COW there will be ambiguous situations that will break the synchrony. Jason ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-09 0:42 ` Jason Gunthorpe 2021-01-09 2:50 ` Andrea Arcangeli @ 2021-01-13 21:56 ` Jerome Glisse 2021-01-13 23:39 ` Jason Gunthorpe 1 sibling, 1 reply; 96+ messages in thread From: Jerome Glisse @ 2021-01-13 21:56 UTC (permalink / raw) To: Jason Gunthorpe Cc: Andrea Arcangeli, linux-mm, linux-kernel, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Linus Torvalds, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Fri, Jan 08, 2021 at 08:42:55PM -0400, Jason Gunthorpe wrote: > On Fri, Jan 08, 2021 at 05:43:56PM -0500, Andrea Arcangeli wrote: > > On Fri, Jan 08, 2021 at 02:19:45PM -0400, Jason Gunthorpe wrote: > > > On Fri, Jan 08, 2021 at 12:00:36PM -0500, Andrea Arcangeli wrote: > > > > > The majority cannot be converted to notifiers because they are DMA > > > > > based. Every one of those is an ABI for something, and does not expect > > > > > extra privilege to function. It would be a major breaking change to > > > > > have pin_user_pages require some cap. > > > > > > > > ... what makes them safe is to be transient GUP pin and not long > > > > term. > > > > > > > > Please note the "long term" in the underlined line. > > > > > > Many of them are long term, though only 50 or so have been marked > > > specifically with FOLL_LONGTERM. I don't see how we can make such a > > > major ABI break. > > > > io_uring is one of those indeed and I already flagged it. > > > > This isn't a black and white issue, kernel memory is also pinned but > > it's not in movable pageblocks... How do you tell the VM in GUP to > > migrate memory to a non movable pageblock before pinning it? Because > > that's what it should do to create less breakage. > > There is already a patch series floating about to do exactly that for > FOLL_LONGTERM pins based on the existing code in GUP for CMA migration > > > For example iommu obviously need to be privileged, if your argument > > that it's enough to use the right API to take long term pins > > unconstrained, that's not the case. Pins are pins and prevent moving > > or freeing the memory, their effect is the same and again worse than > > mlock on many levels. > > The ship sailed on this a decade ago, it is completely infeasible to > go back now, it would completely break widely used things like GPU, > RDMA and more. > I am late to this but GPU should not be use as an excuse for GUP. GUP is a broken model and the way GPU use GUP is less broken then RDMA. In GPU driver GUP contract with userspace is that the data the GPU can access is a snapshot of what the process memory was at the time you asked for the GUP. Process can start using different pages right after. There is no constant coherency contract (ie CPU and GPU can be working on different pages). If you want coherency ie always have CPU and GPU work on the same page then you need to use mmu notifier and avoid pinning pages. Anything that does not abide by mmu notifier is broken and can not be fix. Cheers, Jérôme ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-13 21:56 ` Jerome Glisse @ 2021-01-13 23:39 ` Jason Gunthorpe 2021-01-14 2:35 ` Jerome Glisse 0 siblings, 1 reply; 96+ messages in thread From: Jason Gunthorpe @ 2021-01-13 23:39 UTC (permalink / raw) To: Jerome Glisse Cc: Andrea Arcangeli, linux-mm, linux-kernel, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Linus Torvalds, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Wed, Jan 13, 2021 at 04:56:38PM -0500, Jerome Glisse wrote: > is a broken model and the way GPU use GUP is less broken then RDMA. In > GPU driver GUP contract with userspace is that the data the GPU can > access is a snapshot of what the process memory was at the time you > asked for the GUP. Process can start using different pages right after. > There is no constant coherency contract (ie CPU and GPU can be working > on different pages). Look at the habana labs "totally not a GPU" driver, it doesn't work that way, GPU compute operations do want coherency. The mmu notifier hackery some of the other GPU drivers use to get coherency requires putting the kernel between every single work submission, and has all kinds of wonky issues and limitations - I think it is net worse approach than GUP, honestly. Jason ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy 2021-01-13 23:39 ` Jason Gunthorpe @ 2021-01-14 2:35 ` Jerome Glisse 0 siblings, 0 replies; 96+ messages in thread From: Jerome Glisse @ 2021-01-14 2:35 UTC (permalink / raw) To: Jason Gunthorpe Cc: Andrea Arcangeli, linux-mm, linux-kernel, Yu Zhao, Andy Lutomirski, Peter Xu, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra, Linus Torvalds, Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oleg Nesterov, Jann Horn, Kees Cook, John Hubbard, Leon Romanovsky, Jan Kara, Kirill Tkhai On Wed, Jan 13, 2021 at 07:39:36PM -0400, Jason Gunthorpe wrote: > On Wed, Jan 13, 2021 at 04:56:38PM -0500, Jerome Glisse wrote: > > > is a broken model and the way GPU use GUP is less broken then RDMA. In > > GPU driver GUP contract with userspace is that the data the GPU can > > access is a snapshot of what the process memory was at the time you > > asked for the GUP. Process can start using different pages right after. > > There is no constant coherency contract (ie CPU and GPU can be working > > on different pages). > > Look at the habana labs "totally not a GPU" driver, it doesn't work > that way, GPU compute operations do want coherency. > > The mmu notifier hackery some of the other GPU drivers use to get > coherency requires putting the kernel between every single work > submission, and has all kinds of wonky issues and limitations - I > think it is net worse approach than GUP, honestly. Yes what GPU driver do today with GUP is wrong but it is only use for texture upload/download. So that is a very limited scope (amdkfd being an exception here). Yes also to the fact that waiting on GPU fence from mmu notifier callback is bad. We are thinking on how to solve this. But what do matter is that hardware is moving in right direction and we will no longer need GUP. So GUP is dying out in GPU driver. Cheers, Jérôme ^ permalink raw reply [flat|nested] 96+ messages in thread
[parent not found: <20210109034958.6928-1-hdanton@sina.com>]
* Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy [not found] ` <20210109034958.6928-1-hdanton@sina.com> @ 2021-01-11 14:39 ` Jason Gunthorpe 0 siblings, 0 replies; 96+ messages in thread From: Jason Gunthorpe @ 2021-01-11 14:39 UTC (permalink / raw) To: Hillf Danton Cc: linux-mm, linux-kernel, Yu Zhao, Andy Lutomirski, Peter Xu, Jann Horn On Sat, Jan 09, 2021 at 11:49:58AM +0800, Hillf Danton wrote: > On Fri, 8 Jan 2021 14:19:45 -0400 Jason Gunthorpe wrote: > > > > What I was trying to explain below, is I think we agreed that a page > > under active FOLL_LONGTERM pin *can not* be write protected. > > > > Establishing the FOLL_LONGTERM pin (for read or write) must *always* > > break the write protection and the VM *cannot* later establish a new > > write protection on that page while the pin is active. > > > > Indeed, it is complete nonsense to try and write protect a page that > > has active DMA write activity! Changing the CPU page protection bits > > will not stop any DMA! Doing so will inevitably become a security > > problem with an attack similar to what you described. > > > > So this is what was done during fork() - fork will no longer write > > protect pages under FOLL_LONGTERM to make them COWable, instead it > > will copy them at fork time. > > Is it, in a step forward, unlikely for DMA write activity to happen > during page copy at fork? I'm not sure it matters, it is not that much different than CPU write activity concurrent to fork(). fork() will capture some point in time - if the application cares that this data is coherent during fork() then it has to deliberately cause coherence somehow. DMA just has fewer options for the application to create the coherency because of data tearing during the page copy. Jason ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup 2021-01-05 20:39 ` Andrea Arcangeli 2021-01-05 21:20 ` Yu Zhao 2021-01-05 21:22 ` Nadav Amit @ 2021-01-05 21:55 ` Peter Xu 2 siblings, 0 replies; 96+ messages in thread From: Peter Xu @ 2021-01-05 21:55 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nadav Amit, linux-mm, lkml, Yu Zhao, Andy Lutomirski, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra On Tue, Jan 05, 2021 at 03:39:35PM -0500, Andrea Arcangeli wrote: > I'd suggest to coordinate with Peter on that, since I wasn't planning > to work on this if somebody else offered to do it. Thanks, Andrea. Nadav, please go ahead with whatever patch(es) in your mind. Please let me know if you prefer me to do it, or I'll wait for your new version. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes 2020-12-25 9:25 [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Nadav Amit 2020-12-25 9:25 ` [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect Nadav Amit 2020-12-25 9:25 ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Nadav Amit @ 2021-03-02 22:13 ` Peter Xu 2021-03-02 22:14 ` Nadav Amit 2 siblings, 1 reply; 96+ messages in thread From: Peter Xu @ 2021-03-02 22:13 UTC (permalink / raw) To: Nadav Amit Cc: linux-mm, linux-kernel, Nadav Amit, Andrea Arcangeli, Yu Zhao, Andy Lutomirski, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra On Fri, Dec 25, 2020 at 01:25:27AM -0800, Nadav Amit wrote: > From: Nadav Amit <namit@vmware.com> > > This patch-set went from v1 to RFCv2, as there is still an ongoing > discussion regarding the way of solving the recently found races due to > deferred TLB flushes. These patches are only sent for reference for now, > and can be applied later if no better solution is taken. > > In a nutshell, write-protecting PTEs with deferred TLB flushes was mostly > performed while holding mmap_lock for write. This prevented concurrent > page-fault handler invocations from mistakenly assuming that a page is > write-protected when in fact, due to the deferred TLB flush, other CPU > could still write to the page. Such a write can cause a memory > corruption if it takes place after the page was copied (in > cow_user_page()), and before the PTE was flushed (by wp_page_copy()). > > However, the userfaultfd and soft-dirty mechanisms did not take > mmap_lock for write, but only for read, which made such races possible. > Since commit 09854ba94c6a ("mm: do_wp_page() simplification") these > races became more likely to take place as non-COW'd pages are more > likely to be COW'd instead of being reused. Both of the races that > these patches are intended to resolve were produced on v5.10. > > To avoid the performance overhead some alternative solutions that do not > require to acquire mmap_lock for write were proposed, specifically for > userfaultfd. So far no better solution that can be backported was > proposed for the soft-dirty case. > > v1->RFCv2: > - Better (i.e., correct) description of the userfaultfd buggy case [Yu] > - Patch for the soft-dirty case Nadav, Do you plan to post a new version to fix the tlb corrupt issue that this series wanted to solve? Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 96+ messages in thread
* Re: [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes 2021-03-02 22:13 ` [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Peter Xu @ 2021-03-02 22:14 ` Nadav Amit 0 siblings, 0 replies; 96+ messages in thread From: Nadav Amit @ 2021-03-02 22:14 UTC (permalink / raw) To: Peter Xu Cc: Linux-MM, LKML, Andrea Arcangeli, Yu Zhao, Andy Lutomirski, Pavel Emelyanov, Mike Kravetz, Mike Rapoport, Minchan Kim, Will Deacon, Peter Zijlstra [-- Attachment #1: Type: text/plain, Size: 1999 bytes --] > On Mar 2, 2021, at 2:13 PM, Peter Xu <peterx@redhat.com> wrote: > > On Fri, Dec 25, 2020 at 01:25:27AM -0800, Nadav Amit wrote: >> From: Nadav Amit <namit@vmware.com> >> >> This patch-set went from v1 to RFCv2, as there is still an ongoing >> discussion regarding the way of solving the recently found races due to >> deferred TLB flushes. These patches are only sent for reference for now, >> and can be applied later if no better solution is taken. >> >> In a nutshell, write-protecting PTEs with deferred TLB flushes was mostly >> performed while holding mmap_lock for write. This prevented concurrent >> page-fault handler invocations from mistakenly assuming that a page is >> write-protected when in fact, due to the deferred TLB flush, other CPU >> could still write to the page. Such a write can cause a memory >> corruption if it takes place after the page was copied (in >> cow_user_page()), and before the PTE was flushed (by wp_page_copy()). >> >> However, the userfaultfd and soft-dirty mechanisms did not take >> mmap_lock for write, but only for read, which made such races possible. >> Since commit 09854ba94c6a ("mm: do_wp_page() simplification") these >> races became more likely to take place as non-COW'd pages are more >> likely to be COW'd instead of being reused. Both of the races that >> these patches are intended to resolve were produced on v5.10. >> >> To avoid the performance overhead some alternative solutions that do not >> require to acquire mmap_lock for write were proposed, specifically for >> userfaultfd. So far no better solution that can be backported was >> proposed for the soft-dirty case. >> >> v1->RFCv2: >> - Better (i.e., correct) description of the userfaultfd buggy case [Yu] >> - Patch for the soft-dirty case > > Nadav, > > Do you plan to post a new version to fix the tlb corrupt issue that this series > wanted to solve? Yes, yes. Sorry for that. Will do so later today. Regards, Nadav [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 96+ messages in thread
end of thread, other threads:[~2021-03-02 22:56 UTC | newest] Thread overview: 96+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-12-25 9:25 [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Nadav Amit 2020-12-25 9:25 ` [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect Nadav Amit 2021-01-04 12:22 ` Peter Zijlstra 2021-01-04 19:24 ` Andrea Arcangeli 2021-01-04 19:35 ` Nadav Amit 2021-01-04 20:19 ` Andrea Arcangeli 2021-01-04 20:39 ` Nadav Amit 2021-01-04 21:01 ` Andrea Arcangeli 2021-01-04 21:26 ` Nadav Amit 2021-01-05 18:45 ` Andrea Arcangeli 2021-01-05 19:05 ` Nadav Amit 2021-01-05 19:45 ` Andrea Arcangeli 2021-01-05 20:06 ` Nadav Amit 2021-01-05 21:06 ` Andrea Arcangeli 2021-01-05 21:43 ` Peter Xu 2021-01-05 8:13 ` Peter Zijlstra 2021-01-05 8:52 ` Nadav Amit 2021-01-05 14:26 ` Peter Zijlstra 2021-01-05 8:58 ` Peter Zijlstra 2021-01-05 9:22 ` Nadav Amit 2021-01-05 17:58 ` Andrea Arcangeli 2021-01-05 15:08 ` Peter Xu 2021-01-05 18:08 ` Andrea Arcangeli 2021-01-05 18:41 ` Peter Xu 2021-01-05 18:55 ` Andrea Arcangeli 2021-01-05 19:07 ` Nadav Amit 2021-01-05 19:43 ` Peter Xu 2020-12-25 9:25 ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Nadav Amit 2021-01-05 15:08 ` Will Deacon 2021-01-05 18:20 ` Andrea Arcangeli 2021-01-05 19:26 ` Nadav Amit 2021-01-05 20:39 ` Andrea Arcangeli 2021-01-05 21:20 ` Yu Zhao 2021-01-05 21:22 ` Nadav Amit 2021-01-05 22:16 ` Will Deacon 2021-01-06 0:29 ` Andrea Arcangeli 2021-01-06 0:02 ` Andrea Arcangeli 2021-01-07 20:04 ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Andrea Arcangeli 2021-01-07 20:04 ` [PATCH 1/2] mm: proc: Invalidate TLB after clearing soft-dirty page state Andrea Arcangeli 2021-01-07 20:04 ` [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending Andrea Arcangeli 2021-01-07 20:17 ` Linus Torvalds 2021-01-07 20:25 ` Linus Torvalds 2021-01-07 20:58 ` Andrea Arcangeli 2021-01-07 21:29 ` Linus Torvalds 2021-01-07 21:53 ` John Hubbard 2021-01-07 22:00 ` Linus Torvalds 2021-01-07 22:14 ` John Hubbard 2021-01-07 22:20 ` Linus Torvalds 2021-01-07 22:24 ` Linus Torvalds 2021-01-07 22:37 ` John Hubbard 2021-01-15 11:27 ` Jan Kara 2021-01-07 22:31 ` Andrea Arcangeli 2021-01-07 22:42 ` Linus Torvalds 2021-01-07 22:51 ` Linus Torvalds 2021-01-07 23:48 ` Andrea Arcangeli 2021-01-08 0:25 ` Linus Torvalds 2021-01-08 12:48 ` Will Deacon 2021-01-08 16:14 ` Andrea Arcangeli 2021-01-08 17:39 ` Linus Torvalds 2021-01-08 17:53 ` Andrea Arcangeli 2021-01-08 19:25 ` Linus Torvalds 2021-01-09 0:12 ` Andrea Arcangeli 2021-01-08 17:30 ` Linus Torvalds 2021-01-07 23:28 ` Andrea Arcangeli 2021-01-07 21:36 ` kernel test robot 2021-01-07 20:25 ` [PATCH 0/2] page_count can't be used to decide when wp_page_copy Jason Gunthorpe 2021-01-07 20:32 ` Linus Torvalds 2021-01-07 21:05 ` Linus Torvalds 2021-01-07 22:02 ` Andrea Arcangeli 2021-01-07 22:17 ` Linus Torvalds 2021-01-07 22:56 ` Andrea Arcangeli 2021-01-09 19:32 ` Matthew Wilcox 2021-01-09 19:46 ` Linus Torvalds 2021-01-15 14:30 ` Jan Kara 2021-01-07 21:54 ` Andrea Arcangeli 2021-01-07 21:45 ` Andrea Arcangeli 2021-01-08 13:36 ` Jason Gunthorpe 2021-01-08 17:00 ` Andrea Arcangeli 2021-01-08 18:19 ` Jason Gunthorpe 2021-01-08 18:31 ` Andy Lutomirski 2021-01-08 18:38 ` Linus Torvalds 2021-01-08 23:34 ` Andrea Arcangeli 2021-01-09 19:03 ` Andy Lutomirski 2021-01-09 19:15 ` Linus Torvalds 2021-01-08 18:59 ` Linus Torvalds 2021-01-08 22:43 ` Andrea Arcangeli 2021-01-09 0:42 ` Jason Gunthorpe 2021-01-09 2:50 ` Andrea Arcangeli 2021-01-11 14:30 ` Jason Gunthorpe 2021-01-13 21:56 ` Jerome Glisse 2021-01-13 23:39 ` Jason Gunthorpe 2021-01-14 2:35 ` Jerome Glisse [not found] ` <20210109034958.6928-1-hdanton@sina.com> 2021-01-11 14:39 ` Jason Gunthorpe 2021-01-05 21:55 ` [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup Peter Xu 2021-03-02 22:13 ` [RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes Peter Xu 2021-03-02 22:14 ` Nadav Amit
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).