* [PATCH] mm: Don't fault around userfaultfd-registered regions on reads @ 2020-11-26 22:23 Peter Xu 2020-11-27 8:16 ` Mike Rapoport ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Peter Xu @ 2020-11-26 22:23 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: peterx, Andrew Morton, Hugh Dickins, Andrea Arcangeli, Mike Rapoport Faulting around for reads are in most cases helpful for the performance so that continuous memory accesses may avoid another trip of page fault. However it may not always work as expected. For example, userfaultfd registered regions may not be the best candidate for pre-faults around the reads. For missing mode uffds, fault around does not help because if the page cache existed, then the page should be there already. If the page cache is not there, nothing else we can do, either. If the fault-around code is destined to be helpless for userfault-missing vmas, then ideally we can skip it. For wr-protected mode uffds, errornously fault in those pages around could lead to threads accessing the pages without uffd server's awareness. For example, when punching holes on uffd-wp registered shmem regions, we'll first try to unmap all the pages before evicting the page cache but without locking the page (please refer to shmem_fallocate(), where unmap_mapping_range() is called before shmem_truncate_range()). When fault-around happens near a hole being punched, we might errornously fault in the "holes" right before it will be punched. Then there's a small window before the page cache was finally dropped, and after the page will be writable again (NOTE: the uffd-wp protect information is totally lost due to the pre-unmap in shmem_fallocate(), so the page can be writable within the small window). That's severe data loss. Let's grant the userspace full control of the uffd-registered ranges, rather than trying to do the tricks. Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Peter Xu <peterx@redhat.com> --- Note that since no file-backed uffd-wp support is there yet upstream, so the uffd-wp check is actually not really functioning. However since we have all the necessary uffd-wp concepts already upstream, maybe it's better to do it once and for all. This patch comes from debugging a data loss issue when working on the uffd-wp support on shmem/hugetlbfs. I posted this out for early review and comments, but also because it should already start to benefit missing mode userfaultfd to avoid trying to fault around on reads. --- include/linux/userfaultfd_k.h | 5 +++++ mm/memory.c | 17 +++++++++++++++++ 2 files changed, 22 insertions(+) diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index a8e5f3ea9bb2..451d99bb3a1a 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -62,6 +62,11 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma) return vma->vm_flags & VM_UFFD_WP; } +static inline bool vma_registered_userfaultfd(struct vm_area_struct *vma) +{ + return userfaultfd_missing(vma) || userfaultfd_wp(vma); +} + static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma, pte_t pte) { diff --git a/mm/memory.c b/mm/memory.c index eeae590e526a..ca58ada94c96 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3933,6 +3933,23 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf) int off; vm_fault_t ret = 0; + /* + * Be extremely careful with uffd-armed regions. + * + * For missing mode uffds, fault around does not help because if the + * page cache existed, then the page should be there already. If the + * page cache is not there, nothing else we can do either. + * + * For wr-protected mode uffds, errornously fault in those pages around + * could lead to threads accessing the pages without uffd server's + * awareness, finally it could cause ghostly data corruption. + * + * The idea is that, every single page of uffd regions should be + * governed by the userspace on which page to fault in. + */ + if (unlikely(vma_registered_userfaultfd(vmf->vma))) + return 0; + nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT; mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK; -- 2.26.2 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads 2020-11-26 22:23 [PATCH] mm: Don't fault around userfaultfd-registered regions on reads Peter Xu @ 2020-11-27 8:16 ` Mike Rapoport 2020-11-27 13:31 ` Peter Xu 2020-11-27 12:22 ` Matthew Wilcox 2020-11-27 17:00 ` David Hildenbrand 2 siblings, 1 reply; 12+ messages in thread From: Mike Rapoport @ 2020-11-27 8:16 UTC (permalink / raw) To: Peter Xu Cc: linux-kernel, linux-mm, Andrew Morton, Hugh Dickins, Andrea Arcangeli, Mike Rapoport On Thu, Nov 26, 2020 at 05:23:59PM -0500, Peter Xu wrote: > Faulting around for reads are in most cases helpful for the performance so that > continuous memory accesses may avoid another trip of page fault. However it > may not always work as expected. > > For example, userfaultfd registered regions may not be the best candidate for > pre-faults around the reads. > > For missing mode uffds, fault around does not help because if the page cache > existed, then the page should be there already. If the page cache is not > there, nothing else we can do, either. If the fault-around code is destined to > be helpless for userfault-missing vmas, then ideally we can skip it. > > For wr-protected mode uffds, errornously fault in those pages around could lead > to threads accessing the pages without uffd server's awareness. For example, > when punching holes on uffd-wp registered shmem regions, we'll first try to > unmap all the pages before evicting the page cache but without locking the > page (please refer to shmem_fallocate(), where unmap_mapping_range() is called > before shmem_truncate_range()). When fault-around happens near a hole being > punched, we might errornously fault in the "holes" right before it will be > punched. Then there's a small window before the page cache was finally > dropped, and after the page will be writable again (NOTE: the uffd-wp protect > information is totally lost due to the pre-unmap in shmem_fallocate(), so the > page can be writable within the small window). That's severe data loss. > > Let's grant the userspace full control of the uffd-registered ranges, rather > than trying to do the tricks. > > Cc: Hugh Dickins <hughd@google.com> > Cc: Andrea Arcangeli <aarcange@redhat.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> > Signed-off-by: Peter Xu <peterx@redhat.com> One nit below, except that Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> > --- > > Note that since no file-backed uffd-wp support is there yet upstream, so the > uffd-wp check is actually not really functioning. However since we have all > the necessary uffd-wp concepts already upstream, maybe it's better to do it > once and for all. > > This patch comes from debugging a data loss issue when working on the uffd-wp > support on shmem/hugetlbfs. I posted this out for early review and comments, > but also because it should already start to benefit missing mode userfaultfd to > avoid trying to fault around on reads. > --- > include/linux/userfaultfd_k.h | 5 +++++ > mm/memory.c | 17 +++++++++++++++++ > 2 files changed, 22 insertions(+) > > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h > index a8e5f3ea9bb2..451d99bb3a1a 100644 > --- a/include/linux/userfaultfd_k.h > +++ b/include/linux/userfaultfd_k.h > @@ -62,6 +62,11 @@ static inline bool userfaultfd_wp(struct vm_area_struct *vma) > return vma->vm_flags & VM_UFFD_WP; > } > > +static inline bool vma_registered_userfaultfd(struct vm_area_struct *vma) > +{ > + return userfaultfd_missing(vma) || userfaultfd_wp(vma); > +} We have userfaultfd_armed() that does exectly this, don't we? > + > static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma, > pte_t pte) > { > diff --git a/mm/memory.c b/mm/memory.c > index eeae590e526a..ca58ada94c96 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3933,6 +3933,23 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf) > int off; > vm_fault_t ret = 0; > > + /* > + * Be extremely careful with uffd-armed regions. > + * > + * For missing mode uffds, fault around does not help because if the > + * page cache existed, then the page should be there already. If the > + * page cache is not there, nothing else we can do either. > + * > + * For wr-protected mode uffds, errornously fault in those pages around > + * could lead to threads accessing the pages without uffd server's > + * awareness, finally it could cause ghostly data corruption. > + * > + * The idea is that, every single page of uffd regions should be > + * governed by the userspace on which page to fault in. > + */ > + if (unlikely(vma_registered_userfaultfd(vmf->vma))) > + return 0; > + > nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT; > mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK; > > -- > 2.26.2 > -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads 2020-11-27 8:16 ` Mike Rapoport @ 2020-11-27 13:31 ` Peter Xu 0 siblings, 0 replies; 12+ messages in thread From: Peter Xu @ 2020-11-27 13:31 UTC (permalink / raw) To: Mike Rapoport Cc: linux-kernel, linux-mm, Andrew Morton, Hugh Dickins, Andrea Arcangeli, Mike Rapoport On Fri, Nov 27, 2020 at 10:16:05AM +0200, Mike Rapoport wrote: > On Thu, Nov 26, 2020 at 05:23:59PM -0500, Peter Xu wrote: > > Faulting around for reads are in most cases helpful for the performance so that > > continuous memory accesses may avoid another trip of page fault. However it > > may not always work as expected. > > > > For example, userfaultfd registered regions may not be the best candidate for > > pre-faults around the reads. > > > > For missing mode uffds, fault around does not help because if the page cache > > existed, then the page should be there already. If the page cache is not > > there, nothing else we can do, either. If the fault-around code is destined to > > be helpless for userfault-missing vmas, then ideally we can skip it. > > > > For wr-protected mode uffds, errornously fault in those pages around could lead > > to threads accessing the pages without uffd server's awareness. For example, > > when punching holes on uffd-wp registered shmem regions, we'll first try to > > unmap all the pages before evicting the page cache but without locking the > > page (please refer to shmem_fallocate(), where unmap_mapping_range() is called > > before shmem_truncate_range()). When fault-around happens near a hole being > > punched, we might errornously fault in the "holes" right before it will be > > punched. Then there's a small window before the page cache was finally > > dropped, and after the page will be writable again (NOTE: the uffd-wp protect > > information is totally lost due to the pre-unmap in shmem_fallocate(), so the > > page can be writable within the small window). That's severe data loss. > > > > Let's grant the userspace full control of the uffd-registered ranges, rather > > than trying to do the tricks. > > > > Cc: Hugh Dickins <hughd@google.com> > > Cc: Andrea Arcangeli <aarcange@redhat.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> > > Signed-off-by: Peter Xu <peterx@redhat.com> > > One nit below, except that > > Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Thanks! > > +static inline bool vma_registered_userfaultfd(struct vm_area_struct *vma) > > +{ > > + return userfaultfd_missing(vma) || userfaultfd_wp(vma); > > +} > > We have userfaultfd_armed() that does exectly this, don't we? Yes, will fix that up. -- Peter Xu ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads 2020-11-26 22:23 [PATCH] mm: Don't fault around userfaultfd-registered regions on reads Peter Xu 2020-11-27 8:16 ` Mike Rapoport @ 2020-11-27 12:22 ` Matthew Wilcox 2020-11-27 13:51 ` Peter Xu 2020-11-28 0:33 ` Andrea Arcangeli 2020-11-27 17:00 ` David Hildenbrand 2 siblings, 2 replies; 12+ messages in thread From: Matthew Wilcox @ 2020-11-27 12:22 UTC (permalink / raw) To: Peter Xu Cc: linux-kernel, linux-mm, Andrew Morton, Hugh Dickins, Andrea Arcangeli, Mike Rapoport On Thu, Nov 26, 2020 at 05:23:59PM -0500, Peter Xu wrote: > For missing mode uffds, fault around does not help because if the page cache > existed, then the page should be there already. If the page cache is not > there, nothing else we can do, either. If the fault-around code is destined to > be helpless for userfault-missing vmas, then ideally we can skip it. But it might have been faulted into the cache by another task, so skipping it is bad. > For wr-protected mode uffds, errornously fault in those pages around could lead > to threads accessing the pages without uffd server's awareness. For example, > when punching holes on uffd-wp registered shmem regions, we'll first try to > unmap all the pages before evicting the page cache but without locking the > page (please refer to shmem_fallocate(), where unmap_mapping_range() is called > before shmem_truncate_range()). When fault-around happens near a hole being > punched, we might errornously fault in the "holes" right before it will be > punched. Then there's a small window before the page cache was finally > dropped, and after the page will be writable again (NOTE: the uffd-wp protect > information is totally lost due to the pre-unmap in shmem_fallocate(), so the > page can be writable within the small window). That's severe data loss. Sounds like you have a missing page_mkwrite implementation. > This patch comes from debugging a data loss issue when working on the uffd-wp > support on shmem/hugetlbfs. I posted this out for early review and comments, > but also because it should already start to benefit missing mode userfaultfd to > avoid trying to fault around on reads. A measurable difference? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads 2020-11-27 12:22 ` Matthew Wilcox @ 2020-11-27 13:51 ` Peter Xu 2020-11-28 0:33 ` Andrea Arcangeli 1 sibling, 0 replies; 12+ messages in thread From: Peter Xu @ 2020-11-27 13:51 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-kernel, linux-mm, Andrew Morton, Hugh Dickins, Andrea Arcangeli, Mike Rapoport Matthew, Thanks for the review comments. On Fri, Nov 27, 2020 at 12:22:24PM +0000, Matthew Wilcox wrote: > On Thu, Nov 26, 2020 at 05:23:59PM -0500, Peter Xu wrote: > > For missing mode uffds, fault around does not help because if the page cache > > existed, then the page should be there already. If the page cache is not > > there, nothing else we can do, either. If the fault-around code is destined to > > be helpless for userfault-missing vmas, then ideally we can skip it. > > But it might have been faulted into the cache by another task, so skipping > it is bad. Is there a real use case of such? I thought about it, at least for qemu postcopy it's not working like that. I feel like CRIU neither, but Mike could correct me. Even if there's a case of that, for example, if task A tries to copy the pages over to a tmpfs file and task B accesses the pages too with uffd missing registered, the ideal solution is still that when a page is copied task A can installs the pte for the current task, rather than relying on the fault around on reads, IMHO. I don't know whether there's a way to do it, though. > > > For wr-protected mode uffds, errornously fault in those pages around could lead > > to threads accessing the pages without uffd server's awareness. For example, > > when punching holes on uffd-wp registered shmem regions, we'll first try to > > unmap all the pages before evicting the page cache but without locking the > > page (please refer to shmem_fallocate(), where unmap_mapping_range() is called > > before shmem_truncate_range()). When fault-around happens near a hole being > > punched, we might errornously fault in the "holes" right before it will be > > punched. Then there's a small window before the page cache was finally > > dropped, and after the page will be writable again (NOTE: the uffd-wp protect > > information is totally lost due to the pre-unmap in shmem_fallocate(), so the > > page can be writable within the small window). That's severe data loss. > > Sounds like you have a missing page_mkwrite implementation. I think it's slightly different issue, because shmem may not know whether the page should be allowed to write or not. AFAIU, uffd-wp is designed and implemented in a way that the final protect information is kept within ptes so e.g. vmas does not have a solid understanding on whether a page should be write-protected or not (so VM_UFFD_WP in vma flags is a hint only, and also that's why we won't abuse creating tons of vmas). We tried hard to keep the pte information on track, majorly _PAGE_UFFD_WP, alive even across swap in/outs and migrations. If pte is lost, we can't get that information from page cache, at least for now. > > > This patch comes from debugging a data loss issue when working on the uffd-wp > > support on shmem/hugetlbfs. I posted this out for early review and comments, > > but also because it should already start to benefit missing mode userfaultfd to > > avoid trying to fault around on reads. > > A measurable difference? Nop. I didn't measure missing case. It should really depend on whether there's a use case of such, imho. If there's, then we may still want that (however uffd-wp might be a different story, as discussed above). Otherwise maybe we should just avoid doing that for all. The other thing that led me to this patch (rather than only check against uffd-wp, for which case I'll just keep that small patch in my own tree until posting the uffd-wp series) is I thought about when the community would like to introduce things like "minor-faults" for userfaultfd. In that case we'd need to be careful too here otherwise we may lose some minor faults. And from that pov I'm thinking whether it's easier to just forbid kernel-side tricks like this for uffd in general, since as I also stated previously that IMHO if a memory region is registered with userfaultfd, maybe it's always good to "always" rely on the userspace on resolving page faults, so that we don't need to debug things like uffd-wp data crash as userfaultfd evolves, because that'll be non-trivial task at least to me... Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads 2020-11-27 12:22 ` Matthew Wilcox 2020-11-27 13:51 ` Peter Xu @ 2020-11-28 0:33 ` Andrea Arcangeli 2020-11-28 15:29 ` Peter Xu 1 sibling, 1 reply; 12+ messages in thread From: Andrea Arcangeli @ 2020-11-28 0:33 UTC (permalink / raw) To: Matthew Wilcox Cc: Peter Xu, linux-kernel, linux-mm, Andrew Morton, Hugh Dickins, Mike Rapoport Hello, On Fri, Nov 27, 2020 at 12:22:24PM +0000, Matthew Wilcox wrote: > On Thu, Nov 26, 2020 at 05:23:59PM -0500, Peter Xu wrote: > > For wr-protected mode uffds, errornously fault in those pages around could lead > > to threads accessing the pages without uffd server's awareness. For example, > > when punching holes on uffd-wp registered shmem regions, we'll first try to > > unmap all the pages before evicting the page cache but without locking the > > page (please refer to shmem_fallocate(), where unmap_mapping_range() is called > > before shmem_truncate_range()). When fault-around happens near a hole being > > punched, we might errornously fault in the "holes" right before it will be > > punched. Then there's a small window before the page cache was finally > > dropped, and after the page will be writable again (NOTE: the uffd-wp protect > > information is totally lost due to the pre-unmap in shmem_fallocate(), so the > > page can be writable within the small window). That's severe data loss. > > Sounds like you have a missing page_mkwrite implementation. If the real fault happened through the pagetable (which got dropped by the hole punching), a "missing type" userfault would be delivered to userland (because the pte would be none). Userland would invoke UFFDIO_COPY with the UFFDIO_COPY_MODE_WP flag. Such flag would then map the filled shmem page (not necessarily all zero and not necessarily the old content before the hole punch) with _PAGE_RW not set and _PAGE_UFFD_WP set, so the next write would also trigger a wrprotect userfault (this is what the uffd-wp app expects). filemap_map_pages doesn't notify userland when it fills a pte and it will map again the page read-write. However filemap_map_pages isn't capable to fill a hole and to undo the hole punch, all it can do are minor faults to re-fill the ptes from a not-yet-truncated inode page. Now it would be ok if filemap_map_pages re-filled the ptes, after they were zapped the first time by fallocate, and then the fallocate would zap them again before truncating the page, but I don't see a second pte zap... there's just the below single call of unmap_mapping_range: if ((u64)unmap_end > (u64)unmap_start) unmap_mapping_range(mapping, unmap_start, 1 + unmap_end - unmap_start, 0); shmem_truncate_range(inode, offset, offset + len - 1); So looking at filemap_map_pages in shmem, I'm really wondering if the non-uffd case is correct or not. Do we end up with ptes pointing to non pagecache, so then the virtual mapping is out of sync also with read/write syscalls that will then allocate another shmem page out of sync of the ptes? If a real page fault happened instead of filemap_map_pages, the shmem_fault() would block during fallocate punch hole by checking inode->i_private, as the comment says: * So refrain from * faulting pages into the hole while it's being punched. It's not immediately clear where filemap_map_pages refrains from faulting pages into the hole while it's being punched... given it's ignoring inode->i_private. So I'm not exactly sure how shmem can safely faulted in through filemap_map_pages, without going through shmem_fault... I suppose shmem simply is unsafe to use filemap_map_pages and it'd require a specific shmem_map_pages? If only filemap_map_pages was refraining re-faulting ptes of a shmem page that is about to be truncated (whose original ptes had _PAGE_RW unset and _PAGE_UFFD_WP set) there would be no problem with the uffd interaction. So a proper shmem_map_pages could co-exist with uffd, the userfaultfd_armed check would be only an optimization but it wouldn't be required to avoid userland memory corruption? From 8c6fb1b7dde309f0c8b5666a8e13557ae46369b4 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli <aarcange@redhat.com> Date: Fri, 27 Nov 2020 19:12:44 -0500 Subject: [PATCH 1/1] shmem: stop faulting in pages without checking inode->i_private Per shmem_fault comment shmem need to "refrain from faulting pages into the hole while it's being punched" and to do so it must check inode->i_private, which filemap_map_pages won't so it's unsafe to use in shmem because it can leave ptes pointing to non-pagecache pages in shmem backed vmas. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- mm/shmem.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/shmem.c b/mm/shmem.c index 8e2b35ba93ad..f6f29b3e67c6 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3942,7 +3942,6 @@ static const struct super_operations shmem_ops = { static const struct vm_operations_struct shmem_vm_ops = { .fault = shmem_fault, - .map_pages = filemap_map_pages, #ifdef CONFIG_NUMA .set_policy = shmem_set_policy, .get_policy = shmem_get_policy, Thanks, Andrea ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads 2020-11-28 0:33 ` Andrea Arcangeli @ 2020-11-28 15:29 ` Peter Xu 2020-12-01 19:08 ` Andrea Arcangeli 2020-12-01 21:31 ` Hugh Dickins 0 siblings, 2 replies; 12+ messages in thread From: Peter Xu @ 2020-11-28 15:29 UTC (permalink / raw) To: Andrea Arcangeli Cc: Matthew Wilcox, linux-kernel, linux-mm, Andrew Morton, Hugh Dickins, Mike Rapoport On Fri, Nov 27, 2020 at 07:33:39PM -0500, Andrea Arcangeli wrote: > Hello, Hi, Andrea! > > On Fri, Nov 27, 2020 at 12:22:24PM +0000, Matthew Wilcox wrote: > > On Thu, Nov 26, 2020 at 05:23:59PM -0500, Peter Xu wrote: > > > For wr-protected mode uffds, errornously fault in those pages around could lead > > > to threads accessing the pages without uffd server's awareness. For example, > > > when punching holes on uffd-wp registered shmem regions, we'll first try to > > > unmap all the pages before evicting the page cache but without locking the > > > page (please refer to shmem_fallocate(), where unmap_mapping_range() is called > > > before shmem_truncate_range()). When fault-around happens near a hole being > > > punched, we might errornously fault in the "holes" right before it will be > > > punched. Then there's a small window before the page cache was finally > > > dropped, and after the page will be writable again (NOTE: the uffd-wp protect > > > information is totally lost due to the pre-unmap in shmem_fallocate(), so the > > > page can be writable within the small window). That's severe data loss. > > > > Sounds like you have a missing page_mkwrite implementation. > > If the real fault happened through the pagetable (which got dropped by > the hole punching), a "missing type" userfault would be delivered to > userland (because the pte would be none). Userland would invoke > UFFDIO_COPY with the UFFDIO_COPY_MODE_WP flag. Such flag would then > map the filled shmem page (not necessarily all zero and not > necessarily the old content before the hole punch) with _PAGE_RW not > set and _PAGE_UFFD_WP set, so the next write would also trigger a > wrprotect userfault (this is what the uffd-wp app expects). > > filemap_map_pages doesn't notify userland when it fills a pte and it > will map again the page read-write. Yes. A trivial supplementary detail is that filemap_map_pages() will only set it read-only since alloc_set_pte() will only set write bit if it's a write. In our case it's read fault-around so without it. However it'll be quickly marked as writable as long as the thread quickly writes to it again. > > However filemap_map_pages isn't capable to fill a hole and to undo the > hole punch, all it can do are minor faults to re-fill the ptes from a > not-yet-truncated inode page. > > Now it would be ok if filemap_map_pages re-filled the ptes, after they > were zapped the first time by fallocate, and then the fallocate would > zap them again before truncating the page, but I don't see a second > pte zap... there's just the below single call of unmap_mapping_range: > > if ((u64)unmap_end > (u64)unmap_start) > unmap_mapping_range(mapping, unmap_start, > 1 + unmap_end - unmap_start, 0); > shmem_truncate_range(inode, offset, offset + len - 1); IMHO the 2nd pte zap is inside shmem_truncate_range(), where we will call truncate_inode_page() then onto truncate_cleanup_page(). Since we're at it, some more context: this is actually where I started to notice the issue, that we'll try to pre-unmap the whole region first before shmem_truncate_range(). The thing is shmem_truncate_range() will zap the ptes too, in an even safer way (with page lock held). So before I came up with the current patch, I also tried below patch, and it also fixes the data corrupt issue: -----8<----- diff --git a/mm/shmem.c b/mm/shmem.c index efa19e33e470..b275f401fe1f 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2777,7 +2777,6 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, inode_lock(inode); if (mode & FALLOC_FL_PUNCH_HOLE) { - struct address_space *mapping = file->f_mapping; loff_t unmap_start = round_up(offset, PAGE_SIZE); loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1; DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq); @@ -2795,9 +2794,6 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, inode->i_private = &shmem_falloc; spin_unlock(&inode->i_lock); - if ((u64)unmap_end > (u64)unmap_start) - unmap_mapping_range(mapping, unmap_start, - 1 + unmap_end - unmap_start, 0); shmem_truncate_range(inode, offset, offset + len - 1); /* No need to unmap again: hole-punching leaves COWed pages */ -----8<----- Above code existed starting from the 1st day of shmem fallocate code, so I was thinking why we did that. One reason could be for performance, since this pre-unmap of explicit unmap_mapping_range() will try to walk the page tables once rather than walking for every single page, so when the hole is huge it could benefit us since truncate_cleanup_page() is able to avoid those per-page walks then because page_mapped() could be more likely to be zero: if (page_mapped(page)) { pgoff_t nr = PageTransHuge(page) ? HPAGE_PMD_NR : 1; unmap_mapping_pages(mapping, page->index, nr, false); } It would be good if Hugh can help confirm whether my understanding is correct, though.. As a summary, that's why I didn't try to remove the optimization (which fixes the issue too) but instead I tried to dissable fault around instead for uffd, which sounds a better thing to do. > > So looking at filemap_map_pages in shmem, I'm really wondering if the > non-uffd case is correct or not. > > Do we end up with ptes pointing to non pagecache, so then the virtual > mapping is out of sync also with read/write syscalls that will then > allocate another shmem page out of sync of the ptes? > > If a real page fault happened instead of filemap_map_pages, the > shmem_fault() would block during fallocate punch hole by checking > inode->i_private, as the comment says: > > * So refrain from > * faulting pages into the hole while it's being punched. > > It's not immediately clear where filemap_map_pages refrains from > faulting pages into the hole while it's being punched... given it's > ignoring inode->i_private. > > So I'm not exactly sure how shmem can safely faulted in through > filemap_map_pages, without going through shmem_fault... I suppose > shmem simply is unsafe to use filemap_map_pages and it'd require > a specific shmem_map_pages? > > If only filemap_map_pages was refraining re-faulting ptes of a shmem > page that is about to be truncated (whose original ptes had _PAGE_RW > unset and _PAGE_UFFD_WP set) there would be no problem with the uffd > interaction. So a proper shmem_map_pages could co-exist with uffd, the > userfaultfd_armed check would be only an optimization but it wouldn't > be required to avoid userland memory corruption? So far I still think it's necessary to have this patch or equivilant to avoid fault-around. As discussed above, current map_pages of shmem (which is the general filemap_map_pages) should be safe without uffd as long as we'll remove ptes correctly, but also with page lock held (that should be exactly what we do right now in shmem_truncate_range). But since we have that optimization on pre-unmap, then uffd-wp is not safe any more, because of the race that this patch tries to avoid. And if my understanding above is correct, we may still want to keep shmem fault around logic, however it just may not suitable for uffd-wp, or uffd in general. Thanks! > > From 8c6fb1b7dde309f0c8b5666a8e13557ae46369b4 Mon Sep 17 00:00:00 2001 > From: Andrea Arcangeli <aarcange@redhat.com> > Date: Fri, 27 Nov 2020 19:12:44 -0500 > Subject: [PATCH 1/1] shmem: stop faulting in pages without checking > inode->i_private > > Per shmem_fault comment shmem need to "refrain from faulting pages > into the hole while it's being punched" and to do so it must check > inode->i_private, which filemap_map_pages won't so it's unsafe to use > in shmem because it can leave ptes pointing to non-pagecache pages in > shmem backed vmas. > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> > --- > mm/shmem.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/mm/shmem.c b/mm/shmem.c > index 8e2b35ba93ad..f6f29b3e67c6 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -3942,7 +3942,6 @@ static const struct super_operations shmem_ops = { > > static const struct vm_operations_struct shmem_vm_ops = { > .fault = shmem_fault, > - .map_pages = filemap_map_pages, > #ifdef CONFIG_NUMA > .set_policy = shmem_set_policy, > .get_policy = shmem_get_policy, > > > Thanks, > Andrea > -- Peter Xu ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads 2020-11-28 15:29 ` Peter Xu @ 2020-12-01 19:08 ` Andrea Arcangeli 2020-12-01 21:31 ` Hugh Dickins 1 sibling, 0 replies; 12+ messages in thread From: Andrea Arcangeli @ 2020-12-01 19:08 UTC (permalink / raw) To: Peter Xu Cc: Matthew Wilcox, linux-kernel, linux-mm, Andrew Morton, Hugh Dickins, Mike Rapoport Hi Peter, On Sat, Nov 28, 2020 at 10:29:03AM -0500, Peter Xu wrote: > Yes. A trivial supplementary detail is that filemap_map_pages() will only set > it read-only since alloc_set_pte() will only set write bit if it's a write. In > our case it's read fault-around so without it. However it'll be quickly marked > as writable as long as the thread quickly writes to it again. The problem here is that all the information injected into the kernel through the UFFDIO_WRITEPROTECT syscalls is wiped by the quoted unmap_mapping_range. We can later discuss we actually check _PAGE_UFFD_WP, but that's not the first thing to solve here. Here need to we find a way to retain _PAGE_UFFD_WP until the page is actually locked and is atomically disappearing from the xarray index as far as further shmem page faults are concerned. Adding that information to the xarray isn't ideal because it's mm specific and not a property of the "pagecache", but keeping that information in the pagetables as we do for anon memory is also causing issue as shown below, because the shmem pagetables are supposed to be zapped at any time and be re-created on the fly based on the xarray. As opposed this never happens with anon memory. For example when shmem swap the swap entry is written in the xarray not in the pagetable, how are we going to let _PAGE_UFFD_WP survive swapping of a shmem if we store the _PAGE_UFFD_WP in the pte? Did you test swap? So here I'm afraid there's something more fundamental in how to have _PAGE_UFFD_WP survive swapping and a random pte zap, and we need more infrastructure in the unmap_mapping_range to retain that information so then swapping also works then. So I don't think we can evaluate this patch without seeing the full patchset and how the rest is being handled. Still it's great you found the source of the corruption for the current shmem uffd-wp codebase so we can move forward! > > However filemap_map_pages isn't capable to fill a hole and to undo the > > hole punch, all it can do are minor faults to re-fill the ptes from a > > not-yet-truncated inode page. > > > > Now it would be ok if filemap_map_pages re-filled the ptes, after they > > were zapped the first time by fallocate, and then the fallocate would > > zap them again before truncating the page, but I don't see a second > > pte zap... there's just the below single call of unmap_mapping_range: > > > > if ((u64)unmap_end > (u64)unmap_start) > > unmap_mapping_range(mapping, unmap_start, > > 1 + unmap_end - unmap_start, 0); > > shmem_truncate_range(inode, offset, offset + len - 1); > > IMHO the 2nd pte zap is inside shmem_truncate_range(), where we will call > truncate_inode_page() then onto truncate_cleanup_page(). I wrongly assumed there was only one invalidate and I guessed that is what caused the problem since it run outside the page lock so it wasn't "atomic" with respect to the page truncation in the xarray. The comment "refrain from faulting pages into the hole while it's being punched" and the "inode->i_private waitqueue" logic is just to slowly throttle the faulting so the truncation is faster. It's still not entirely clear why it's needed since shmem_getpage_gfp uses find_lock_entry and the page had to be locked. I didn't look at the history to see the order of where those changes were added to see if it's still needed. Anyway it was already suspect that "unlikely(inode->i_private)" would be enough as an out of order check to fully serialize things but for a performance "throttling" logic it's fine and non concerning (worst case it is not needed and the page lock waitqueue would have serialized the faults against the slow invalidate loops already). > Since we're at it, some more context: this is actually where I started to > notice the issue, that we'll try to pre-unmap the whole region first before > shmem_truncate_range(). The thing is shmem_truncate_range() will zap the ptes > too, in an even safer way (with page lock held). So before I came up with the > current patch, I also tried below patch, and it also fixes the data corrupt > issue: > > -----8<----- > diff --git a/mm/shmem.c b/mm/shmem.c > index efa19e33e470..b275f401fe1f 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -2777,7 +2777,6 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, > inode_lock(inode); > > if (mode & FALLOC_FL_PUNCH_HOLE) { > - struct address_space *mapping = file->f_mapping; > loff_t unmap_start = round_up(offset, PAGE_SIZE); > loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1; > DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq); > @@ -2795,9 +2794,6 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, > inode->i_private = &shmem_falloc; > spin_unlock(&inode->i_lock); > > - if ((u64)unmap_end > (u64)unmap_start) > - unmap_mapping_range(mapping, unmap_start, > - 1 + unmap_end - unmap_start, 0); > shmem_truncate_range(inode, offset, offset + len - 1); > /* No need to unmap again: hole-punching leaves COWed pages */ > -----8<----- > > Above code existed starting from the 1st day of shmem fallocate code, so I was > thinking why we did that. One reason could be for performance, since this > pre-unmap of explicit unmap_mapping_range() will try to walk the page tables > once rather than walking for every single page, so when the hole is huge it > could benefit us since truncate_cleanup_page() is able to avoid those per-page > walks then because page_mapped() could be more likely to be zero: > > if (page_mapped(page)) { > pgoff_t nr = PageTransHuge(page) ? HPAGE_PMD_NR : 1; > unmap_mapping_pages(mapping, page->index, nr, false); > } > > It would be good if Hugh can help confirm whether my understanding is correct, > though.. > > As a summary, that's why I didn't try to remove the optimization (which fixes > the issue too) but instead I tried to dissable fault around instead for uffd, > which sounds a better thing to do. I'm afraid the real problem is the above invalidate though, and not the invalidate itself but the fact uffd-wp shmem cannot cope with it and we need to find a way to have the _PAGE_UFFD_WP information survive even without applying your above patch. > So far I still think it's necessary to have this patch or equivilant to avoid > fault-around. > > As discussed above, current map_pages of shmem (which is the general > filemap_map_pages) should be safe without uffd as long as we'll remove ptes > correctly, but also with page lock held (that should be exactly what we do > right now in shmem_truncate_range). > > But since we have that optimization on pre-unmap, then uffd-wp is not safe any > more, because of the race that this patch tries to avoid. > > And if my understanding above is correct, we may still want to keep shmem fault > around logic, however it just may not suitable for uffd-wp, or uffd in general. Since there's the second invalidate in truncate_inode_page unde the page lock I agree it means filemap_map_pages is perfectly safe for shmem. However it also means filemap_map_pages is a 100% bypass as far as userfaultfd is concerned. That applies to all modes including missing faults not just wp faults. In fact, no matter the workload, given the current semantics of uffd on filebacked mappings: it's always guaranteed there cannot be even a single extra userfault, regardless if filemap_map_pages is active or inactive in shmem, hugetlbfs or any other filesystem. So it's not intuitive why it shouldn't be suitable for uffd, given its presence is also completely unnoticeable to userland. Despite what the faultaround name, it's not a real fault. If later David send the work to handle minor faults the above will change (so that the huge final dirty bitmap can be delivered to the destination node while the guest already runs in the destination), because the uffd semantics will be extended to cover minor faults. But the current upstream only trigger userfault on a file hole, so major faults only, and filemap_map_pages by definition cannot fill a file hole. So again it's not clear why we should even care if filemap_map_pages is active or not, when uffd is armed, no matter if in wp or missing mode. More interesting is how to retain the _PAGE_UFFD_WP during swapping that will get rid of all pagetables too, since that looks a much wider loss of info, than the above race and it looks the same issue. I didn't see the full patchset yet, so I cannot tell if you solved swapping some other way already, but until that is clear, the above looks a minor concern and whatever solution that works to retain the _PAGE_UFFD_WP information during shmem swapping should also solve the above without extra changes to filemap_map_pages and the fault around logic if the uffd is armed. What I'm really saying is that there's no point to apply this patch, until we see the full patchset of shmem uffd-wp and then it's possible to evaluate if there are no other losses for the _PAGE_UFFD_WP bit. Anon memory is completely different, it's impossible to lose _PAGE_UFFD_WP there, since the swap entry format contains it, for shmem the pte is zero instead during swap. Thanks! Andrea ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads 2020-11-28 15:29 ` Peter Xu 2020-12-01 19:08 ` Andrea Arcangeli @ 2020-12-01 21:31 ` Hugh Dickins 2020-12-01 23:42 ` Andrea Arcangeli 1 sibling, 1 reply; 12+ messages in thread From: Hugh Dickins @ 2020-12-01 21:31 UTC (permalink / raw) To: Peter Xu Cc: Andrea Arcangeli, Matthew Wilcox, linux-kernel, linux-mm, Andrew Morton, Hugh Dickins, Mike Rapoport On Sat, 28 Nov 2020, Peter Xu wrote: > On Fri, Nov 27, 2020 at 07:33:39PM -0500, Andrea Arcangeli wrote: > > > Now it would be ok if filemap_map_pages re-filled the ptes, after they > > were zapped the first time by fallocate, and then the fallocate would > > zap them again before truncating the page, but I don't see a second > > pte zap... there's just the below single call of unmap_mapping_range: > > > > if ((u64)unmap_end > (u64)unmap_start) > > unmap_mapping_range(mapping, unmap_start, > > 1 + unmap_end - unmap_start, 0); > > shmem_truncate_range(inode, offset, offset + len - 1); > > IMHO the 2nd pte zap is inside shmem_truncate_range(), where we will call > truncate_inode_page() then onto truncate_cleanup_page(). Correct. > > Since we're at it, some more context: this is actually where I started to > notice the issue, that we'll try to pre-unmap the whole region first before > shmem_truncate_range(). The thing is shmem_truncate_range() will zap the ptes > too, in an even safer way (with page lock held). So before I came up with the > current patch, I also tried below patch, and it also fixes the data corrupt > issue: > > -----8<----- > diff --git a/mm/shmem.c b/mm/shmem.c > index efa19e33e470..b275f401fe1f 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -2777,7 +2777,6 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, > inode_lock(inode); > > if (mode & FALLOC_FL_PUNCH_HOLE) { > - struct address_space *mapping = file->f_mapping; > loff_t unmap_start = round_up(offset, PAGE_SIZE); > loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1; > DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq); > @@ -2795,9 +2794,6 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, > inode->i_private = &shmem_falloc; > spin_unlock(&inode->i_lock); > > - if ((u64)unmap_end > (u64)unmap_start) > - unmap_mapping_range(mapping, unmap_start, > - 1 + unmap_end - unmap_start, 0); > shmem_truncate_range(inode, offset, offset + len - 1); > /* No need to unmap again: hole-punching leaves COWed pages */ > -----8<----- > > Above code existed starting from the 1st day of shmem fallocate code, so I was > thinking why we did that. One reason could be for performance, since this > pre-unmap of explicit unmap_mapping_range() will try to walk the page tables > once rather than walking for every single page, so when the hole is huge it > could benefit us since truncate_cleanup_page() is able to avoid those per-page > walks then because page_mapped() could be more likely to be zero: > > if (page_mapped(page)) { > pgoff_t nr = PageTransHuge(page) ? HPAGE_PMD_NR : 1; > unmap_mapping_pages(mapping, page->index, nr, false); > } > > It would be good if Hugh can help confirm whether my understanding is correct, > though.. Correct. Code duplicated from mm/truncate.c, but with more comments over there, in truncate_pagecache() and in truncate_pagecache_range(). > > As a summary, that's why I didn't try to remove the optimization (which fixes > the issue too) but instead I tried to dissable fault around instead for uffd, > which sounds a better thing to do. > > > > > So looking at filemap_map_pages in shmem, I'm really wondering if the > > non-uffd case is correct or not. > > > > Do we end up with ptes pointing to non pagecache, so then the virtual > > mapping is out of sync also with read/write syscalls that will then > > allocate another shmem page out of sync of the ptes? No, as you point out, the second pte zap, under page lock, keeps it safe. > > > > If a real page fault happened instead of filemap_map_pages, the > > shmem_fault() would block during fallocate punch hole by checking > > inode->i_private, as the comment says: > > > > * So refrain from > > * faulting pages into the hole while it's being punched. > > > > It's not immediately clear where filemap_map_pages refrains from > > faulting pages into the hole while it's being punched... given it's > > ignoring inode->i_private. Please don't ever rely on that i_private business for correctness: as more of the comment around "So refrain from" explains, it was added to avoid a livelock with the trinity fuzzer, and I'd gladly delete it all, if I could ever be confident that enough has changed in the intervening years that it's no longer necessary. It tends to prevent shmem faults racing hole punches in the same area (without quite guaranteeing no race at all). But without the livelock issue, I'd just have gone on letting them race. "Punch hole" == "Lose data" - though I can imagine that UFFD would like more control over that. Since map_pages is driven by faulting, it should already be throttled too. But Andrea in next mail goes on to see other issues with UFFD_WP in relation to shmem swap, so this thread is probably dead now. If there's a bit to spare for UFFD_WP in the anonymous swap entry, then that bit should be usable for shmem too: but a shmem page (and its swap entry) can be shared between many different users, so I don't know whether that will make sense. Hugh ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads 2020-12-01 21:31 ` Hugh Dickins @ 2020-12-01 23:42 ` Andrea Arcangeli 0 siblings, 0 replies; 12+ messages in thread From: Andrea Arcangeli @ 2020-12-01 23:42 UTC (permalink / raw) To: Hugh Dickins Cc: Peter Xu, Matthew Wilcox, linux-kernel, linux-mm, Andrew Morton, Mike Rapoport Hi Hugh, On Tue, Dec 01, 2020 at 01:31:21PM -0800, Hugh Dickins wrote: > Please don't ever rely on that i_private business for correctness: as The out of order and lockless "if (inode->i_private)" branch didn't inspire much confidence in terms of being able to rely on it for locking correctness indeed. > more of the comment around "So refrain from" explains, it was added > to avoid a livelock with the trinity fuzzer, and I'd gladly delete > it all, if I could ever be confident that enough has changed in the > intervening years that it's no longer necessary. I was wondering if it's still needed, so now I checked the old code and it also did an unconditional lock_page in shmem_fault, so I assume it's still necessary. > It tends to prevent shmem faults racing hole punches in the same area > (without quite guaranteeing no race at all). But without the livelock > issue, I'd just have gone on letting them race. "Punch hole" == > "Lose data" - though I can imagine that UFFD would like more control > over that. Since map_pages is driven by faulting, it should already > be throttled too. Yes I see now it just needs to "eventually" stop the shmem_fault activity, it doesn't need to catch those faults already in flight, so it cannot relied upon as the form of serialization to zap the pageteables while truncating the page. > But Andrea in next mail goes on to see other issues with UFFD_WP > in relation to shmem swap, so this thread is probably dead now. > If there's a bit to spare for UFFD_WP in the anonymous swap entry, > then that bit should be usable for shmem too: but a shmem page > (and its swap entry) can be shared between many different users, > so I don't know whether that will make sense. An UFFD_WP software bit exists both for the present pte and for the swap entry: #define _PAGE_UFFD_WP (_AT(pteval_t, 1) << _PAGE_BIT_UFFD_WP) #define _PAGE_SWP_UFFD_WP _PAGE_USER It works similarly to soft dirty, if the page is swapped _PAGE_UFFD_WP it's translated to _PAGE_SWP_UFFD_WP and the other way around during swapin. The problem is that the bit represents an information that is specific to an mm and a single mapping. If you punch an hole in a shmem, you map the shmem file in two processes A and B, you register the mapped range in a uffd opened by process A using VM_UFFD_MISSING, only process A will generate userfaults if you access the virtual address that corresponds to the hole in the file, process B will still fill the hole normally and it won't trigger userfaults in process A. The uffd context is attached to an mm, and the notification only has effect on that mm, the mm that created the context. Only the UFFDIO_ operations that resolve the fault (like UFFDIO_COPY) have effect visible by all file sharers. So VM_UFFD_WP shall work the same, and the _PAGE_SWP_UFFD_WP information of a swapped out shmem page shouldn't go in the xarray because doing so would share it with all "mm" sharing the mapping. Currently uffd-wp only works on anon memory so this is a new challenge in extending uffd-wp to shmem it seems. If any pagetable of an anonymous memory mapping is zapped, then not just the _PAGE_SWP_UFFD_WP bit is lost, the whole data is lost too so it'd break not just uffd-wp. As opposed with shmem the ptes can be zapped at all times by memory pressure alone even before the final ->writepage swap out is invoked. It's always possible to make uffd-wp work without those bits (anon memory also would work without those bits), but the reason those bits exist is to avoid a flood of UFFDIO_WRITEPROTECT ioctl false-negatives after fork or after swapping (in the anon case). In the shmem case if we'd it without a bitflag to track which page was wrprotected in which vma, the uffd handler would be required to mark each pte writable with a syscall every time a new pte has been established on the range and there's a write to the page. One possibility could be to store the bit set by UFFDIO_WRITEPROTECT in a vmalloc bitmap associated with the shmem vma at uffd registration time, so it can be checked during the shmem_fault() if VM_UFFD_WP is set on the vma? The vma_splits and vma_merge would be the hard part. Another possibility would be to change how the pte invalidate work for vmas with VM_UFFD_WP set, the pte and the _PAGE_UFFD_WP would need to survive all invalidates... only the final truncation under page lock would be really allowed to clear the whole pte and destroy the _PAGE_UFFD_WP information in the pte and then to free the pgtables on those ranges. Once we find a way to make that bit survive the invalidates, we can check it in shmem_fault to decide if to invoke handle_userfault if the bit is set and FAULT_FLAG_WRITE is set in vmf->flags, or if to map the page wrprotected in that vaddr if the bit is set and FAULT_FLAG_WRITE is not set. Thanks, Andrea ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads 2020-11-26 22:23 [PATCH] mm: Don't fault around userfaultfd-registered regions on reads Peter Xu 2020-11-27 8:16 ` Mike Rapoport 2020-11-27 12:22 ` Matthew Wilcox @ 2020-11-27 17:00 ` David Hildenbrand 2020-11-27 18:28 ` Peter Xu 2 siblings, 1 reply; 12+ messages in thread From: David Hildenbrand @ 2020-11-27 17:00 UTC (permalink / raw) To: Peter Xu, linux-kernel, linux-mm Cc: Andrew Morton, Hugh Dickins, Andrea Arcangeli, Mike Rapoport On 26.11.20 23:23, Peter Xu wrote: > Faulting around for reads are in most cases helpful for the performance so that > continuous memory accesses may avoid another trip of page fault. However it > may not always work as expected. > > For example, userfaultfd registered regions may not be the best candidate for > pre-faults around the reads. Are we getting uffd faults even though no-one actually accessed it? So in case I would track what some part of my program actually reads, I would get wrong notifications? -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads 2020-11-27 17:00 ` David Hildenbrand @ 2020-11-27 18:28 ` Peter Xu 0 siblings, 0 replies; 12+ messages in thread From: Peter Xu @ 2020-11-27 18:28 UTC (permalink / raw) To: David Hildenbrand Cc: linux-kernel, linux-mm, Andrew Morton, Hugh Dickins, Andrea Arcangeli, Mike Rapoport On Fri, Nov 27, 2020 at 06:00:44PM +0100, David Hildenbrand wrote: > On 26.11.20 23:23, Peter Xu wrote: > > Faulting around for reads are in most cases helpful for the performance so that > > continuous memory accesses may avoid another trip of page fault. However it > > may not always work as expected. > > > > For example, userfaultfd registered regions may not be the best candidate for > > pre-faults around the reads. > > Are we getting uffd faults even though no-one actually accessed it? Userfaultfd should only notify if someone accessed it. > So in case I would track what some part of my program actually reads, I > would get wrong notifications? For shmem, we can't track it right now, afaict. The message will only generate if the page cache is not there. While tracking page reads without page cache is helpless.. because they're all zeros. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2020-12-01 23:42 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-11-26 22:23 [PATCH] mm: Don't fault around userfaultfd-registered regions on reads Peter Xu 2020-11-27 8:16 ` Mike Rapoport 2020-11-27 13:31 ` Peter Xu 2020-11-27 12:22 ` Matthew Wilcox 2020-11-27 13:51 ` Peter Xu 2020-11-28 0:33 ` Andrea Arcangeli 2020-11-28 15:29 ` Peter Xu 2020-12-01 19:08 ` Andrea Arcangeli 2020-12-01 21:31 ` Hugh Dickins 2020-12-01 23:42 ` Andrea Arcangeli 2020-11-27 17:00 ` David Hildenbrand 2020-11-27 18:28 ` Peter Xu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).