linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads
@ 2020-11-30 23:06 Peter Xu
  2020-12-01  9:30 ` David Hildenbrand
  2020-12-01 12:59 ` Matthew Wilcox
  0 siblings, 2 replies; 18+ messages in thread
From: Peter Xu @ 2020-11-30 23:06 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrew Morton, peterx, Hugh Dickins, Andrea Arcangeli,
	Mike Rapoport, Matthew Wilcox

Faulting around for reads are in most cases helpful for the performance so that
continuous memory accesses may avoid another trip of page fault.  However it
may not always work as expected.

For example, userfaultfd registered regions may not be the best candidate for
pre-faults around the reads.

For missing mode uffds, fault around does not help because if the page cache
existed, then the page should be there already.  If the page cache is not
there, nothing else we can do, either.  If the fault-around code is destined to
be helpless for userfault-missing vmas, then ideally we can skip it.

For wr-protected mode uffds, errornously fault in those pages around could lead
to threads accessing the pages without uffd server's awareness.  For example,
when punching holes on uffd-wp registered shmem regions, we'll first try to
unmap all the pages before evicting the page cache but without locking the
page (please refer to shmem_fallocate(), where unmap_mapping_range() is called
before shmem_truncate_range()).  When fault-around happens near a hole being
punched, we might errornously fault in the "holes" right before it will be
punched.  Then there's a small window before the page cache was finally
dropped, and after the page will be writable again (NOTE: the uffd-wp protect
information is totally lost due to the pre-unmap in shmem_fallocate(), so the
page can be writable within the small window).  That's severe data loss.

Let's grant the userspace full control of the uffd-registered ranges, rather
than trying to do the tricks.

Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---

v2:
- use userfaultfd_armed() directly [Mike]

Note that since no file-backed uffd-wp support is there yet upstream, so the
uffd-wp check is actually not really functioning.  However since we have all
the necessary uffd-wp concepts already upstream, maybe it's better to do it
once and for all.

This patch comes from debugging a data loss issue when working on the uffd-wp
support on shmem/hugetlbfs.  I posted this out for early review and comments,
but also because it should already start to benefit missing mode userfaultfd to
avoid trying to fault around on reads.
---
 mm/memory.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index eeae590e526a..59b2be22565e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3933,6 +3933,23 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
 	int off;
 	vm_fault_t ret = 0;
 
+	/*
+	 * Be extremely careful with uffd-armed regions.
+	 *
+	 * For missing mode uffds, fault around does not help because if the
+	 * page cache existed, then the page should be there already.  If the
+	 * page cache is not there, nothing else we can do either.
+	 *
+	 * For wr-protected mode uffds, errornously fault in those pages around
+	 * could lead to threads accessing the pages without uffd server's
+	 * awareness, finally it could cause ghostly data corruption.
+	 *
+	 * The idea is that, every single page of uffd regions should be
+	 * governed by the userspace on which page to fault in.
+	 */
+	if (unlikely(userfaultfd_armed(vmf->vma)))
+		return 0;
+
 	nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
 	mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads
  2020-11-30 23:06 [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads Peter Xu
@ 2020-12-01  9:30 ` David Hildenbrand
  2020-12-01 12:59 ` Matthew Wilcox
  1 sibling, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2020-12-01  9:30 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: Andrew Morton, Hugh Dickins, Andrea Arcangeli, Mike Rapoport,
	Matthew Wilcox

On 01.12.20 00:06, Peter Xu wrote:
> Faulting around for reads are in most cases helpful for the performance so that
> continuous memory accesses may avoid another trip of page fault.  However it
> may not always work as expected.
> 
> For example, userfaultfd registered regions may not be the best candidate for
> pre-faults around the reads.
> 
> For missing mode uffds, fault around does not help because if the page cache
> existed, then the page should be there already.  If the page cache is not
> there, nothing else we can do, either.  If the fault-around code is destined to
> be helpless for userfault-missing vmas, then ideally we can skip it.
> 
> For wr-protected mode uffds, errornously fault in those pages around could lead
> to threads accessing the pages without uffd server's awareness.  For example,
> when punching holes on uffd-wp registered shmem regions, we'll first try to
> unmap all the pages before evicting the page cache but without locking the
> page (please refer to shmem_fallocate(), where unmap_mapping_range() is called
> before shmem_truncate_range()).  When fault-around happens near a hole being
> punched, we might errornously fault in the "holes" right before it will be
> punched.  Then there's a small window before the page cache was finally
> dropped, and after the page will be writable again (NOTE: the uffd-wp protect
> information is totally lost due to the pre-unmap in shmem_fallocate(), so the
> page can be writable within the small window).  That's severe data loss.
> 
> Let's grant the userspace full control of the uffd-registered ranges, rather
> than trying to do the tricks.
> 
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> 
> v2:
> - use userfaultfd_armed() directly [Mike]
> 
> Note that since no file-backed uffd-wp support is there yet upstream, so the
> uffd-wp check is actually not really functioning.  However since we have all
> the necessary uffd-wp concepts already upstream, maybe it's better to do it
> once and for all.
> 
> This patch comes from debugging a data loss issue when working on the uffd-wp
> support on shmem/hugetlbfs.  I posted this out for early review and comments,
> but also because it should already start to benefit missing mode userfaultfd to
> avoid trying to fault around on reads.
> ---
>  mm/memory.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index eeae590e526a..59b2be22565e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3933,6 +3933,23 @@ static vm_fault_t do_fault_around(struct vm_fault *vmf)
>  	int off;
>  	vm_fault_t ret = 0;
>  
> +	/*
> +	 * Be extremely careful with uffd-armed regions.
> +	 *
> +	 * For missing mode uffds, fault around does not help because if the
> +	 * page cache existed, then the page should be there already.  If the
> +	 * page cache is not there, nothing else we can do either.
> +	 *
> +	 * For wr-protected mode uffds, errornously fault in those pages around
> +	 * could lead to threads accessing the pages without uffd server's
> +	 * awareness, finally it could cause ghostly data corruption.
> +	 *
> +	 * The idea is that, every single page of uffd regions should be
> +	 * governed by the userspace on which page to fault in.
> +	 */
> +	if (unlikely(userfaultfd_armed(vmf->vma)))
> +		return 0;
> +
>  	nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
>  	mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;
>  
> 

Thanks for the clarifying comment.

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads
  2020-11-30 23:06 [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads Peter Xu
  2020-12-01  9:30 ` David Hildenbrand
@ 2020-12-01 12:59 ` Matthew Wilcox
  2020-12-01 22:30   ` Peter Xu
  1 sibling, 1 reply; 18+ messages in thread
From: Matthew Wilcox @ 2020-12-01 12:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, Mike Rapoport

On Mon, Nov 30, 2020 at 06:06:03PM -0500, Peter Xu wrote:
> Faulting around for reads are in most cases helpful for the performance so that
> continuous memory accesses may avoid another trip of page fault.  However it
> may not always work as expected.
> 
> For example, userfaultfd registered regions may not be the best candidate for
> pre-faults around the reads.
> 
> For missing mode uffds, fault around does not help because if the page cache
> existed, then the page should be there already.  If the page cache is not
> there, nothing else we can do, either.  If the fault-around code is destined to
> be helpless for userfault-missing vmas, then ideally we can skip it.

This sounds like you're thinking of a file which has exactly one user.
If there are multiple processes mapping the same file, then no, there's
no reason to expect a page to be already present in the page table,
just because it's present in the page cache.

> For wr-protected mode uffds, errornously fault in those pages around could lead
> to threads accessing the pages without uffd server's awareness.  For example,
> when punching holes on uffd-wp registered shmem regions, we'll first try to
> unmap all the pages before evicting the page cache but without locking the
> page (please refer to shmem_fallocate(), where unmap_mapping_range() is called
> before shmem_truncate_range()).  When fault-around happens near a hole being
> punched, we might errornously fault in the "holes" right before it will be
> punched.  Then there's a small window before the page cache was finally
> dropped, and after the page will be writable again (NOTE: the uffd-wp protect
> information is totally lost due to the pre-unmap in shmem_fallocate(), so the
> page can be writable within the small window).  That's severe data loss.

This still doesn't make sense.  If the page is Uptodate in the page
cache, then userspace gets to access it.  If you don't want the page to
be accessible, ClearPageUptodate().  read() can also access it if it's
marked Uptodate.  A write fault on a page will call the filesystem's
page_mkwrite() and you can block it there.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads
  2020-12-01 12:59 ` Matthew Wilcox
@ 2020-12-01 22:30   ` Peter Xu
  2020-12-02  0:02     ` Andrea Arcangeli
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2020-12-01 22:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, linux-mm, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, Mike Rapoport, David Hildenbrand

On Tue, Dec 01, 2020 at 12:59:27PM +0000, Matthew Wilcox wrote:
> On Mon, Nov 30, 2020 at 06:06:03PM -0500, Peter Xu wrote:
> > Faulting around for reads are in most cases helpful for the performance so that
> > continuous memory accesses may avoid another trip of page fault.  However it
> > may not always work as expected.
> > 
> > For example, userfaultfd registered regions may not be the best candidate for
> > pre-faults around the reads.
> > 
> > For missing mode uffds, fault around does not help because if the page cache
> > existed, then the page should be there already.  If the page cache is not
> > there, nothing else we can do, either.  If the fault-around code is destined to
> > be helpless for userfault-missing vmas, then ideally we can skip it.
> 
> This sounds like you're thinking of a file which has exactly one user.
> If there are multiple processes mapping the same file, then no, there's
> no reason to expect a page to be already present in the page table,
> just because it's present in the page cache.
> 
> > For wr-protected mode uffds, errornously fault in those pages around could lead
> > to threads accessing the pages without uffd server's awareness.  For example,
> > when punching holes on uffd-wp registered shmem regions, we'll first try to
> > unmap all the pages before evicting the page cache but without locking the
> > page (please refer to shmem_fallocate(), where unmap_mapping_range() is called
> > before shmem_truncate_range()).  When fault-around happens near a hole being
> > punched, we might errornously fault in the "holes" right before it will be
> > punched.  Then there's a small window before the page cache was finally
> > dropped, and after the page will be writable again (NOTE: the uffd-wp protect
> > information is totally lost due to the pre-unmap in shmem_fallocate(), so the
> > page can be writable within the small window).  That's severe data loss.
> 
> This still doesn't make sense.  If the page is Uptodate in the page
> cache, then userspace gets to access it.  If you don't want the page to
> be accessible, ClearPageUptodate().  read() can also access it if it's
> marked Uptodate.  A write fault on a page will call the filesystem's
> page_mkwrite() and you can block it there.

I still don't think the page_mkwrite() could help here... Though Andrea pointed
out an more important issue against swap cache (in the v1 thread [1]).  Indeed
if we have those figured out maybe we'll also rethink this patch then it could
become optional; while that seems to be required to allow shmem swap in/out
with uffd-wp which I haven't yet tested.  As Hugh pointed out, purely reuse the
_PAGE_SWP_UFFD_WP in swap cache may not work trivially since uffd-wp is per-pte
rather than per-page, so I probably need to think a bit more on how to do
that...

I don't know whether a patch like this could still be good in the future.  For
now, let's drop this patch until we solve all the rest of the puzzle.

My thanks to all the reviewers, and sorry for the noise!

NAK myself.

[1] https://lore.kernel.org/lkml/alpine.LSU.2.11.2012011250070.1582@eggly.anvils/T/#mef0716b38f4f5fc07b7542f2c11a07535ea31aad

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads
  2020-12-01 22:30   ` Peter Xu
@ 2020-12-02  0:02     ` Andrea Arcangeli
  2020-12-02 22:37       ` Hugh Dickins
  0 siblings, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2020-12-02  0:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: Matthew Wilcox, linux-kernel, linux-mm, Andrew Morton,
	Hugh Dickins, Mike Rapoport, David Hildenbrand

On Tue, Dec 01, 2020 at 05:30:33PM -0500, Peter Xu wrote:
> On Tue, Dec 01, 2020 at 12:59:27PM +0000, Matthew Wilcox wrote:
> > On Mon, Nov 30, 2020 at 06:06:03PM -0500, Peter Xu wrote:
> > > Faulting around for reads are in most cases helpful for the performance so that
> > > continuous memory accesses may avoid another trip of page fault.  However it
> > > may not always work as expected.
> > > 
> > > For example, userfaultfd registered regions may not be the best candidate for
> > > pre-faults around the reads.
> > > 
> > > For missing mode uffds, fault around does not help because if the page cache
> > > existed, then the page should be there already.  If the page cache is not
> > > there, nothing else we can do, either.  If the fault-around code is destined to
> > > be helpless for userfault-missing vmas, then ideally we can skip it.
> > 
> > This sounds like you're thinking of a file which has exactly one user.
> > If there are multiple processes mapping the same file, then no, there's
> > no reason to expect a page to be already present in the page table,
> > just because it's present in the page cache.
> > 
> > > For wr-protected mode uffds, errornously fault in those pages around could lead
> > > to threads accessing the pages without uffd server's awareness.  For example,
> > > when punching holes on uffd-wp registered shmem regions, we'll first try to
> > > unmap all the pages before evicting the page cache but without locking the
> > > page (please refer to shmem_fallocate(), where unmap_mapping_range() is called
> > > before shmem_truncate_range()).  When fault-around happens near a hole being
> > > punched, we might errornously fault in the "holes" right before it will be
> > > punched.  Then there's a small window before the page cache was finally
> > > dropped, and after the page will be writable again (NOTE: the uffd-wp protect
> > > information is totally lost due to the pre-unmap in shmem_fallocate(), so the
> > > page can be writable within the small window).  That's severe data loss.
> > 
> > This still doesn't make sense.  If the page is Uptodate in the page
> > cache, then userspace gets to access it.  If you don't want the page to
> > be accessible, ClearPageUptodate().  read() can also access it if it's
> > marked Uptodate.  A write fault on a page will call the filesystem's
> > page_mkwrite() and you can block it there.
> 
> I still don't think the page_mkwrite() could help here... Though Andrea pointed

I tend to agree page_mkwrite won't help, there's no I/O, nor dirty
memory pressure, not even VM_FAULT_MISSING can work on real
filesystems yet. The uffd context is associated to certain virtual
addresses in the "mm", read/write syscalls shouldn't notice any
difference, as all other mm shouldn't notice anything either.

It should be enough to check the bit in shmem_fault invoked in ->fault
for this purpose, the problem is we need the bit to survive the invalidate.

> out an more important issue against swap cache (in the v1 thread [1]).  Indeed
> if we have those figured out maybe we'll also rethink this patch then it could
> become optional; while that seems to be required to allow shmem swap in/out
> with uffd-wp which I haven't yet tested.  As Hugh pointed out, purely reuse the
> _PAGE_SWP_UFFD_WP in swap cache may not work trivially since uffd-wp is per-pte
> rather than per-page, so I probably need to think a bit more on how to do
> that...
> 
> I don't know whether a patch like this could still be good in the future.  For
> now, let's drop this patch until we solve all the rest of the puzzle.
>
> My thanks to all the reviewers, and sorry for the noise!

Thanks to you Peter! No noise here, it's great progress to have found
the next piece of the puzzle.

Any suggestions on how to have the per-vaddr per-mm _PAGE_UFFD_WP bit
survive the pte invalidates in a way that remains associated to a
certain vaddr in a single mm (so it can shoot itself in the foot if it
wants, but it can't interfere with all other mm sharing the shmem
file) would be welcome...

Andrea


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads
  2020-12-02  0:02     ` Andrea Arcangeli
@ 2020-12-02 22:37       ` Hugh Dickins
  2020-12-02 23:41         ` Peter Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Hugh Dickins @ 2020-12-02 22:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Peter Xu, Matthew Wilcox, linux-kernel, linux-mm, Andrew Morton,
	Hugh Dickins, Mike Rapoport, David Hildenbrand

On Tue, 1 Dec 2020, Andrea Arcangeli wrote:
> 
> Any suggestions on how to have the per-vaddr per-mm _PAGE_UFFD_WP bit
> survive the pte invalidates in a way that remains associated to a
> certain vaddr in a single mm (so it can shoot itself in the foot if it
> wants, but it can't interfere with all other mm sharing the shmem
> file) would be welcome...

I think it has to be a new variety of swap-like non_swap_entry() pte,
see include/linux/swapops.h.  Anything else would be more troublesome.

Search for non_swap_entry and for migration_entry, to find places that 
might need to learn about this new variety.

IIUC you only need a single value, no need to carve out another whole
swp_type: could probably be swp_offset 0 of any swp_type other than 0.

Note that fork's copy_page_range() does not "copy ptes where a page
fault will fill them correctly", so would in effect put a pte_none
into the child where the parent has this uffd_wp entry.  I don't know
anything about uffd versus fork, whether that would pose a problem.

Hugh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads
  2020-12-02 22:37       ` Hugh Dickins
@ 2020-12-02 23:41         ` Peter Xu
  2020-12-03  5:36           ` Hugh Dickins
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2020-12-02 23:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Matthew Wilcox, linux-kernel, linux-mm,
	Andrew Morton, Mike Rapoport, David Hildenbrand

On Wed, Dec 02, 2020 at 02:37:33PM -0800, Hugh Dickins wrote:
> On Tue, 1 Dec 2020, Andrea Arcangeli wrote:
> > 
> > Any suggestions on how to have the per-vaddr per-mm _PAGE_UFFD_WP bit
> > survive the pte invalidates in a way that remains associated to a
> > certain vaddr in a single mm (so it can shoot itself in the foot if it
> > wants, but it can't interfere with all other mm sharing the shmem
> > file) would be welcome...
> 
> I think it has to be a new variety of swap-like non_swap_entry() pte,
> see include/linux/swapops.h.  Anything else would be more troublesome.
> 
> Search for non_swap_entry and for migration_entry, to find places that 
> might need to learn about this new variety.
> 
> IIUC you only need a single value, no need to carve out another whole
> swp_type: could probably be swp_offset 0 of any swp_type other than 0.
> 
> Note that fork's copy_page_range() does not "copy ptes where a page
> fault will fill them correctly", so would in effect put a pte_none
> into the child where the parent has this uffd_wp entry.  I don't know
> anything about uffd versus fork, whether that would pose a problem.

Thanks for the idea, Hugh!

I thought about something similar today, but instead of swap entries, I was
thinking about constantly filling in a pte with a value of "_PAGE_PROTNONE |
_PAGE_UFFD_WP" when e.g. we'd like to zap a page with shmem+uffd-wp. I feel
like the fundamental idea is similar - we can somehow keep the pte with uffd-wp
information even if zapped/swapped-out, so as long as the shmem access will
fruther trap into the fault handler, then we can operate on that pte and read
that information out, like recover that pte into a normal pte (with swap/page
cache, and vma/addr information, we'll be able to) and then we can retry the
fault.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads
  2020-12-02 23:41         ` Peter Xu
@ 2020-12-03  5:36           ` Hugh Dickins
  2020-12-03 18:02             ` Peter Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Hugh Dickins @ 2020-12-03  5:36 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, Andrea Arcangeli, Matthew Wilcox, linux-kernel,
	linux-mm, Andrew Morton, Mike Rapoport, David Hildenbrand

On Wed, 2 Dec 2020, Peter Xu wrote:
> On Wed, Dec 02, 2020 at 02:37:33PM -0800, Hugh Dickins wrote:
> > On Tue, 1 Dec 2020, Andrea Arcangeli wrote:
> > > 
> > > Any suggestions on how to have the per-vaddr per-mm _PAGE_UFFD_WP bit
> > > survive the pte invalidates in a way that remains associated to a
> > > certain vaddr in a single mm (so it can shoot itself in the foot if it
> > > wants, but it can't interfere with all other mm sharing the shmem
> > > file) would be welcome...
> > 
> > I think it has to be a new variety of swap-like non_swap_entry() pte,
> > see include/linux/swapops.h.  Anything else would be more troublesome.
> > 
> > Search for non_swap_entry and for migration_entry, to find places that 
> > might need to learn about this new variety.
> > 
> > IIUC you only need a single value, no need to carve out another whole
> > swp_type: could probably be swp_offset 0 of any swp_type other than 0.
> > 
> > Note that fork's copy_page_range() does not "copy ptes where a page
> > fault will fill them correctly", so would in effect put a pte_none
> > into the child where the parent has this uffd_wp entry.  I don't know
> > anything about uffd versus fork, whether that would pose a problem.
> 
> Thanks for the idea, Hugh!
> 
> I thought about something similar today, but instead of swap entries, I was
> thinking about constantly filling in a pte with a value of "_PAGE_PROTNONE |
> _PAGE_UFFD_WP" when e.g. we'd like to zap a page with shmem+uffd-wp. I feel
> like the fundamental idea is similar - we can somehow keep the pte with uffd-wp
> information even if zapped/swapped-out, so as long as the shmem access will
> fruther trap into the fault handler, then we can operate on that pte and read
> that information out, like recover that pte into a normal pte (with swap/page
> cache, and vma/addr information, we'll be able to) and then we can retry the
> fault.

Yes, I think that should work too: I can't predict which way would cause
less trouble.

We usually tend to keep away from protnone games, because NUMA balancing
use of protnone is already confusing enough.

But those ptes will be pte_present(), so you must provide a pfn, and I
think if you use the zero_pfn, vm_normal_page() will return false on it,
and avoid special casing (and reference counting) it in various places.

Hugh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads
  2020-12-03  5:36           ` Hugh Dickins
@ 2020-12-03 18:02             ` Peter Xu
  2020-12-03 19:44               ` Andrea Arcangeli
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2020-12-03 18:02 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Matthew Wilcox, linux-kernel, linux-mm,
	Andrew Morton, Mike Rapoport, David Hildenbrand

On Wed, Dec 02, 2020 at 09:36:45PM -0800, Hugh Dickins wrote:
> On Wed, 2 Dec 2020, Peter Xu wrote:
> > On Wed, Dec 02, 2020 at 02:37:33PM -0800, Hugh Dickins wrote:
> > > On Tue, 1 Dec 2020, Andrea Arcangeli wrote:
> > > > 
> > > > Any suggestions on how to have the per-vaddr per-mm _PAGE_UFFD_WP bit
> > > > survive the pte invalidates in a way that remains associated to a
> > > > certain vaddr in a single mm (so it can shoot itself in the foot if it
> > > > wants, but it can't interfere with all other mm sharing the shmem
> > > > file) would be welcome...
> > > 
> > > I think it has to be a new variety of swap-like non_swap_entry() pte,
> > > see include/linux/swapops.h.  Anything else would be more troublesome.
> > > 
> > > Search for non_swap_entry and for migration_entry, to find places that 
> > > might need to learn about this new variety.
> > > 
> > > IIUC you only need a single value, no need to carve out another whole
> > > swp_type: could probably be swp_offset 0 of any swp_type other than 0.
> > > 
> > > Note that fork's copy_page_range() does not "copy ptes where a page
> > > fault will fill them correctly", so would in effect put a pte_none
> > > into the child where the parent has this uffd_wp entry.  I don't know
> > > anything about uffd versus fork, whether that would pose a problem.
> > 
> > Thanks for the idea, Hugh!
> > 
> > I thought about something similar today, but instead of swap entries, I was
> > thinking about constantly filling in a pte with a value of "_PAGE_PROTNONE |
> > _PAGE_UFFD_WP" when e.g. we'd like to zap a page with shmem+uffd-wp. I feel
> > like the fundamental idea is similar - we can somehow keep the pte with uffd-wp
> > information even if zapped/swapped-out, so as long as the shmem access will
> > fruther trap into the fault handler, then we can operate on that pte and read
> > that information out, like recover that pte into a normal pte (with swap/page
> > cache, and vma/addr information, we'll be able to) and then we can retry the
> > fault.
> 
> Yes, I think that should work too: I can't predict which way would cause
> less trouble.
> 
> We usually tend to keep away from protnone games, because NUMA balancing
> use of protnone is already confusing enough.

Yes it is tricky.  However it gives me the feeling that numa balancing and its
protnone trick provided a general solution for things like this, so that we can
trap a per-mm per-pte page access like this.

With that, I'm currently slightly prefer to try the protnone way first, because
using swp entry could be a bit misleading from the 1st glance - note that when
this happens, we could have two states for this pte:

  1. Page in shmem page cache
  2. Page in shmem swap cache (so page cache is a value)

And actually there's another 3rd state that should never happen as long as the
userspace does unprotect properly, but I guess we'd better also take care of:

  3. Page is not cached at all (page missing; logically should not happen
     because when page cache evicted then all ptes should have been removed,
     however since this pte is not linked to the page in any way, it could get
     lost?  Then we should simply retry the fault after recovering the tricky
     pte into an all-zero pte)

It'll be a bit odd imho to use a swp entry to represent all these states, for
example we can see the pte is a swp-like entry but in reality the shmem page
sits right in the page cache..

So I'm thinking whether we could decouple the pte_protnone() idea with numa
balancing first (currently, there's a close bind), let numa depend on protnone,
then uffd-wp+shmem can also depend on protnone.

> 
> But those ptes will be pte_present(), so you must provide a pfn,

Could I ask why?

> and I think if you use the zero_pfn, vm_normal_page() will return false on it,
> and avoid special casing (and reference counting) it in various places.

I'll keep this in mind, thanks.

Meanwhile, this reminded me another option - besides _PAGE_PROTNONE, can we use
_PAGE_SPECIAL?  That sounds better at least from its naming that it tells it's
a special page already. We can also leverage existing pte_special() checks here
and there, then mimic what we do with pte_devmap(), maybe?

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads
  2020-12-03 18:02             ` Peter Xu
@ 2020-12-03 19:44               ` Andrea Arcangeli
  2020-12-04  2:30                 ` Peter Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2020-12-03 19:44 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, Matthew Wilcox, linux-kernel, linux-mm,
	Andrew Morton, Mike Rapoport, David Hildenbrand

On Thu, Dec 03, 2020 at 01:02:34PM -0500, Peter Xu wrote:
> On Wed, Dec 02, 2020 at 09:36:45PM -0800, Hugh Dickins wrote:
> > On Wed, 2 Dec 2020, Peter Xu wrote:
> > > On Wed, Dec 02, 2020 at 02:37:33PM -0800, Hugh Dickins wrote:
> > > > On Tue, 1 Dec 2020, Andrea Arcangeli wrote:
> > > > > 
> > > > > Any suggestions on how to have the per-vaddr per-mm _PAGE_UFFD_WP bit
> > > > > survive the pte invalidates in a way that remains associated to a
> > > > > certain vaddr in a single mm (so it can shoot itself in the foot if it
> > > > > wants, but it can't interfere with all other mm sharing the shmem
> > > > > file) would be welcome...
> > > > 
> > > > I think it has to be a new variety of swap-like non_swap_entry() pte,
> > > > see include/linux/swapops.h.  Anything else would be more troublesome.

Agreed. Solving it by changing the unmapping of the ptes is also some
trouble but less troublesome than adding new bitmaps to the vma to
handle in vma_merge/vma_split.

> > But those ptes will be pte_present(), so you must provide a pfn,
> 
> Could I ask why?

_PAGE_PROTNONE exists only for one reason, so pte_present returns true
and the page is as good as mapped, the pfn is real and mapped,
everything is up and running fine except _PAGE_PRESENT is not set in
the pte. _PAGE_PROTNONE doesn't unmap the pte, it only triggers faults
on a mapped pte.

When we set _PAGE_PROTNONE and clear _PAGE_PRESENT atomically, nothing
changes at all for all pte_present and all other VM common code,
except now you get page faults.

So numa hinting faults use that and it's a perfect fit for that,
because you just want to change nothing, but still be notified on
access.

Here instead you need to really unmap the page and lose any pfn or
page reference and everything along with it, just one bit need to
survive the unmap: the _PAGE_UFFD_WP bit.

I tend to agree this needs to work more similarly to the migration
entry like Hugh suggested or an entirely new mechanism to keep "vma
specific state" alive, for filebacked mappings that get zapped but
that have still a vma intact.

The vma removal in munmap() syscall, is then the only point where the
pte is only allowed to be cleared for good and only then the pte can
be freed.

Not even MADV_DONTNEED should be allowed to zero out the pte, it can
drop everything but that single bit.

> Meanwhile, this reminded me another option - besides _PAGE_PROTNONE, can we use
> _PAGE_SPECIAL?  That sounds better at least from its naming that it tells it's
> a special page already. We can also leverage existing pte_special() checks here
> and there, then mimic what we do with pte_devmap(), maybe?

That's also for mapped pages VM_PFNMAP or similar I think.

By memory the !pte_present case for filebacked vmas only exists as
migration entry so I think we'll just add a branch to that case so
that it can disambiguate the migration entry from the _PAGE_UFFDP_WP
bit.

So we can reserve one bit in the migration entry that is always
enforced zero when it is a migration entry.

When that bit is set on a non-present page in a filebacked vma, it
will mean _UFFD_PAGE_WP is set for that vaddr in that mm.

Then we need to pass a parameter in all pte zapping operations to tell
if the unmap event is happening because the vma has been truncated, or
if it's happening for any other reason.

If it's happening for any other reasons (page truncate, MADV_DONTNEED,
memory pressure to swap/writepage) we just convert any present pte
with _UFFD_PAGE_WP set, to the non-migration entry non-present pte
with the reserved migration entry bit set.

If the present pte has no _UFFD_PAGE_WP then it'll be zapped as usual
regardless of VM_UFFD_WP in the vma or not.

If the pte zapping is instead invoked because of a vma truncation, and
it means it's the last unmap operation on that vaddr, the pte zap
logic will be told to ignore the _UFFD_PAGE_WP in any present pte so
the ptes will be zeroed out and later freed as needed.

Thanks,
Andrea


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads
  2020-12-03 19:44               ` Andrea Arcangeli
@ 2020-12-04  2:30                 ` Peter Xu
  2020-12-04  4:10                   ` Andrea Arcangeli
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2020-12-04  2:30 UTC (permalink / raw)
  To: Andrea Arcangeli, Hugh Dickins
  Cc: Hugh Dickins, Matthew Wilcox, linux-kernel, linux-mm,
	Andrew Morton, Mike Rapoport, David Hildenbrand

On Thu, Dec 03, 2020 at 02:44:14PM -0500, Andrea Arcangeli wrote:
> On Thu, Dec 03, 2020 at 01:02:34PM -0500, Peter Xu wrote:
> > On Wed, Dec 02, 2020 at 09:36:45PM -0800, Hugh Dickins wrote:
> > > On Wed, 2 Dec 2020, Peter Xu wrote:
> > > > On Wed, Dec 02, 2020 at 02:37:33PM -0800, Hugh Dickins wrote:
> > > > > On Tue, 1 Dec 2020, Andrea Arcangeli wrote:
> > > > > > 
> > > > > > Any suggestions on how to have the per-vaddr per-mm _PAGE_UFFD_WP bit
> > > > > > survive the pte invalidates in a way that remains associated to a
> > > > > > certain vaddr in a single mm (so it can shoot itself in the foot if it
> > > > > > wants, but it can't interfere with all other mm sharing the shmem
> > > > > > file) would be welcome...
> > > > > 
> > > > > I think it has to be a new variety of swap-like non_swap_entry() pte,
> > > > > see include/linux/swapops.h.  Anything else would be more troublesome.
> 
> Agreed. Solving it by changing the unmapping of the ptes is also some
> trouble but less troublesome than adding new bitmaps to the vma to
> handle in vma_merge/vma_split.
> 
> > > But those ptes will be pte_present(), so you must provide a pfn,
> > 
> > Could I ask why?
> 
> _PAGE_PROTNONE exists only for one reason, so pte_present returns true
> and the page is as good as mapped, the pfn is real and mapped,
> everything is up and running fine except _PAGE_PRESENT is not set in
> the pte. _PAGE_PROTNONE doesn't unmap the pte, it only triggers faults
> on a mapped pte.
> 
> When we set _PAGE_PROTNONE and clear _PAGE_PRESENT atomically, nothing
> changes at all for all pte_present and all other VM common code,
> except now you get page faults.
> 
> So numa hinting faults use that and it's a perfect fit for that,
> because you just want to change nothing, but still be notified on
> access.
> 
> Here instead you need to really unmap the page and lose any pfn or
> page reference and everything along with it, just one bit need to
> survive the unmap: the _PAGE_UFFD_WP bit.
> 
> I tend to agree this needs to work more similarly to the migration
> entry like Hugh suggested or an entirely new mechanism to keep "vma
> specific state" alive, for filebacked mappings that get zapped but
> that have still a vma intact.

Thanks Andrea & Hugh. I think I get the point now.

I guess missed the fact that !pte_present() means it's a swap entry by
definition...  So my previous thoughts on using either _PAGE_PROTNONE or
_PAGE_SPECIAL could be a bit silly because if without pte_present() they'll
just be misunderstood as swp entries, then all the _PAGE_* bits won't make any
sense any more, since they could mean some strange (type, offset) tuple of swp
entries.

So yes, I think we need a swp entry, with some special format.

> 
> The vma removal in munmap() syscall, is then the only point where the
> pte is only allowed to be cleared for good and only then the pte can
> be freed.
> 
> Not even MADV_DONTNEED should be allowed to zero out the pte, it can
> drop everything but that single bit.

Right.  Ideally I wouldn't expect any sane userspace that uses uffd-wp with
shmem to MADV_DONTNEED a page right under wr-protection, because it really
shouldn't do that... However from the semantics of MADV_DONTNEED, indeed it
should be the same case as the shmem fallocate code where there's a pre-unmap,
so we should be prepared for that too.

> 
> > Meanwhile, this reminded me another option - besides _PAGE_PROTNONE, can we use
> > _PAGE_SPECIAL?  That sounds better at least from its naming that it tells it's
> > a special page already. We can also leverage existing pte_special() checks here
> > and there, then mimic what we do with pte_devmap(), maybe?
> 
> That's also for mapped pages VM_PFNMAP or similar I think.
> 
> By memory the !pte_present case for filebacked vmas only exists as
> migration entry so I think we'll just add a branch to that case so
> that it can disambiguate the migration entry from the _PAGE_UFFDP_WP
> bit.
> 
> So we can reserve one bit in the migration entry that is always
> enforced zero when it is a migration entry.
> 
> When that bit is set on a non-present page in a filebacked vma, it
> will mean _UFFD_PAGE_WP is set for that vaddr in that mm.

I'm just afraid there's no space left for a migration entry, because migration
entries fills in the pfn information into swp offset field rather than a real
offset (please refer to make_migration_entry())?  I assume PFN can use any bit.
Or did I miss anything?

I went back to see the original proposal from Hugh:

  IIUC you only need a single value, no need to carve out another whole
  swp_type: could probably be swp_offset 0 of any swp_type other than 0.

Hugh/Andrea, sorry if this is a stupid swap question: could you help explain
why swp_offset=0 won't be used by any swap device?  I believe it's correct,
it's just that I failed to figure out the reason myself. :(

Besides above, to reuse the _PAGE_SWP_UFFD_WP (as it's already there), maybe we
can also define the special swp entry as:

  swp_entry(0, _PAGE_SWP_UFFD_WP)

Then as long as we check this against the vma so that we know it's not
anonymous memory, then it's the special pte.  Combined as:

    vma_is_anonymous(vma)==false && swp_entry(0, _PAGE_SWP_UFFD_WP);

> 
> Then we need to pass a parameter in all pte zapping operations to tell
> if the unmap event is happening because the vma has been truncated, or
> if it's happening for any other reason.
> 
> If it's happening for any other reasons (page truncate, MADV_DONTNEED,
> memory pressure to swap/writepage) we just convert any present pte
> with _UFFD_PAGE_WP set, to the non-migration entry non-present pte
> with the reserved migration entry bit set.
> 
> If the present pte has no _UFFD_PAGE_WP then it'll be zapped as usual
> regardless of VM_UFFD_WP in the vma or not.
> 
> If the pte zapping is instead invoked because of a vma truncation, and
> it means it's the last unmap operation on that vaddr, the pte zap
> logic will be told to ignore the _UFFD_PAGE_WP in any present pte so
> the ptes will be zeroed out and later freed as needed.

Agreed.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads
  2020-12-04  2:30                 ` Peter Xu
@ 2020-12-04  4:10                   ` Andrea Arcangeli
  2020-12-04  5:59                     ` Hugh Dickins
  2020-12-04 18:12                     ` Andrea Arcangeli
  0 siblings, 2 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2020-12-04  4:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, Matthew Wilcox, linux-kernel, linux-mm,
	Andrew Morton, Mike Rapoport, David Hildenbrand

Hi Peter,

On Thu, Dec 03, 2020 at 09:30:51PM -0500, Peter Xu wrote:
> I'm just afraid there's no space left for a migration entry, because migration
> entries fills in the pfn information into swp offset field rather than a real
> offset (please refer to make_migration_entry())?  I assume PFN can use any bit.
> Or did I miss anything?
> 
> I went back to see the original proposal from Hugh:
> 
>   IIUC you only need a single value, no need to carve out another whole
>   swp_type: could probably be swp_offset 0 of any swp_type other than 0.
> 
> Hugh/Andrea, sorry if this is a stupid swap question: could you help explain
> why swp_offset=0 won't be used by any swap device?  I believe it's correct,
> it's just that I failed to figure out the reason myself. :(
> 

Hugh may want to review if I got it wrong, but there's basically three
ways.

swp_type would mean adding one more reserved value in addition of
SWP_MIGRATION_READ and SWP_MIGRATION_WRITE (kind of increasing
SWP_MIGRATION_NUM to 3).

swp_offset = 0 works in combination of SWP_MIGRATION_WRITE and
SWP_MIGRATION_READ if we enforce pfn 0 is never used by the kernel
(I'd feel safer with pfn value -1UL truncated to the bits of the swp
offset, since the swp_entry format is common code).

The bit I was suggesting is just one more bit like _PAGE_SWP_UFFD_WP
from the pte, one that cannot ever be set in any swp entry today. I
assume it can't be _PAGE_SWP_UFFD_WP since that already can be set but
you may want to verify it...

It'd be set on the pte (not in the swap entry), then it doesn't matter
much what's inside the swp_entry anymore. The pte value would be
generated with this:

 pte_swp_uffd_wp_unmap(swp_entry_to_pte(swp_entry(SWP_MIGRATION_READ, 0)))

(maybe SWP_MIGRATION_READ could also be 0 and then it can be just
enough to set that single bit in the pte and nothing else, all other
bits zero)

We never store a raw swp entry in the pte (the raw swp entry is stored
in the xarray, it's the index of the swapcache).

To solve our unmap issue we only deal with pte storage (no xarray
index storage). This is why it can also be in the arch specific pte
representation of the swp entry, it doesn't need to be a special value
defined in the swp entry common code.

Being the swap entry to pte conversion arch dependent, such bit needs
to be defined by each arch (reserving a offset or type value in swp
entry would solve it in the common code).

#define SWP_OFFSET_FIRST_BIT	(_PAGE_BIT_PROTNONE + 1)

All bits below PROTNONE are available for software use and we use bit
1 (soft dirty) 2 (uffd_wp). protnone bit 8 itself (global bit) must
not be set or it'll look protnone and pte_present will be true. Bit 7
is PSE so it's also not available because pte_present checks that
too.

It appears you can pick between bit 3 4 5 6 at your own choice and it
doesn't look like we're running out of those yet (if we were there
would be a bigger incentive to encode it as part of the swp entry
format). Example:

#define _PAGE_SWP_UFFD_WP_UNMAP	_PAGE_PWT

If that bit it set and pte_present is false, then everything else in
that that pte is meaningless and it means uffd wrprotected
pte_none.

So in the migration-entry/swapin page fault path, you could go one
step back and check the pte for such bit, if it's set it's not a
migration entry.

If there's a read access it should fill the page mark with
shmem_fault, keep the pte wrprotected and then set _PAGE_UFFD_WP on
the pte. If there's a write access it should invoke handle_userfault.

If there's any reason where the swp_entry reservation is simpler
that's ok too, you'll see an huge lot of more details once you try to
implement it so you'll be better able to judje later. I'm greatly
simplifying everything but this is not simple feat...

Thanks,
Andrea


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads
  2020-12-04  4:10                   ` Andrea Arcangeli
@ 2020-12-04  5:59                     ` Hugh Dickins
  2020-12-04 16:50                       ` Peter Xu
  2020-12-04 18:12                     ` Andrea Arcangeli
  1 sibling, 1 reply; 18+ messages in thread
From: Hugh Dickins @ 2020-12-04  5:59 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Peter Xu, Hugh Dickins, Matthew Wilcox, linux-kernel, linux-mm,
	Andrew Morton, Mike Rapoport, David Hildenbrand

On Thu, 3 Dec 2020, Andrea Arcangeli wrote:
> On Thu, Dec 03, 2020 at 09:30:51PM -0500, Peter Xu wrote:
> > I'm just afraid there's no space left for a migration entry, because migration
> > entries fills in the pfn information into swp offset field rather than a real
> > offset (please refer to make_migration_entry())?  I assume PFN can use any bit.
> > Or did I miss anything?
> > 
> > I went back to see the original proposal from Hugh:
> > 
> >   IIUC you only need a single value, no need to carve out another whole
> >   swp_type: could probably be swp_offset 0 of any swp_type other than 0.
> > 
> > Hugh/Andrea, sorry if this is a stupid swap question: could you help explain
> > why swp_offset=0 won't be used by any swap device?  I believe it's correct,
> > it's just that I failed to figure out the reason myself. :(
> > 

It's because swp_offset 0 is the offset of the swap header, and if we
ever used that when allocating swap, then the swap header would get
overwritten, and that swap area become unrecognizable next time.

But I said it would be usable for UFFD with any swp_type other than 0,
because a swap entry of type 0, offset 0 is simply 0, which looks just
like no swap entry at all, and there are (or were: I might not be
up-to-date) benign races where a swap entry might get passed down but
then found to be 0, and that was understandable and permitted (yes,
I still see the "if (!entry.val) goto out;" in __swap_info_get()).

And that might be related to pte_none() being 0 on most architectures
(not s390 IIRC): we need to distinguish none from swap.  Though that
all gets complicated by the way the swp_entry is munged before being
put into a pte, and the x86 swap munging got more complicated when
L1TF was revealed (and accompanied by prot none munging too) -
search git log of v4.19 for x86/speculation/l1tf if you need to.

> 
> Hugh may want to review if I got it wrong, but there's basically three
> ways.
> 
> swp_type would mean adding one more reserved value in addition of
> SWP_MIGRATION_READ and SWP_MIGRATION_WRITE (kind of increasing
> SWP_MIGRATION_NUM to 3).

I'm not very keen on actually using any of the SWP_MIGRATION defines,
partly because in principle UFFD should not depend on CONFIG_MIGRATION,
partly because the uffd_wp entry would not behave anything like a
migration entry (whose pfn should always indicate a locked page).

swp_offset 0 of swp_type 1 perhaps?

> 
> swp_offset = 0 works in combination of SWP_MIGRATION_WRITE and
> SWP_MIGRATION_READ if we enforce pfn 0 is never used by the kernel
> (I'd feel safer with pfn value -1UL truncated to the bits of the swp
> offset, since the swp_entry format is common code).
> 
> The bit I was suggesting is just one more bit like _PAGE_SWP_UFFD_WP
> from the pte, one that cannot ever be set in any swp entry today. I
> assume it can't be _PAGE_SWP_UFFD_WP since that already can be set but
> you may want to verify it...

I don't see why you would need another bit for this.

The code that checks non-present non-none entries in page table,
for whether they are actually swap or migration entries or whatever,
would now also check for swp_offset 0 of swp_type 1 and go off to
the UFFD WP processing if so.

I didn't pay much attention to below, it seemed over-complicated.
And I don't think Peter's PROT_NONE alternative was unworkable,
but would have to be more careful about pfn and L1TF than shown.
And I am more comfortable to focus on the swap-like direction,
than think in two directions at once - never my strength!

Hugh

> 
> It'd be set on the pte (not in the swap entry), then it doesn't matter
> much what's inside the swp_entry anymore. The pte value would be
> generated with this:
> 
>  pte_swp_uffd_wp_unmap(swp_entry_to_pte(swp_entry(SWP_MIGRATION_READ, 0)))
> 
> (maybe SWP_MIGRATION_READ could also be 0 and then it can be just
> enough to set that single bit in the pte and nothing else, all other
> bits zero)
> 
> We never store a raw swp entry in the pte (the raw swp entry is stored
> in the xarray, it's the index of the swapcache).
> 
> To solve our unmap issue we only deal with pte storage (no xarray
> index storage). This is why it can also be in the arch specific pte
> representation of the swp entry, it doesn't need to be a special value
> defined in the swp entry common code.
> 
> Being the swap entry to pte conversion arch dependent, such bit needs
> to be defined by each arch (reserving a offset or type value in swp
> entry would solve it in the common code).
> 
> #define SWP_OFFSET_FIRST_BIT	(_PAGE_BIT_PROTNONE + 1)
> 
> All bits below PROTNONE are available for software use and we use bit
> 1 (soft dirty) 2 (uffd_wp). protnone bit 8 itself (global bit) must
> not be set or it'll look protnone and pte_present will be true. Bit 7
> is PSE so it's also not available because pte_present checks that
> too.
> 
> It appears you can pick between bit 3 4 5 6 at your own choice and it
> doesn't look like we're running out of those yet (if we were there
> would be a bigger incentive to encode it as part of the swp entry
> format). Example:
> 
> #define _PAGE_SWP_UFFD_WP_UNMAP	_PAGE_PWT
> 
> If that bit it set and pte_present is false, then everything else in
> that that pte is meaningless and it means uffd wrprotected
> pte_none.
> 
> So in the migration-entry/swapin page fault path, you could go one
> step back and check the pte for such bit, if it's set it's not a
> migration entry.
> 
> If there's a read access it should fill the page mark with
> shmem_fault, keep the pte wrprotected and then set _PAGE_UFFD_WP on
> the pte. If there's a write access it should invoke handle_userfault.
> 
> If there's any reason where the swp_entry reservation is simpler
> that's ok too, you'll see an huge lot of more details once you try to
> implement it so you'll be better able to judje later. I'm greatly
> simplifying everything but this is not simple feat...
> 
> Thanks,
> Andrea

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads
  2020-12-04  5:59                     ` Hugh Dickins
@ 2020-12-04 16:50                       ` Peter Xu
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Xu @ 2020-12-04 16:50 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Matthew Wilcox, linux-kernel, linux-mm,
	Andrew Morton, Mike Rapoport, David Hildenbrand

On Thu, Dec 03, 2020 at 09:59:50PM -0800, Hugh Dickins wrote:
> On Thu, 3 Dec 2020, Andrea Arcangeli wrote:
> > On Thu, Dec 03, 2020 at 09:30:51PM -0500, Peter Xu wrote:
> > > I'm just afraid there's no space left for a migration entry, because migration
> > > entries fills in the pfn information into swp offset field rather than a real
> > > offset (please refer to make_migration_entry())?  I assume PFN can use any bit.
> > > Or did I miss anything?
> > > 
> > > I went back to see the original proposal from Hugh:
> > > 
> > >   IIUC you only need a single value, no need to carve out another whole
> > >   swp_type: could probably be swp_offset 0 of any swp_type other than 0.
> > > 
> > > Hugh/Andrea, sorry if this is a stupid swap question: could you help explain
> > > why swp_offset=0 won't be used by any swap device?  I believe it's correct,
> > > it's just that I failed to figure out the reason myself. :(
> > > 
> 
> It's because swp_offset 0 is the offset of the swap header, and if we
> ever used that when allocating swap, then the swap header would get
> overwritten, and that swap area become unrecognizable next time.
> 
> But I said it would be usable for UFFD with any swp_type other than 0,
> because a swap entry of type 0, offset 0 is simply 0, which looks just
> like no swap entry at all, and there are (or were: I might not be
> up-to-date) benign races where a swap entry might get passed down but
> then found to be 0, and that was understandable and permitted (yes,
> I still see the "if (!entry.val) goto out;" in __swap_info_get()).
> 
> And that might be related to pte_none() being 0 on most architectures
> (not s390 IIRC): we need to distinguish none from swap.  Though that
> all gets complicated by the way the swp_entry is munged before being
> put into a pte, and the x86 swap munging got more complicated when
> L1TF was revealed (and accompanied by prot none munging too) -
> search git log of v4.19 for x86/speculation/l1tf if you need to.

My thanks to both of you for explaining the details.

> 
> > 
> > Hugh may want to review if I got it wrong, but there's basically three
> > ways.
> > 
> > swp_type would mean adding one more reserved value in addition of
> > SWP_MIGRATION_READ and SWP_MIGRATION_WRITE (kind of increasing
> > SWP_MIGRATION_NUM to 3).
> 
> I'm not very keen on actually using any of the SWP_MIGRATION defines,
> partly because in principle UFFD should not depend on CONFIG_MIGRATION,
> partly because the uffd_wp entry would not behave anything like a
> migration entry (whose pfn should always indicate a locked page).
> 
> swp_offset 0 of swp_type 1 perhaps?
> 
> > 
> > swp_offset = 0 works in combination of SWP_MIGRATION_WRITE and
> > SWP_MIGRATION_READ if we enforce pfn 0 is never used by the kernel
> > (I'd feel safer with pfn value -1UL truncated to the bits of the swp
> > offset, since the swp_entry format is common code).
> > 
> > The bit I was suggesting is just one more bit like _PAGE_SWP_UFFD_WP
> > from the pte, one that cannot ever be set in any swp entry today. I
> > assume it can't be _PAGE_SWP_UFFD_WP since that already can be set but
> > you may want to verify it...
> 
> I don't see why you would need another bit for this.
> 
> The code that checks non-present non-none entries in page table,
> for whether they are actually swap or migration entries or whatever,
> would now also check for swp_offset 0 of swp_type 1 and go off to
> the UFFD WP processing if so.
> 
> I didn't pay much attention to below, it seemed over-complicated.
> And I don't think Peter's PROT_NONE alternative was unworkable,
> but would have to be more careful about pfn and L1TF than shown.
> And I am more comfortable to focus on the swap-like direction,
> than think in two directions at once - never my strength!

Yes, I think both of them may work, but I'll follow your advise on using swap
entries, assuming easier and cleaner than _PAGE_PROTNONE.  For example, current
pte_present() does make more sense to return false for such an uffd-wp reserved
pte.  Then I won't make _PAGE_PROTNONE even more complicated too.

So I guess I'll start with type==1 && offset==0.

(PS: I still think "swp_entry(0, _UFFD_SWP_UFFD_WP) && !vma_is_anonymous(vma)"
 could also be a good candidate comparing to "swp_entry(1, 0)" considering
 type==1 here is kind of randomly chosen from all the other numbers except 0;
 but maybe that's not extremely important - the major logic should be the same)

Thanks!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads
  2020-12-04  4:10                   ` Andrea Arcangeli
  2020-12-04  5:59                     ` Hugh Dickins
@ 2020-12-04 18:12                     ` Andrea Arcangeli
  2020-12-04 19:23                       ` Peter Xu
  1 sibling, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2020-12-04 18:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, Matthew Wilcox, linux-kernel, linux-mm,
	Andrew Morton, Mike Rapoport, David Hildenbrand

On Thu, Dec 03, 2020 at 11:10:18PM -0500, Andrea Arcangeli wrote:
> from the pte, one that cannot ever be set in any swp entry today. I
> assume it can't be _PAGE_SWP_UFFD_WP since that already can be set but
> you may want to verify it...

I thought more about the above, and I think the already existing
pte_swp_mkuffd_wp will just be enough without having to reserve an
extra bitflag if we encode it as a non migration entry.

The check:

if (!pte_present && !pte_none && pte_swp_uffd_wp && not_anonymous_vma && !is_migration_entry)

should be enough to disambiguate it. When setting it, it'd be enough
to set the pte to the value _PAGE_SWP_UFFD_WP.

Although if you prefer to check for:

if (!pte_present && !pte_none && swp_type == 1 && swp_offset == 0 && not_anonymous_vma && !is_migration_entry)

that would do as well.

It's up to you, just my preference is to reuse _PAGE_SWP_UFFD_WP since
it has already to exist, there are already all the pte_swp_*uffd*
methods available or uffd-wp cannot work.

Thanks,
Andrea


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads
  2020-12-04 18:12                     ` Andrea Arcangeli
@ 2020-12-04 19:23                       ` Peter Xu
  2020-12-04 19:37                         ` Andrea Arcangeli
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2020-12-04 19:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Matthew Wilcox, linux-kernel, linux-mm,
	Andrew Morton, Mike Rapoport, David Hildenbrand

Hi, Andrea,

On Fri, Dec 04, 2020 at 01:12:56PM -0500, Andrea Arcangeli wrote:
> On Thu, Dec 03, 2020 at 11:10:18PM -0500, Andrea Arcangeli wrote:
> > from the pte, one that cannot ever be set in any swp entry today. I
> > assume it can't be _PAGE_SWP_UFFD_WP since that already can be set but
> > you may want to verify it...
> 
> I thought more about the above, and I think the already existing
> pte_swp_mkuffd_wp will just be enough without having to reserve an
> extra bitflag if we encode it as a non migration entry.
> 
> The check:
> 
> if (!pte_present && !pte_none && pte_swp_uffd_wp && not_anonymous_vma && !is_migration_entry)

[1]

> 
> should be enough to disambiguate it. When setting it, it'd be enough
> to set the pte to the value _PAGE_SWP_UFFD_WP.
> 
> Although if you prefer to check for:
> 
> if (!pte_present && !pte_none && swp_type == 1 && swp_offset == 0 && not_anonymous_vma && !is_migration_entry)

[2]

> 
> that would do as well.
> 
> It's up to you, just my preference is to reuse _PAGE_SWP_UFFD_WP since
> it has already to exist, there are already all the pte_swp_*uffd*
> methods available or uffd-wp cannot work.

Yes, I had the same thought that it would be nice if this special pte can be
still related to _PAGE_SWP_UFFD_WP.

To me, above [2] looks exactly the same as Hugh suggested to check against
swp_type==1 && swp_offset==0, since:

  - do_swap_page() basically already means "!pte_present && !pte_none"

  - "not_anonymous_vma" seems optional if uffd-wp+shmem will be the first user
    of such a swp entry

  - "!is_migration_entry" seems optional since if swp_type==1, it will never be
    a migration entry

While for above [1] that's the thing I asked besides the current type==1 &
offset=0 proposal.  Quotting one of the previous emails:

> So I guess I'll start with type==1 && offset==0.
> 
> (PS: I still think "swp_entry(0, _UFFD_SWP_UFFD_WP) && !vma_is_anonymous(vma)"
>  could also be a good candidate comparing to "swp_entry(1, 0)" considering
>  type==1 here is kind of randomly chosen from all the other numbers except 0;
>  but maybe that's not extremely important - the major logic should be the same)

If we see [1]:

  if (!pte_present && !pte_none && pte_swp_uffd_wp && not_anonymous_vma && !is_migration_entry)

Then it's fundamentally the same as:

  swp_entry(0, _UFFD_SWP_UFFD_WP) && !vma_is_anonymous(vma)

Reasons similar to above.

Thanks!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads
  2020-12-04 19:23                       ` Peter Xu
@ 2020-12-04 19:37                         ` Andrea Arcangeli
  2020-12-04 20:21                           ` Peter Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Andrea Arcangeli @ 2020-12-04 19:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, Matthew Wilcox, linux-kernel, linux-mm,
	Andrew Morton, Mike Rapoport, David Hildenbrand

On Fri, Dec 04, 2020 at 02:23:29PM -0500, Peter Xu wrote:
> If we see [1]:
> 
>   if (!pte_present && !pte_none && pte_swp_uffd_wp && not_anonymous_vma && !is_migration_entry)
> 
> Then it's fundamentally the same as:
> 
>   swp_entry(0, _UFFD_SWP_UFFD_WP) && !vma_is_anonymous(vma)

Yes conceptually it's the same, but in practice it's different value
in the raw pte if you set  swp_offset  to _UFFD_SWP_UFFD_WP.

Setting swp_offset to _UFFD_SWP_UFFD_WP is just confusing, it's better
magic type 1 offset 0 then to even touch _UFFD_SWP_UFFD_WP if you
reserve the magic value in the swap entry.

pte_swp_uffd_wp or _UFFD_SWP_UFFD_WP are the raw pte value.

swp_entry(0, _UFFD_SWP_UFFD_WP) is not the raw pte value, and it has
to be first converted to the raw pte value with swp_entry_to_pte, so
the _UFFD_SWP_UFFD_WP gets shifted left.

If we use _UFFD_SWP_UFFD_WP it looks much cleaner to keep it in the
pte, not in the swp entry, since then you can use the already existing
methods that only can take in input the pte_t (not the swp_entry_t).

Thanks,
Andrea


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads
  2020-12-04 19:37                         ` Andrea Arcangeli
@ 2020-12-04 20:21                           ` Peter Xu
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Xu @ 2020-12-04 20:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Matthew Wilcox, linux-kernel, linux-mm,
	Andrew Morton, Mike Rapoport, David Hildenbrand

On Fri, Dec 04, 2020 at 02:37:17PM -0500, Andrea Arcangeli wrote:
> If we use _UFFD_SWP_UFFD_WP it looks much cleaner to keep it in the
> pte, not in the swp entry, since then you can use the already existing
> methods that only can take in input the pte_t (not the swp_entry_t).

Ah, I see now.  Yes it looks nicer if we don't even need swp_entry_t knowledge
to recognize the special pte.  So I'll try all these ideas and update.  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-12-04 20:23 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-30 23:06 [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads Peter Xu
2020-12-01  9:30 ` David Hildenbrand
2020-12-01 12:59 ` Matthew Wilcox
2020-12-01 22:30   ` Peter Xu
2020-12-02  0:02     ` Andrea Arcangeli
2020-12-02 22:37       ` Hugh Dickins
2020-12-02 23:41         ` Peter Xu
2020-12-03  5:36           ` Hugh Dickins
2020-12-03 18:02             ` Peter Xu
2020-12-03 19:44               ` Andrea Arcangeli
2020-12-04  2:30                 ` Peter Xu
2020-12-04  4:10                   ` Andrea Arcangeli
2020-12-04  5:59                     ` Hugh Dickins
2020-12-04 16:50                       ` Peter Xu
2020-12-04 18:12                     ` Andrea Arcangeli
2020-12-04 19:23                       ` Peter Xu
2020-12-04 19:37                         ` Andrea Arcangeli
2020-12-04 20:21                           ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).