All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently
@ 2021-09-22 17:51 Peter Xu
  2021-09-22 18:21 ` David Hildenbrand
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Peter Xu @ 2021-09-22 17:51 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrew Morton, peterx, Andrea Arcangeli, Axel Rasmussen,
	Hugh Dickins, Nadav Amit

We forbid merging thps for uffd-wp enabled regions, by breaking the khugepaged
scanning right after we detected a uffd-wp armed pte (either present, or swap).

It works, but it's less efficient, because those ptes only exist for VM_UFFD_WP
enabled VMAs.  Checking against the vma flag would be more efficient, and good
enough.  To be explicit, we could still be able to merge some thps for
VM_UFFD_WP regions before this patch as long as they have zero uffd-wp armed
ptes, however that's not a major target for thp collapse anyways.

This mostly reverts commit e1e267c7928fe387e5e1cffeafb0de2d0473663a, but
instead we do the same check at vma level, so it's not a bugfix.

This also paves the way for file-backed uffd-wp support, as the VM_UFFD_WP flag
will work for file-backed too.

After this patch, the error for khugepaged for these regions will switch from
SCAN_PTE_UFFD_WP to SCAN_VMA_CHECK.

Since uffd minor mode should not allow thp as well, do the same thing for minor
mode to stop early on trying to collapse pages in khugepaged.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---

Axel: as I asked in the other thread, please help check whether minor mode will
work properly with shmem thp enabled.  If not, I feel like this patch could be
part of that effort at last, but it's also possible that I missed something.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/trace/events/huge_memory.h |  1 -
 mm/khugepaged.c                    | 26 +++-----------------------
 2 files changed, 3 insertions(+), 24 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 4fdb14a81108..53532f5925c3 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -15,7 +15,6 @@
 	EM( SCAN_EXCEED_SWAP_PTE,	"exceed_swap_pte")		\
 	EM( SCAN_EXCEED_SHARED_PTE,	"exceed_shared_pte")		\
 	EM( SCAN_PTE_NON_PRESENT,	"pte_non_present")		\
-	EM( SCAN_PTE_UFFD_WP,		"pte_uffd_wp")			\
 	EM( SCAN_PAGE_RO,		"no_writable_page")		\
 	EM( SCAN_LACK_REFERENCED_PAGE,	"lack_referenced_page")		\
 	EM( SCAN_PAGE_NULL,		"page_null")			\
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 045cc579f724..3afe66d48db0 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -31,7 +31,6 @@ enum scan_result {
 	SCAN_EXCEED_SWAP_PTE,
 	SCAN_EXCEED_SHARED_PTE,
 	SCAN_PTE_NON_PRESENT,
-	SCAN_PTE_UFFD_WP,
 	SCAN_PAGE_RO,
 	SCAN_LACK_REFERENCED_PAGE,
 	SCAN_PAGE_NULL,
@@ -467,6 +466,9 @@ static bool hugepage_vma_check(struct vm_area_struct *vma,
 		return false;
 	if (vma_is_temporary_stack(vma))
 		return false;
+	/* Don't allow thp merging for wp/minor enabled uffd regions */
+	if (userfaultfd_wp(vma) || userfaultfd_minor(vma))
+		return false;
 	return !(vm_flags & VM_NO_KHUGEPAGED);
 }
 
@@ -1246,15 +1248,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		pte_t pteval = *_pte;
 		if (is_swap_pte(pteval)) {
 			if (++unmapped <= khugepaged_max_ptes_swap) {
-				/*
-				 * Always be strict with uffd-wp
-				 * enabled swap entries.  Please see
-				 * comment below for pte_uffd_wp().
-				 */
-				if (pte_swp_uffd_wp(pteval)) {
-					result = SCAN_PTE_UFFD_WP;
-					goto out_unmap;
-				}
 				continue;
 			} else {
 				result = SCAN_EXCEED_SWAP_PTE;
@@ -1270,19 +1263,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 				goto out_unmap;
 			}
 		}
-		if (pte_uffd_wp(pteval)) {
-			/*
-			 * Don't collapse the page if any of the small
-			 * PTEs are armed with uffd write protection.
-			 * Here we can also mark the new huge pmd as
-			 * write protected if any of the small ones is
-			 * marked but that could bring unknown
-			 * userfault messages that falls outside of
-			 * the registered range.  So, just be simple.
-			 */
-			result = SCAN_PTE_UFFD_WP;
-			goto out_unmap;
-		}
 		if (pte_write(pteval))
 			writable = true;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently
  2021-09-22 17:51 [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently Peter Xu
@ 2021-09-22 18:21 ` David Hildenbrand
  2021-09-22 18:58   ` Peter Xu
  2021-09-22 19:33 ` Peter Xu
  2021-09-22 20:49   ` Axel Rasmussen
  2 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2021-09-22 18:21 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: Andrew Morton, Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Nadav Amit

On 22.09.21 19:51, Peter Xu wrote:
> We forbid merging thps for uffd-wp enabled regions, by breaking the khugepaged
> scanning right after we detected a uffd-wp armed pte (either present, or swap).
> 
> It works, but it's less efficient, because those ptes only exist for VM_UFFD_WP
> enabled VMAs.  Checking against the vma flag would be more efficient, and good
> enough.  To be explicit, we could still be able to merge some thps for
> VM_UFFD_WP regions before this patch as long as they have zero uffd-wp armed
> ptes, however that's not a major target for thp collapse anyways.
> 

Hm, are we sure there are no users that could benefit from the current 
handling?

I'm thinking about long-term uffd-wp users that effectively end up 
wp-ing on only a small fraction of a gigantic vma, or always wp complete 
blocks in a certain granularity in the range of THP.

Databases come to mind ...

In the past, I played with the idea of using uffd-wp to protect access 
to logically unplugged memory regions part of virtio-mem devices in QEMU 
-- which would exactly do something as described above. But I'll most 
probably be using ordinary uffd once any users that might read such 
logically unplugged memory have been "fixed".

The change itself looks sane to me AFAIKT.

> This mostly reverts commit e1e267c7928fe387e5e1cffeafb0de2d0473663a, but
> instead we do the same check at vma level, so it's not a bugfix.
> 
> This also paves the way for file-backed uffd-wp support, as the VM_UFFD_WP flag
> will work for file-backed too.
> 
> After this patch, the error for khugepaged for these regions will switch from
> SCAN_PTE_UFFD_WP to SCAN_VMA_CHECK.
> 
> Since uffd minor mode should not allow thp as well, do the same thing for minor
> mode to stop early on trying to collapse pages in khugepaged.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Nadav Amit <nadav.amit@gmail.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> 
> Axel: as I asked in the other thread, please help check whether minor mode will
> work properly with shmem thp enabled.  If not, I feel like this patch could be
> part of that effort at last, but it's also possible that I missed something.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   include/trace/events/huge_memory.h |  1 -
>   mm/khugepaged.c                    | 26 +++-----------------------
>   2 files changed, 3 insertions(+), 24 deletions(-)
> 
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 4fdb14a81108..53532f5925c3 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -15,7 +15,6 @@
>   	EM( SCAN_EXCEED_SWAP_PTE,	"exceed_swap_pte")		\
>   	EM( SCAN_EXCEED_SHARED_PTE,	"exceed_shared_pte")		\
>   	EM( SCAN_PTE_NON_PRESENT,	"pte_non_present")		\
> -	EM( SCAN_PTE_UFFD_WP,		"pte_uffd_wp")			\
>   	EM( SCAN_PAGE_RO,		"no_writable_page")		\
>   	EM( SCAN_LACK_REFERENCED_PAGE,	"lack_referenced_page")		\
>   	EM( SCAN_PAGE_NULL,		"page_null")			\
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 045cc579f724..3afe66d48db0 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -31,7 +31,6 @@ enum scan_result {
>   	SCAN_EXCEED_SWAP_PTE,
>   	SCAN_EXCEED_SHARED_PTE,
>   	SCAN_PTE_NON_PRESENT,
> -	SCAN_PTE_UFFD_WP,
>   	SCAN_PAGE_RO,
>   	SCAN_LACK_REFERENCED_PAGE,
>   	SCAN_PAGE_NULL,
> @@ -467,6 +466,9 @@ static bool hugepage_vma_check(struct vm_area_struct *vma,
>   		return false;
>   	if (vma_is_temporary_stack(vma))
>   		return false;
> +	/* Don't allow thp merging for wp/minor enabled uffd regions */
> +	if (userfaultfd_wp(vma) || userfaultfd_minor(vma))
> +		return false;
>   	return !(vm_flags & VM_NO_KHUGEPAGED);
>   }
>   
> @@ -1246,15 +1248,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   		pte_t pteval = *_pte;
>   		if (is_swap_pte(pteval)) {
>   			if (++unmapped <= khugepaged_max_ptes_swap) {
> -				/*
> -				 * Always be strict with uffd-wp
> -				 * enabled swap entries.  Please see
> -				 * comment below for pte_uffd_wp().
> -				 */
> -				if (pte_swp_uffd_wp(pteval)) {
> -					result = SCAN_PTE_UFFD_WP;
> -					goto out_unmap;
> -				}
>   				continue;
>   			} else {
>   				result = SCAN_EXCEED_SWAP_PTE;
> @@ -1270,19 +1263,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   				goto out_unmap;
>   			}
>   		}
> -		if (pte_uffd_wp(pteval)) {
> -			/*
> -			 * Don't collapse the page if any of the small
> -			 * PTEs are armed with uffd write protection.
> -			 * Here we can also mark the new huge pmd as
> -			 * write protected if any of the small ones is
> -			 * marked but that could bring unknown
> -			 * userfault messages that falls outside of
> -			 * the registered range.  So, just be simple.
> -			 */
> -			result = SCAN_PTE_UFFD_WP;
> -			goto out_unmap;
> -		}
>   		if (pte_write(pteval))
>   			writable = true;
>   
> 


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently
  2021-09-22 18:21 ` David Hildenbrand
@ 2021-09-22 18:58   ` Peter Xu
  2021-09-22 19:29       ` Yang Shi
  2021-09-24 10:05     ` David Hildenbrand
  0 siblings, 2 replies; 21+ messages in thread
From: Peter Xu @ 2021-09-22 18:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Andrew Morton, Andrea Arcangeli,
	Axel Rasmussen, Hugh Dickins, Nadav Amit

On Wed, Sep 22, 2021 at 08:21:40PM +0200, David Hildenbrand wrote:
> On 22.09.21 19:51, Peter Xu wrote:
> > We forbid merging thps for uffd-wp enabled regions, by breaking the khugepaged
> > scanning right after we detected a uffd-wp armed pte (either present, or swap).
> > 
> > It works, but it's less efficient, because those ptes only exist for VM_UFFD_WP
> > enabled VMAs.  Checking against the vma flag would be more efficient, and good
> > enough.  To be explicit, we could still be able to merge some thps for
> > VM_UFFD_WP regions before this patch as long as they have zero uffd-wp armed
> > ptes, however that's not a major target for thp collapse anyways.
> > 
> 
> Hm, are we sure there are no users that could benefit from the current
> handling?
> 
> I'm thinking about long-term uffd-wp users that effectively end up wp-ing on
> only a small fraction of a gigantic vma, or always wp complete blocks in a
> certain granularity in the range of THP.

Yes, that's a good question.

> 
> Databases come to mind ...

One thing to mention is that this patch didn't forbid thp being used within a
uffd-wp-ed range.  E.g., we still allow thp exist, we can uffd-wp a thp and
it'll split only until when the thp is written.

While what this patch does is it stops khugepaged from proactively merging
those small pages into thps as long as VM_UFFD_WP|VM_UFFD_MINOR is set.  It may
still affect some user, but it's not a complete disable on thp.

> 
> In the past, I played with the idea of using uffd-wp to protect access to
> logically unplugged memory regions part of virtio-mem devices in QEMU --
> which would exactly do something as described above. But I'll most probably
> be using ordinary uffd once any users that might read such logically
> unplugged memory have been "fixed".

Yes, even if you'd like to keep using uffd-wp that sounds like a very
reasonable scenario.

> 
> The change itself looks sane to me AFAIKT.

So one major motivation of this patch of mine is to prepare for shmem, because
the old commit obviously only covered anonymous.

But after a 2nd thought, I just noticed shmem shouldn't have a problem with
khugepaged merging at all!

The thing is, when khugepaged is merging a shmem thp, unlike anonymous, it'll
not merge the ptes into a pmd, but it'll simply zap the ptes.  It means all
uffd-wp tracking information won't be lost even if merging happened, those info
will still be kept inside pgtables using (the upcoming) pte markers.

When faulted, we'll just do small page mappings while it won't stop the shmem
thp from being mapped hugely in other mm, afaict.

With that in mind, indeed I see this patch less necessary to be merged; so for
sparsely wr-protected vmas like virtio-mem we can still keep some of the ranges
mergeable, that sounds a good thing to keep it as-is.

NACK myself for now: let's not lose that good property of both thp+uffd-wp so
easily, and I'll think more of it.

(To Axel: my question to minor mode handling thp still stands, I think..)

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently
  2021-09-22 18:58   ` Peter Xu
@ 2021-09-22 19:29       ` Yang Shi
  2021-09-24 10:05     ` David Hildenbrand
  1 sibling, 0 replies; 21+ messages in thread
From: Yang Shi @ 2021-09-22 19:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Linux Kernel Mailing List, Linux MM,
	Andrew Morton, Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Nadav Amit

On Wed, Sep 22, 2021 at 11:58 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Sep 22, 2021 at 08:21:40PM +0200, David Hildenbrand wrote:
> > On 22.09.21 19:51, Peter Xu wrote:
> > > We forbid merging thps for uffd-wp enabled regions, by breaking the khugepaged
> > > scanning right after we detected a uffd-wp armed pte (either present, or swap).
> > >
> > > It works, but it's less efficient, because those ptes only exist for VM_UFFD_WP
> > > enabled VMAs.  Checking against the vma flag would be more efficient, and good
> > > enough.  To be explicit, we could still be able to merge some thps for
> > > VM_UFFD_WP regions before this patch as long as they have zero uffd-wp armed
> > > ptes, however that's not a major target for thp collapse anyways.
> > >
> >
> > Hm, are we sure there are no users that could benefit from the current
> > handling?
> >
> > I'm thinking about long-term uffd-wp users that effectively end up wp-ing on
> > only a small fraction of a gigantic vma, or always wp complete blocks in a
> > certain granularity in the range of THP.
>
> Yes, that's a good question.
>
> >
> > Databases come to mind ...
>
> One thing to mention is that this patch didn't forbid thp being used within a
> uffd-wp-ed range.  E.g., we still allow thp exist, we can uffd-wp a thp and
> it'll split only until when the thp is written.
>
> While what this patch does is it stops khugepaged from proactively merging
> those small pages into thps as long as VM_UFFD_WP|VM_UFFD_MINOR is set.  It may
> still affect some user, but it's not a complete disable on thp.
>
> >
> > In the past, I played with the idea of using uffd-wp to protect access to
> > logically unplugged memory regions part of virtio-mem devices in QEMU --
> > which would exactly do something as described above. But I'll most probably
> > be using ordinary uffd once any users that might read such logically
> > unplugged memory have been "fixed".
>
> Yes, even if you'd like to keep using uffd-wp that sounds like a very
> reasonable scenario.
>
> >
> > The change itself looks sane to me AFAIKT.
>
> So one major motivation of this patch of mine is to prepare for shmem, because
> the old commit obviously only covered anonymous.
>
> But after a 2nd thought, I just noticed shmem shouldn't have a problem with
> khugepaged merging at all!
>
> The thing is, when khugepaged is merging a shmem thp, unlike anonymous, it'll
> not merge the ptes into a pmd, but it'll simply zap the ptes.  It means all
> uffd-wp tracking information won't be lost even if merging happened, those info
> will still be kept inside pgtables using (the upcoming) pte markers.

khugepqged does remove the pgtables. Please check out
retract_page_tables(). The pmd will be cleared and the ptes will be
freed otherwise the collapsed THP won't get PMD mapped by later
access.

>
> When faulted, we'll just do small page mappings while it won't stop the shmem
> thp from being mapped hugely in other mm, afaict.
>
> With that in mind, indeed I see this patch less necessary to be merged; so for
> sparsely wr-protected vmas like virtio-mem we can still keep some of the ranges
> mergeable, that sounds a good thing to keep it as-is.
>
> NACK myself for now: let's not lose that good property of both thp+uffd-wp so
> easily, and I'll think more of it.
>
> (To Axel: my question to minor mode handling thp still stands, I think..)
>
> Thanks,
>
> --
> Peter Xu
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently
@ 2021-09-22 19:29       ` Yang Shi
  0 siblings, 0 replies; 21+ messages in thread
From: Yang Shi @ 2021-09-22 19:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Linux Kernel Mailing List, Linux MM,
	Andrew Morton, Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Nadav Amit

On Wed, Sep 22, 2021 at 11:58 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Sep 22, 2021 at 08:21:40PM +0200, David Hildenbrand wrote:
> > On 22.09.21 19:51, Peter Xu wrote:
> > > We forbid merging thps for uffd-wp enabled regions, by breaking the khugepaged
> > > scanning right after we detected a uffd-wp armed pte (either present, or swap).
> > >
> > > It works, but it's less efficient, because those ptes only exist for VM_UFFD_WP
> > > enabled VMAs.  Checking against the vma flag would be more efficient, and good
> > > enough.  To be explicit, we could still be able to merge some thps for
> > > VM_UFFD_WP regions before this patch as long as they have zero uffd-wp armed
> > > ptes, however that's not a major target for thp collapse anyways.
> > >
> >
> > Hm, are we sure there are no users that could benefit from the current
> > handling?
> >
> > I'm thinking about long-term uffd-wp users that effectively end up wp-ing on
> > only a small fraction of a gigantic vma, or always wp complete blocks in a
> > certain granularity in the range of THP.
>
> Yes, that's a good question.
>
> >
> > Databases come to mind ...
>
> One thing to mention is that this patch didn't forbid thp being used within a
> uffd-wp-ed range.  E.g., we still allow thp exist, we can uffd-wp a thp and
> it'll split only until when the thp is written.
>
> While what this patch does is it stops khugepaged from proactively merging
> those small pages into thps as long as VM_UFFD_WP|VM_UFFD_MINOR is set.  It may
> still affect some user, but it's not a complete disable on thp.
>
> >
> > In the past, I played with the idea of using uffd-wp to protect access to
> > logically unplugged memory regions part of virtio-mem devices in QEMU --
> > which would exactly do something as described above. But I'll most probably
> > be using ordinary uffd once any users that might read such logically
> > unplugged memory have been "fixed".
>
> Yes, even if you'd like to keep using uffd-wp that sounds like a very
> reasonable scenario.
>
> >
> > The change itself looks sane to me AFAIKT.
>
> So one major motivation of this patch of mine is to prepare for shmem, because
> the old commit obviously only covered anonymous.
>
> But after a 2nd thought, I just noticed shmem shouldn't have a problem with
> khugepaged merging at all!
>
> The thing is, when khugepaged is merging a shmem thp, unlike anonymous, it'll
> not merge the ptes into a pmd, but it'll simply zap the ptes.  It means all
> uffd-wp tracking information won't be lost even if merging happened, those info
> will still be kept inside pgtables using (the upcoming) pte markers.

khugepqged does remove the pgtables. Please check out
retract_page_tables(). The pmd will be cleared and the ptes will be
freed otherwise the collapsed THP won't get PMD mapped by later
access.

>
> When faulted, we'll just do small page mappings while it won't stop the shmem
> thp from being mapped hugely in other mm, afaict.
>
> With that in mind, indeed I see this patch less necessary to be merged; so for
> sparsely wr-protected vmas like virtio-mem we can still keep some of the ranges
> mergeable, that sounds a good thing to keep it as-is.
>
> NACK myself for now: let's not lose that good property of both thp+uffd-wp so
> easily, and I'll think more of it.
>
> (To Axel: my question to minor mode handling thp still stands, I think..)
>
> Thanks,
>
> --
> Peter Xu
>
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently
  2021-09-22 17:51 [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently Peter Xu
  2021-09-22 18:21 ` David Hildenbrand
@ 2021-09-22 19:33 ` Peter Xu
  2021-09-22 20:49   ` Axel Rasmussen
  2 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2021-09-22 19:33 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrew Morton, Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Nadav Amit

On Wed, Sep 22, 2021 at 01:51:56PM -0400, Peter Xu wrote:
> Axel: as I asked in the other thread, please help check whether minor mode will
> work properly with shmem thp enabled.  If not, I feel like this patch could be
> part of that effort at last, but it's also possible that I missed something.

Hmm, this seems to be a false-alarm too.

UFFDIO_CONTINUE is fine with thp because shmem_getpage() only returns small
pages.

Khugepaged is fine too on merging small shmem pages into thps, the same reason
as uffd-wp: it zaps ptes only, so previous pte_none() ptes will keep the same,
it makes sure all old pte_none ptes will not continue until a UFFDIO_CONTINUE.

The only last problem is when khugepaged merged small ptes into a thp, then it
could zap ptes even if they existed before.  So for minor mode fault, the uffd
service thread needs to be prepared for false positive messages.

That shouldn't be a problem either, because file-backed memory pgtables are
unstable and prone to lost, so uffd minor fault handler should always be
prepared for false positives anyway..

In summary: please feel free to ignore the above note, and sorry for the noise.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently
  2021-09-22 19:29       ` Yang Shi
  (?)
@ 2021-09-22 20:04       ` Peter Xu
  2021-09-22 20:23         ` Peter Xu
  -1 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2021-09-22 20:04 UTC (permalink / raw)
  To: Yang Shi
  Cc: David Hildenbrand, Linux Kernel Mailing List, Linux MM,
	Andrew Morton, Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Nadav Amit

On Wed, Sep 22, 2021 at 12:29:35PM -0700, Yang Shi wrote:
> khugepqged does remove the pgtables. Please check out
> retract_page_tables(). The pmd will be cleared and the ptes will be
> freed otherwise the collapsed THP won't get PMD mapped by later
> access.

Indeed.

I should probably still properly disable khugepaged for at least VM_SHARED &&
VM_UFFD_WP, then I'd keep the anonymous && minor mode behavior untouched.

The other problem is even if current mm/vma doesn't have UFFD_WP registered,
some other mm/vma could have UFFD_WP enabled there that mapped the same file.
Checking that up within retract_page_tables() on all VMAs seems to be a bit too
late.

Checking it early may not trivially work too - I can walk the vma interval tree
at the entry of khugepaged_scan_file(), making sure no vma has UFFD_WP set.
However I don't see how it'll stop some of the vma from having UFFD_WP
registered later after that point but before retract_page_tables().

I'll need to think about it, but thanks for the input, Yang.  That's a very
important point.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently
  2021-09-22 20:04       ` Peter Xu
@ 2021-09-22 20:23         ` Peter Xu
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2021-09-22 20:23 UTC (permalink / raw)
  To: Yang Shi
  Cc: David Hildenbrand, Linux Kernel Mailing List, Linux MM,
	Andrew Morton, Andrea Arcangeli, Axel Rasmussen, Hugh Dickins,
	Nadav Amit

On Wed, Sep 22, 2021 at 04:04:44PM -0400, Peter Xu wrote:
> On Wed, Sep 22, 2021 at 12:29:35PM -0700, Yang Shi wrote:
> > khugepqged does remove the pgtables. Please check out
> > retract_page_tables(). The pmd will be cleared and the ptes will be
> > freed otherwise the collapsed THP won't get PMD mapped by later
> > access.
> 
> Indeed.
> 
> I should probably still properly disable khugepaged for at least VM_SHARED &&
> VM_UFFD_WP, then I'd keep the anonymous && minor mode behavior untouched.
> 
> The other problem is even if current mm/vma doesn't have UFFD_WP registered,
> some other mm/vma could have UFFD_WP enabled there that mapped the same file.
> Checking that up within retract_page_tables() on all VMAs seems to be a bit too
> late.
> 
> Checking it early may not trivially work too - I can walk the vma interval tree
> at the entry of khugepaged_scan_file(), making sure no vma has UFFD_WP set.
> However I don't see how it'll stop some of the vma from having UFFD_WP
> registered later after that point but before retract_page_tables().
> 
> I'll need to think about it, but thanks for the input, Yang.  That's a very
> important point.

Perhaps I need something like this:

---8<---
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 045cc579f724..c63e957336d1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1583,6 +1583,15 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
                pmd = mm_find_pmd(mm, addr);
                if (!pmd)
                        continue;
+               /*
+                * When a vma is registered with uffd-wp, we can't recycle the
+                * pmd pgtable because there can be pte markers installed.
+                * Skip it only, so the rest mm/vma can still have the same
+                * file mapped hugely, however it'll always mapped in small
+                * page size for uffd-wp registered ranges.
+                */
+               if (userfaultfd_wp(vma))
+                       continue;
                /*
                 * We need exclusive mmap_lock to retract page table.
                 *
---8<---

I won't post a v2 because then that patch will be shmem-only and uffd-wp-only.
I'll keep it with the upcoming series I'm going to post to support shmem.

Thanks,

-- 
Peter Xu


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently
  2021-09-22 17:51 [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently Peter Xu
@ 2021-09-22 20:49   ` Axel Rasmussen
  2021-09-22 19:33 ` Peter Xu
  2021-09-22 20:49   ` Axel Rasmussen
  2 siblings, 0 replies; 21+ messages in thread
From: Axel Rasmussen @ 2021-09-22 20:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: LKML, Linux MM, Andrew Morton, Andrea Arcangeli, Hugh Dickins,
	Nadav Amit

On Wed, Sep 22, 2021 at 10:52 AM Peter Xu <peterx@redhat.com> wrote:
>
> We forbid merging thps for uffd-wp enabled regions, by breaking the khugepaged
> scanning right after we detected a uffd-wp armed pte (either present, or swap).
>
> It works, but it's less efficient, because those ptes only exist for VM_UFFD_WP
> enabled VMAs.  Checking against the vma flag would be more efficient, and good
> enough.  To be explicit, we could still be able to merge some thps for
> VM_UFFD_WP regions before this patch as long as they have zero uffd-wp armed
> ptes, however that's not a major target for thp collapse anyways.
>
> This mostly reverts commit e1e267c7928fe387e5e1cffeafb0de2d0473663a, but
> instead we do the same check at vma level, so it's not a bugfix.
>
> This also paves the way for file-backed uffd-wp support, as the VM_UFFD_WP flag
> will work for file-backed too.
>
> After this patch, the error for khugepaged for these regions will switch from
> SCAN_PTE_UFFD_WP to SCAN_VMA_CHECK.
>
> Since uffd minor mode should not allow thp as well, do the same thing for minor
> mode to stop early on trying to collapse pages in khugepaged.
>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Nadav Amit <nadav.amit@gmail.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>
> Axel: as I asked in the other thread, please help check whether minor mode will
> work properly with shmem thp enabled.  If not, I feel like this patch could be
> part of that effort at last, but it's also possible that I missed something.

Sorry for missing the other thread.

Unfortunately, I think shmem THP *doesn't* really work with minor
faults, and what's worse, just checking the VMA flag isn't enough.

First, let me note the guarantee UFFD minor faults are trying to
provide: for a given mapping, any minor fault (that is, pte_none() but
a page is present in the page cache) must result in a minor userfault
event. Furthermore, the only way the fault may be resolved (i.e., a
PTE installed) is via a UFFDIO_CONTINUE ioctl from userspace.

A typical use case for minor faults is, we have two mappings (i.e.,
two VMAs), both pointing to the same underlying physical memory. It's
typical for both to have MAP_SHARED. It's typical for one of these
mappings to be fully faulted in (i.e., all of its PTEs exist), while
the other one has some missing PTEs. The problem is, khugepaged might
scan *either* of the two mappings. Say it picks the fully-faulted VMA:
even if we set khugepaged_max_ptes_none to zero, it will still go
ahead and collapse these pages - because *this* VMA has no missing
PTEs.

Why is this a problem? When we collapse, we install a PMD, for *all*
VMAs which reference these pages. In other words, we might install
PTEs for the other, minor-fault-registered mapping, and therefore
userfaults will never trigger for some of those regions, even though
userspace never UFFDIO_CONTINUE-ed them.

I *think* the right place to check for this and solve it is in
retract_page_tables(), and I have a patch which does this. I've been
hesitant to send it though, as due to a lack of time and the
complexity involved I haven't been able to write a clear reproducer
program, which my patch clearly fixes. :/

>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/trace/events/huge_memory.h |  1 -
>  mm/khugepaged.c                    | 26 +++-----------------------
>  2 files changed, 3 insertions(+), 24 deletions(-)
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 4fdb14a81108..53532f5925c3 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -15,7 +15,6 @@
>         EM( SCAN_EXCEED_SWAP_PTE,       "exceed_swap_pte")              \
>         EM( SCAN_EXCEED_SHARED_PTE,     "exceed_shared_pte")            \
>         EM( SCAN_PTE_NON_PRESENT,       "pte_non_present")              \
> -       EM( SCAN_PTE_UFFD_WP,           "pte_uffd_wp")                  \
>         EM( SCAN_PAGE_RO,               "no_writable_page")             \
>         EM( SCAN_LACK_REFERENCED_PAGE,  "lack_referenced_page")         \
>         EM( SCAN_PAGE_NULL,             "page_null")                    \
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 045cc579f724..3afe66d48db0 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -31,7 +31,6 @@ enum scan_result {
>         SCAN_EXCEED_SWAP_PTE,
>         SCAN_EXCEED_SHARED_PTE,
>         SCAN_PTE_NON_PRESENT,
> -       SCAN_PTE_UFFD_WP,
>         SCAN_PAGE_RO,
>         SCAN_LACK_REFERENCED_PAGE,
>         SCAN_PAGE_NULL,
> @@ -467,6 +466,9 @@ static bool hugepage_vma_check(struct vm_area_struct *vma,
>                 return false;
>         if (vma_is_temporary_stack(vma))
>                 return false;
> +       /* Don't allow thp merging for wp/minor enabled uffd regions */
> +       if (userfaultfd_wp(vma) || userfaultfd_minor(vma))
> +               return false;
>         return !(vm_flags & VM_NO_KHUGEPAGED);
>  }
>
> @@ -1246,15 +1248,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>                 pte_t pteval = *_pte;
>                 if (is_swap_pte(pteval)) {
>                         if (++unmapped <= khugepaged_max_ptes_swap) {
> -                               /*
> -                                * Always be strict with uffd-wp
> -                                * enabled swap entries.  Please see
> -                                * comment below for pte_uffd_wp().
> -                                */
> -                               if (pte_swp_uffd_wp(pteval)) {
> -                                       result = SCAN_PTE_UFFD_WP;
> -                                       goto out_unmap;
> -                               }
>                                 continue;
>                         } else {
>                                 result = SCAN_EXCEED_SWAP_PTE;
> @@ -1270,19 +1263,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>                                 goto out_unmap;
>                         }
>                 }
> -               if (pte_uffd_wp(pteval)) {
> -                       /*
> -                        * Don't collapse the page if any of the small
> -                        * PTEs are armed with uffd write protection.
> -                        * Here we can also mark the new huge pmd as
> -                        * write protected if any of the small ones is
> -                        * marked but that could bring unknown
> -                        * userfault messages that falls outside of
> -                        * the registered range.  So, just be simple.
> -                        */
> -                       result = SCAN_PTE_UFFD_WP;
> -                       goto out_unmap;
> -               }
>                 if (pte_write(pteval))
>                         writable = true;
>
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently
@ 2021-09-22 20:49   ` Axel Rasmussen
  0 siblings, 0 replies; 21+ messages in thread
From: Axel Rasmussen @ 2021-09-22 20:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: LKML, Linux MM, Andrew Morton, Andrea Arcangeli, Hugh Dickins,
	Nadav Amit

On Wed, Sep 22, 2021 at 10:52 AM Peter Xu <peterx@redhat.com> wrote:
>
> We forbid merging thps for uffd-wp enabled regions, by breaking the khugepaged
> scanning right after we detected a uffd-wp armed pte (either present, or swap).
>
> It works, but it's less efficient, because those ptes only exist for VM_UFFD_WP
> enabled VMAs.  Checking against the vma flag would be more efficient, and good
> enough.  To be explicit, we could still be able to merge some thps for
> VM_UFFD_WP regions before this patch as long as they have zero uffd-wp armed
> ptes, however that's not a major target for thp collapse anyways.
>
> This mostly reverts commit e1e267c7928fe387e5e1cffeafb0de2d0473663a, but
> instead we do the same check at vma level, so it's not a bugfix.
>
> This also paves the way for file-backed uffd-wp support, as the VM_UFFD_WP flag
> will work for file-backed too.
>
> After this patch, the error for khugepaged for these regions will switch from
> SCAN_PTE_UFFD_WP to SCAN_VMA_CHECK.
>
> Since uffd minor mode should not allow thp as well, do the same thing for minor
> mode to stop early on trying to collapse pages in khugepaged.
>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Nadav Amit <nadav.amit@gmail.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>
> Axel: as I asked in the other thread, please help check whether minor mode will
> work properly with shmem thp enabled.  If not, I feel like this patch could be
> part of that effort at last, but it's also possible that I missed something.

Sorry for missing the other thread.

Unfortunately, I think shmem THP *doesn't* really work with minor
faults, and what's worse, just checking the VMA flag isn't enough.

First, let me note the guarantee UFFD minor faults are trying to
provide: for a given mapping, any minor fault (that is, pte_none() but
a page is present in the page cache) must result in a minor userfault
event. Furthermore, the only way the fault may be resolved (i.e., a
PTE installed) is via a UFFDIO_CONTINUE ioctl from userspace.

A typical use case for minor faults is, we have two mappings (i.e.,
two VMAs), both pointing to the same underlying physical memory. It's
typical for both to have MAP_SHARED. It's typical for one of these
mappings to be fully faulted in (i.e., all of its PTEs exist), while
the other one has some missing PTEs. The problem is, khugepaged might
scan *either* of the two mappings. Say it picks the fully-faulted VMA:
even if we set khugepaged_max_ptes_none to zero, it will still go
ahead and collapse these pages - because *this* VMA has no missing
PTEs.

Why is this a problem? When we collapse, we install a PMD, for *all*
VMAs which reference these pages. In other words, we might install
PTEs for the other, minor-fault-registered mapping, and therefore
userfaults will never trigger for some of those regions, even though
userspace never UFFDIO_CONTINUE-ed them.

I *think* the right place to check for this and solve it is in
retract_page_tables(), and I have a patch which does this. I've been
hesitant to send it though, as due to a lack of time and the
complexity involved I haven't been able to write a clear reproducer
program, which my patch clearly fixes. :/

>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/trace/events/huge_memory.h |  1 -
>  mm/khugepaged.c                    | 26 +++-----------------------
>  2 files changed, 3 insertions(+), 24 deletions(-)
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 4fdb14a81108..53532f5925c3 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -15,7 +15,6 @@
>         EM( SCAN_EXCEED_SWAP_PTE,       "exceed_swap_pte")              \
>         EM( SCAN_EXCEED_SHARED_PTE,     "exceed_shared_pte")            \
>         EM( SCAN_PTE_NON_PRESENT,       "pte_non_present")              \
> -       EM( SCAN_PTE_UFFD_WP,           "pte_uffd_wp")                  \
>         EM( SCAN_PAGE_RO,               "no_writable_page")             \
>         EM( SCAN_LACK_REFERENCED_PAGE,  "lack_referenced_page")         \
>         EM( SCAN_PAGE_NULL,             "page_null")                    \
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 045cc579f724..3afe66d48db0 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -31,7 +31,6 @@ enum scan_result {
>         SCAN_EXCEED_SWAP_PTE,
>         SCAN_EXCEED_SHARED_PTE,
>         SCAN_PTE_NON_PRESENT,
> -       SCAN_PTE_UFFD_WP,
>         SCAN_PAGE_RO,
>         SCAN_LACK_REFERENCED_PAGE,
>         SCAN_PAGE_NULL,
> @@ -467,6 +466,9 @@ static bool hugepage_vma_check(struct vm_area_struct *vma,
>                 return false;
>         if (vma_is_temporary_stack(vma))
>                 return false;
> +       /* Don't allow thp merging for wp/minor enabled uffd regions */
> +       if (userfaultfd_wp(vma) || userfaultfd_minor(vma))
> +               return false;
>         return !(vm_flags & VM_NO_KHUGEPAGED);
>  }
>
> @@ -1246,15 +1248,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>                 pte_t pteval = *_pte;
>                 if (is_swap_pte(pteval)) {
>                         if (++unmapped <= khugepaged_max_ptes_swap) {
> -                               /*
> -                                * Always be strict with uffd-wp
> -                                * enabled swap entries.  Please see
> -                                * comment below for pte_uffd_wp().
> -                                */
> -                               if (pte_swp_uffd_wp(pteval)) {
> -                                       result = SCAN_PTE_UFFD_WP;
> -                                       goto out_unmap;
> -                               }
>                                 continue;
>                         } else {
>                                 result = SCAN_EXCEED_SWAP_PTE;
> @@ -1270,19 +1263,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>                                 goto out_unmap;
>                         }
>                 }
> -               if (pte_uffd_wp(pteval)) {
> -                       /*
> -                        * Don't collapse the page if any of the small
> -                        * PTEs are armed with uffd write protection.
> -                        * Here we can also mark the new huge pmd as
> -                        * write protected if any of the small ones is
> -                        * marked but that could bring unknown
> -                        * userfault messages that falls outside of
> -                        * the registered range.  So, just be simple.
> -                        */
> -                       result = SCAN_PTE_UFFD_WP;
> -                       goto out_unmap;
> -               }
>                 if (pte_write(pteval))
>                         writable = true;
>
> --
> 2.31.1
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently
  2021-09-22 20:49   ` Axel Rasmussen
  (?)
@ 2021-09-22 21:20   ` Peter Xu
  2021-09-22 23:18       ` Hugh Dickins
  -1 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2021-09-22 21:20 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: LKML, Linux MM, Andrew Morton, Andrea Arcangeli, Hugh Dickins,
	Nadav Amit

On Wed, Sep 22, 2021 at 01:49:42PM -0700, Axel Rasmussen wrote:
> Sorry for missing the other thread.

Heh, you didn't miss the other thread; I just posted both of the emails in a
few hours. :)

> 
> Unfortunately, I think shmem THP *doesn't* really work with minor
> faults, and what's worse, just checking the VMA flag isn't enough.

As I replied to myself (sorry to have done that), now I think minor mode is
fine, but let's see what else I've missed, which is possible...  Please see
below.

> 
> First, let me note the guarantee UFFD minor faults are trying to
> provide: for a given mapping, any minor fault (that is, pte_none() but
> a page is present in the page cache) must result in a minor userfault
> event. Furthermore, the only way the fault may be resolved (i.e., a
> PTE installed) is via a UFFDIO_CONTINUE ioctl from userspace.

Yes.

> 
> A typical use case for minor faults is, we have two mappings (i.e.,
> two VMAs), both pointing to the same underlying physical memory. It's
> typical for both to have MAP_SHARED. It's typical for one of these
> mappings to be fully faulted in (i.e., all of its PTEs exist), while
> the other one has some missing PTEs. The problem is, khugepaged might
> scan *either* of the two mappings. Say it picks the fully-faulted VMA:
> even if we set khugepaged_max_ptes_none to zero, it will still go
> ahead and collapse these pages - because *this* VMA has no missing
> PTEs.

Yes.

> 
> Why is this a problem? When we collapse, we install a PMD, for *all*
> VMAs which reference these pages. In other words, we might install
> PTEs for the other, minor-fault-registered mapping, and therefore
> userfaults will never trigger for some of those regions, even though
> userspace never UFFDIO_CONTINUE-ed them.

Nop - we don't install PMD for file-backed, do we?

Please see khugepaged_scan_pmd() - that one installs PMDs indeed, but it's
anonymous-only code.

Then please also see khugepaged_scan_file() - that one handles file-backed
(aka, shmem), and it does _not_ install pmd, afaict.  The installation is lazy.

Not installing pmd means uffd-minor can still trap any further faults just like
before, afaiu.

There's a very trivial detail that the pmd missing case will have a very slight
code path change when the next page fault happens: in __handle_mm_fault() we'll
first try to go into create_huge_pmd() once, however since shmem didn't provide
huge_fault(), we'll go the VM_FAULT_FALLBACK path, and things will go like
before when faulting on a small pte.  The next UFFDIO_CONTINUE will allocate
that missing pmd again, however it'll install a 4K page only.

> 
> I *think* the right place to check for this and solve it is in
> retract_page_tables(), and I have a patch which does this. I've been
> hesitant to send it though, as due to a lack of time and the
> complexity involved I haven't been able to write a clear reproducer
> program, which my patch clearly fixes. :/

Yes retract_page_tables() could drop pmd pgtable for minor fault, but IMHO it's
fine too as mentioned above.

Minor mode should only care about trapping the page fault when the next access
comes.  retract_page_tables() will wipe the pmd pgtable page, that's not fine
for uffd-wp, but IMHO that's still very fine for minor mode as it will keep
trapping the old missing ptes; the difference is it'll just generate even more
traps (rather than on the pte holes only, now it'll generate one message for
each 4k over the merged 2M).

As I mentioned in the other thread, I think that'll cause false positive minor
fault messages, but IMHO that's fine, and minor fault userspace should always
need to handle that.

I fully agree with you that a reproducer would be very nice to try.  So if my
understanding is correct, the reproducer won't really fail on minor mode in any
way, but it'll just need to be prepared to receive more messages than it should.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently
  2021-09-22 21:20   ` Peter Xu
@ 2021-09-22 23:18       ` Hugh Dickins
  0 siblings, 0 replies; 21+ messages in thread
From: Hugh Dickins @ 2021-09-22 23:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: Axel Rasmussen, LKML, Linux MM, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Nadav Amit

On Wed, 22 Sep 2021, Peter Xu wrote:
> 
> Not installing pmd means uffd-minor can still trap any further faults just like
> before, afaiu.
> 
> There's a very trivial detail that the pmd missing case will have a very slight
> code path change when the next page fault happens: in __handle_mm_fault() we'll
> first try to go into create_huge_pmd() once, however since shmem didn't provide
> huge_fault(), we'll go the VM_FAULT_FALLBACK path, and things will go like
> before when faulting on a small pte.  The next UFFDIO_CONTINUE will allocate
> that missing pmd again, however it'll install a 4K page only.

I think you're mistaken there.

I can't tell you much about ->huge_fault(), something introduced for
DAX I believe; but shmem has managed pmd mappings without it, since
before ->huge_fault() was ever added.

Look for the call to do_set_pmd() in finish_fault(): I think you'll
find that is the way shmem's huge pmds get in.

Earlier in the thread you suggested "shmem_getpage() only returns
small pages": but it can very well return PageTransCompound pages,
head or tail, which arrive at this do_set_pmd().

Hugh


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently
@ 2021-09-22 23:18       ` Hugh Dickins
  0 siblings, 0 replies; 21+ messages in thread
From: Hugh Dickins @ 2021-09-22 23:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: Axel Rasmussen, LKML, Linux MM, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Nadav Amit

On Wed, 22 Sep 2021, Peter Xu wrote:
> 
> Not installing pmd means uffd-minor can still trap any further faults just like
> before, afaiu.
> 
> There's a very trivial detail that the pmd missing case will have a very slight
> code path change when the next page fault happens: in __handle_mm_fault() we'll
> first try to go into create_huge_pmd() once, however since shmem didn't provide
> huge_fault(), we'll go the VM_FAULT_FALLBACK path, and things will go like
> before when faulting on a small pte.  The next UFFDIO_CONTINUE will allocate
> that missing pmd again, however it'll install a 4K page only.

I think you're mistaken there.

I can't tell you much about ->huge_fault(), something introduced for
DAX I believe; but shmem has managed pmd mappings without it, since
before ->huge_fault() was ever added.

Look for the call to do_set_pmd() in finish_fault(): I think you'll
find that is the way shmem's huge pmds get in.

Earlier in the thread you suggested "shmem_getpage() only returns
small pages": but it can very well return PageTransCompound pages,
head or tail, which arrive at this do_set_pmd().

Hugh

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently
  2021-09-22 23:18       ` Hugh Dickins
  (?)
@ 2021-09-22 23:44       ` Peter Xu
  2021-09-23  1:22           ` Hugh Dickins
  -1 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2021-09-22 23:44 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Axel Rasmussen, LKML, Linux MM, Andrew Morton, Andrea Arcangeli,
	Nadav Amit

On Wed, Sep 22, 2021 at 04:18:09PM -0700, Hugh Dickins wrote:
> On Wed, 22 Sep 2021, Peter Xu wrote:
> > 
> > Not installing pmd means uffd-minor can still trap any further faults just like
> > before, afaiu.
> > 
> > There's a very trivial detail that the pmd missing case will have a very slight
> > code path change when the next page fault happens: in __handle_mm_fault() we'll
> > first try to go into create_huge_pmd() once, however since shmem didn't provide
> > huge_fault(), we'll go the VM_FAULT_FALLBACK path, and things will go like
> > before when faulting on a small pte.  The next UFFDIO_CONTINUE will allocate
> > that missing pmd again, however it'll install a 4K page only.
> 
> I think you're mistaken there.
> 
> I can't tell you much about ->huge_fault(), something introduced for
> DAX I believe; but shmem has managed pmd mappings without it, since
> before ->huge_fault() was ever added.

Right, I wanted to express we didn't go into there, hence no way to allocate
pmd there.

> 
> Look for the call to do_set_pmd() in finish_fault(): I think you'll
> find that is the way shmem's huge pmds get in.
> 
> Earlier in the thread you suggested "shmem_getpage() only returns
> small pages": but it can very well return PageTransCompound pages,
> head or tail, which arrive at this do_set_pmd().

But note that uffd-minor will trap the shmem fault() even if pmd_none:

	page = pagecache_get_page(mapping, index,
					FGP_ENTRY | FGP_HEAD | FGP_LOCK, 0);

	if (page && vma && userfaultfd_minor(vma)) {
		if (!xa_is_value(page)) {
			unlock_page(page);
			put_page(page);
		}
		*fault_type = handle_userfault(vmf, VM_UFFD_MINOR);
		return 0;
	}

That's why I think it'll be fine, because it should only be UFFDIO_CONTINUE
that installs the pte (alongside with allocating the pmd).

Or did I miss something?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently
  2021-09-22 23:44       ` Peter Xu
@ 2021-09-23  1:22           ` Hugh Dickins
  0 siblings, 0 replies; 21+ messages in thread
From: Hugh Dickins @ 2021-09-23  1:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, Axel Rasmussen, LKML, Linux MM, Andrew Morton,
	Andrea Arcangeli, Nadav Amit

On Wed, 22 Sep 2021, Peter Xu wrote:
> On Wed, Sep 22, 2021 at 04:18:09PM -0700, Hugh Dickins wrote:
> > On Wed, 22 Sep 2021, Peter Xu wrote:
> > > 
> > > Not installing pmd means uffd-minor can still trap any further faults just like
> > > before, afaiu.
> > > 
> > > There's a very trivial detail that the pmd missing case will have a very slight
> > > code path change when the next page fault happens: in __handle_mm_fault() we'll
> > > first try to go into create_huge_pmd() once, however since shmem didn't provide
> > > huge_fault(), we'll go the VM_FAULT_FALLBACK path, and things will go like
> > > before when faulting on a small pte.  The next UFFDIO_CONTINUE will allocate
> > > that missing pmd again, however it'll install a 4K page only.
> > 
> > I think you're mistaken there.
> > 
> > I can't tell you much about ->huge_fault(), something introduced for
> > DAX I believe; but shmem has managed pmd mappings without it, since
> > before ->huge_fault() was ever added.
> 
> Right, I wanted to express we didn't go into there, hence no way to allocate
> pmd there.
> 
> > 
> > Look for the call to do_set_pmd() in finish_fault(): I think you'll
> > find that is the way shmem's huge pmds get in.
> > 
> > Earlier in the thread you suggested "shmem_getpage() only returns
> > small pages": but it can very well return PageTransCompound pages,
> > head or tail, which arrive at this do_set_pmd().
> 
> But note that uffd-minor will trap the shmem fault() even if pmd_none:
> 
> 	page = pagecache_get_page(mapping, index,
> 					FGP_ENTRY | FGP_HEAD | FGP_LOCK, 0);
> 
> 	if (page && vma && userfaultfd_minor(vma)) {
> 		if (!xa_is_value(page)) {
> 			unlock_page(page);
> 			put_page(page);
> 		}
> 		*fault_type = handle_userfault(vmf, VM_UFFD_MINOR);
> 		return 0;
> 	}
> 
> That's why I think it'll be fine, because it should only be UFFDIO_CONTINUE
> that installs the pte (alongside with allocating the pmd).
> 
> Or did I miss something?

No, I think I misunderstood you before: thanks for re-explaining.
(And Axel's !userfaultfd_minor() check before calling do_fault_around()
plays an important part in making sure that it does reach shmem_fault().)

Hugh

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently
@ 2021-09-23  1:22           ` Hugh Dickins
  0 siblings, 0 replies; 21+ messages in thread
From: Hugh Dickins @ 2021-09-23  1:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, Axel Rasmussen, LKML, Linux MM, Andrew Morton,
	Andrea Arcangeli, Nadav Amit

On Wed, 22 Sep 2021, Peter Xu wrote:
> On Wed, Sep 22, 2021 at 04:18:09PM -0700, Hugh Dickins wrote:
> > On Wed, 22 Sep 2021, Peter Xu wrote:
> > > 
> > > Not installing pmd means uffd-minor can still trap any further faults just like
> > > before, afaiu.
> > > 
> > > There's a very trivial detail that the pmd missing case will have a very slight
> > > code path change when the next page fault happens: in __handle_mm_fault() we'll
> > > first try to go into create_huge_pmd() once, however since shmem didn't provide
> > > huge_fault(), we'll go the VM_FAULT_FALLBACK path, and things will go like
> > > before when faulting on a small pte.  The next UFFDIO_CONTINUE will allocate
> > > that missing pmd again, however it'll install a 4K page only.
> > 
> > I think you're mistaken there.
> > 
> > I can't tell you much about ->huge_fault(), something introduced for
> > DAX I believe; but shmem has managed pmd mappings without it, since
> > before ->huge_fault() was ever added.
> 
> Right, I wanted to express we didn't go into there, hence no way to allocate
> pmd there.
> 
> > 
> > Look for the call to do_set_pmd() in finish_fault(): I think you'll
> > find that is the way shmem's huge pmds get in.
> > 
> > Earlier in the thread you suggested "shmem_getpage() only returns
> > small pages": but it can very well return PageTransCompound pages,
> > head or tail, which arrive at this do_set_pmd().
> 
> But note that uffd-minor will trap the shmem fault() even if pmd_none:
> 
> 	page = pagecache_get_page(mapping, index,
> 					FGP_ENTRY | FGP_HEAD | FGP_LOCK, 0);
> 
> 	if (page && vma && userfaultfd_minor(vma)) {
> 		if (!xa_is_value(page)) {
> 			unlock_page(page);
> 			put_page(page);
> 		}
> 		*fault_type = handle_userfault(vmf, VM_UFFD_MINOR);
> 		return 0;
> 	}
> 
> That's why I think it'll be fine, because it should only be UFFDIO_CONTINUE
> that installs the pte (alongside with allocating the pmd).
> 
> Or did I miss something?

No, I think I misunderstood you before: thanks for re-explaining.
(And Axel's !userfaultfd_minor() check before calling do_fault_around()
plays an important part in making sure that it does reach shmem_fault().)

Hugh


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently
  2021-09-23  1:22           ` Hugh Dickins
  (?)
@ 2021-09-23  2:18           ` Peter Xu
  2021-09-23 16:47               ` Axel Rasmussen
  -1 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2021-09-23  2:18 UTC (permalink / raw)
  To: Hugh Dickins, Axel Rasmussen
  Cc: Axel Rasmussen, LKML, Linux MM, Andrew Morton, Andrea Arcangeli,
	Nadav Amit

[-- Attachment #1: Type: text/plain, Size: 1201 bytes --]

On Wed, Sep 22, 2021 at 06:22:45PM -0700, Hugh Dickins wrote:
> No, I think I misunderstood you before: thanks for re-explaining.
> (And Axel's !userfaultfd_minor() check before calling do_fault_around()
> plays an important part in making sure that it does reach shmem_fault().)

Still thanks for confirming this, Hugh.

Said that, Axel, I didn't mean I'm against doing something similar like
uffd-wp; it's just a heads-up that maybe you won't find a reproducer with real
issues with minor mode.

Even if I think minor mode should be fine with current code, we could still
choose to disable khugepaged from removing the pmd for VM_UFFD_MINOR vmas, just
like what we'll do with VM_UFFD_WP.  At least it can still reduce false
positives.

So far in my local branch I queued the patch which I attached, that's required
for uffd-wp shmem afaict.  If you think minor mode would like that too, I can
post it separately with minor mode added in.

Note that it's slightly different from what I pasted in reply to Yang Shi - I
made it slightly more complicated just to make sure there's no race.  I
mentioned the possible race (I think) in the commit log.

Let me know your preference.

Thanks,

-- 
Peter Xu

[-- Attachment #2: patch --]
[-- Type: text/plain, Size: 2697 bytes --]

From 989d36914ac144177e17f9aacbf2785bb8f21420 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Wed, 22 Sep 2021 16:23:33 -0400
Subject: [PATCH] mm/khugepaged: Don't recycle vma pgtable if uffd-wp
 registered

When we're trying to collapse a 2M huge shmem page, don't retract pgtable pmd
page if it's registered with uffd-wp, because that pgtable could have pte
markers installed.  Recycling of that pgtable means we'll lose the pte markers.
That could cause data loss for an uffd-wp enabled application on shmem.

Instead of disabling khugepaged on these files, simply skip retracting these
special VMAs, then the page cache can still be merged into a huge thp, and
other mm/vma can still map the range of file with a huge thp when proper.

Note that checking VM_UFFD_WP needs to be done with mmap_sem held for write,
that avoids race like:

         khugepaged                             user thread
         ==========                             ===========
     check VM_UFFD_WP, not set
                                       UFFDIO_REGISTER with uffd-wp on shmem
                                       wr-protect some pages (install markers)
     take mmap_sem write lock
     erase pmd and free pmd page
      --> pte markers are dropped unnoticed!

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/khugepaged.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 045cc579f724..23e1d03156b3 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1451,6 +1451,10 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 	if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE))
 		return;
 
+	/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
+	if (userfaultfd_wp(vma))
+		return;
+
 	hpage = find_lock_page(vma->vm_file->f_mapping,
 			       linear_page_index(vma, haddr));
 	if (!hpage)
@@ -1591,7 +1595,15 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		 * reverse order. Trylock is a way to avoid deadlock.
 		 */
 		if (mmap_write_trylock(mm)) {
-			if (!khugepaged_test_exit(mm)) {
+			/*
+			 * When a vma is registered with uffd-wp, we can't
+			 * recycle the pmd pgtable because there can be pte
+			 * markers installed.  Skip it only, so the rest mm/vma
+			 * can still have the same file mapped hugely, however
+			 * it'll always mapped in small page size for uffd-wp
+			 * registered ranges.
+			 */
+			if (!khugepaged_test_exit(mm) && !userfaultfd_wp(vma)) {
 				spinlock_t *ptl = pmd_lock(mm, pmd);
 				/* assume page table is clear */
 				_pmd = pmdp_collapse_flush(vma, addr, pmd);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently
  2021-09-23  2:18           ` Peter Xu
@ 2021-09-23 16:47               ` Axel Rasmussen
  0 siblings, 0 replies; 21+ messages in thread
From: Axel Rasmussen @ 2021-09-23 16:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, LKML, Linux MM, Andrew Morton, Andrea Arcangeli,
	Nadav Amit

On Wed, Sep 22, 2021 at 7:18 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Sep 22, 2021 at 06:22:45PM -0700, Hugh Dickins wrote:
> > No, I think I misunderstood you before: thanks for re-explaining.
> > (And Axel's !userfaultfd_minor() check before calling do_fault_around()
> > plays an important part in making sure that it does reach shmem_fault().)
>
> Still thanks for confirming this, Hugh.
>
> Said that, Axel, I didn't mean I'm against doing something similar like
> uffd-wp; it's just a heads-up that maybe you won't find a reproducer with real
> issues with minor mode.
>
> Even if I think minor mode should be fine with current code, we could still
> choose to disable khugepaged from removing the pmd for VM_UFFD_MINOR vmas, just
> like what we'll do with VM_UFFD_WP.  At least it can still reduce false
> positives.
>
> So far in my local branch I queued the patch which I attached, that's required
> for uffd-wp shmem afaict.  If you think minor mode would like that too, I can
> post it separately with minor mode added in.

No worries, you can leave the minor fault case to me.

My thinking there was a THP collapse bug was really just based on
speculation, not a real reproducer, so it's very possible my
speculation was wrong. It will take some more thinking and reading to
convince myself one way or the other. :) Thanks to you and Hugh for
all the details.

I'd prefer not to add this fix "just in case", if it isn't a real
problem, as it seems like it may confuse future readers of the code.

I'll send out a patch for it if / when I manage to build a real
reproducer. Or, in the meantime, some of my Google colleagues are
testing this code via their live migration implementation, so if there
is a bug here there's a good chance we'll find it that way too.

>
> Note that it's slightly different from what I pasted in reply to Yang Shi - I
> made it slightly more complicated just to make sure there's no race.  I
> mentioned the possible race (I think) in the commit log.
>
> Let me know your preference.
>
> Thanks,
>
> --
> Peter Xu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently
@ 2021-09-23 16:47               ` Axel Rasmussen
  0 siblings, 0 replies; 21+ messages in thread
From: Axel Rasmussen @ 2021-09-23 16:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, LKML, Linux MM, Andrew Morton, Andrea Arcangeli,
	Nadav Amit

On Wed, Sep 22, 2021 at 7:18 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Sep 22, 2021 at 06:22:45PM -0700, Hugh Dickins wrote:
> > No, I think I misunderstood you before: thanks for re-explaining.
> > (And Axel's !userfaultfd_minor() check before calling do_fault_around()
> > plays an important part in making sure that it does reach shmem_fault().)
>
> Still thanks for confirming this, Hugh.
>
> Said that, Axel, I didn't mean I'm against doing something similar like
> uffd-wp; it's just a heads-up that maybe you won't find a reproducer with real
> issues with minor mode.
>
> Even if I think minor mode should be fine with current code, we could still
> choose to disable khugepaged from removing the pmd for VM_UFFD_MINOR vmas, just
> like what we'll do with VM_UFFD_WP.  At least it can still reduce false
> positives.
>
> So far in my local branch I queued the patch which I attached, that's required
> for uffd-wp shmem afaict.  If you think minor mode would like that too, I can
> post it separately with minor mode added in.

No worries, you can leave the minor fault case to me.

My thinking there was a THP collapse bug was really just based on
speculation, not a real reproducer, so it's very possible my
speculation was wrong. It will take some more thinking and reading to
convince myself one way or the other. :) Thanks to you and Hugh for
all the details.

I'd prefer not to add this fix "just in case", if it isn't a real
problem, as it seems like it may confuse future readers of the code.

I'll send out a patch for it if / when I manage to build a real
reproducer. Or, in the meantime, some of my Google colleagues are
testing this code via their live migration implementation, so if there
is a bug here there's a good chance we'll find it that way too.

>
> Note that it's slightly different from what I pasted in reply to Yang Shi - I
> made it slightly more complicated just to make sure there's no race.  I
> mentioned the possible race (I think) in the commit log.
>
> Let me know your preference.
>
> Thanks,
>
> --
> Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently
  2021-09-23 16:47               ` Axel Rasmussen
  (?)
@ 2021-09-23 17:53               ` Peter Xu
  -1 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2021-09-23 17:53 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Hugh Dickins, LKML, Linux MM, Andrew Morton, Andrea Arcangeli,
	Nadav Amit

On Thu, Sep 23, 2021 at 09:47:42AM -0700, Axel Rasmussen wrote:
> My thinking there was a THP collapse bug was really just based on
> speculation, not a real reproducer, so it's very possible my
> speculation was wrong. It will take some more thinking and reading to
> convince myself one way or the other. :) Thanks to you and Hugh for
> all the details.
> 
> I'd prefer not to add this fix "just in case", if it isn't a real
> problem, as it seems like it may confuse future readers of the code.

It's not "just in case" to me - IMHO it's theoretically causing more false
positives as I used to mention, at least that's my understanding so far. So if
the theory is correct it'll 100% happen when khugepaged merged some
minor-registered regions.

Uffd-wp could have many false positives like this if we don't support swap - at
last we decided to fully support swap then we removed all the false positives
regarding swapping.  I think it's similar here, but khugepaged should trigger
much less frequently on the false positives upon uffd-minor, than swapping upon
uffd-wp.

But yes, there's definitely no rush on thinking or anything - it'll never hurt
to think more. And more importantly, verify it with some test program would be
great; after all theoretically it'll just work like a charm to me.

> 
> I'll send out a patch for it if / when I manage to build a real
> reproducer. Or, in the meantime, some of my Google colleagues are
> testing this code via their live migration implementation, so if there
> is a bug here there's a good chance we'll find it that way too.

Sounds like a good plan.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently
  2021-09-22 18:58   ` Peter Xu
  2021-09-22 19:29       ` Yang Shi
@ 2021-09-24 10:05     ` David Hildenbrand
  1 sibling, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2021-09-24 10:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Andrew Morton, Andrea Arcangeli,
	Axel Rasmussen, Hugh Dickins, Nadav Amit

On 22.09.21 20:58, Peter Xu wrote:
> On Wed, Sep 22, 2021 at 08:21:40PM +0200, David Hildenbrand wrote:
>> On 22.09.21 19:51, Peter Xu wrote:
>>> We forbid merging thps for uffd-wp enabled regions, by breaking the khugepaged
>>> scanning right after we detected a uffd-wp armed pte (either present, or swap).
>>>
>>> It works, but it's less efficient, because those ptes only exist for VM_UFFD_WP
>>> enabled VMAs.  Checking against the vma flag would be more efficient, and good
>>> enough.  To be explicit, we could still be able to merge some thps for
>>> VM_UFFD_WP regions before this patch as long as they have zero uffd-wp armed
>>> ptes, however that's not a major target for thp collapse anyways.
>>>
>>
>> Hm, are we sure there are no users that could benefit from the current
>> handling?
>>
>> I'm thinking about long-term uffd-wp users that effectively end up wp-ing on
>> only a small fraction of a gigantic vma, or always wp complete blocks in a
>> certain granularity in the range of THP.
> 
> Yes, that's a good question.
> 
>>
>> Databases come to mind ...
> 
> One thing to mention is that this patch didn't forbid thp being used within a
> uffd-wp-ed range.  E.g., we still allow thp exist, we can uffd-wp a thp and
> it'll split only until when the thp is written.
> 
> While what this patch does is it stops khugepaged from proactively merging
> those small pages into thps as long as VM_UFFD_WP|VM_UFFD_MINOR is set.  It may
> still affect some user, but it's not a complete disable on thp.
> 
>>
>> In the past, I played with the idea of using uffd-wp to protect access to
>> logically unplugged memory regions part of virtio-mem devices in QEMU --
>> which would exactly do something as described above. But I'll most probably
>> be using ordinary uffd once any users that might read such logically
>> unplugged memory have been "fixed".
> 
> Yes, even if you'd like to keep using uffd-wp that sounds like a very
> reasonable scenario.
> 
>>
>> The change itself looks sane to me AFAIKT.
> 
> So one major motivation of this patch of mine is to prepare for shmem, because
> the old commit obviously only covered anonymous.
> 
> But after a 2nd thought, I just noticed shmem shouldn't have a problem with
> khugepaged merging at all!
> 
> The thing is, when khugepaged is merging a shmem thp, unlike anonymous, it'll
> not merge the ptes into a pmd, but it'll simply zap the ptes.  It means all
> uffd-wp tracking information won't be lost even if merging happened, those info
> will still be kept inside pgtables using (the upcoming) pte markers.
> 
> When faulted, we'll just do small page mappings while it won't stop the shmem
> thp from being mapped hugely in other mm, afaict.
> 
> With that in mind, indeed I see this patch less necessary to be merged; so for
> sparsely wr-protected vmas like virtio-mem we can still keep some of the ranges
> mergeable, that sounds a good thing to keep it as-is.
> 
> NACK myself for now: let's not lose that good property of both thp+uffd-wp so
> easily, and I'll think more of it.
> 

Thanks for the insights, shmem THP is still mostly unexplored territory 
on my end :)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-09-24 10:05 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-22 17:51 [PATCH] mm/khugepaged: Detecting uffd-wp vma more efficiently Peter Xu
2021-09-22 18:21 ` David Hildenbrand
2021-09-22 18:58   ` Peter Xu
2021-09-22 19:29     ` Yang Shi
2021-09-22 19:29       ` Yang Shi
2021-09-22 20:04       ` Peter Xu
2021-09-22 20:23         ` Peter Xu
2021-09-24 10:05     ` David Hildenbrand
2021-09-22 19:33 ` Peter Xu
2021-09-22 20:49 ` Axel Rasmussen
2021-09-22 20:49   ` Axel Rasmussen
2021-09-22 21:20   ` Peter Xu
2021-09-22 23:18     ` Hugh Dickins
2021-09-22 23:18       ` Hugh Dickins
2021-09-22 23:44       ` Peter Xu
2021-09-23  1:22         ` Hugh Dickins
2021-09-23  1:22           ` Hugh Dickins
2021-09-23  2:18           ` Peter Xu
2021-09-23 16:47             ` Axel Rasmussen
2021-09-23 16:47               ` Axel Rasmussen
2021-09-23 17:53               ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.