All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/2] mm/migrate: Fix writable pte for read migration entry
@ 2022-11-14  0:04 Peter Xu
  2022-11-14  0:04 ` [PATCH v3 1/2] mm/migrate: Fix read-only page got writable when recover pte Peter Xu
  2022-11-14  0:04 ` [PATCH v3 2/2] mm/uffd: Sanity check write bit for uffd-wp protected ptes Peter Xu
  0 siblings, 2 replies; 13+ messages in thread
From: Peter Xu @ 2022-11-14  0:04 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Rapoport, Nadav Amit, Andrew Morton, peterx,
	Andrea Arcangeli, Ives van Hoorne, Axel Rasmussen,
	Alistair Popple

v3:
- Go back to use WARN_ON_ONCE because mmdebug.h not included in
  arch/x86/include/asm/pgtable.h
- Added r-b and t-b
- s/symbton/symbtom/ [Nadav]
- Updated commit message of patch 1, removing mprotect example

v2:
- Replace WARN_ON_ONCE with VM_WARN_ON_ONCE in patch 2 [Nadav]

This comes from a report from Ives on using uffd-wp on shmem.  More
information can be found in patch 1 commit message.

Patch 2 added some more sanity check when walking pgtables and when we
convert the ptes into other forms e.g. for migration and swap.  It will
make the error trigger even earlier than the user could notice, meanwhile
nail down the case if it's a wrong pgtable setup.

We probably need patch 1 for stable (5.19+).  Please have a look, thanks.

Peter Xu (2):
  mm/migrate: Fix read-only page got writable when recover pte
  mm/uffd: Sanity check write bit for uffd-wp protected ptes

 arch/x86/include/asm/pgtable.h | 18 +++++++++++++++++-
 mm/migrate.c                   |  8 +++++++-
 2 files changed, 24 insertions(+), 2 deletions(-)

-- 
2.37.3


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3 1/2] mm/migrate: Fix read-only page got writable when recover pte
  2022-11-14  0:04 [PATCH v3 0/2] mm/migrate: Fix writable pte for read migration entry Peter Xu
@ 2022-11-14  0:04 ` Peter Xu
  2022-11-15 18:17   ` David Hildenbrand
  2022-11-14  0:04 ` [PATCH v3 2/2] mm/uffd: Sanity check write bit for uffd-wp protected ptes Peter Xu
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Xu @ 2022-11-14  0:04 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Rapoport, Nadav Amit, Andrew Morton, peterx,
	Andrea Arcangeli, Ives van Hoorne, Axel Rasmussen,
	Alistair Popple, stable

Ives van Hoorne from codesandbox.io reported an issue regarding possible
data loss of uffd-wp when applied to memfds on heavily loaded systems.  The
symptom is some read page got data mismatch from the snapshot child VMs.

Here I can also reproduce with a Rust reproducer that was provided by Ives
that keeps taking snapshot of a 256MB VM, on a 32G system when I initiate
80 instances I can trigger the issues in ten minutes.

It turns out that we got some pages write-through even if uffd-wp is
applied to the pte.

The problem is, when removing migration entries, we didn't really worry
about write bit as long as we know it's not a write migration entry.  That
may not be true, for some memory types (e.g. writable shmem) mk_pte can
return a pte with write bit set, then to recover the migration entry to its
original state we need to explicit wr-protect the pte or it'll has the
write bit set if it's a read migration entry.  For uffd it can cause
write-through.

The relevant code on uffd was introduced in the anon support, which is
commit f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration",
2020-04-07).  However anon shouldn't suffer from this problem because anon
should already have the write bit cleared always, so that may not be a
proper Fixes target, while I'm adding the Fixes to be uffd shmem support.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: stable@vger.kernel.org
Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
Reported-by: Ives van Hoorne <ives@codesandbox.io>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Tested-by: Ives van Hoorne <ives@codesandbox.io>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/migrate.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index dff333593a8a..8b6351c08c78 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -213,8 +213,14 @@ static bool remove_migration_pte(struct folio *folio,
 			pte = pte_mkdirty(pte);
 		if (is_writable_migration_entry(entry))
 			pte = maybe_mkwrite(pte, vma);
-		else if (pte_swp_uffd_wp(*pvmw.pte))
+		else
+			/* NOTE: mk_pte can have write bit set */
+			pte = pte_wrprotect(pte);
+
+		if (pte_swp_uffd_wp(*pvmw.pte)) {
+			WARN_ON_ONCE(pte_write(pte));
 			pte = pte_mkuffd_wp(pte);
+		}
 
 		if (folio_test_anon(folio) && !is_readable_migration_entry(entry))
 			rmap_flags |= RMAP_EXCLUSIVE;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 2/2] mm/uffd: Sanity check write bit for uffd-wp protected ptes
  2022-11-14  0:04 [PATCH v3 0/2] mm/migrate: Fix writable pte for read migration entry Peter Xu
  2022-11-14  0:04 ` [PATCH v3 1/2] mm/migrate: Fix read-only page got writable when recover pte Peter Xu
@ 2022-11-14  0:04 ` Peter Xu
  1 sibling, 0 replies; 13+ messages in thread
From: Peter Xu @ 2022-11-14  0:04 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Mike Rapoport, Nadav Amit, Andrew Morton, peterx,
	Andrea Arcangeli, Ives van Hoorne, Axel Rasmussen,
	Alistair Popple

Let's add one sanity check for CONFIG_DEBUG_VM on the write bit in whatever
chance we have when walking through the pgtables.  It can bring the error
earlier even before the app notices the data was corrupted on the snapshot.
Also it helps us to identify this is a wrong pgtable setup, so hopefully a
great information to have for debugging too.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/include/asm/pgtable.h | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5059799bebe3..63bdbb0f989e 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -291,7 +291,23 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 static inline int pte_uffd_wp(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_UFFD_WP;
+	bool wp = pte_flags(pte) & _PAGE_UFFD_WP;
+
+#ifdef CONFIG_DEBUG_VM
+	/*
+	 * Having write bit for wr-protect-marked present ptes is fatal,
+	 * because it means the uffd-wp bit will be ignored and write will
+	 * just go through.
+	 *
+	 * Use any chance of pgtable walking to verify this (e.g., when
+	 * page swapped out or being migrated for all purposes). It means
+	 * something is already wrong.  Tell the admin even before the
+	 * process crashes. We also nail it with wrong pgtable setup.
+	 */
+	WARN_ON_ONCE(wp && pte_write(pte));
+#endif
+
+	return wp;
 }
 
 static inline pte_t pte_mkuffd_wp(pte_t pte)
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/2] mm/migrate: Fix read-only page got writable when recover pte
  2022-11-14  0:04 ` [PATCH v3 1/2] mm/migrate: Fix read-only page got writable when recover pte Peter Xu
@ 2022-11-15 18:17   ` David Hildenbrand
  2022-11-30 22:24     ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2022-11-15 18:17 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: Mike Rapoport, Nadav Amit, Andrew Morton, Andrea Arcangeli,
	Ives van Hoorne, Axel Rasmussen, Alistair Popple, stable

On 14.11.22 01:04, Peter Xu wrote:
> Ives van Hoorne from codesandbox.io reported an issue regarding possible
> data loss of uffd-wp when applied to memfds on heavily loaded systems.  The
> symptom is some read page got data mismatch from the snapshot child VMs.
> 
> Here I can also reproduce with a Rust reproducer that was provided by Ives
> that keeps taking snapshot of a 256MB VM, on a 32G system when I initiate
> 80 instances I can trigger the issues in ten minutes.
> 
> It turns out that we got some pages write-through even if uffd-wp is
> applied to the pte.
> 
> The problem is, when removing migration entries, we didn't really worry
> about write bit as long as we know it's not a write migration entry.  That
> may not be true, for some memory types (e.g. writable shmem) mk_pte can
> return a pte with write bit set, then to recover the migration entry to its
> original state we need to explicit wr-protect the pte or it'll has the
> write bit set if it's a read migration entry.  For uffd it can cause
> write-through.
> 
> The relevant code on uffd was introduced in the anon support, which is
> commit f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration",
> 2020-04-07).  However anon shouldn't suffer from this problem because anon
> should already have the write bit cleared always, so that may not be a
> proper Fixes target, while I'm adding the Fixes to be uffd shmem support.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: stable@vger.kernel.org
> Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs")
> Reported-by: Ives van Hoorne <ives@codesandbox.io>
> Reviewed-by: Alistair Popple <apopple@nvidia.com>
> Tested-by: Ives van Hoorne <ives@codesandbox.io>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   mm/migrate.c | 8 +++++++-
>   1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index dff333593a8a..8b6351c08c78 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -213,8 +213,14 @@ static bool remove_migration_pte(struct folio *folio,
>   			pte = pte_mkdirty(pte);
>   		if (is_writable_migration_entry(entry))
>   			pte = maybe_mkwrite(pte, vma);
> -		else if (pte_swp_uffd_wp(*pvmw.pte))
> +		else
> +			/* NOTE: mk_pte can have write bit set */
> +			pte = pte_wrprotect(pte);
> +
> +		if (pte_swp_uffd_wp(*pvmw.pte)) {
> +			WARN_ON_ONCE(pte_write(pte));
>   			pte = pte_mkuffd_wp(pte);
> +		}
>   
>   		if (folio_test_anon(folio) && !is_readable_migration_entry(entry))
>   			rmap_flags |= RMAP_EXCLUSIVE;

As raised, I don't agree to this generic non-uffd-wp change without 
further, clear justification.

I won't nack it, but I won't ack it either.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/2] mm/migrate: Fix read-only page got writable when recover pte
  2022-11-15 18:17   ` David Hildenbrand
@ 2022-11-30 22:24     ` Andrew Morton
  2022-12-01 15:28       ` Peter Xu
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2022-11-30 22:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, linux-kernel, linux-mm, Mike Rapoport, Nadav Amit,
	Andrea Arcangeli, Ives van Hoorne, Axel Rasmussen,
	Alistair Popple, stable

On Tue, 15 Nov 2022 19:17:43 +0100 David Hildenbrand <david@redhat.com> wrote:

> On 14.11.22 01:04, Peter Xu wrote:
> > Ives van Hoorne from codesandbox.io reported an issue regarding possible
> > data loss of uffd-wp when applied to memfds on heavily loaded systems.  The
> > symptom is some read page got data mismatch from the snapshot child VMs.
> > 
> > Here I can also reproduce with a Rust reproducer that was provided by Ives
> > that keeps taking snapshot of a 256MB VM, on a 32G system when I initiate
> > 80 instances I can trigger the issues in ten minutes.
> > 
> > It turns out that we got some pages write-through even if uffd-wp is
> > applied to the pte.
> > 
> > The problem is, when removing migration entries, we didn't really worry
> > about write bit as long as we know it's not a write migration entry.  That
> > may not be true, for some memory types (e.g. writable shmem) mk_pte can
> > return a pte with write bit set, then to recover the migration entry to its
> > original state we need to explicit wr-protect the pte or it'll has the
> > write bit set if it's a read migration entry.  For uffd it can cause
> > write-through.
> > 
> > The relevant code on uffd was introduced in the anon support, which is
> > commit f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration",
> > 2020-04-07).  However anon shouldn't suffer from this problem because anon
> > should already have the write bit cleared always, so that may not be a
> > proper Fixes target, while I'm adding the Fixes to be uffd shmem support.
> > 
>
> ...
>
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -213,8 +213,14 @@ static bool remove_migration_pte(struct folio *folio,
> >   			pte = pte_mkdirty(pte);
> >   		if (is_writable_migration_entry(entry))
> >   			pte = maybe_mkwrite(pte, vma);
> > -		else if (pte_swp_uffd_wp(*pvmw.pte))
> > +		else
> > +			/* NOTE: mk_pte can have write bit set */
> > +			pte = pte_wrprotect(pte);
> > +
> > +		if (pte_swp_uffd_wp(*pvmw.pte)) {
> > +			WARN_ON_ONCE(pte_write(pte));

Will this warnnig trigger in the scenario you and Ives have discovered?

> >   			pte = pte_mkuffd_wp(pte);
> > +		}
> >   
> >   		if (folio_test_anon(folio) && !is_readable_migration_entry(entry))
> >   			rmap_flags |= RMAP_EXCLUSIVE;
> 
> As raised, I don't agree to this generic non-uffd-wp change without 
> further, clear justification.

Pater, can you please work this further?

> I won't nack it, but I won't ack it either.

I wouldn't mind seeing a little code comment which explains why we're
doing this.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/2] mm/migrate: Fix read-only page got writable when recover pte
  2022-11-30 22:24     ` Andrew Morton
@ 2022-12-01 15:28       ` Peter Xu
  2022-12-01 15:42         ` David Hildenbrand
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Xu @ 2022-12-01 15:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, linux-kernel, linux-mm, Mike Rapoport,
	Nadav Amit, Andrea Arcangeli, Ives van Hoorne, Axel Rasmussen,
	Alistair Popple, stable

[-- Attachment #1: Type: text/plain, Size: 3759 bytes --]

Hi, Andrew,

On Wed, Nov 30, 2022 at 02:24:25PM -0800, Andrew Morton wrote:
> On Tue, 15 Nov 2022 19:17:43 +0100 David Hildenbrand <david@redhat.com> wrote:
> 
> > On 14.11.22 01:04, Peter Xu wrote:
> > > Ives van Hoorne from codesandbox.io reported an issue regarding possible
> > > data loss of uffd-wp when applied to memfds on heavily loaded systems.  The
> > > symptom is some read page got data mismatch from the snapshot child VMs.
> > > 
> > > Here I can also reproduce with a Rust reproducer that was provided by Ives
> > > that keeps taking snapshot of a 256MB VM, on a 32G system when I initiate
> > > 80 instances I can trigger the issues in ten minutes.
> > > 
> > > It turns out that we got some pages write-through even if uffd-wp is
> > > applied to the pte.
> > > 
> > > The problem is, when removing migration entries, we didn't really worry
> > > about write bit as long as we know it's not a write migration entry.  That
> > > may not be true, for some memory types (e.g. writable shmem) mk_pte can
> > > return a pte with write bit set, then to recover the migration entry to its
> > > original state we need to explicit wr-protect the pte or it'll has the
> > > write bit set if it's a read migration entry.  For uffd it can cause
> > > write-through.
> > > 
> > > The relevant code on uffd was introduced in the anon support, which is
> > > commit f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration",
> > > 2020-04-07).  However anon shouldn't suffer from this problem because anon
> > > should already have the write bit cleared always, so that may not be a
> > > proper Fixes target, while I'm adding the Fixes to be uffd shmem support.
> > > 
> >
> > ...
> >
> > > --- a/mm/migrate.c
> > > +++ b/mm/migrate.c
> > > @@ -213,8 +213,14 @@ static bool remove_migration_pte(struct folio *folio,
> > >   			pte = pte_mkdirty(pte);
> > >   		if (is_writable_migration_entry(entry))
> > >   			pte = maybe_mkwrite(pte, vma);
> > > -		else if (pte_swp_uffd_wp(*pvmw.pte))
> > > +		else
> > > +			/* NOTE: mk_pte can have write bit set */
> > > +			pte = pte_wrprotect(pte);
> > > +
> > > +		if (pte_swp_uffd_wp(*pvmw.pte)) {
> > > +			WARN_ON_ONCE(pte_write(pte));
> 
> Will this warnnig trigger in the scenario you and Ives have discovered?

If without the above newly added wr-protect, yes.  This is the case where
we found we got write bit set even if uffd-wp bit is also set, hence allows
the write to go through even if marked protected.

> 
> > >   			pte = pte_mkuffd_wp(pte);
> > > +		}
> > >   
> > >   		if (folio_test_anon(folio) && !is_readable_migration_entry(entry))
> > >   			rmap_flags |= RMAP_EXCLUSIVE;
> > 
> > As raised, I don't agree to this generic non-uffd-wp change without 
> > further, clear justification.
> 
> Pater, can you please work this further?

I didn't reply here because I have already replied with the question in
previous version with a few attempts.  Quotting myself:

https://lore.kernel.org/all/Y3KgYeMTdTM0FN5W@x1n/

        The thing is recovering the pte into its original form is the
        safest approach to me, so I think we need justification on why it's
        always safe to set the write bit.

I've also got another longer email trying to explain why I think it's the
other way round to be justfied, rather than justifying removal of the write
bit for a read migration entry, here:

https://lore.kernel.org/all/Y3O5bCXSbvKJrjRL@x1n/

> 
> > I won't nack it, but I won't ack it either.
> 
> I wouldn't mind seeing a little code comment which explains why we're
> doing this.

I've got one more fixup to the same patch attached, with enriched comments
on why we need wr-protect for read migration entries.

Please have a look to see whether that helps, thanks.

-- 
Peter Xu

[-- Attachment #2: 0001-fixup-mm-migrate-fix-read-only-page-got-writable-whe.patch --]
[-- Type: text/plain, Size: 1101 bytes --]

From d68c98047ce54c62f3454997a55f23ff6fb317cd Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Thu, 1 Dec 2022 10:19:22 -0500
Subject: [PATCH] fixup! mm/migrate: fix read-only page got writable when
 recover pte
Content-type: text/plain

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/migrate.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index c13c828d34f3..d14f1f3ab073 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -214,7 +214,14 @@ static bool remove_migration_pte(struct folio *folio,
 		if (is_writable_migration_entry(entry))
 			pte = maybe_mkwrite(pte, vma);
 		else
-			/* NOTE: mk_pte can have write bit set */
+			/*
+			 * NOTE: mk_pte() can have write bit set per memory
+			 * type (e.g. shmem), or pte_mkdirty() per archs
+			 * (e.g., sparc64).  If this is a read migration
+			 * entry, we need to make sure when we recover the
+			 * pte from migration entry to present entry the
+			 * write bit is cleared.
+			 */
 			pte = pte_wrprotect(pte);
 
 		if (pte_swp_uffd_wp(*pvmw.pte)) {
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/2] mm/migrate: Fix read-only page got writable when recover pte
  2022-12-01 15:28       ` Peter Xu
@ 2022-12-01 15:42         ` David Hildenbrand
  2022-12-01 22:30           ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2022-12-01 15:42 UTC (permalink / raw)
  To: Peter Xu, Andrew Morton
  Cc: linux-kernel, linux-mm, Mike Rapoport, Nadav Amit,
	Andrea Arcangeli, Ives van Hoorne, Axel Rasmussen,
	Alistair Popple, stable

On 01.12.22 16:28, Peter Xu wrote:
> Hi, Andrew,
> 
> On Wed, Nov 30, 2022 at 02:24:25PM -0800, Andrew Morton wrote:
>> On Tue, 15 Nov 2022 19:17:43 +0100 David Hildenbrand <david@redhat.com> wrote:
>>
>>> On 14.11.22 01:04, Peter Xu wrote:
>>>> Ives van Hoorne from codesandbox.io reported an issue regarding possible
>>>> data loss of uffd-wp when applied to memfds on heavily loaded systems.  The
>>>> symptom is some read page got data mismatch from the snapshot child VMs.
>>>>
>>>> Here I can also reproduce with a Rust reproducer that was provided by Ives
>>>> that keeps taking snapshot of a 256MB VM, on a 32G system when I initiate
>>>> 80 instances I can trigger the issues in ten minutes.
>>>>
>>>> It turns out that we got some pages write-through even if uffd-wp is
>>>> applied to the pte.
>>>>
>>>> The problem is, when removing migration entries, we didn't really worry
>>>> about write bit as long as we know it's not a write migration entry.  That
>>>> may not be true, for some memory types (e.g. writable shmem) mk_pte can
>>>> return a pte with write bit set, then to recover the migration entry to its
>>>> original state we need to explicit wr-protect the pte or it'll has the
>>>> write bit set if it's a read migration entry.  For uffd it can cause
>>>> write-through.
>>>>
>>>> The relevant code on uffd was introduced in the anon support, which is
>>>> commit f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration",
>>>> 2020-04-07).  However anon shouldn't suffer from this problem because anon
>>>> should already have the write bit cleared always, so that may not be a
>>>> proper Fixes target, while I'm adding the Fixes to be uffd shmem support.
>>>>
>>>
>>> ...
>>>
>>>> --- a/mm/migrate.c
>>>> +++ b/mm/migrate.c
>>>> @@ -213,8 +213,14 @@ static bool remove_migration_pte(struct folio *folio,
>>>>    			pte = pte_mkdirty(pte);
>>>>    		if (is_writable_migration_entry(entry))
>>>>    			pte = maybe_mkwrite(pte, vma);
>>>> -		else if (pte_swp_uffd_wp(*pvmw.pte))
>>>> +		else
>>>> +			/* NOTE: mk_pte can have write bit set */
>>>> +			pte = pte_wrprotect(pte);
>>>> +
>>>> +		if (pte_swp_uffd_wp(*pvmw.pte)) {
>>>> +			WARN_ON_ONCE(pte_write(pte));
>>
>> Will this warnnig trigger in the scenario you and Ives have discovered?
> 
> If without the above newly added wr-protect, yes.  This is the case where
> we found we got write bit set even if uffd-wp bit is also set, hence allows
> the write to go through even if marked protected.
> 
>>
>>>>    			pte = pte_mkuffd_wp(pte);
>>>> +		}
>>>>    
>>>>    		if (folio_test_anon(folio) && !is_readable_migration_entry(entry))
>>>>    			rmap_flags |= RMAP_EXCLUSIVE;
>>>
>>> As raised, I don't agree to this generic non-uffd-wp change without
>>> further, clear justification.
>>
>> Pater, can you please work this further?
> 
> I didn't reply here because I have already replied with the question in
> previous version with a few attempts.  Quotting myself:
> 
> https://lore.kernel.org/all/Y3KgYeMTdTM0FN5W@x1n/
> 
>          The thing is recovering the pte into its original form is the
>          safest approach to me, so I think we need justification on why it's
>          always safe to set the write bit.
> 
> I've also got another longer email trying to explain why I think it's the
> other way round to be justfied, rather than justifying removal of the write
> bit for a read migration entry, here:
> 

And I disagree for this patch that is supposed to fix this hunk:


@@ -243,11 +243,15 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
                 entry = pte_to_swp_entry(*pvmw.pte);
                 if (is_write_migration_entry(entry))
                         pte = maybe_mkwrite(pte, vma);
+               else if (pte_swp_uffd_wp(*pvmw.pte))
+                       pte = pte_mkuffd_wp(pte);
  
                 if (unlikely(is_zone_device_page(new))) {
                         if (is_device_private_page(new)) {
                                 entry = make_device_private_entry(new, pte_write(pte));
                                 pte = swp_entry_to_pte(entry);
+                               if (pte_swp_uffd_wp(*pvmw.pte))
+                                       pte = pte_mkuffd_wp(pte);
                         }
                 }
  

There is really nothing to justify the other way around here.
If it's broken fix it independently and properly backport it independenty.

But we don't know about any such broken case.

I have no energy to spare to argue further ;)



-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/2] mm/migrate: Fix read-only page got writable when recover pte
  2022-12-01 15:42         ` David Hildenbrand
@ 2022-12-01 22:30           ` Andrew Morton
  2022-12-02 11:03             ` David Hildenbrand
  2022-12-02 17:18             ` David Hildenbrand
  0 siblings, 2 replies; 13+ messages in thread
From: Andrew Morton @ 2022-12-01 22:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, linux-kernel, linux-mm, Mike Rapoport, Nadav Amit,
	Andrea Arcangeli, Ives van Hoorne, Axel Rasmussen,
	Alistair Popple, stable

On Thu, 1 Dec 2022 16:42:52 +0100 David Hildenbrand <david@redhat.com> wrote:

> On 01.12.22 16:28, Peter Xu wrote:
> > 
> > I didn't reply here because I have already replied with the question in
> > previous version with a few attempts.  Quotting myself:
> > 
> > https://lore.kernel.org/all/Y3KgYeMTdTM0FN5W@x1n/
> > 
> >          The thing is recovering the pte into its original form is the
> >          safest approach to me, so I think we need justification on why it's
> >          always safe to set the write bit.
> > 
> > I've also got another longer email trying to explain why I think it's the
> > other way round to be justfied, rather than justifying removal of the write
> > bit for a read migration entry, here:
> > 
> 
> And I disagree for this patch that is supposed to fix this hunk:
> 
> 
> @@ -243,11 +243,15 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
>                  entry = pte_to_swp_entry(*pvmw.pte);
>                  if (is_write_migration_entry(entry))
>                          pte = maybe_mkwrite(pte, vma);
> +               else if (pte_swp_uffd_wp(*pvmw.pte))
> +                       pte = pte_mkuffd_wp(pte);
>   
>                  if (unlikely(is_zone_device_page(new))) {
>                          if (is_device_private_page(new)) {
>                                  entry = make_device_private_entry(new, pte_write(pte));
>                                  pte = swp_entry_to_pte(entry);
> +                               if (pte_swp_uffd_wp(*pvmw.pte))
> +                                       pte = pte_mkuffd_wp(pte);
>                          }
>                  }

David, I'm unclear on what you mean by the above.  Can you please
expand?

> 
> There is really nothing to justify the other way around here.
> If it's broken fix it independently and properly backport it independenty.
> 
> But we don't know about any such broken case.
> 
> I have no energy to spare to argue further ;)

This is a silent data loss bug, which is about as bad as it gets. 
Under obscure conditions, fortunately.  But please let's keep working
it.  Let's aim for something minimal for backporting purposes.  We can
revisit any cleanliness issues later.

David, do you feel that the proposed fix will at least address the bug
without adverse side-effects?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/2] mm/migrate: Fix read-only page got writable when recover pte
  2022-12-01 22:30           ` Andrew Morton
@ 2022-12-02 11:03             ` David Hildenbrand
  2022-12-02 12:07               ` David Hildenbrand
  2022-12-02 17:18             ` David Hildenbrand
  1 sibling, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2022-12-02 11:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, linux-kernel, linux-mm, Mike Rapoport, Nadav Amit,
	Andrea Arcangeli, Ives van Hoorne, Axel Rasmussen,
	Alistair Popple, stable

On 01.12.22 23:30, Andrew Morton wrote:
> On Thu, 1 Dec 2022 16:42:52 +0100 David Hildenbrand <david@redhat.com> wrote:
> 
>> On 01.12.22 16:28, Peter Xu wrote:
>>>
>>> I didn't reply here because I have already replied with the question in
>>> previous version with a few attempts.  Quotting myself:
>>>
>>> https://lore.kernel.org/all/Y3KgYeMTdTM0FN5W@x1n/
>>>
>>>           The thing is recovering the pte into its original form is the
>>>           safest approach to me, so I think we need justification on why it's
>>>           always safe to set the write bit.
>>>
>>> I've also got another longer email trying to explain why I think it's the
>>> other way round to be justfied, rather than justifying removal of the write
>>> bit for a read migration entry, here:
>>>
>>
>> And I disagree for this patch that is supposed to fix this hunk:
>>
>>
>> @@ -243,11 +243,15 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
>>                   entry = pte_to_swp_entry(*pvmw.pte);
>>                   if (is_write_migration_entry(entry))
>>                           pte = maybe_mkwrite(pte, vma);
>> +               else if (pte_swp_uffd_wp(*pvmw.pte))
>> +                       pte = pte_mkuffd_wp(pte);
>>    
>>                   if (unlikely(is_zone_device_page(new))) {
>>                           if (is_device_private_page(new)) {
>>                                   entry = make_device_private_entry(new, pte_write(pte));
>>                                   pte = swp_entry_to_pte(entry);
>> +                               if (pte_swp_uffd_wp(*pvmw.pte))
>> +                                       pte = pte_mkuffd_wp(pte);
>>                           }
>>                   }
> 
> David, I'm unclear on what you mean by the above.  Can you please
> expand?
> 
>>
>> There is really nothing to justify the other way around here.
>> If it's broken fix it independently and properly backport it independenty.
>>
>> But we don't know about any such broken case.
>>
>> I have no energy to spare to argue further ;)
> 
> This is a silent data loss bug, which is about as bad as it gets.
> Under obscure conditions, fortunately.  But please let's keep working
> it.  Let's aim for something minimal for backporting purposes.  We can
> revisit any cleanliness issues later.

Okay, you activated my energy reserves.

> 
> David, do you feel that the proposed fix will at least address the bug
> without adverse side-effects?

Usually, when I suspect something is dodgy I unconsciously push back 
harder than I usually would.

I just looked into the issue once again and realized that this patch 
here (and also my alternative proposal) most likely tackles the 
more-generic issue from the wrong direction. I found yet another such 
bug (most probably two, just too lazy to write another reproducer). 
Migration code does the right thing here -- IMHO -- and the issue should 
be fixed differently.

I'm testing an alternative patch right now and will share it later 
today, along with a reproducer.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/2] mm/migrate: Fix read-only page got writable when recover pte
  2022-12-02 11:03             ` David Hildenbrand
@ 2022-12-02 12:07               ` David Hildenbrand
  2022-12-02 15:14                 ` Peter Xu
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2022-12-02 12:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, linux-kernel, linux-mm, Mike Rapoport, Nadav Amit,
	Andrea Arcangeli, Ives van Hoorne, Axel Rasmussen,
	Alistair Popple, stable

[-- Attachment #1: Type: text/plain, Size: 3243 bytes --]

On 02.12.22 12:03, David Hildenbrand wrote:
> On 01.12.22 23:30, Andrew Morton wrote:
>> On Thu, 1 Dec 2022 16:42:52 +0100 David Hildenbrand <david@redhat.com> wrote:
>>
>>> On 01.12.22 16:28, Peter Xu wrote:
>>>>
>>>> I didn't reply here because I have already replied with the question in
>>>> previous version with a few attempts.  Quotting myself:
>>>>
>>>> https://lore.kernel.org/all/Y3KgYeMTdTM0FN5W@x1n/
>>>>
>>>>            The thing is recovering the pte into its original form is the
>>>>            safest approach to me, so I think we need justification on why it's
>>>>            always safe to set the write bit.
>>>>
>>>> I've also got another longer email trying to explain why I think it's the
>>>> other way round to be justfied, rather than justifying removal of the write
>>>> bit for a read migration entry, here:
>>>>
>>>
>>> And I disagree for this patch that is supposed to fix this hunk:
>>>
>>>
>>> @@ -243,11 +243,15 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
>>>                    entry = pte_to_swp_entry(*pvmw.pte);
>>>                    if (is_write_migration_entry(entry))
>>>                            pte = maybe_mkwrite(pte, vma);
>>> +               else if (pte_swp_uffd_wp(*pvmw.pte))
>>> +                       pte = pte_mkuffd_wp(pte);
>>>     
>>>                    if (unlikely(is_zone_device_page(new))) {
>>>                            if (is_device_private_page(new)) {
>>>                                    entry = make_device_private_entry(new, pte_write(pte));
>>>                                    pte = swp_entry_to_pte(entry);
>>> +                               if (pte_swp_uffd_wp(*pvmw.pte))
>>> +                                       pte = pte_mkuffd_wp(pte);
>>>                            }
>>>                    }
>>
>> David, I'm unclear on what you mean by the above.  Can you please
>> expand?
>>
>>>
>>> There is really nothing to justify the other way around here.
>>> If it's broken fix it independently and properly backport it independenty.
>>>
>>> But we don't know about any such broken case.
>>>
>>> I have no energy to spare to argue further ;)
>>
>> This is a silent data loss bug, which is about as bad as it gets.
>> Under obscure conditions, fortunately.  But please let's keep working
>> it.  Let's aim for something minimal for backporting purposes.  We can
>> revisit any cleanliness issues later.
> 
> Okay, you activated my energy reserves.
> 
>>
>> David, do you feel that the proposed fix will at least address the bug
>> without adverse side-effects?
> 
> Usually, when I suspect something is dodgy I unconsciously push back
> harder than I usually would.
> 
> I just looked into the issue once again and realized that this patch
> here (and also my alternative proposal) most likely tackles the
> more-generic issue from the wrong direction. I found yet another such
> bug (most probably two, just too lazy to write another reproducer).
> Migration code does the right thing here -- IMHO -- and the issue should
> be fixed differently.
> 
> I'm testing an alternative patch right now and will share it later
> today, along with a reproducer.
> 

mprotect() reproducer attached.

-- 
Thanks,

David / dhildenb

[-- Attachment #2: uffd-wp-mprotect.c --]
[-- Type: text/x-csrc, Size: 3186 bytes --]

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <poll.h>
#include <pthread.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>
#include <linux/memfd.h>
#include <linux/userfaultfd.h>

size_t pagesize;
int uffd;

static void *uffd_thread_fn(void *arg)
{
	static struct uffd_msg msg;
	ssize_t nread;

	while (1) {
		struct pollfd pollfd;
		int nready;

		pollfd.fd = uffd;
		pollfd.events = POLLIN;
		nready = poll(&pollfd, 1, -1);
		if (nready == -1) {
			fprintf(stderr, "poll() failed: %d\n", errno);
			exit(1);
		}

		nread = read(uffd, &msg, sizeof(msg));
		if (nread <= 0)
			continue;

		if (msg.event != UFFD_EVENT_PAGEFAULT ||
		    !(msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP)) {
			printf("FAIL: wrong uffd-wp event fired\n");
			exit(1);
		}

		printf("PASS: uffd-wp fired\n");
		exit(0);
	}
}

static int setup_uffd(char *map)
{
	struct uffdio_api uffdio_api;
	struct uffdio_register uffdio_register;
	struct uffdio_range uffd_range;
	pthread_t thread;

	uffd = syscall(__NR_userfaultfd,
		       O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
	if (uffd < 0) {
		fprintf(stderr, "syscall() failed: %d\n", errno);
		return -errno;
	}

	uffdio_api.api = UFFD_API;
	uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
	if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) {
		fprintf(stderr, "UFFDIO_API failed: %d\n", errno);
		return -errno;
	}

	if (!(uffdio_api.features & UFFD_FEATURE_PAGEFAULT_FLAG_WP)) {
		fprintf(stderr, "UFFD_FEATURE_WRITEPROTECT missing\n");
		return -ENOSYS;
	}

	uffdio_register.range.start = (unsigned long) map;
	uffdio_register.range.len = pagesize;
	uffdio_register.mode = UFFDIO_REGISTER_MODE_WP;
	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) < 0) {
		fprintf(stderr, "UFFDIO_REGISTER failed: %d\n", errno);
		return -errno;
	}

	pthread_create(&thread, NULL, uffd_thread_fn, NULL);

	return 0;
}

int main(int argc, char **argv)
{
	struct uffdio_writeprotect uffd_writeprotect;
	char *map;
	int fd;

	pagesize = getpagesize();
	fd = memfd_create("test", 0);
	if (fd < 0) {
		fprintf(stderr, "memfd_create() failed\n");
		return -errno;
	}
	if (ftruncate(fd, pagesize)) {
		fprintf(stderr, "ftruncate() failed\n");
		return -errno;
	}

	/* Start out without write protection. */
	map = mmap(NULL, pagesize, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
	if (map == MAP_FAILED) {
		fprintf(stderr, "mmap() failed\n");
		return -errno;
	}

	if (setup_uffd(map))
		return 1;

	/* Populate a page ... */
	memset(map, 0, pagesize);

	/* ... and write-protect it using uffd-wp. */
	uffd_writeprotect.range.start = (unsigned long) map;
	uffd_writeprotect.range.len = pagesize;
	uffd_writeprotect.mode = UFFDIO_WRITEPROTECT_MODE_WP;
	if (ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect)) {
		fprintf(stderr, "UFFDIO_WRITEPROTECT failed: %d\n", errno);
		return -errno;
	}

	/* Write-protect the whole mapping temporarily. */
	mprotect(map, pagesize, PROT_READ);
	mprotect(map, pagesize, PROT_READ|PROT_WRITE);

	/* Test if uffd-wp fires. */
	memset(map, 1, pagesize);

	printf("FAIL: uffd-wp did not fire\n");
	return 1;
}

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/2] mm/migrate: Fix read-only page got writable when recover pte
  2022-12-02 12:07               ` David Hildenbrand
@ 2022-12-02 15:14                 ` Peter Xu
  2022-12-02 15:40                   ` David Hildenbrand
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Xu @ 2022-12-02 15:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, linux-kernel, linux-mm, Mike Rapoport, Nadav Amit,
	Andrea Arcangeli, Ives van Hoorne, Axel Rasmussen,
	Alistair Popple, stable

On Fri, Dec 02, 2022 at 01:07:02PM +0100, David Hildenbrand wrote:
> On 02.12.22 12:03, David Hildenbrand wrote:
> > On 01.12.22 23:30, Andrew Morton wrote:
> > > On Thu, 1 Dec 2022 16:42:52 +0100 David Hildenbrand <david@redhat.com> wrote:
> > > 
> > > > On 01.12.22 16:28, Peter Xu wrote:
> > > > > 
> > > > > I didn't reply here because I have already replied with the question in
> > > > > previous version with a few attempts.  Quotting myself:
> > > > > 
> > > > > https://lore.kernel.org/all/Y3KgYeMTdTM0FN5W@x1n/
> > > > > 
> > > > >            The thing is recovering the pte into its original form is the
> > > > >            safest approach to me, so I think we need justification on why it's
> > > > >            always safe to set the write bit.
> > > > > 
> > > > > I've also got another longer email trying to explain why I think it's the
> > > > > other way round to be justfied, rather than justifying removal of the write
> > > > > bit for a read migration entry, here:
> > > > > 
> > > > 
> > > > And I disagree for this patch that is supposed to fix this hunk:
> > > > 
> > > > 
> > > > @@ -243,11 +243,15 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
> > > >                    entry = pte_to_swp_entry(*pvmw.pte);
> > > >                    if (is_write_migration_entry(entry))
> > > >                            pte = maybe_mkwrite(pte, vma);
> > > > +               else if (pte_swp_uffd_wp(*pvmw.pte))
> > > > +                       pte = pte_mkuffd_wp(pte);
> > > >                    if (unlikely(is_zone_device_page(new))) {
> > > >                            if (is_device_private_page(new)) {
> > > >                                    entry = make_device_private_entry(new, pte_write(pte));
> > > >                                    pte = swp_entry_to_pte(entry);
> > > > +                               if (pte_swp_uffd_wp(*pvmw.pte))
> > > > +                                       pte = pte_mkuffd_wp(pte);
> > > >                            }
> > > >                    }
> > > 
> > > David, I'm unclear on what you mean by the above.  Can you please
> > > expand?
> > > 
> > > > 
> > > > There is really nothing to justify the other way around here.
> > > > If it's broken fix it independently and properly backport it independenty.
> > > > 
> > > > But we don't know about any such broken case.
> > > > 
> > > > I have no energy to spare to argue further ;)
> > > 
> > > This is a silent data loss bug, which is about as bad as it gets.
> > > Under obscure conditions, fortunately.  But please let's keep working
> > > it.  Let's aim for something minimal for backporting purposes.  We can
> > > revisit any cleanliness issues later.
> > 
> > Okay, you activated my energy reserves.
> > 
> > > 
> > > David, do you feel that the proposed fix will at least address the bug
> > > without adverse side-effects?
> > 
> > Usually, when I suspect something is dodgy I unconsciously push back
> > harder than I usually would.

Please consider using unconsciousness only for self guidance, figuring out
directions, or making decisions on one's own.

For discussions on the list which can get more than one person involved, we
do need consciousness and reasonings.

Thanks for the reproducer, that's definitely good reasonings.  Do you have
other reproducer that can trigger an issue without mprotect()?

As I probably mentioned before in other threads mprotect() is IMHO
conceptually against uffd-wp and I don't yet figured out how to use them
all right.  For example, we can uffd-wr-protect a pte in uffd-wp range,
then if we do "mprotect(RW)" it's hard to tell whether the user wants it
write or not.  E.g., using mprotect(RW) to resolve page faults should be
wrong because it'll not touch the uffd-wp bit at all.  I confess I never
thought more on how we should define the interactions between uffd-wp and
mprotect.

In short, it'll be great if you have other reproducers for any uffd-wp
issues other than mprotect().

I said that also because I just got another message from Ives privately
that there _seems_ to have yet another even harder to reproduce bug here
(Ives, feel free to fill in any more information if you got it).  So if you
can figure out what's missing and already write a reproducer, that'll be
perfect.

Thanks,

> > 
> > I just looked into the issue once again and realized that this patch
> > here (and also my alternative proposal) most likely tackles the
> > more-generic issue from the wrong direction. I found yet another such
> > bug (most probably two, just too lazy to write another reproducer).
> > Migration code does the right thing here -- IMHO -- and the issue should
> > be fixed differently.
> > 
> > I'm testing an alternative patch right now and will share it later
> > today, along with a reproducer.
> > 
> 
> mprotect() reproducer attached.
> 
> -- 
> Thanks,
> 
> David / dhildenb

> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <errno.h>
> #include <poll.h>
> #include <pthread.h>
> #include <sys/mman.h>
> #include <sys/syscall.h>
> #include <sys/ioctl.h>
> #include <linux/memfd.h>
> #include <linux/userfaultfd.h>
> 
> size_t pagesize;
> int uffd;
> 
> static void *uffd_thread_fn(void *arg)
> {
> 	static struct uffd_msg msg;
> 	ssize_t nread;
> 
> 	while (1) {
> 		struct pollfd pollfd;
> 		int nready;
> 
> 		pollfd.fd = uffd;
> 		pollfd.events = POLLIN;
> 		nready = poll(&pollfd, 1, -1);
> 		if (nready == -1) {
> 			fprintf(stderr, "poll() failed: %d\n", errno);
> 			exit(1);
> 		}
> 
> 		nread = read(uffd, &msg, sizeof(msg));
> 		if (nread <= 0)
> 			continue;
> 
> 		if (msg.event != UFFD_EVENT_PAGEFAULT ||
> 		    !(msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP)) {
> 			printf("FAIL: wrong uffd-wp event fired\n");
> 			exit(1);
> 		}
> 
> 		printf("PASS: uffd-wp fired\n");
> 		exit(0);
> 	}
> }
> 
> static int setup_uffd(char *map)
> {
> 	struct uffdio_api uffdio_api;
> 	struct uffdio_register uffdio_register;
> 	struct uffdio_range uffd_range;
> 	pthread_t thread;
> 
> 	uffd = syscall(__NR_userfaultfd,
> 		       O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
> 	if (uffd < 0) {
> 		fprintf(stderr, "syscall() failed: %d\n", errno);
> 		return -errno;
> 	}
> 
> 	uffdio_api.api = UFFD_API;
> 	uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
> 	if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) {
> 		fprintf(stderr, "UFFDIO_API failed: %d\n", errno);
> 		return -errno;
> 	}
> 
> 	if (!(uffdio_api.features & UFFD_FEATURE_PAGEFAULT_FLAG_WP)) {
> 		fprintf(stderr, "UFFD_FEATURE_WRITEPROTECT missing\n");
> 		return -ENOSYS;
> 	}
> 
> 	uffdio_register.range.start = (unsigned long) map;
> 	uffdio_register.range.len = pagesize;
> 	uffdio_register.mode = UFFDIO_REGISTER_MODE_WP;
> 	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) < 0) {
> 		fprintf(stderr, "UFFDIO_REGISTER failed: %d\n", errno);
> 		return -errno;
> 	}
> 
> 	pthread_create(&thread, NULL, uffd_thread_fn, NULL);
> 
> 	return 0;
> }
> 
> int main(int argc, char **argv)
> {
> 	struct uffdio_writeprotect uffd_writeprotect;
> 	char *map;
> 	int fd;
> 
> 	pagesize = getpagesize();
> 	fd = memfd_create("test", 0);
> 	if (fd < 0) {
> 		fprintf(stderr, "memfd_create() failed\n");
> 		return -errno;
> 	}
> 	if (ftruncate(fd, pagesize)) {
> 		fprintf(stderr, "ftruncate() failed\n");
> 		return -errno;
> 	}
> 
> 	/* Start out without write protection. */
> 	map = mmap(NULL, pagesize, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
> 	if (map == MAP_FAILED) {
> 		fprintf(stderr, "mmap() failed\n");
> 		return -errno;
> 	}
> 
> 	if (setup_uffd(map))
> 		return 1;
> 
> 	/* Populate a page ... */
> 	memset(map, 0, pagesize);
> 
> 	/* ... and write-protect it using uffd-wp. */
> 	uffd_writeprotect.range.start = (unsigned long) map;
> 	uffd_writeprotect.range.len = pagesize;
> 	uffd_writeprotect.mode = UFFDIO_WRITEPROTECT_MODE_WP;
> 	if (ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect)) {
> 		fprintf(stderr, "UFFDIO_WRITEPROTECT failed: %d\n", errno);
> 		return -errno;
> 	}
> 
> 	/* Write-protect the whole mapping temporarily. */
> 	mprotect(map, pagesize, PROT_READ);
> 	mprotect(map, pagesize, PROT_READ|PROT_WRITE);
> 
> 	/* Test if uffd-wp fires. */
> 	memset(map, 1, pagesize);
> 
> 	printf("FAIL: uffd-wp did not fire\n");
> 	return 1;
> }


-- 
Peter Xu


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/2] mm/migrate: Fix read-only page got writable when recover pte
  2022-12-02 15:14                 ` Peter Xu
@ 2022-12-02 15:40                   ` David Hildenbrand
  0 siblings, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2022-12-02 15:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: Andrew Morton, linux-kernel, linux-mm, Mike Rapoport, Nadav Amit,
	Andrea Arcangeli, Ives van Hoorne, Axel Rasmussen,
	Alistair Popple, stable

>>>>
>>>> David, do you feel that the proposed fix will at least address the bug
>>>> without adverse side-effects?
>>>
>>> Usually, when I suspect something is dodgy I unconsciously push back
>>> harder than I usually would.
> 
> Please consider using unconsciousness only for self guidance, figuring out
> directions, or making decisions on one's own.

Yeah, sorry about my communication. I expressed that this approach felt 
wrong to me, I just wasn't able to phrase exactly why I thought 
migration is doing the right thing and didn't have a lot of time to look 
into the details.

Now I dedicated some time and realized that mproctect() is doing the 
exact same thing, it became clearer to me why migration code wasn't 
broken before.

> 
> For discussions on the list which can get more than one person involved, we
> do need consciousness and reasonings.

Yeah, I need vacation.

> 
> Thanks for the reproducer, that's definitely good reasonings.  Do you have
> other reproducer that can trigger an issue without mprotect()?

As noted in the RFC patch I sent, I suspect NUMA hinting page remapping 
might similarly trigger it. I did not try reproducing it, though.

> 
> As I probably mentioned before in other threads mprotect() is IMHO
> conceptually against uffd-wp and I don't yet figured out how to use them
> all right.  For example, we can uffd-wr-protect a pte in uffd-wp range,
> then if we do "mprotect(RW)" it's hard to tell whether the user wants it
> write or not.  E.g., using mprotect(RW) to resolve page faults should be
> wrong because it'll not touch the uffd-wp bit at all.  I confess I never
> thought more on how we should define the interactions between uffd-wp and
> mprotect.
> 
> In short, it'll be great if you have other reproducers for any uffd-wp
> issues other than mprotect().
> 
> I said that also because I just got another message from Ives privately
> that there _seems_ to have yet another even harder to reproduce bug here
> (Ives, feel free to fill in any more information if you got it).  So if you
> can figure out what's missing and already write a reproducer, that'll be
> perfect.

Maybe NUMA hitning on the fallback path, when we didn't migrate or 
migration failed?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 1/2] mm/migrate: Fix read-only page got writable when recover pte
  2022-12-01 22:30           ` Andrew Morton
  2022-12-02 11:03             ` David Hildenbrand
@ 2022-12-02 17:18             ` David Hildenbrand
  1 sibling, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2022-12-02 17:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, linux-kernel, linux-mm, Mike Rapoport, Nadav Amit,
	Andrea Arcangeli, Ives van Hoorne, Axel Rasmussen,
	Alistair Popple, stable

On 01.12.22 23:30, Andrew Morton wrote:
> On Thu, 1 Dec 2022 16:42:52 +0100 David Hildenbrand <david@redhat.com> wrote:
> 
>> On 01.12.22 16:28, Peter Xu wrote:
>>>
>>> I didn't reply here because I have already replied with the question in
>>> previous version with a few attempts.  Quotting myself:
>>>
>>> https://lore.kernel.org/all/Y3KgYeMTdTM0FN5W@x1n/
>>>
>>>           The thing is recovering the pte into its original form is the
>>>           safest approach to me, so I think we need justification on why it's
>>>           always safe to set the write bit.
>>>
>>> I've also got another longer email trying to explain why I think it's the
>>> other way round to be justfied, rather than justifying removal of the write
>>> bit for a read migration entry, here:
>>>
>>
>> And I disagree for this patch that is supposed to fix this hunk:
>>
>>
>> @@ -243,11 +243,15 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
>>                   entry = pte_to_swp_entry(*pvmw.pte);
>>                   if (is_write_migration_entry(entry))
>>                           pte = maybe_mkwrite(pte, vma);
>> +               else if (pte_swp_uffd_wp(*pvmw.pte))
>> +                       pte = pte_mkuffd_wp(pte);
>>    
>>                   if (unlikely(is_zone_device_page(new))) {
>>                           if (is_device_private_page(new)) {
>>                                   entry = make_device_private_entry(new, pte_write(pte));
>>                                   pte = swp_entry_to_pte(entry);
>> +                               if (pte_swp_uffd_wp(*pvmw.pte))
>> +                                       pte = pte_mkuffd_wp(pte);
>>                           }
>>                   }
> 
> David, I'm unclear on what you mean by the above.  Can you please
> expand?
> 
>>
>> There is really nothing to justify the other way around here.
>> If it's broken fix it independently and properly backport it independenty.
>>
>> But we don't know about any such broken case.
>>
>> I have no energy to spare to argue further ;)
> 
> This is a silent data loss bug, which is about as bad as it gets.
> Under obscure conditions, fortunately.  But please let's keep working
> it.  Let's aim for something minimal for backporting purposes.  We can
> revisit any cleanliness issues later.
> 
> David, do you feel that the proposed fix will at least address the bug
> without adverse side-effects?

Just to answer that question clearly: it will fix this bug, but it's 
likely that other similar bugs remain (suspecting NUMA hinting).

Adverse side effect will be that some PTEs that could we writable won't 
be writable. I assume it's not too bad in practice for this particular case.

I proposed an alternative fix and identified other possible broken 
cases. Again, I don't NAK this patch as is, it just logically doesn't 
make sense to me to handle this case differently to the other 
vma->vm_page_prot users. (more details in the other thread)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-12-02 17:20 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-14  0:04 [PATCH v3 0/2] mm/migrate: Fix writable pte for read migration entry Peter Xu
2022-11-14  0:04 ` [PATCH v3 1/2] mm/migrate: Fix read-only page got writable when recover pte Peter Xu
2022-11-15 18:17   ` David Hildenbrand
2022-11-30 22:24     ` Andrew Morton
2022-12-01 15:28       ` Peter Xu
2022-12-01 15:42         ` David Hildenbrand
2022-12-01 22:30           ` Andrew Morton
2022-12-02 11:03             ` David Hildenbrand
2022-12-02 12:07               ` David Hildenbrand
2022-12-02 15:14                 ` Peter Xu
2022-12-02 15:40                   ` David Hildenbrand
2022-12-02 17:18             ` David Hildenbrand
2022-11-14  0:04 ` [PATCH v3 2/2] mm/uffd: Sanity check write bit for uffd-wp protected ptes Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.