linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 1/2] mm/userfaultfd: Support WP on multiple VMAs
@ 2023-02-16  9:16 Muhammad Usama Anjum
  2023-02-16  9:16 ` [PATCH v4 2/2] mm/userfaultfd: add VM_WARN_ONCE() Muhammad Usama Anjum
  2023-02-16  9:37 ` [PATCH v4 1/2] mm/userfaultfd: Support WP on multiple VMAs David Hildenbrand
  0 siblings, 2 replies; 10+ messages in thread
From: Muhammad Usama Anjum @ 2023-02-16  9:16 UTC (permalink / raw)
  To: peterx, david, Andrew Morton
  Cc: Muhammad Usama Anjum, kernel, Paul Gofman, linux-mm, linux-kernel

mwriteprotect_range() errors out if [start, end) doesn't fall in one
VMA. We are facing a use case where multiple VMAs are present in one
range of interest. For example, the following pseudocode reproduces the
error which we are trying to fix:
- Allocate memory of size 16 pages with PROT_NONE with mmap
- Register userfaultfd
- Change protection of the first half (1 to 8 pages) of memory to
  PROT_READ | PROT_WRITE. This breaks the memory area in two VMAs.
- Now UFFDIO_WRITEPROTECT_MODE_WP on the whole memory of 16 pages errors
  out.

This is a simple use case where user may or may not know if the memory
area has been divided into multiple VMAs.

We need an implementation which doesn't disrupt the already present
users. So keeping things simple, stop going over all the VMAs if any one
of the VMA hasn't been registered in WP mode. While at it, remove the
un-needed error check as well.

Reported-by: Paul Gofman <pgofman@codeweavers.com>
Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
---
Changes since v3:
- Rebase on top of next-20230616

Changes since v2:
- Correct the return error code and cleanup a bit

Changes since v1:
- Correct the start and ending values passed to uffd_wp_range()
---
 mm/userfaultfd.c | 39 ++++++++++++++++++++++-----------------
 1 file changed, 22 insertions(+), 17 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 53c3d916ff66..77c5839e591c 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -741,9 +741,12 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 			unsigned long len, bool enable_wp,
 			atomic_t *mmap_changing)
 {
+	unsigned long end = start + len;
+	unsigned long _start, _end;
 	struct vm_area_struct *dst_vma;
 	unsigned long page_mask;
 	long err;
+	VMA_ITERATOR(vmi, dst_mm, start);
 
 	/*
 	 * Sanitize the command parameters:
@@ -766,28 +769,30 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
 		goto out_unlock;
 
 	err = -ENOENT;
-	dst_vma = find_dst_vma(dst_mm, start, len);
+	for_each_vma_range(vmi, dst_vma, end) {
 
-	if (!dst_vma)
-		goto out_unlock;
-	if (!userfaultfd_wp(dst_vma))
-		goto out_unlock;
-	if (!vma_can_userfault(dst_vma, dst_vma->vm_flags))
-		goto out_unlock;
+		if (!userfaultfd_wp(dst_vma)) {
+			err = -ENOENT;
+			break;
+		}
 
-	if (is_vm_hugetlb_page(dst_vma)) {
-		err = -EINVAL;
-		page_mask = vma_kernel_pagesize(dst_vma) - 1;
-		if ((start & page_mask) || (len & page_mask))
-			goto out_unlock;
-	}
+		if (is_vm_hugetlb_page(dst_vma)) {
+			err = -EINVAL;
+			page_mask = vma_kernel_pagesize(dst_vma) - 1;
+			if ((start & page_mask) || (len & page_mask))
+				break;
+		}
 
-	err = uffd_wp_range(dst_mm, dst_vma, start, len, enable_wp);
+		_start = max(dst_vma->vm_start, start);
+		_end = min(dst_vma->vm_end, end);
 
-	/* Return 0 on success, <0 on failures */
-	if (err > 0)
-		err = 0;
+		err = uffd_wp_range(dst_mm, dst_vma, _start, _end - _start, enable_wp);
 
+		/* Return 0 on success, <0 on failures */
+		if (err < 0)
+			break;
+		err = 0;
+	}
 out_unlock:
 	mmap_read_unlock(dst_mm);
 	return err;
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 2/2] mm/userfaultfd: add VM_WARN_ONCE()
  2023-02-16  9:16 [PATCH v4 1/2] mm/userfaultfd: Support WP on multiple VMAs Muhammad Usama Anjum
@ 2023-02-16  9:16 ` Muhammad Usama Anjum
  2023-02-16  9:24   ` David Hildenbrand
  2023-02-16  9:37 ` [PATCH v4 1/2] mm/userfaultfd: Support WP on multiple VMAs David Hildenbrand
  1 sibling, 1 reply; 10+ messages in thread
From: Muhammad Usama Anjum @ 2023-02-16  9:16 UTC (permalink / raw)
  To: peterx, david, Andrew Morton
  Cc: Muhammad Usama Anjum, kernel, linux-mm, linux-kernel

Add VM_WARN_ONCE() to uffd_wp_range() to detect range (start, len) abuse.

Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
---
 mm/userfaultfd.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 77c5839e591c..d89ed44d2668 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -717,6 +717,8 @@ long uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma,
 	struct mmu_gather tlb;
 	long ret;
 
+	VM_WARN_ONCE(start < dst_vma->vm_start || start + len > dst_vma->vm_end,
+		     "The address range exceeds VMA boundary.\n");
 	if (enable_wp)
 		mm_cp_flags = MM_CP_UFFD_WP;
 	else
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 2/2] mm/userfaultfd: add VM_WARN_ONCE()
  2023-02-16  9:16 ` [PATCH v4 2/2] mm/userfaultfd: add VM_WARN_ONCE() Muhammad Usama Anjum
@ 2023-02-16  9:24   ` David Hildenbrand
  2023-02-16  9:48     ` Muhammad Usama Anjum
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2023-02-16  9:24 UTC (permalink / raw)
  To: Muhammad Usama Anjum, peterx, Andrew Morton
  Cc: kernel, linux-mm, linux-kernel

On 16.02.23 10:16, Muhammad Usama Anjum wrote:
> Add VM_WARN_ONCE() to uffd_wp_range() to detect range (start, len) abuse.
> 
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
> ---
>   mm/userfaultfd.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 77c5839e591c..d89ed44d2668 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -717,6 +717,8 @@ long uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma,
>   	struct mmu_gather tlb;
>   	long ret;
>   
> +	VM_WARN_ONCE(start < dst_vma->vm_start || start + len > dst_vma->vm_end,
> +		     "The address range exceeds VMA boundary.\n");

VM_WARN_ON_ONCE is sufficient (sorry for spelling out the wrong variant 
earlier).

These kinds of bugs are expected to be found early during testing, still 
it might make sense to implement a backup path

if (WARN_ON_ONCE(...))
	return -EINVAL;

But we can't use VM_WARN_ON_ONCE here, so we can't compile it out 
anymore ... so I guess a simple VM_WARN_ON_ONCE() is sufficient.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 1/2] mm/userfaultfd: Support WP on multiple VMAs
  2023-02-16  9:16 [PATCH v4 1/2] mm/userfaultfd: Support WP on multiple VMAs Muhammad Usama Anjum
  2023-02-16  9:16 ` [PATCH v4 2/2] mm/userfaultfd: add VM_WARN_ONCE() Muhammad Usama Anjum
@ 2023-02-16  9:37 ` David Hildenbrand
  2023-02-16 20:25   ` Peter Xu
  1 sibling, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2023-02-16  9:37 UTC (permalink / raw)
  To: Muhammad Usama Anjum, peterx, Andrew Morton
  Cc: kernel, Paul Gofman, linux-mm, linux-kernel

On 16.02.23 10:16, Muhammad Usama Anjum wrote:
> mwriteprotect_range() errors out if [start, end) doesn't fall in one
> VMA. We are facing a use case where multiple VMAs are present in one
> range of interest. For example, the following pseudocode reproduces the
> error which we are trying to fix:
> - Allocate memory of size 16 pages with PROT_NONE with mmap
> - Register userfaultfd
> - Change protection of the first half (1 to 8 pages) of memory to
>    PROT_READ | PROT_WRITE. This breaks the memory area in two VMAs.
> - Now UFFDIO_WRITEPROTECT_MODE_WP on the whole memory of 16 pages errors
>    out.

I think, in QEMU, with partial madvise()/mmap(MAP_FIXED) while handling 
memory remapping during reboot to discard pages with memory errors, it 
would be possible that we get multiple VMAs and could not enable uffd-wp 
for background snapshots anymore. So this change makes sense to me.

Especially, because userfaultfd_register() seems to already properly 
handle multi-VMA ranges correctly. It traverses the VMA list twice ... 
but also holds the mmap lock in write mode.

> 
> This is a simple use case where user may or may not know if the memory
> area has been divided into multiple VMAs.
> 
> We need an implementation which doesn't disrupt the already present
> users. So keeping things simple, stop going over all the VMAs if any one
> of the VMA hasn't been registered in WP mode. While at it, remove the
> un-needed error check as well.
> 
> Reported-by: Paul Gofman <pgofman@codeweavers.com>
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
> ---


Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 2/2] mm/userfaultfd: add VM_WARN_ONCE()
  2023-02-16  9:24   ` David Hildenbrand
@ 2023-02-16  9:48     ` Muhammad Usama Anjum
  2023-02-16 20:26       ` Peter Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Muhammad Usama Anjum @ 2023-02-16  9:48 UTC (permalink / raw)
  To: David Hildenbrand, peterx, Andrew Morton
  Cc: Muhammad Usama Anjum, kernel, linux-mm, linux-kernel

On 2/16/23 2:24 PM, David Hildenbrand wrote:
> On 16.02.23 10:16, Muhammad Usama Anjum wrote:
>> Add VM_WARN_ONCE() to uffd_wp_range() to detect range (start, len) abuse.
>>
>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
>> ---
>>   mm/userfaultfd.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
>> index 77c5839e591c..d89ed44d2668 100644
>> --- a/mm/userfaultfd.c
>> +++ b/mm/userfaultfd.c
>> @@ -717,6 +717,8 @@ long uffd_wp_range(struct mm_struct *dst_mm, struct
>> vm_area_struct *dst_vma,
>>       struct mmu_gather tlb;
>>       long ret;
>>   +    VM_WARN_ONCE(start < dst_vma->vm_start || start + len >
>> dst_vma->vm_end,
>> +             "The address range exceeds VMA boundary.\n");
> 
> VM_WARN_ON_ONCE is sufficient (sorry for spelling out the wrong variant
> earlier).
Will do in the next version. Thanks.

> 
> These kinds of bugs are expected to be found early during testing, still it
> might make sense to implement a backup path
> 
> if (WARN_ON_ONCE(...))
>     return -EINVAL;
> 
> But we can't use VM_WARN_ON_ONCE here, so we can't compile it out anymore
> ... so I guess a simple VM_WARN_ON_ONCE() is sufficient.
> 

-- 
BR,
Muhammad Usama Anjum

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 1/2] mm/userfaultfd: Support WP on multiple VMAs
  2023-02-16  9:37 ` [PATCH v4 1/2] mm/userfaultfd: Support WP on multiple VMAs David Hildenbrand
@ 2023-02-16 20:25   ` Peter Xu
  2023-02-17  8:53     ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Xu @ 2023-02-16 20:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Muhammad Usama Anjum, Andrew Morton, kernel, Paul Gofman,
	linux-mm, linux-kernel

On Thu, Feb 16, 2023 at 10:37:36AM +0100, David Hildenbrand wrote:
> On 16.02.23 10:16, Muhammad Usama Anjum wrote:
> > mwriteprotect_range() errors out if [start, end) doesn't fall in one
> > VMA. We are facing a use case where multiple VMAs are present in one
> > range of interest. For example, the following pseudocode reproduces the
> > error which we are trying to fix:
> > - Allocate memory of size 16 pages with PROT_NONE with mmap
> > - Register userfaultfd
> > - Change protection of the first half (1 to 8 pages) of memory to
> >    PROT_READ | PROT_WRITE. This breaks the memory area in two VMAs.
> > - Now UFFDIO_WRITEPROTECT_MODE_WP on the whole memory of 16 pages errors
> >    out.
> 
> I think, in QEMU, with partial madvise()/mmap(MAP_FIXED) while handling
> memory remapping during reboot to discard pages with memory errors, it would
> be possible that we get multiple VMAs and could not enable uffd-wp for
> background snapshots anymore. So this change makes sense to me.

Any pointer for this one?

> 
> Especially, because userfaultfd_register() seems to already properly handle
> multi-VMA ranges correctly. It traverses the VMA list twice ... but also
> holds the mmap lock in write mode.
> 
> > 
> > This is a simple use case where user may or may not know if the memory
> > area has been divided into multiple VMAs.
> > 
> > We need an implementation which doesn't disrupt the already present
> > users. So keeping things simple, stop going over all the VMAs if any one
> > of the VMA hasn't been registered in WP mode. While at it, remove the
> > un-needed error check as well.
> > 
> > Reported-by: Paul Gofman <pgofman@codeweavers.com>
> > Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
> > ---
> 
> 
> Acked-by: David Hildenbrand <david@redhat.com>

Acked-by: Peter Xu <peterx@redhat.com>

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 2/2] mm/userfaultfd: add VM_WARN_ONCE()
  2023-02-16  9:48     ` Muhammad Usama Anjum
@ 2023-02-16 20:26       ` Peter Xu
  2023-02-17 10:40         ` Muhammad Usama Anjum
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Xu @ 2023-02-16 20:26 UTC (permalink / raw)
  To: Muhammad Usama Anjum
  Cc: David Hildenbrand, Andrew Morton, kernel, linux-mm, linux-kernel

On Thu, Feb 16, 2023 at 02:48:51PM +0500, Muhammad Usama Anjum wrote:
> On 2/16/23 2:24 PM, David Hildenbrand wrote:
> > On 16.02.23 10:16, Muhammad Usama Anjum wrote:
> >> Add VM_WARN_ONCE() to uffd_wp_range() to detect range (start, len) abuse.
> >>
> >> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
> >> ---
> >>   mm/userfaultfd.c | 2 ++
> >>   1 file changed, 2 insertions(+)
> >>
> >> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> >> index 77c5839e591c..d89ed44d2668 100644
> >> --- a/mm/userfaultfd.c
> >> +++ b/mm/userfaultfd.c
> >> @@ -717,6 +717,8 @@ long uffd_wp_range(struct mm_struct *dst_mm, struct
> >> vm_area_struct *dst_vma,
> >>       struct mmu_gather tlb;
> >>       long ret;
> >>   +    VM_WARN_ONCE(start < dst_vma->vm_start || start + len >
> >> dst_vma->vm_end,
> >> +             "The address range exceeds VMA boundary.\n");
> > 
> > VM_WARN_ON_ONCE is sufficient (sorry for spelling out the wrong variant
> > earlier).
> Will do in the next version. Thanks.

Shall we just squash the two patches?

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 1/2] mm/userfaultfd: Support WP on multiple VMAs
  2023-02-16 20:25   ` Peter Xu
@ 2023-02-17  8:53     ` David Hildenbrand
  2023-02-17 17:35       ` Peter Xu
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2023-02-17  8:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: Muhammad Usama Anjum, Andrew Morton, kernel, Paul Gofman,
	linux-mm, linux-kernel

On 16.02.23 21:25, Peter Xu wrote:
> On Thu, Feb 16, 2023 at 10:37:36AM +0100, David Hildenbrand wrote:
>> On 16.02.23 10:16, Muhammad Usama Anjum wrote:
>>> mwriteprotect_range() errors out if [start, end) doesn't fall in one
>>> VMA. We are facing a use case where multiple VMAs are present in one
>>> range of interest. For example, the following pseudocode reproduces the
>>> error which we are trying to fix:
>>> - Allocate memory of size 16 pages with PROT_NONE with mmap
>>> - Register userfaultfd
>>> - Change protection of the first half (1 to 8 pages) of memory to
>>>     PROT_READ | PROT_WRITE. This breaks the memory area in two VMAs.
>>> - Now UFFDIO_WRITEPROTECT_MODE_WP on the whole memory of 16 pages errors
>>>     out.
>>
>> I think, in QEMU, with partial madvise()/mmap(MAP_FIXED) while handling
>> memory remapping during reboot to discard pages with memory errors, it would
>> be possible that we get multiple VMAs and could not enable uffd-wp for
>> background snapshots anymore. So this change makes sense to me.
> 
> Any pointer for this one?

In qemu, softmmu/physmem.c:qemu_ram_remap() is instructed on reboot to 
remap VMAs due to MCE pages. We apply QEMU_MADV_MERGEABLE (if configured 
for the machine) and QEMU_MADV_DONTDUMP (if configured for the machine), 
so the kernel could merge the VMAs again.

(a) From experiments (~2 years ago), I recall that some VMAs won't get 
merged again ever. I faintly remember that this was the case for 
hugetlb. It might have changed in the meantime, haven't tried it again. 
But looking at is_mergeable_vma(), we refuse to merge with 
vma->vm_ops->close. I think that might be set for hugetlb 
(hugetlb_vm_op_close).

(b) We don't consider memory-backend overrides, like toggling a backend 
QEMU_MADV_MERGEABLE or QEMU_MADV_DONTDUMP from backends/hostmem.c, 
resulting in multiple unmergable VMAs.

(c) We don't consider memory-backend  mbind() we don't re-apply the 
mbind() policy, resulting in unmergable VMAs.


The correct way to handle (b) and (c) would be to notify the memory 
backend, to let it reapply the correct flags, and to reapply the mbind() 
policy (I once had patches for that, have to look them up again).

So in these rare setups with MCEs, we would be getting more VMAs and 
while the uffd-wp registration would succeed, uffd-wp protection would fail.

Not that this is purely theoretical, people don't heavily use background 
snapshots yet, so I am not aware of any reports. Further, I consider it 
only to happen very rarely (MCE+reboot+a/b/c).

So it's more of a "the app doesn't necessarily keep track of the exact 
VMAs".

[I am not sure sure how helpful remapping !anon memory really is, we 
should be getting the same messed-up MCE pages from the fd again, but 
that's a different discussion I guess]

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 2/2] mm/userfaultfd: add VM_WARN_ONCE()
  2023-02-16 20:26       ` Peter Xu
@ 2023-02-17 10:40         ` Muhammad Usama Anjum
  0 siblings, 0 replies; 10+ messages in thread
From: Muhammad Usama Anjum @ 2023-02-17 10:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: Muhammad Usama Anjum, David Hildenbrand, Andrew Morton, kernel,
	linux-mm, linux-kernel

On 2/17/23 1:26 AM, Peter Xu wrote:
> On Thu, Feb 16, 2023 at 02:48:51PM +0500, Muhammad Usama Anjum wrote:
>> On 2/16/23 2:24 PM, David Hildenbrand wrote:
>>> On 16.02.23 10:16, Muhammad Usama Anjum wrote:
>>>> Add VM_WARN_ONCE() to uffd_wp_range() to detect range (start, len) abuse.
>>>>
>>>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
>>>> ---
>>>>   mm/userfaultfd.c | 2 ++
>>>>   1 file changed, 2 insertions(+)
>>>>
>>>> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
>>>> index 77c5839e591c..d89ed44d2668 100644
>>>> --- a/mm/userfaultfd.c
>>>> +++ b/mm/userfaultfd.c
>>>> @@ -717,6 +717,8 @@ long uffd_wp_range(struct mm_struct *dst_mm, struct
>>>> vm_area_struct *dst_vma,
>>>>       struct mmu_gather tlb;
>>>>       long ret;
>>>>   +    VM_WARN_ONCE(start < dst_vma->vm_start || start + len >
>>>> dst_vma->vm_end,
>>>> +             "The address range exceeds VMA boundary.\n");
>>>
>>> VM_WARN_ON_ONCE is sufficient (sorry for spelling out the wrong variant
>>> earlier).
>> Will do in the next version. Thanks.
> 
> Shall we just squash the two patches?
Will squash in next version.

> 

-- 
BR,
Muhammad Usama Anjum

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 1/2] mm/userfaultfd: Support WP on multiple VMAs
  2023-02-17  8:53     ` David Hildenbrand
@ 2023-02-17 17:35       ` Peter Xu
  0 siblings, 0 replies; 10+ messages in thread
From: Peter Xu @ 2023-02-17 17:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Muhammad Usama Anjum, Andrew Morton, kernel, Paul Gofman,
	linux-mm, linux-kernel

On Fri, Feb 17, 2023 at 09:53:47AM +0100, David Hildenbrand wrote:
> On 16.02.23 21:25, Peter Xu wrote:
> > On Thu, Feb 16, 2023 at 10:37:36AM +0100, David Hildenbrand wrote:
> > > On 16.02.23 10:16, Muhammad Usama Anjum wrote:
> > > > mwriteprotect_range() errors out if [start, end) doesn't fall in one
> > > > VMA. We are facing a use case where multiple VMAs are present in one
> > > > range of interest. For example, the following pseudocode reproduces the
> > > > error which we are trying to fix:
> > > > - Allocate memory of size 16 pages with PROT_NONE with mmap
> > > > - Register userfaultfd
> > > > - Change protection of the first half (1 to 8 pages) of memory to
> > > >     PROT_READ | PROT_WRITE. This breaks the memory area in two VMAs.
> > > > - Now UFFDIO_WRITEPROTECT_MODE_WP on the whole memory of 16 pages errors
> > > >     out.
> > > 
> > > I think, in QEMU, with partial madvise()/mmap(MAP_FIXED) while handling
> > > memory remapping during reboot to discard pages with memory errors, it would
> > > be possible that we get multiple VMAs and could not enable uffd-wp for
> > > background snapshots anymore. So this change makes sense to me.
> > 
> > Any pointer for this one?
> 
> In qemu, softmmu/physmem.c:qemu_ram_remap() is instructed on reboot to remap
> VMAs due to MCE pages. We apply QEMU_MADV_MERGEABLE (if configured for the
> machine) and QEMU_MADV_DONTDUMP (if configured for the machine), so the
> kernel could merge the VMAs again.
> 
> (a) From experiments (~2 years ago), I recall that some VMAs won't get
> merged again ever. I faintly remember that this was the case for hugetlb. It
> might have changed in the meantime, haven't tried it again. But looking at
> is_mergeable_vma(), we refuse to merge with vma->vm_ops->close. I think that
> might be set for hugetlb (hugetlb_vm_op_close).
> 
> (b) We don't consider memory-backend overrides, like toggling a backend
> QEMU_MADV_MERGEABLE or QEMU_MADV_DONTDUMP from backends/hostmem.c, resulting
> in multiple unmergable VMAs.
> 
> (c) We don't consider memory-backend  mbind() we don't re-apply the mbind()
> policy, resulting in unmergable VMAs.
> 
> 
> The correct way to handle (b) and (c) would be to notify the memory backend,
> to let it reapply the correct flags, and to reapply the mbind() policy (I
> once had patches for that, have to look them up again).

Makes sense.  There should be a single entry for reloading a RAM with the
specified properties rather than randomly applying when we noticed.

> 
> So in these rare setups with MCEs, we would be getting more VMAs and while
> the uffd-wp registration would succeed, uffd-wp protection would fail.
> 
> Not that this is purely theoretical, people don't heavily use background
> snapshots yet, so I am not aware of any reports. Further, I consider it only
> to happen very rarely (MCE+reboot+a/b/c).
> 
> So it's more of a "the app doesn't necessarily keep track of the exact
> VMAs".

Agree.

> 
> [I am not sure sure how helpful remapping !anon memory really is, we should
> be getting the same messed-up MCE pages from the fd again, but that's a
> different discussion I guess]

Yes it sounds like a bug to me.  I'm afraid what it really wanted here is
actually not remap but truncation in strict semantics.  I think the
hwpoison code in QEMU is just slightly buggy all around - e.g. I found that
qemu_ram_remap() probably wants to use host psize not the guest.

But let's not pollute the mailing lists anymore; thanks for the context!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-02-17 17:36 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-16  9:16 [PATCH v4 1/2] mm/userfaultfd: Support WP on multiple VMAs Muhammad Usama Anjum
2023-02-16  9:16 ` [PATCH v4 2/2] mm/userfaultfd: add VM_WARN_ONCE() Muhammad Usama Anjum
2023-02-16  9:24   ` David Hildenbrand
2023-02-16  9:48     ` Muhammad Usama Anjum
2023-02-16 20:26       ` Peter Xu
2023-02-17 10:40         ` Muhammad Usama Anjum
2023-02-16  9:37 ` [PATCH v4 1/2] mm/userfaultfd: Support WP on multiple VMAs David Hildenbrand
2023-02-16 20:25   ` Peter Xu
2023-02-17  8:53     ` David Hildenbrand
2023-02-17 17:35       ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).