* [PATCH] mm: Loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390
@ 2015-11-11 15:35 Jason J. Herne
2015-11-11 17:30 ` Andrea Arcangeli
0 siblings, 1 reply; 18+ messages in thread
From: Jason J. Herne @ 2015-11-11 15:35 UTC (permalink / raw)
To: linux-s390; +Cc: linux-mm, aarcange, borntraeger, Jason J. Herne
MADV_NOHUGEPAGE processing is too restrictive. kvm already disables
hugepage but hugepage_madvise() takes the error path when we ask to turn
on the MADV_NOHUGEPAGE bit and the bit is already on. This causes Qemu's
new postcopy migration feature to fail on s390 because its first action is
to madvise the guest address space as NOHUGEPAGE. This patch modifies the
code so that the operation succeeds without error now.
Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
---
mm/huge_memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c29ddeb..a8b5347 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2025,7 +2025,7 @@ int hugepage_madvise(struct vm_area_struct *vma,
/*
* Be somewhat over-protective like KSM for now!
*/
- if (*vm_flags & (VM_NOHUGEPAGE | VM_NO_THP))
+ if (*vm_flags & VM_NO_THP)
return -EINVAL;
*vm_flags &= ~VM_HUGEPAGE;
*vm_flags |= VM_NOHUGEPAGE;
--
1.9.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: Loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390
2015-11-11 15:35 [PATCH] mm: Loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390 Jason J. Herne
@ 2015-11-11 17:30 ` Andrea Arcangeli
2015-11-11 19:47 ` Christian Borntraeger
2015-11-11 20:01 ` Christian Borntraeger
0 siblings, 2 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2015-11-11 17:30 UTC (permalink / raw)
To: Jason J. Herne; +Cc: linux-s390, linux-mm, borntraeger
Hi Jason,
On Wed, Nov 11, 2015 at 10:35:16AM -0500, Jason J. Herne wrote:
> MADV_NOHUGEPAGE processing is too restrictive. kvm already disables
> hugepage but hugepage_madvise() takes the error path when we ask to turn
> on the MADV_NOHUGEPAGE bit and the bit is already on. This causes Qemu's
I wonder why KVM disables transparent hugepages on s390. It sounds
weird to disable transparent hugepages with KVM. In fact on x86 we
call MADV_HUGEPAGE to be sure transparent hugepages are enabled on the
guest physical memory, even if the transparent_hugepage/enabled ==
madvise.
> new postcopy migration feature to fail on s390 because its first action is
> to madvise the guest address space as NOHUGEPAGE. This patch modifies the
> code so that the operation succeeds without error now.
The other way is to change qemu to keep track it already called
MADV_NOHUGEPAGE and not to call it again. I don't have a strong
opinion on this, I think it's ok to return 0 but it's a visible change
to userland, I can't imagine it to break anything though. It sounds
very unlikely that an app could error out if it notices the kernel
doesn't error out on the second call of MADV_NOHUGEPAGE.
Glad to hear KVM postcopy live migration is already running on s390 too.
Thanks,
Andrea
>
> Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
> ---
> mm/huge_memory.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c29ddeb..a8b5347 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2025,7 +2025,7 @@ int hugepage_madvise(struct vm_area_struct *vma,
> /*
> * Be somewhat over-protective like KSM for now!
> */
> - if (*vm_flags & (VM_NOHUGEPAGE | VM_NO_THP))
> + if (*vm_flags & VM_NO_THP)
> return -EINVAL;
> *vm_flags &= ~VM_HUGEPAGE;
> *vm_flags |= VM_NOHUGEPAGE;
> --
> 1.9.1
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: Loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390
2015-11-11 17:30 ` Andrea Arcangeli
@ 2015-11-11 19:47 ` Christian Borntraeger
2015-11-11 20:01 ` Christian Borntraeger
1 sibling, 0 replies; 18+ messages in thread
From: Christian Borntraeger @ 2015-11-11 19:47 UTC (permalink / raw)
To: Andrea Arcangeli, Jason J. Herne; +Cc: linux-s390, linux-mm, KVM list
Am 11.11.2015 um 18:30 schrieb Andrea Arcangeli:
> Hi Jason,
>
> On Wed, Nov 11, 2015 at 10:35:16AM -0500, Jason J. Herne wrote:
>> MADV_NOHUGEPAGE processing is too restrictive. kvm already disables
>> hugepage but hugepage_madvise() takes the error path when we ask to turn
>> on the MADV_NOHUGEPAGE bit and the bit is already on. This causes Qemu's
>
> I wonder why KVM disables transparent hugepages on s390. It sounds
> weird to disable transparent hugepages with KVM. In fact on x86 we
> call MADV_HUGEPAGE to be sure transparent hugepages are enabled on the
> guest physical memory, even if the transparent_hugepage/enabled ==
> madvise.
KVM on s390 does not support huge pages as of today, as it interfers with storage
keys and other extensions (like cooperative paging). The architectural place to
store these extensions is in the page table extension, which do not exist with
lage pages. We have recently implemented keyless guests and working on additional
changes to allow large pages for guest backing - but not today.
>> new postcopy migration feature to fail on s390 because its first action is
>> to madvise the guest address space as NOHUGEPAGE. This patch modifies the
>> code so that the operation succeeds without error now.
>
> The other way is to change qemu to keep track it already called
> MADV_NOHUGEPAGE and not to call it again. I don't have a strong
> opinion on this, I think it's ok to return 0 but it's a visible change
> to userland, I can't imagine it to break anything though. It sounds
> very unlikely that an app could error out if it notices the kernel
> doesn't error out on the second call of MADV_NOHUGEPAGE.
the sequence of
madvise (something); == ok
madvise (something); == EINVAL;
seems strange. So I think changing the kernel is the better approach.
>
> Glad to hear KVM postcopy live migration is already running on s390 too.
>
> Thanks,
> Andrea
>
>>
>> Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Who is going to take this patch? If I should take the patch, I need an
ACK from the memory mgmt folks.
Christian
>> ---
>> mm/huge_memory.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index c29ddeb..a8b5347 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2025,7 +2025,7 @@ int hugepage_madvise(struct vm_area_struct *vma,
>> /*
>> * Be somewhat over-protective like KSM for now!
>> */
>> - if (*vm_flags & (VM_NOHUGEPAGE | VM_NO_THP))
>> + if (*vm_flags & VM_NO_THP)
>> return -EINVAL;
>> *vm_flags &= ~VM_HUGEPAGE;
>> *vm_flags |= VM_NOHUGEPAGE;
>> --
>> 1.9.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: Loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390
@ 2015-11-11 19:47 ` Christian Borntraeger
0 siblings, 0 replies; 18+ messages in thread
From: Christian Borntraeger @ 2015-11-11 19:47 UTC (permalink / raw)
To: Andrea Arcangeli, Jason J. Herne; +Cc: linux-s390, linux-mm, KVM list
Am 11.11.2015 um 18:30 schrieb Andrea Arcangeli:
> Hi Jason,
>
> On Wed, Nov 11, 2015 at 10:35:16AM -0500, Jason J. Herne wrote:
>> MADV_NOHUGEPAGE processing is too restrictive. kvm already disables
>> hugepage but hugepage_madvise() takes the error path when we ask to turn
>> on the MADV_NOHUGEPAGE bit and the bit is already on. This causes Qemu's
>
> I wonder why KVM disables transparent hugepages on s390. It sounds
> weird to disable transparent hugepages with KVM. In fact on x86 we
> call MADV_HUGEPAGE to be sure transparent hugepages are enabled on the
> guest physical memory, even if the transparent_hugepage/enabled ==
> madvise.
KVM on s390 does not support huge pages as of today, as it interfers with storage
keys and other extensions (like cooperative paging). The architectural place to
store these extensions is in the page table extension, which do not exist with
lage pages. We have recently implemented keyless guests and working on additional
changes to allow large pages for guest backing - but not today.
>> new postcopy migration feature to fail on s390 because its first action is
>> to madvise the guest address space as NOHUGEPAGE. This patch modifies the
>> code so that the operation succeeds without error now.
>
> The other way is to change qemu to keep track it already called
> MADV_NOHUGEPAGE and not to call it again. I don't have a strong
> opinion on this, I think it's ok to return 0 but it's a visible change
> to userland, I can't imagine it to break anything though. It sounds
> very unlikely that an app could error out if it notices the kernel
> doesn't error out on the second call of MADV_NOHUGEPAGE.
the sequence of
madvise (something); == ok
madvise (something); == EINVAL;
seems strange. So I think changing the kernel is the better approach.
>
> Glad to hear KVM postcopy live migration is already running on s390 too.
>
> Thanks,
> Andrea
>
>>
>> Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Who is going to take this patch? If I should take the patch, I need an
ACK from the memory mgmt folks.
Christian
>> ---
>> mm/huge_memory.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index c29ddeb..a8b5347 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2025,7 +2025,7 @@ int hugepage_madvise(struct vm_area_struct *vma,
>> /*
>> * Be somewhat over-protective like KSM for now!
>> */
>> - if (*vm_flags & (VM_NOHUGEPAGE | VM_NO_THP))
>> + if (*vm_flags & VM_NO_THP)
>> return -EINVAL;
>> *vm_flags &= ~VM_HUGEPAGE;
>> *vm_flags |= VM_NOHUGEPAGE;
>> --
>> 1.9.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: Loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390
2015-11-11 17:30 ` Andrea Arcangeli
2015-11-11 19:47 ` Christian Borntraeger
@ 2015-11-11 20:01 ` Christian Borntraeger
2015-11-11 20:37 ` Andrea Arcangeli
1 sibling, 1 reply; 18+ messages in thread
From: Christian Borntraeger @ 2015-11-11 20:01 UTC (permalink / raw)
To: Andrea Arcangeli, Jason J. Herne; +Cc: linux-s390, linux-mm, KVM list
Am 11.11.2015 um 18:30 schrieb Andrea Arcangeli:
> Hi Jason,
>
> On Wed, Nov 11, 2015 at 10:35:16AM -0500, Jason J. Herne wrote:
>> MADV_NOHUGEPAGE processing is too restrictive. kvm already disables
>> hugepage but hugepage_madvise() takes the error path when we ask to turn
>> on the MADV_NOHUGEPAGE bit and the bit is already on. This causes Qemu's
>
> I wonder why KVM disables transparent hugepages on s390. It sounds
> weird to disable transparent hugepages with KVM. In fact on x86 we
> call MADV_HUGEPAGE to be sure transparent hugepages are enabled on the
> guest physical memory, even if the transparent_hugepage/enabled ==
> madvise.
>
>> new postcopy migration feature to fail on s390 because its first action is
>> to madvise the guest address space as NOHUGEPAGE. This patch modifies the
>> code so that the operation succeeds without error now.
>
> The other way is to change qemu to keep track it already called
> MADV_NOHUGEPAGE and not to call it again. I don't have a strong
> opinion on this, I think it's ok to return 0 but it's a visible change
> to userland, I can't imagine it to break anything though. It sounds
> very unlikely that an app could error out if it notices the kernel
> doesn't error out on the second call of MADV_NOHUGEPAGE.
>
> Glad to hear KVM postcopy live migration is already running on s390 too.
Sometimes....we have some issues with userfaultd, which we currently address.
One place is interesting: the kvm code might have to call fixup_user_fault
for a guest address (to map the page writable). Right now we do not pass
FAULT_FLAG_ALLOW_RETRY, which can trigger a warning like
[ 119.414573] FAULT_FLAG_ALLOW_RETRY missing 1
[ 119.414577] CPU: 42 PID: 12853 Comm: qemu-system-s39 Not tainted 4.3.0+ #315
[ 119.414579] 000000011c4579b8 000000011c457a48 0000000000000002 0000000000000000
000000011c457ae8 000000011c457a60 000000011c457a60 0000000000113e26
00000000000002cf 00000000009feef8 0000000000a1e054 000000000000000b
000000011c457aa8 000000011c457a48 0000000000000000 0000000000000000
0000000000000000 0000000000113e26 000000011c457a48 000000011c457aa8
[ 119.414590] Call Trace:
[ 119.414596] ([<0000000000113d16>] show_trace+0xf6/0x148)
[ 119.414598] [<0000000000113dda>] show_stack+0x72/0xf0
[ 119.414600] [<0000000000551b9e>] dump_stack+0x6e/0x90
[ 119.414605] [<000000000032d168>] handle_userfault+0xe0/0x448
[ 119.414609] [<000000000029a2d4>] handle_mm_fault+0x16e4/0x1798
[ 119.414611] [<00000000002930be>] fixup_user_fault+0x86/0x118
[ 119.414614] [<0000000000126bb8>] gmap_ipte_notify+0xa0/0x170
[ 119.414617] [<000000000013ae90>] kvm_arch_vcpu_ioctl_run+0x448/0xc58
[ 119.414619] [<000000000012e4dc>] kvm_vcpu_ioctl+0x37c/0x668
[ 119.414622] [<00000000002eba68>] do_vfs_ioctl+0x3a8/0x508
[ 119.414624] [<00000000002ebc6c>] SyS_ioctl+0xa4/0xb8
[ 119.414627] [<0000000000815c56>] system_call+0xd6/0x264
[ 119.414629] [<000003ff9628721a>] 0x3ff9628721a
I think we can rework this to use something that sets FAULT_FLAG_ALLOW_RETRY,
but this begs the question if a futex operation on userfault backed memory
would also be broken. The futex code also does fixup_user_fault without
FAULT_FLAG_ALLOW_RETRY as far as I can tell.
Christian
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: Loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390
2015-11-11 20:01 ` Christian Borntraeger
@ 2015-11-11 20:37 ` Andrea Arcangeli
0 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2015-11-11 20:37 UTC (permalink / raw)
To: Christian Borntraeger; +Cc: Jason J. Herne, linux-s390, linux-mm, KVM list
On Wed, Nov 11, 2015 at 09:01:44PM +0100, Christian Borntraeger wrote:
> Am 11.11.2015 um 18:30 schrieb Andrea Arcangeli:
> > Hi Jason,
> >
> > On Wed, Nov 11, 2015 at 10:35:16AM -0500, Jason J. Herne wrote:
> >> MADV_NOHUGEPAGE processing is too restrictive. kvm already disables
> >> hugepage but hugepage_madvise() takes the error path when we ask to turn
> >> on the MADV_NOHUGEPAGE bit and the bit is already on. This causes Qemu's
> >
> > I wonder why KVM disables transparent hugepages on s390. It sounds
> > weird to disable transparent hugepages with KVM. In fact on x86 we
> > call MADV_HUGEPAGE to be sure transparent hugepages are enabled on the
> > guest physical memory, even if the transparent_hugepage/enabled ==
> > madvise.
> >
> >> new postcopy migration feature to fail on s390 because its first action is
> >> to madvise the guest address space as NOHUGEPAGE. This patch modifies the
> >> code so that the operation succeeds without error now.
> >
> > The other way is to change qemu to keep track it already called
> > MADV_NOHUGEPAGE and not to call it again. I don't have a strong
> > opinion on this, I think it's ok to return 0 but it's a visible change
> > to userland, I can't imagine it to break anything though. It sounds
> > very unlikely that an app could error out if it notices the kernel
> > doesn't error out on the second call of MADV_NOHUGEPAGE.
> >
> > Glad to hear KVM postcopy live migration is already running on s390 too.
>
> Sometimes....we have some issues with userfaultd, which we currently address.
> One place is interesting: the kvm code might have to call fixup_user_fault
> for a guest address (to map the page writable). Right now we do not pass
> FAULT_FLAG_ALLOW_RETRY, which can trigger a warning like
>
> [ 119.414573] FAULT_FLAG_ALLOW_RETRY missing 1
> [ 119.414577] CPU: 42 PID: 12853 Comm: qemu-system-s39 Not tainted 4.3.0+ #315
> [ 119.414579] 000000011c4579b8 000000011c457a48 0000000000000002 0000000000000000
> 000000011c457ae8 000000011c457a60 000000011c457a60 0000000000113e26
> 00000000000002cf 00000000009feef8 0000000000a1e054 000000000000000b
> 000000011c457aa8 000000011c457a48 0000000000000000 0000000000000000
> 0000000000000000 0000000000113e26 000000011c457a48 000000011c457aa8
> [ 119.414590] Call Trace:
> [ 119.414596] ([<0000000000113d16>] show_trace+0xf6/0x148)
> [ 119.414598] [<0000000000113dda>] show_stack+0x72/0xf0
> [ 119.414600] [<0000000000551b9e>] dump_stack+0x6e/0x90
> [ 119.414605] [<000000000032d168>] handle_userfault+0xe0/0x448
> [ 119.414609] [<000000000029a2d4>] handle_mm_fault+0x16e4/0x1798
> [ 119.414611] [<00000000002930be>] fixup_user_fault+0x86/0x118
> [ 119.414614] [<0000000000126bb8>] gmap_ipte_notify+0xa0/0x170
> [ 119.414617] [<000000000013ae90>] kvm_arch_vcpu_ioctl_run+0x448/0xc58
> [ 119.414619] [<000000000012e4dc>] kvm_vcpu_ioctl+0x37c/0x668
> [ 119.414622] [<00000000002eba68>] do_vfs_ioctl+0x3a8/0x508
> [ 119.414624] [<00000000002ebc6c>] SyS_ioctl+0xa4/0xb8
> [ 119.414627] [<0000000000815c56>] system_call+0xd6/0x264
> [ 119.414629] [<000003ff9628721a>] 0x3ff9628721a
>
> I think we can rework this to use something that sets FAULT_FLAG_ALLOW_RETRY,
> but this begs the question if a futex operation on userfault backed memory
> would also be broken. The futex code also does fixup_user_fault without
> FAULT_FLAG_ALLOW_RETRY as far as I can tell.
That's a good point, but qemu never does futex on the guest physical
memory so can't be a problem for postcopy at least, it's also not
destabilizing in any way (and the stack dump also happens only if you
have DEBUG_VM selected).
The userfaultfd stress test actually could end up using futex on the
userfault memory, but it never triggered anything, it doesn't get to
that fixup_user_fault at runtime.
Still it should be fixed for s390 and futex.
It's probably better to add a fixup_user_fault_unlocked that will work
like get_user_pages_unlocked. I.e. leaves the details of the mmap_sem
locking internally to the function, and will handle VM_FAULT_RETRY
automatically by re-taking the mmap_sem and repeating the
fixup_user_fault after updating the FAULT_FLAG_ALLOW_RETRY to
FAULT_FLAG_TRIED.
Note that the FAULT_FLAG_TRIED logic will need an overhaul soon, as we
must be able to release the mmap_sem in handle_userfault() even if
FAULT_FLAG_TRIED is passed instead of FAULT_FLAG_ALLOW_RETRY for to
reasons:
1) the theoretical scheduler race, with the schedule() not blocking
after setting the state to TASK_KILLABLE and then calling
schedule() will be not an issue anymore (we'll return
VM_FAULT_RETRY and handle_userfault() will be repeated by the
caller immediately later and hopefully eventually schedule() will
block...). We never experienced this race a single time and
thousands of postcopy live migration happened so it seems mostly a
theoretical issue. Even in the worst case it triggers it cannot
corrupt memory but you'll get a sigbus and a stuck dump identical
to what you already experienced above on s390 (not because of the
theoretical scheduler race for s390 case, but the failure point
ends the same).
2) we'll be able to wrprotect pagetables (only with mmap_sem for
reading, without touching vmas) as the program runs, and while
faults are already in progress. If the wrprotection happens after
the kernel already returned VM_FAULT_RETRY of its own because a
page fault was already in flight and it was processed as a
lightweight-locking one, we may enter the write protect tracking
handle_userfault() with FAULT_FLAG_TRIED and we must still be
allowed to block and in turn to drop the mmap_sem and in turn to
return VM_FAULT_RETRY or a VM_FAULT_USERFAULT or some other
flag. In fact it's also fine if the handle_userfault() won't cause
a transition from the caller fault_flags from
FAULT_FLAG_ALLOW_RETRY to FAULT_FLAG_TRIED. I think a new
VM_FAULT_? flag is the way to go and then the caller will know
nothing has really happened and the caller must retry also if it
was already at the FAULT_FLAG_TRIED stage when it entered
handle_mm_fault. Or the caller will avoid the transition to
FAULT_FLAG_TRIED if it was still at the FAULT_FLAG_ALLOW_RETRY
stage.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: Loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390
@ 2015-11-11 20:37 ` Andrea Arcangeli
0 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2015-11-11 20:37 UTC (permalink / raw)
To: Christian Borntraeger; +Cc: Jason J. Herne, linux-s390, linux-mm, KVM list
On Wed, Nov 11, 2015 at 09:01:44PM +0100, Christian Borntraeger wrote:
> Am 11.11.2015 um 18:30 schrieb Andrea Arcangeli:
> > Hi Jason,
> >
> > On Wed, Nov 11, 2015 at 10:35:16AM -0500, Jason J. Herne wrote:
> >> MADV_NOHUGEPAGE processing is too restrictive. kvm already disables
> >> hugepage but hugepage_madvise() takes the error path when we ask to turn
> >> on the MADV_NOHUGEPAGE bit and the bit is already on. This causes Qemu's
> >
> > I wonder why KVM disables transparent hugepages on s390. It sounds
> > weird to disable transparent hugepages with KVM. In fact on x86 we
> > call MADV_HUGEPAGE to be sure transparent hugepages are enabled on the
> > guest physical memory, even if the transparent_hugepage/enabled ==
> > madvise.
> >
> >> new postcopy migration feature to fail on s390 because its first action is
> >> to madvise the guest address space as NOHUGEPAGE. This patch modifies the
> >> code so that the operation succeeds without error now.
> >
> > The other way is to change qemu to keep track it already called
> > MADV_NOHUGEPAGE and not to call it again. I don't have a strong
> > opinion on this, I think it's ok to return 0 but it's a visible change
> > to userland, I can't imagine it to break anything though. It sounds
> > very unlikely that an app could error out if it notices the kernel
> > doesn't error out on the second call of MADV_NOHUGEPAGE.
> >
> > Glad to hear KVM postcopy live migration is already running on s390 too.
>
> Sometimes....we have some issues with userfaultd, which we currently address.
> One place is interesting: the kvm code might have to call fixup_user_fault
> for a guest address (to map the page writable). Right now we do not pass
> FAULT_FLAG_ALLOW_RETRY, which can trigger a warning like
>
> [ 119.414573] FAULT_FLAG_ALLOW_RETRY missing 1
> [ 119.414577] CPU: 42 PID: 12853 Comm: qemu-system-s39 Not tainted 4.3.0+ #315
> [ 119.414579] 000000011c4579b8 000000011c457a48 0000000000000002 0000000000000000
> 000000011c457ae8 000000011c457a60 000000011c457a60 0000000000113e26
> 00000000000002cf 00000000009feef8 0000000000a1e054 000000000000000b
> 000000011c457aa8 000000011c457a48 0000000000000000 0000000000000000
> 0000000000000000 0000000000113e26 000000011c457a48 000000011c457aa8
> [ 119.414590] Call Trace:
> [ 119.414596] ([<0000000000113d16>] show_trace+0xf6/0x148)
> [ 119.414598] [<0000000000113dda>] show_stack+0x72/0xf0
> [ 119.414600] [<0000000000551b9e>] dump_stack+0x6e/0x90
> [ 119.414605] [<000000000032d168>] handle_userfault+0xe0/0x448
> [ 119.414609] [<000000000029a2d4>] handle_mm_fault+0x16e4/0x1798
> [ 119.414611] [<00000000002930be>] fixup_user_fault+0x86/0x118
> [ 119.414614] [<0000000000126bb8>] gmap_ipte_notify+0xa0/0x170
> [ 119.414617] [<000000000013ae90>] kvm_arch_vcpu_ioctl_run+0x448/0xc58
> [ 119.414619] [<000000000012e4dc>] kvm_vcpu_ioctl+0x37c/0x668
> [ 119.414622] [<00000000002eba68>] do_vfs_ioctl+0x3a8/0x508
> [ 119.414624] [<00000000002ebc6c>] SyS_ioctl+0xa4/0xb8
> [ 119.414627] [<0000000000815c56>] system_call+0xd6/0x264
> [ 119.414629] [<000003ff9628721a>] 0x3ff9628721a
>
> I think we can rework this to use something that sets FAULT_FLAG_ALLOW_RETRY,
> but this begs the question if a futex operation on userfault backed memory
> would also be broken. The futex code also does fixup_user_fault without
> FAULT_FLAG_ALLOW_RETRY as far as I can tell.
That's a good point, but qemu never does futex on the guest physical
memory so can't be a problem for postcopy at least, it's also not
destabilizing in any way (and the stack dump also happens only if you
have DEBUG_VM selected).
The userfaultfd stress test actually could end up using futex on the
userfault memory, but it never triggered anything, it doesn't get to
that fixup_user_fault at runtime.
Still it should be fixed for s390 and futex.
It's probably better to add a fixup_user_fault_unlocked that will work
like get_user_pages_unlocked. I.e. leaves the details of the mmap_sem
locking internally to the function, and will handle VM_FAULT_RETRY
automatically by re-taking the mmap_sem and repeating the
fixup_user_fault after updating the FAULT_FLAG_ALLOW_RETRY to
FAULT_FLAG_TRIED.
Note that the FAULT_FLAG_TRIED logic will need an overhaul soon, as we
must be able to release the mmap_sem in handle_userfault() even if
FAULT_FLAG_TRIED is passed instead of FAULT_FLAG_ALLOW_RETRY for to
reasons:
1) the theoretical scheduler race, with the schedule() not blocking
after setting the state to TASK_KILLABLE and then calling
schedule() will be not an issue anymore (we'll return
VM_FAULT_RETRY and handle_userfault() will be repeated by the
caller immediately later and hopefully eventually schedule() will
block...). We never experienced this race a single time and
thousands of postcopy live migration happened so it seems mostly a
theoretical issue. Even in the worst case it triggers it cannot
corrupt memory but you'll get a sigbus and a stuck dump identical
to what you already experienced above on s390 (not because of the
theoretical scheduler race for s390 case, but the failure point
ends the same).
2) we'll be able to wrprotect pagetables (only with mmap_sem for
reading, without touching vmas) as the program runs, and while
faults are already in progress. If the wrprotection happens after
the kernel already returned VM_FAULT_RETRY of its own because a
page fault was already in flight and it was processed as a
lightweight-locking one, we may enter the write protect tracking
handle_userfault() with FAULT_FLAG_TRIED and we must still be
allowed to block and in turn to drop the mmap_sem and in turn to
return VM_FAULT_RETRY or a VM_FAULT_USERFAULT or some other
flag. In fact it's also fine if the handle_userfault() won't cause
a transition from the caller fault_flags from
FAULT_FLAG_ALLOW_RETRY to FAULT_FLAG_TRIED. I think a new
VM_FAULT_? flag is the way to go and then the caller will know
nothing has really happened and the caller must retry also if it
was already at the FAULT_FLAG_TRIED stage when it entered
handle_mm_fault. Or the caller will avoid the transition to
FAULT_FLAG_TRIED if it was still at the FAULT_FLAG_ALLOW_RETRY
stage.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: Loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390
2015-11-11 19:47 ` Christian Borntraeger
@ 2015-11-11 20:42 ` Andrea Arcangeli
-1 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2015-11-11 20:42 UTC (permalink / raw)
To: Christian Borntraeger; +Cc: Jason J. Herne, linux-s390, linux-mm, KVM list
On Wed, Nov 11, 2015 at 08:47:34PM +0100, Christian Borntraeger wrote:
> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
> Who is going to take this patch? If I should take the patch, I need an
> ACK from the memory mgmt folks.
I would suggest to resend in CC to Andrew to merge in -mm after taking
care of the below, as it's a mm common code part.
>
> Christian
>
>
> >> ---
> >> mm/huge_memory.c | 2 +-
> >> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index c29ddeb..a8b5347 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -2025,7 +2025,7 @@ int hugepage_madvise(struct vm_area_struct *vma,
> >> /*
> >> * Be somewhat over-protective like KSM for now!
> >> */
> >> - if (*vm_flags & (VM_NOHUGEPAGE | VM_NO_THP))
> >> + if (*vm_flags & VM_NO_THP)
> >> return -EINVAL;
> >> *vm_flags &= ~VM_HUGEPAGE;
> >> *vm_flags |= VM_NOHUGEPAGE;
If we make this change the MADV_HUGEPAGE must be taken care of too or
it doesn't make sense to threat them differently.
After taking care of the MADV_HUGEPAGE you can add my reviewed-by when
you resubmit to Andrew.
Thanks,
Andrea
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: Loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390
@ 2015-11-11 20:42 ` Andrea Arcangeli
0 siblings, 0 replies; 18+ messages in thread
From: Andrea Arcangeli @ 2015-11-11 20:42 UTC (permalink / raw)
To: Christian Borntraeger; +Cc: Jason J. Herne, linux-s390, linux-mm, KVM list
On Wed, Nov 11, 2015 at 08:47:34PM +0100, Christian Borntraeger wrote:
> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
> Who is going to take this patch? If I should take the patch, I need an
> ACK from the memory mgmt folks.
I would suggest to resend in CC to Andrew to merge in -mm after taking
care of the below, as it's a mm common code part.
>
> Christian
>
>
> >> ---
> >> mm/huge_memory.c | 2 +-
> >> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index c29ddeb..a8b5347 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -2025,7 +2025,7 @@ int hugepage_madvise(struct vm_area_struct *vma,
> >> /*
> >> * Be somewhat over-protective like KSM for now!
> >> */
> >> - if (*vm_flags & (VM_NOHUGEPAGE | VM_NO_THP))
> >> + if (*vm_flags & VM_NO_THP)
> >> return -EINVAL;
> >> *vm_flags &= ~VM_HUGEPAGE;
> >> *vm_flags |= VM_NOHUGEPAGE;
If we make this change the MADV_HUGEPAGE must be taken care of too or
it doesn't make sense to threat them differently.
After taking care of the MADV_HUGEPAGE you can add my reviewed-by when
you resubmit to Andrew.
Thanks,
Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: Loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390
2015-11-19 8:22 ` Christian Borntraeger
[not found] ` <564D86AE.1010305-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
@ 2015-11-19 9:43 ` Dr. David Alan Gilbert
1 sibling, 0 replies; 18+ messages in thread
From: Dr. David Alan Gilbert @ 2015-11-19 9:43 UTC (permalink / raw)
To: Christian Borntraeger
Cc: Vlastimil Babka, Jason J. Herne, linux-s390, linux-mm, akpm,
aarcange, linux-api, linux-man, qemu-devel, Juan Quintela
* Christian Borntraeger (borntraeger@de.ibm.com) wrote:
> On 11/18/2015 02:31 PM, Vlastimil Babka wrote:
> > [CC += linux-api@vger.kernel.org]
> >
> > Since this is a kernel-user-space API change, please CC linux-api@. The kernel
> > source file Documentation/SubmitChecklist notes that all Linux kernel patches
> > that change userspace interfaces should be CCed to linux-api@vger.kernel.org, so
> > that the various parties who are interested in API changes are informed. For
> > further information, see https://www.kernel.org/doc/man-pages/linux-api-ml.html
> >
> > On 11/12/2015 04:18 PM, Jason J. Herne wrote:
> >> MADV_NOHUGEPAGE processing is too restrictive. kvm already disables
> >> hugepage but hugepage_madvise() takes the error path when we ask to turn
> >> on the MADV_NOHUGEPAGE bit and the bit is already on. This causes Qemu's
> >> new postcopy migration feature to fail on s390 because its first action is
> >> to madvise the guest address space as NOHUGEPAGE. This patch modifies the
> >> code so that the operation succeeds without error now.
> >>
> >> Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
> >> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
> >
> > Looks like the manpage should be fine, as it wasn't very specific wrt these
> > madvise flags. The only thing that potentially applies is:
> >
> > "EINVAL advice is not a valid."
> >
> > which itself looks like it needs fixing. Valid what, value? As in completely
> > unknown flags, or flags not valid for the given vma?
> >
> > Anyway, I agree that it doesn't make sense to fail madvise when the given flag
> > is already set. On the other hand, I don't think the userspace app should fail
> > just because of madvise failing? It should in general be an advice that the
> > kernel is also strictly speaking free to ignore as it shouldn't affect
> > correctnes, just performance. Yeah, there are exceptions today like
> > MADV_DONTNEED, but that shouldn't apply to hugepages?
> > So I think Qemu needs fixing too.
>
> yes, I agree. David, Juan. I think The postcopy code should not fail if the madvise.
> Can you fix that?
Yes, I can change that.
> Also what happens if the kernel is build
> > without CONFIG_TRANSPARENT_HUGEPAGE? Then madvise also returns EINVAL,
>
> Does it? To me it looks more like we would trigger a kernel bug.
Yes, it does return EINVAL; it's a shame - there's no way to distinguish between
a kernel that doesn't have hugepage and a screwup in the addresses passed;
everything gets squashed into EINVAL.
Dave
> mm/madvise.c:
> case MADV_HUGEPAGE:
> case MADV_NOHUGEPAGE:
> error = hugepage_madvise(vma, &new_flags, behavior); <-----
> if (error)
> goto out;
> break;
> }
>
>
> include/linux/huge_mm.h:
> static inline int hugepage_madvise(struct vm_area_struct *vma,
> unsigned long *vm_flags, int advice)
> {
> BUG();
> return 0;
> }
>
> If we just remove the BUG() statement the code would actually be correct
> in simply ignoring an MADVISE it cannot handle. If you agree, I can
> spin a patch.
>
>
>
>
> > how does Qemu handle that?
>
> The normal qemu startup ignores the return value of the madvise. Only the
> recent post migration changes want to disable huge pages for userfaultd.
> And this code checks the return value. And yes, we should change that
> in QEMU.
>
>
>
>
> >
> >> ---
> >> mm/huge_memory.c | 4 ++--
> >> 1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index c29ddeb..62fe06b 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -2009,7 +2009,7 @@ int hugepage_madvise(struct vm_area_struct *vma,
> >> /*
> >> * Be somewhat over-protective like KSM for now!
> >> */
> >> - if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
> >> + if (*vm_flags & VM_NO_THP)
> >> return -EINVAL;
> >> *vm_flags &= ~VM_NOHUGEPAGE;
> >> *vm_flags |= VM_HUGEPAGE;
> >> @@ -2025,7 +2025,7 @@ int hugepage_madvise(struct vm_area_struct *vma,
> >> /*
> >> * Be somewhat over-protective like KSM for now!
> >> */
> >> - if (*vm_flags & (VM_NOHUGEPAGE | VM_NO_THP))
> >> + if (*vm_flags & VM_NO_THP)
> >> return -EINVAL;
> >> *vm_flags &= ~VM_HUGEPAGE;
> >> *vm_flags |= VM_NOHUGEPAGE;
> >>
> >
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: Loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390
2015-11-19 8:22 ` Christian Borntraeger
@ 2015-11-19 9:31 ` Vlastimil Babka
2015-11-19 9:43 ` Dr. David Alan Gilbert
1 sibling, 0 replies; 18+ messages in thread
From: Vlastimil Babka @ 2015-11-19 9:31 UTC (permalink / raw)
To: Christian Borntraeger, Jason J. Herne, linux-s390-u79uwXL29TY76Z2rM5mHXA
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
aarcange-H+wXaHxf7aLQT0dZR+AlfA,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-man-u79uwXL29TY76Z2rM5mHXA, qemu-devel,
Dr. David Alan Gilbert, Juan Quintela
On 11/19/2015 09:22 AM, Christian Borntraeger wrote:
> On 11/18/2015 02:31 PM, Vlastimil Babka wrote:
>> [CC += linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org]
>> Anyway, I agree that it doesn't make sense to fail madvise when the given flag
>> is already set. On the other hand, I don't think the userspace app should fail
>> just because of madvise failing? It should in general be an advice that the
>> kernel is also strictly speaking free to ignore as it shouldn't affect
>> correctnes, just performance. Yeah, there are exceptions today like
>> MADV_DONTNEED, but that shouldn't apply to hugepages?
>> So I think Qemu needs fixing too.
>
> yes, I agree. David, Juan. I think The postcopy code should not fail if the madvise.
> Can you fix that?
>
> Also what happens if the kernel is build
>> without CONFIG_TRANSPARENT_HUGEPAGE? Then madvise also returns EINVAL,
>
> Does it? To me it looks more like we would trigger a kernel bug.
>
> mm/madvise.c:
> case MADV_HUGEPAGE:
> case MADV_NOHUGEPAGE:
> error = hugepage_madvise(vma, &new_flags, behavior); <-----
> if (error)
> goto out;
> break;
> }
>
>
> include/linux/huge_mm.h:
> static inline int hugepage_madvise(struct vm_area_struct *vma,
> unsigned long *vm_flags, int advice)
> {
> BUG();
> return 0;
> }
>
> If we just remove the BUG() statement the code would actually be correct
> in simply ignoring an MADVISE it cannot handle. If you agree, I can
> spin a patch.
Yeah this looks suspicious at first, but the code is not reachable
thanks to madvise_behavior_valid() returning false:
...
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
case MADV_HUGEPAGE:
case MADV_NOHUGEPAGE:
#endif
case MADV_DONTDUMP:
case MADV_DODUMP:
return true;
default:
return false;
I think the BUG() is pointless (KSM doesn't use it) but not wrong. I
wouldn't object to removal.
>
>
>> how does Qemu handle that?
>
> The normal qemu startup ignores the return value of the madvise. Only the
> recent post migration changes want to disable huge pages for userfaultd.
> And this code checks the return value. And yes, we should change that
> in QEMU.
Great, thanks :)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: Loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390
@ 2015-11-19 9:31 ` Vlastimil Babka
0 siblings, 0 replies; 18+ messages in thread
From: Vlastimil Babka @ 2015-11-19 9:31 UTC (permalink / raw)
To: Christian Borntraeger, Jason J. Herne, linux-s390
Cc: linux-mm, akpm, aarcange, linux-api, linux-man, qemu-devel,
Dr. David Alan Gilbert, Juan Quintela
On 11/19/2015 09:22 AM, Christian Borntraeger wrote:
> On 11/18/2015 02:31 PM, Vlastimil Babka wrote:
>> [CC += linux-api@vger.kernel.org]
>> Anyway, I agree that it doesn't make sense to fail madvise when the given flag
>> is already set. On the other hand, I don't think the userspace app should fail
>> just because of madvise failing? It should in general be an advice that the
>> kernel is also strictly speaking free to ignore as it shouldn't affect
>> correctnes, just performance. Yeah, there are exceptions today like
>> MADV_DONTNEED, but that shouldn't apply to hugepages?
>> So I think Qemu needs fixing too.
>
> yes, I agree. David, Juan. I think The postcopy code should not fail if the madvise.
> Can you fix that?
>
> Also what happens if the kernel is build
>> without CONFIG_TRANSPARENT_HUGEPAGE? Then madvise also returns EINVAL,
>
> Does it? To me it looks more like we would trigger a kernel bug.
>
> mm/madvise.c:
> case MADV_HUGEPAGE:
> case MADV_NOHUGEPAGE:
> error = hugepage_madvise(vma, &new_flags, behavior); <-----
> if (error)
> goto out;
> break;
> }
>
>
> include/linux/huge_mm.h:
> static inline int hugepage_madvise(struct vm_area_struct *vma,
> unsigned long *vm_flags, int advice)
> {
> BUG();
> return 0;
> }
>
> If we just remove the BUG() statement the code would actually be correct
> in simply ignoring an MADVISE it cannot handle. If you agree, I can
> spin a patch.
Yeah this looks suspicious at first, but the code is not reachable
thanks to madvise_behavior_valid() returning false:
...
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
case MADV_HUGEPAGE:
case MADV_NOHUGEPAGE:
#endif
case MADV_DONTDUMP:
case MADV_DODUMP:
return true;
default:
return false;
I think the BUG() is pointless (KSM doesn't use it) but not wrong. I
wouldn't object to removal.
>
>
>> how does Qemu handle that?
>
> The normal qemu startup ignores the return value of the madvise. Only the
> recent post migration changes want to disable huge pages for userfaultd.
> And this code checks the return value. And yes, we should change that
> in QEMU.
Great, thanks :)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: Loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390
2015-11-18 13:31 ` Vlastimil Babka
(?)
@ 2015-11-19 8:22 ` Christian Borntraeger
[not found] ` <564D86AE.1010305-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2015-11-19 9:43 ` Dr. David Alan Gilbert
-1 siblings, 2 replies; 18+ messages in thread
From: Christian Borntraeger @ 2015-11-19 8:22 UTC (permalink / raw)
To: Vlastimil Babka, Jason J. Herne, linux-s390
Cc: linux-mm, akpm, aarcange, linux-api, linux-man, qemu-devel,
Dr. David Alan Gilbert, Juan Quintela
On 11/18/2015 02:31 PM, Vlastimil Babka wrote:
> [CC += linux-api@vger.kernel.org]
>
> Since this is a kernel-user-space API change, please CC linux-api@. The kernel
> source file Documentation/SubmitChecklist notes that all Linux kernel patches
> that change userspace interfaces should be CCed to linux-api@vger.kernel.org, so
> that the various parties who are interested in API changes are informed. For
> further information, see https://www.kernel.org/doc/man-pages/linux-api-ml.html
>
> On 11/12/2015 04:18 PM, Jason J. Herne wrote:
>> MADV_NOHUGEPAGE processing is too restrictive. kvm already disables
>> hugepage but hugepage_madvise() takes the error path when we ask to turn
>> on the MADV_NOHUGEPAGE bit and the bit is already on. This causes Qemu's
>> new postcopy migration feature to fail on s390 because its first action is
>> to madvise the guest address space as NOHUGEPAGE. This patch modifies the
>> code so that the operation succeeds without error now.
>>
>> Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
>> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
>
> Looks like the manpage should be fine, as it wasn't very specific wrt these
> madvise flags. The only thing that potentially applies is:
>
> "EINVAL advice is not a valid."
>
> which itself looks like it needs fixing. Valid what, value? As in completely
> unknown flags, or flags not valid for the given vma?
>
> Anyway, I agree that it doesn't make sense to fail madvise when the given flag
> is already set. On the other hand, I don't think the userspace app should fail
> just because of madvise failing? It should in general be an advice that the
> kernel is also strictly speaking free to ignore as it shouldn't affect
> correctnes, just performance. Yeah, there are exceptions today like
> MADV_DONTNEED, but that shouldn't apply to hugepages?
> So I think Qemu needs fixing too.
yes, I agree. David, Juan. I think The postcopy code should not fail if the madvise.
Can you fix that?
Also what happens if the kernel is build
> without CONFIG_TRANSPARENT_HUGEPAGE? Then madvise also returns EINVAL,
Does it? To me it looks more like we would trigger a kernel bug.
mm/madvise.c:
case MADV_HUGEPAGE:
case MADV_NOHUGEPAGE:
error = hugepage_madvise(vma, &new_flags, behavior); <-----
if (error)
goto out;
break;
}
include/linux/huge_mm.h:
static inline int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice)
{
BUG();
return 0;
}
If we just remove the BUG() statement the code would actually be correct
in simply ignoring an MADVISE it cannot handle. If you agree, I can
spin a patch.
> how does Qemu handle that?
The normal qemu startup ignores the return value of the madvise. Only the
recent post migration changes want to disable huge pages for userfaultd.
And this code checks the return value. And yes, we should change that
in QEMU.
>
>> ---
>> mm/huge_memory.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index c29ddeb..62fe06b 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2009,7 +2009,7 @@ int hugepage_madvise(struct vm_area_struct *vma,
>> /*
>> * Be somewhat over-protective like KSM for now!
>> */
>> - if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
>> + if (*vm_flags & VM_NO_THP)
>> return -EINVAL;
>> *vm_flags &= ~VM_NOHUGEPAGE;
>> *vm_flags |= VM_HUGEPAGE;
>> @@ -2025,7 +2025,7 @@ int hugepage_madvise(struct vm_area_struct *vma,
>> /*
>> * Be somewhat over-protective like KSM for now!
>> */
>> - if (*vm_flags & (VM_NOHUGEPAGE | VM_NO_THP))
>> + if (*vm_flags & VM_NO_THP)
>> return -EINVAL;
>> *vm_flags &= ~VM_HUGEPAGE;
>> *vm_flags |= VM_NOHUGEPAGE;
>>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: Loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390
2015-11-12 15:18 Jason J. Herne
@ 2015-11-18 13:31 ` Vlastimil Babka
2015-11-13 22:58 ` David Rientjes
[not found] ` <1447341516-18076-1-git-send-email-jjherne-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2 siblings, 0 replies; 18+ messages in thread
From: Vlastimil Babka @ 2015-11-18 13:31 UTC (permalink / raw)
To: Jason J. Herne, linux-s390-u79uwXL29TY76Z2rM5mHXA
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
aarcange-H+wXaHxf7aLQT0dZR+AlfA,
borntraeger-tA70FqPdS9bQT0dZR+AlfA,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-man-u79uwXL29TY76Z2rM5mHXA
[CC += linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org]
Since this is a kernel-user-space API change, please CC linux-api@. The kernel
source file Documentation/SubmitChecklist notes that all Linux kernel patches
that change userspace interfaces should be CCed to linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, so
that the various parties who are interested in API changes are informed. For
further information, see https://www.kernel.org/doc/man-pages/linux-api-ml.html
On 11/12/2015 04:18 PM, Jason J. Herne wrote:
> MADV_NOHUGEPAGE processing is too restrictive. kvm already disables
> hugepage but hugepage_madvise() takes the error path when we ask to turn
> on the MADV_NOHUGEPAGE bit and the bit is already on. This causes Qemu's
> new postcopy migration feature to fail on s390 because its first action is
> to madvise the guest address space as NOHUGEPAGE. This patch modifies the
> code so that the operation succeeds without error now.
>
> Signed-off-by: Jason J. Herne <jjherne-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> Reviewed-by: Andrea Arcangeli <aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Looks like the manpage should be fine, as it wasn't very specific wrt these
madvise flags. The only thing that potentially applies is:
"EINVAL advice is not a valid."
which itself looks like it needs fixing. Valid what, value? As in completely
unknown flags, or flags not valid for the given vma?
Anyway, I agree that it doesn't make sense to fail madvise when the given flag
is already set. On the other hand, I don't think the userspace app should fail
just because of madvise failing? It should in general be an advice that the
kernel is also strictly speaking free to ignore as it shouldn't affect
correctnes, just performance. Yeah, there are exceptions today like
MADV_DONTNEED, but that shouldn't apply to hugepages?
So I think Qemu needs fixing too. Also what happens if the kernel is build
without CONFIG_TRANSPARENT_HUGEPAGE? Then madvise also returns EINVAL, how does
Qemu handle that?
> ---
> mm/huge_memory.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c29ddeb..62fe06b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2009,7 +2009,7 @@ int hugepage_madvise(struct vm_area_struct *vma,
> /*
> * Be somewhat over-protective like KSM for now!
> */
> - if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
> + if (*vm_flags & VM_NO_THP)
> return -EINVAL;
> *vm_flags &= ~VM_NOHUGEPAGE;
> *vm_flags |= VM_HUGEPAGE;
> @@ -2025,7 +2025,7 @@ int hugepage_madvise(struct vm_area_struct *vma,
> /*
> * Be somewhat over-protective like KSM for now!
> */
> - if (*vm_flags & (VM_NOHUGEPAGE | VM_NO_THP))
> + if (*vm_flags & VM_NO_THP)
> return -EINVAL;
> *vm_flags &= ~VM_HUGEPAGE;
> *vm_flags |= VM_NOHUGEPAGE;
>
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: Loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390
@ 2015-11-18 13:31 ` Vlastimil Babka
0 siblings, 0 replies; 18+ messages in thread
From: Vlastimil Babka @ 2015-11-18 13:31 UTC (permalink / raw)
To: Jason J. Herne, linux-s390
Cc: linux-mm, akpm, aarcange, borntraeger, linux-api, linux-man
[CC += linux-api@vger.kernel.org]
Since this is a kernel-user-space API change, please CC linux-api@. The kernel
source file Documentation/SubmitChecklist notes that all Linux kernel patches
that change userspace interfaces should be CCed to linux-api@vger.kernel.org, so
that the various parties who are interested in API changes are informed. For
further information, see https://www.kernel.org/doc/man-pages/linux-api-ml.html
On 11/12/2015 04:18 PM, Jason J. Herne wrote:
> MADV_NOHUGEPAGE processing is too restrictive. kvm already disables
> hugepage but hugepage_madvise() takes the error path when we ask to turn
> on the MADV_NOHUGEPAGE bit and the bit is already on. This causes Qemu's
> new postcopy migration feature to fail on s390 because its first action is
> to madvise the guest address space as NOHUGEPAGE. This patch modifies the
> code so that the operation succeeds without error now.
>
> Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Looks like the manpage should be fine, as it wasn't very specific wrt these
madvise flags. The only thing that potentially applies is:
"EINVAL advice is not a valid."
which itself looks like it needs fixing. Valid what, value? As in completely
unknown flags, or flags not valid for the given vma?
Anyway, I agree that it doesn't make sense to fail madvise when the given flag
is already set. On the other hand, I don't think the userspace app should fail
just because of madvise failing? It should in general be an advice that the
kernel is also strictly speaking free to ignore as it shouldn't affect
correctnes, just performance. Yeah, there are exceptions today like
MADV_DONTNEED, but that shouldn't apply to hugepages?
So I think Qemu needs fixing too. Also what happens if the kernel is build
without CONFIG_TRANSPARENT_HUGEPAGE? Then madvise also returns EINVAL, how does
Qemu handle that?
> ---
> mm/huge_memory.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c29ddeb..62fe06b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2009,7 +2009,7 @@ int hugepage_madvise(struct vm_area_struct *vma,
> /*
> * Be somewhat over-protective like KSM for now!
> */
> - if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
> + if (*vm_flags & VM_NO_THP)
> return -EINVAL;
> *vm_flags &= ~VM_NOHUGEPAGE;
> *vm_flags |= VM_HUGEPAGE;
> @@ -2025,7 +2025,7 @@ int hugepage_madvise(struct vm_area_struct *vma,
> /*
> * Be somewhat over-protective like KSM for now!
> */
> - if (*vm_flags & (VM_NOHUGEPAGE | VM_NO_THP))
> + if (*vm_flags & VM_NO_THP)
> return -EINVAL;
> *vm_flags &= ~VM_HUGEPAGE;
> *vm_flags |= VM_NOHUGEPAGE;
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: Loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390
2015-11-12 15:18 Jason J. Herne
2015-11-12 16:45 ` Christian Borntraeger
@ 2015-11-13 22:58 ` David Rientjes
[not found] ` <1447341516-18076-1-git-send-email-jjherne-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2 siblings, 0 replies; 18+ messages in thread
From: David Rientjes @ 2015-11-13 22:58 UTC (permalink / raw)
To: Jason J. Herne; +Cc: linux-s390, linux-mm, akpm, aarcange, borntraeger
On Thu, 12 Nov 2015, Jason J. Herne wrote:
> MADV_NOHUGEPAGE processing is too restrictive. kvm already disables
> hugepage but hugepage_madvise() takes the error path when we ask to turn
> on the MADV_NOHUGEPAGE bit and the bit is already on. This causes Qemu's
> new postcopy migration feature to fail on s390 because its first action is
> to madvise the guest address space as NOHUGEPAGE. This patch modifies the
> code so that the operation succeeds without error now.
>
> Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] mm: Loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390
2015-11-12 15:18 Jason J. Herne
@ 2015-11-12 16:45 ` Christian Borntraeger
2015-11-13 22:58 ` David Rientjes
[not found] ` <1447341516-18076-1-git-send-email-jjherne-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2 siblings, 0 replies; 18+ messages in thread
From: Christian Borntraeger @ 2015-11-12 16:45 UTC (permalink / raw)
To: Jason J. Herne, linux-s390; +Cc: linux-mm, akpm, aarcange
On 11/12/2015 04:18 PM, Jason J. Herne wrote:
> MADV_NOHUGEPAGE processing is too restrictive. kvm already disables
> hugepage but hugepage_madvise() takes the error path when we ask to turn
> on the MADV_NOHUGEPAGE bit and the bit is already on. This causes Qemu's
> new postcopy migration feature to fail on s390 because its first action is
> to madvise the guest address space as NOHUGEPAGE. This patch modifies the
> code so that the operation succeeds without error now.
maybe add
"For consistency reasons do the same for MADV_HUGEPAGE."
>
> Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Andrew, can you queue this patch?
> ---
> mm/huge_memory.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c29ddeb..62fe06b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2009,7 +2009,7 @@ int hugepage_madvise(struct vm_area_struct *vma,
> /*
> * Be somewhat over-protective like KSM for now!
> */
> - if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
> + if (*vm_flags & VM_NO_THP)
> return -EINVAL;
> *vm_flags &= ~VM_NOHUGEPAGE;
> *vm_flags |= VM_HUGEPAGE;
> @@ -2025,7 +2025,7 @@ int hugepage_madvise(struct vm_area_struct *vma,
> /*
> * Be somewhat over-protective like KSM for now!
> */
> - if (*vm_flags & (VM_NOHUGEPAGE | VM_NO_THP))
> + if (*vm_flags & VM_NO_THP)
> return -EINVAL;
> *vm_flags &= ~VM_HUGEPAGE;
> *vm_flags |= VM_NOHUGEPAGE;
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] mm: Loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390
@ 2015-11-12 15:18 Jason J. Herne
2015-11-12 16:45 ` Christian Borntraeger
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: Jason J. Herne @ 2015-11-12 15:18 UTC (permalink / raw)
To: linux-s390; +Cc: linux-mm, akpm, aarcange, borntraeger, Jason J. Herne
MADV_NOHUGEPAGE processing is too restrictive. kvm already disables
hugepage but hugepage_madvise() takes the error path when we ask to turn
on the MADV_NOHUGEPAGE bit and the bit is already on. This causes Qemu's
new postcopy migration feature to fail on s390 because its first action is
to madvise the guest address space as NOHUGEPAGE. This patch modifies the
code so that the operation succeeds without error now.
Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
---
mm/huge_memory.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c29ddeb..62fe06b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2009,7 +2009,7 @@ int hugepage_madvise(struct vm_area_struct *vma,
/*
* Be somewhat over-protective like KSM for now!
*/
- if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP))
+ if (*vm_flags & VM_NO_THP)
return -EINVAL;
*vm_flags &= ~VM_NOHUGEPAGE;
*vm_flags |= VM_HUGEPAGE;
@@ -2025,7 +2025,7 @@ int hugepage_madvise(struct vm_area_struct *vma,
/*
* Be somewhat over-protective like KSM for now!
*/
- if (*vm_flags & (VM_NOHUGEPAGE | VM_NO_THP))
+ if (*vm_flags & VM_NO_THP)
return -EINVAL;
*vm_flags &= ~VM_HUGEPAGE;
*vm_flags |= VM_NOHUGEPAGE;
--
1.9.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 18+ messages in thread
end of thread, other threads:[~2015-11-19 9:43 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-11 15:35 [PATCH] mm: Loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390 Jason J. Herne
2015-11-11 17:30 ` Andrea Arcangeli
2015-11-11 19:47 ` Christian Borntraeger
2015-11-11 19:47 ` Christian Borntraeger
2015-11-11 20:42 ` Andrea Arcangeli
2015-11-11 20:42 ` Andrea Arcangeli
2015-11-11 20:01 ` Christian Borntraeger
2015-11-11 20:37 ` Andrea Arcangeli
2015-11-11 20:37 ` Andrea Arcangeli
2015-11-12 15:18 Jason J. Herne
2015-11-12 16:45 ` Christian Borntraeger
2015-11-13 22:58 ` David Rientjes
[not found] ` <1447341516-18076-1-git-send-email-jjherne-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2015-11-18 13:31 ` Vlastimil Babka
2015-11-18 13:31 ` Vlastimil Babka
2015-11-19 8:22 ` Christian Borntraeger
[not found] ` <564D86AE.1010305-tA70FqPdS9bQT0dZR+AlfA@public.gmane.org>
2015-11-19 9:31 ` Vlastimil Babka
2015-11-19 9:31 ` Vlastimil Babka
2015-11-19 9:43 ` Dr. David Alan Gilbert
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.