[PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-03-15 16:07 ` James Morse
  0 siblings, 0 replies; 34+ messages in thread
From: James Morse @ 2017-03-15 16:07 UTC (permalink / raw)
  To: gengdongjiu
  Cc: punit.agrawal, Marc Zyngier, Tyler Baicar, linux-arm-kernel, kvmarm

Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
broken memory can call memory_failure() in mm/memory-failure.c to deliver
SIGBUS to any user space process using the page, and notify all the
in-kernel users.

If the page corresponded with guest memory, KVM will unmap this page
from its stage2 page tables. The user space process that allocated
this memory may have never touched this page in which case it may not
be mapped meaning SIGBUS won't be delivered.

When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
comes to process the stage2 fault.

Do as x86 does, and deliver the SIGBUS when we discover
KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
as this matches the user space mapping size.

Signed-off-by: James Morse <james.morse@arm.com>
CC: gengdongjiu <gengdj.1984@gmail.com>
---
 Without this patch both kvmtool and Qemu exit as the KVM_RUN ioctl() returns
 EFAULT.
 QEMU: error: kvm run failed Bad address
 LVKM: KVM_RUN failed: Bad address

 With this patch both kvmtool and Qemu receive SIGBUS ... and then exit.
 In the future Qemu can use this signal to notify the guest, for more details
 see hwpoison[1].

 [0] https://www.spinics.net/lists/arm-kernel/msg560009.html
 [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/hwpoison.txt


 arch/arm/kvm/mmu.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 962616fd4ddd..9d1aa294e88f 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -20,8 +20,10 @@
 #include <linux/kvm_host.h>
 #include <linux/io.h>
 #include <linux/hugetlb.h>
+#include <linux/sched/signal.h>
 #include <trace/events/kvm.h>
 #include <asm/pgalloc.h>
+#include <asm/siginfo.h>
 #include <asm/cacheflush.h>
 #include <asm/kvm_arm.h>
 #include <asm/kvm_mmu.h>
@@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
 	__coherent_cache_guest_page(vcpu, pfn, size);
 }
 
+static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
+{
+	siginfo_t info;
+
+	info.si_signo   = SIGBUS;
+	info.si_errno   = 0;
+	info.si_code    = BUS_MCEERR_AR;
+	info.si_addr    = (void __user *)address;
+
+	if (hugetlb)
+		info.si_addr_lsb = PMD_SHIFT;
+	else
+		info.si_addr_lsb = PAGE_SHIFT;
+
+	send_sig_info(SIGBUS, &info, current);
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
 			  unsigned long fault_status)
@@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	smp_rmb();
 
 	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
+	if (pfn == KVM_PFN_ERR_HWPOISON) {
+		kvm_send_hwpoison_signal(hva, hugetlb);
+		return 0;
+	}
 	if (is_error_noslot_pfn(pfn))
 		return -EFAULT;
 
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-03-15 16:07 ` James Morse
  0 siblings, 0 replies; 34+ messages in thread
From: James Morse @ 2017-03-15 16:07 UTC (permalink / raw)
  To: linux-arm-kernel

Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
broken memory can call memory_failure() in mm/memory-failure.c to deliver
SIGBUS to any user space process using the page, and notify all the
in-kernel users.

If the page corresponded with guest memory, KVM will unmap this page
from its stage2 page tables. The user space process that allocated
this memory may have never touched this page in which case it may not
be mapped meaning SIGBUS won't be delivered.

When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
comes to process the stage2 fault.

Do as x86 does, and deliver the SIGBUS when we discover
KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
as this matches the user space mapping size.

Signed-off-by: James Morse <james.morse@arm.com>
CC: gengdongjiu <gengdj.1984@gmail.com>
---
 Without this patch both kvmtool and Qemu exit as the KVM_RUN ioctl() returns
 EFAULT.
 QEMU: error: kvm run failed Bad address
 LVKM: KVM_RUN failed: Bad address

 With this patch both kvmtool and Qemu receive SIGBUS ... and then exit.
 In the future Qemu can use this signal to notify the guest, for more details
 see hwpoison[1].

 [0] https://www.spinics.net/lists/arm-kernel/msg560009.html
 [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/hwpoison.txt


 arch/arm/kvm/mmu.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 962616fd4ddd..9d1aa294e88f 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -20,8 +20,10 @@
 #include <linux/kvm_host.h>
 #include <linux/io.h>
 #include <linux/hugetlb.h>
+#include <linux/sched/signal.h>
 #include <trace/events/kvm.h>
 #include <asm/pgalloc.h>
+#include <asm/siginfo.h>
 #include <asm/cacheflush.h>
 #include <asm/kvm_arm.h>
 #include <asm/kvm_mmu.h>
@@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
 	__coherent_cache_guest_page(vcpu, pfn, size);
 }
 
+static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
+{
+	siginfo_t info;
+
+	info.si_signo   = SIGBUS;
+	info.si_errno   = 0;
+	info.si_code    = BUS_MCEERR_AR;
+	info.si_addr    = (void __user *)address;
+
+	if (hugetlb)
+		info.si_addr_lsb = PMD_SHIFT;
+	else
+		info.si_addr_lsb = PAGE_SHIFT;
+
+	send_sig_info(SIGBUS, &info, current);
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
 			  unsigned long fault_status)
@@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	smp_rmb();
 
 	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
+	if (pfn == KVM_PFN_ERR_HWPOISON) {
+		kvm_send_hwpoison_signal(hva, hugetlb);
+		return 0;
+	}
 	if (is_error_noslot_pfn(pfn))
 		return -EFAULT;
 
-- 
2.10.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-03-15 16:07 ` James Morse
@ 2017-03-17 15:06   ` Punit Agrawal
  -1 siblings, 0 replies; 34+ messages in thread
From: Punit Agrawal @ 2017-03-17 15:06 UTC (permalink / raw)
  To: James Morse
  Cc: Marc Zyngier, Tyler Baicar, kvmarm, linux-arm-kernel, gengdongjiu

Hi James,

One comment at the end.

James Morse <james.morse@arm.com> writes:

> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
> broken memory can call memory_failure() in mm/memory-failure.c to deliver
> SIGBUS to any user space process using the page, and notify all the
> in-kernel users.
>
> If the page corresponded with guest memory, KVM will unmap this page
> from its stage2 page tables. The user space process that allocated
> this memory may have never touched this page in which case it may not
> be mapped meaning SIGBUS won't be delivered.
>
> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
> comes to process the stage2 fault.
>
> Do as x86 does, and deliver the SIGBUS when we discover
> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
> as this matches the user space mapping size.
>
> Signed-off-by: James Morse <james.morse@arm.com>
> CC: gengdongjiu <gengdj.1984@gmail.com>
> ---
>  Without this patch both kvmtool and Qemu exit as the KVM_RUN ioctl() returns
>  EFAULT.
>  QEMU: error: kvm run failed Bad address
>  LVKM: KVM_RUN failed: Bad address
>
>  With this patch both kvmtool and Qemu receive SIGBUS ... and then exit.
>  In the future Qemu can use this signal to notify the guest, for more details
>  see hwpoison[1].
>
>  [0] https://www.spinics.net/lists/arm-kernel/msg560009.html
>  [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/hwpoison.txt
>
>
>  arch/arm/kvm/mmu.c | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
>
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 962616fd4ddd..9d1aa294e88f 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -20,8 +20,10 @@
>  #include <linux/kvm_host.h>
>  #include <linux/io.h>
>  #include <linux/hugetlb.h>
> +#include <linux/sched/signal.h>
>  #include <trace/events/kvm.h>
>  #include <asm/pgalloc.h>
> +#include <asm/siginfo.h>
>  #include <asm/cacheflush.h>
>  #include <asm/kvm_arm.h>
>  #include <asm/kvm_mmu.h>
> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>  	__coherent_cache_guest_page(vcpu, pfn, size);
>  }
>  
> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
> +{
> +	siginfo_t info;
> +
> +	info.si_signo   = SIGBUS;
> +	info.si_errno   = 0;
> +	info.si_code    = BUS_MCEERR_AR;
> +	info.si_addr    = (void __user *)address;
> +
> +	if (hugetlb)
> +		info.si_addr_lsb = PMD_SHIFT;
> +	else
> +		info.si_addr_lsb = PAGE_SHIFT;
> +
> +	send_sig_info(SIGBUS, &info, current);
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>  			  unsigned long fault_status)
> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	smp_rmb();
>  
>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
> +		kvm_send_hwpoison_signal(hva, hugetlb);
> +		return 0;
> +	}
>  	if (is_error_noslot_pfn(pfn))
>  		return -EFAULT;

The changes look good to me. Though in essence as mentioned in the
commit log we are not doing anything different to x86 here. Worth moving
kvm_send_hwpoison_signal to an architecture agostic location and using
it from there?

In any case, FWIW,

Reviewed-by: Punit Agrawal <punit.agrawal@arm.com>

Thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-03-17 15:06   ` Punit Agrawal
  0 siblings, 0 replies; 34+ messages in thread
From: Punit Agrawal @ 2017-03-17 15:06 UTC (permalink / raw)
  To: linux-arm-kernel

Hi James,

One comment at the end.

James Morse <james.morse@arm.com> writes:

> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
> broken memory can call memory_failure() in mm/memory-failure.c to deliver
> SIGBUS to any user space process using the page, and notify all the
> in-kernel users.
>
> If the page corresponded with guest memory, KVM will unmap this page
> from its stage2 page tables. The user space process that allocated
> this memory may have never touched this page in which case it may not
> be mapped meaning SIGBUS won't be delivered.
>
> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
> comes to process the stage2 fault.
>
> Do as x86 does, and deliver the SIGBUS when we discover
> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
> as this matches the user space mapping size.
>
> Signed-off-by: James Morse <james.morse@arm.com>
> CC: gengdongjiu <gengdj.1984@gmail.com>
> ---
>  Without this patch both kvmtool and Qemu exit as the KVM_RUN ioctl() returns
>  EFAULT.
>  QEMU: error: kvm run failed Bad address
>  LVKM: KVM_RUN failed: Bad address
>
>  With this patch both kvmtool and Qemu receive SIGBUS ... and then exit.
>  In the future Qemu can use this signal to notify the guest, for more details
>  see hwpoison[1].
>
>  [0] https://www.spinics.net/lists/arm-kernel/msg560009.html
>  [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/hwpoison.txt
>
>
>  arch/arm/kvm/mmu.c | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
>
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 962616fd4ddd..9d1aa294e88f 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -20,8 +20,10 @@
>  #include <linux/kvm_host.h>
>  #include <linux/io.h>
>  #include <linux/hugetlb.h>
> +#include <linux/sched/signal.h>
>  #include <trace/events/kvm.h>
>  #include <asm/pgalloc.h>
> +#include <asm/siginfo.h>
>  #include <asm/cacheflush.h>
>  #include <asm/kvm_arm.h>
>  #include <asm/kvm_mmu.h>
> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>  	__coherent_cache_guest_page(vcpu, pfn, size);
>  }
>  
> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
> +{
> +	siginfo_t info;
> +
> +	info.si_signo   = SIGBUS;
> +	info.si_errno   = 0;
> +	info.si_code    = BUS_MCEERR_AR;
> +	info.si_addr    = (void __user *)address;
> +
> +	if (hugetlb)
> +		info.si_addr_lsb = PMD_SHIFT;
> +	else
> +		info.si_addr_lsb = PAGE_SHIFT;
> +
> +	send_sig_info(SIGBUS, &info, current);
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>  			  unsigned long fault_status)
> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	smp_rmb();
>  
>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
> +		kvm_send_hwpoison_signal(hva, hugetlb);
> +		return 0;
> +	}
>  	if (is_error_noslot_pfn(pfn))
>  		return -EFAULT;

The changes look good to me. Though in essence as mentioned in the
commit log we are not doing anything different to x86 here. Worth moving
kvm_send_hwpoison_signal to an architecture agostic location and using
it from there?

In any case, FWIW,

Reviewed-by: Punit Agrawal <punit.agrawal@arm.com>

Thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-03-17 15:06   ` Punit Agrawal
@ 2017-03-17 15:48     ` James Morse
  -1 siblings, 0 replies; 34+ messages in thread
From: James Morse @ 2017-03-17 15:48 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: Marc Zyngier, Tyler Baicar, kvmarm, linux-arm-kernel, gengdongjiu

Hi Punit,

On 17/03/17 15:06, Punit Agrawal wrote:
> James Morse <james.morse@arm.com> writes:
>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
>> SIGBUS to any user space process using the page, and notify all the
>> in-kernel users.
>>
>> If the page corresponded with guest memory, KVM will unmap this page
>> from its stage2 page tables. The user space process that allocated
>> this memory may have never touched this page in which case it may not
>> be mapped meaning SIGBUS won't be delivered.
>>
>> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
>> comes to process the stage2 fault.
>>
>> Do as x86 does, and deliver the SIGBUS when we discover
>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
>> as this matches the user space mapping size.

>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index 962616fd4ddd..9d1aa294e88f 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>>  	__coherent_cache_guest_page(vcpu, pfn, size);
>>  }
>>  
>> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
>> +{
>> +	siginfo_t info;
>> +
>> +	info.si_signo   = SIGBUS;
>> +	info.si_errno   = 0;
>> +	info.si_code    = BUS_MCEERR_AR;
>> +	info.si_addr    = (void __user *)address;
>> +
>> +	if (hugetlb)
>> +		info.si_addr_lsb = PMD_SHIFT;
>> +	else
>> +		info.si_addr_lsb = PAGE_SHIFT;
>> +
>> +	send_sig_info(SIGBUS, &info, current);
>> +}

> The changes look good to me. Though in essence as mentioned in the
> commit log we are not doing anything different to x86 here. Worth moving
> kvm_send_hwpoison_signal to an architecture agostic location and using
> it from there?

I had an earlier version that did exactly that, but the x86 version always
reports PAGE_SHIFT as the si_addr_lsb value. I don't know enough about their
version of stage2 to know if that's a bug or an implementation detail, so I
chose not to copy it.


> In any case, FWIW,
> 
> Reviewed-by: Punit Agrawal <punit.agrawal@arm.com>

Thanks!

James

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-03-17 15:48     ` James Morse
  0 siblings, 0 replies; 34+ messages in thread
From: James Morse @ 2017-03-17 15:48 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Punit,

On 17/03/17 15:06, Punit Agrawal wrote:
> James Morse <james.morse@arm.com> writes:
>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
>> SIGBUS to any user space process using the page, and notify all the
>> in-kernel users.
>>
>> If the page corresponded with guest memory, KVM will unmap this page
>> from its stage2 page tables. The user space process that allocated
>> this memory may have never touched this page in which case it may not
>> be mapped meaning SIGBUS won't be delivered.
>>
>> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
>> comes to process the stage2 fault.
>>
>> Do as x86 does, and deliver the SIGBUS when we discover
>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
>> as this matches the user space mapping size.

>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index 962616fd4ddd..9d1aa294e88f 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>>  	__coherent_cache_guest_page(vcpu, pfn, size);
>>  }
>>  
>> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
>> +{
>> +	siginfo_t info;
>> +
>> +	info.si_signo   = SIGBUS;
>> +	info.si_errno   = 0;
>> +	info.si_code    = BUS_MCEERR_AR;
>> +	info.si_addr    = (void __user *)address;
>> +
>> +	if (hugetlb)
>> +		info.si_addr_lsb = PMD_SHIFT;
>> +	else
>> +		info.si_addr_lsb = PAGE_SHIFT;
>> +
>> +	send_sig_info(SIGBUS, &info, current);
>> +}

> The changes look good to me. Though in essence as mentioned in the
> commit log we are not doing anything different to x86 here. Worth moving
> kvm_send_hwpoison_signal to an architecture agostic location and using
> it from there?

I had an earlier version that did exactly that, but the x86 version always
reports PAGE_SHIFT as the si_addr_lsb value. I don't know enough about their
version of stage2 to know if that's a bug or an implementation detail, so I
chose not to copy it.


> In any case, FWIW,
> 
> Reviewed-by: Punit Agrawal <punit.agrawal@arm.com>

Thanks!

James

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-03-15 16:07 ` James Morse
@ 2017-03-24 18:30   ` Christoffer Dall
  -1 siblings, 0 replies; 34+ messages in thread
From: Christoffer Dall @ 2017-03-24 18:30 UTC (permalink / raw)
  To: James Morse
  Cc: punit.agrawal, Marc Zyngier, Tyler Baicar, linux-arm-kernel,
	kvmarm, gengdongjiu

On Wed, Mar 15, 2017 at 04:07:27PM +0000, James Morse wrote:
> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
> broken memory can call memory_failure() in mm/memory-failure.c to deliver
> SIGBUS to any user space process using the page, and notify all the
> in-kernel users.
> 
> If the page corresponded with guest memory, KVM will unmap this page
> from its stage2 page tables. The user space process that allocated
> this memory may have never touched this page in which case it may not
> be mapped meaning SIGBUS won't be delivered.
> 
> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
> comes to process the stage2 fault.
> 
> Do as x86 does, and deliver the SIGBUS when we discover
> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
> as this matches the user space mapping size.
> 
> Signed-off-by: James Morse <james.morse@arm.com>
> CC: gengdongjiu <gengdj.1984@gmail.com>
> ---
>  Without this patch both kvmtool and Qemu exit as the KVM_RUN ioctl() returns
>  EFAULT.
>  QEMU: error: kvm run failed Bad address
>  LVKM: KVM_RUN failed: Bad address
> 
>  With this patch both kvmtool and Qemu receive SIGBUS ... and then exit.
>  In the future Qemu can use this signal to notify the guest, for more details
>  see hwpoison[1].
> 
>  [0] https://www.spinics.net/lists/arm-kernel/msg560009.html
>  [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/hwpoison.txt
> 
> 
>  arch/arm/kvm/mmu.c | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 962616fd4ddd..9d1aa294e88f 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -20,8 +20,10 @@
>  #include <linux/kvm_host.h>
>  #include <linux/io.h>
>  #include <linux/hugetlb.h>
> +#include <linux/sched/signal.h>
>  #include <trace/events/kvm.h>
>  #include <asm/pgalloc.h>
> +#include <asm/siginfo.h>
>  #include <asm/cacheflush.h>
>  #include <asm/kvm_arm.h>
>  #include <asm/kvm_mmu.h>
> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>  	__coherent_cache_guest_page(vcpu, pfn, size);
>  }
>  
> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
> +{
> +	siginfo_t info;
> +
> +	info.si_signo   = SIGBUS;
> +	info.si_errno   = 0;
> +	info.si_code    = BUS_MCEERR_AR;
> +	info.si_addr    = (void __user *)address;
> +
> +	if (hugetlb)
> +		info.si_addr_lsb = PMD_SHIFT;
> +	else
> +		info.si_addr_lsb = PAGE_SHIFT;
> +
> +	send_sig_info(SIGBUS, &info, current);
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>  			  unsigned long fault_status)
> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	smp_rmb();
>  
>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
> +		kvm_send_hwpoison_signal(hva, hugetlb);

The way this is called means that we'll only notify userspace of a huge
mapping if userspace is mapping hugetlbfs, and not because the stage2
mapping may or may not have used transparent huge pages when the error
was discovered.  Is this the desired semantics?

Also notice that the hva is not necessarily aligned to the beginning of
the huge page, so can we be giving userspace wrong information by
pointing in the middle of a huge page and telling it there was an
address error in the size of the PMD ?

> +		return 0;
> +	}
>  	if (is_error_noslot_pfn(pfn))
>  		return -EFAULT;
>  
> -- 
> 2.10.1
> 

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-03-24 18:30   ` Christoffer Dall
  0 siblings, 0 replies; 34+ messages in thread
From: Christoffer Dall @ 2017-03-24 18:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 15, 2017 at 04:07:27PM +0000, James Morse wrote:
> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
> broken memory can call memory_failure() in mm/memory-failure.c to deliver
> SIGBUS to any user space process using the page, and notify all the
> in-kernel users.
> 
> If the page corresponded with guest memory, KVM will unmap this page
> from its stage2 page tables. The user space process that allocated
> this memory may have never touched this page in which case it may not
> be mapped meaning SIGBUS won't be delivered.
> 
> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
> comes to process the stage2 fault.
> 
> Do as x86 does, and deliver the SIGBUS when we discover
> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
> as this matches the user space mapping size.
> 
> Signed-off-by: James Morse <james.morse@arm.com>
> CC: gengdongjiu <gengdj.1984@gmail.com>
> ---
>  Without this patch both kvmtool and Qemu exit as the KVM_RUN ioctl() returns
>  EFAULT.
>  QEMU: error: kvm run failed Bad address
>  LVKM: KVM_RUN failed: Bad address
> 
>  With this patch both kvmtool and Qemu receive SIGBUS ... and then exit.
>  In the future Qemu can use this signal to notify the guest, for more details
>  see hwpoison[1].
> 
>  [0] https://www.spinics.net/lists/arm-kernel/msg560009.html
>  [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/hwpoison.txt
> 
> 
>  arch/arm/kvm/mmu.c | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 962616fd4ddd..9d1aa294e88f 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -20,8 +20,10 @@
>  #include <linux/kvm_host.h>
>  #include <linux/io.h>
>  #include <linux/hugetlb.h>
> +#include <linux/sched/signal.h>
>  #include <trace/events/kvm.h>
>  #include <asm/pgalloc.h>
> +#include <asm/siginfo.h>
>  #include <asm/cacheflush.h>
>  #include <asm/kvm_arm.h>
>  #include <asm/kvm_mmu.h>
> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>  	__coherent_cache_guest_page(vcpu, pfn, size);
>  }
>  
> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
> +{
> +	siginfo_t info;
> +
> +	info.si_signo   = SIGBUS;
> +	info.si_errno   = 0;
> +	info.si_code    = BUS_MCEERR_AR;
> +	info.si_addr    = (void __user *)address;
> +
> +	if (hugetlb)
> +		info.si_addr_lsb = PMD_SHIFT;
> +	else
> +		info.si_addr_lsb = PAGE_SHIFT;
> +
> +	send_sig_info(SIGBUS, &info, current);
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>  			  unsigned long fault_status)
> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	smp_rmb();
>  
>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
> +		kvm_send_hwpoison_signal(hva, hugetlb);

The way this is called means that we'll only notify userspace of a huge
mapping if userspace is mapping hugetlbfs, and not because the stage2
mapping may or may not have used transparent huge pages when the error
was discovered.  Is this the desired semantics?

Also notice that the hva is not necessarily aligned to the beginning of
the huge page, so can we be giving userspace wrong information by
pointing in the middle of a huge page and telling it there was an
address error in the size of the PMD ?

> +		return 0;
> +	}
>  	if (is_error_noslot_pfn(pfn))
>  		return -EFAULT;
>  
> -- 
> 2.10.1
> 

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-03-24 18:30   ` Christoffer Dall
@ 2017-03-27 11:20     ` Punit Agrawal
  -1 siblings, 0 replies; 34+ messages in thread
From: Punit Agrawal @ 2017-03-27 11:20 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Marc Zyngier, Tyler Baicar, linux-arm-kernel, kvmarm, gengdongjiu

Christoffer Dall <cdall@linaro.org> writes:

> On Wed, Mar 15, 2017 at 04:07:27PM +0000, James Morse wrote:
>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
>> SIGBUS to any user space process using the page, and notify all the
>> in-kernel users.
>> 
>> If the page corresponded with guest memory, KVM will unmap this page
>> from its stage2 page tables. The user space process that allocated
>> this memory may have never touched this page in which case it may not
>> be mapped meaning SIGBUS won't be delivered.
>> 
>> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
>> comes to process the stage2 fault.
>> 
>> Do as x86 does, and deliver the SIGBUS when we discover
>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
>> as this matches the user space mapping size.
>> 
>> Signed-off-by: James Morse <james.morse@arm.com>
>> CC: gengdongjiu <gengdj.1984@gmail.com>
>> ---
>>  Without this patch both kvmtool and Qemu exit as the KVM_RUN ioctl() returns
>>  EFAULT.
>>  QEMU: error: kvm run failed Bad address
>>  LVKM: KVM_RUN failed: Bad address
>> 
>>  With this patch both kvmtool and Qemu receive SIGBUS ... and then exit.
>>  In the future Qemu can use this signal to notify the guest, for more details
>>  see hwpoison[1].
>> 
>>  [0] https://www.spinics.net/lists/arm-kernel/msg560009.html
>>  [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/hwpoison.txt
>> 
>> 
>>  arch/arm/kvm/mmu.c | 23 +++++++++++++++++++++++
>>  1 file changed, 23 insertions(+)
>> 
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index 962616fd4ddd..9d1aa294e88f 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -20,8 +20,10 @@
>>  #include <linux/kvm_host.h>
>>  #include <linux/io.h>
>>  #include <linux/hugetlb.h>
>> +#include <linux/sched/signal.h>
>>  #include <trace/events/kvm.h>
>>  #include <asm/pgalloc.h>
>> +#include <asm/siginfo.h>
>>  #include <asm/cacheflush.h>
>>  #include <asm/kvm_arm.h>
>>  #include <asm/kvm_mmu.h>
>> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>>  	__coherent_cache_guest_page(vcpu, pfn, size);
>>  }
>>  
>> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
>> +{
>> +	siginfo_t info;
>> +
>> +	info.si_signo   = SIGBUS;
>> +	info.si_errno   = 0;
>> +	info.si_code    = BUS_MCEERR_AR;
>> +	info.si_addr    = (void __user *)address;
>> +
>> +	if (hugetlb)
>> +		info.si_addr_lsb = PMD_SHIFT;
>> +	else
>> +		info.si_addr_lsb = PAGE_SHIFT;
>> +
>> +	send_sig_info(SIGBUS, &info, current);
>> +}
>> +
>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>>  			  unsigned long fault_status)
>> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>  	smp_rmb();
>>  
>>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
>> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
>> +		kvm_send_hwpoison_signal(hva, hugetlb);
>
> The way this is called means that we'll only notify userspace of a huge
> mapping if userspace is mapping hugetlbfs, and not because the stage2
> mapping may or may not have used transparent huge pages when the error
> was discovered.  Is this the desired semantics?

I think so.

AFAIUI, transparent hugepages are split before being poisoned while all
the underlying pages of a hugepage are poisoned together, i.e., no
splitting.

>
> Also notice that the hva is not necessarily aligned to the beginning of
> the huge page, so can we be giving userspace wrong information by
> pointing in the middle of a huge page and telling it there was an
> address error in the size of the PMD ?
>

I could be reading it wrong but I think we are fine here - the address
(hva) is the location that faulted. And the lsb indicates the least
significant bit of the faulting address (See man sigaction(2)). The
receiver of the signal is expected to use the address and lsb to workout
the extent of corruption.

Though I missed a subtlety while reviewing the patch before. The
reported lsb should be for the userspace hugepage mapping (i.e., hva)
and not for the stage 2.

So in the case of hugepages the value of lsb should be -

huge_page_shift(hstate_vma(vma))

as the kernel supports more than just PMD size hugepages.

Does that make sense?

In light of this, I'd like to retract my Reviewed-by tag for this
version of the patch as I believe we'll need to change the lsb
reporting.

Thanks,
Punit

>> +		return 0;
>> +	}
>>  	if (is_error_noslot_pfn(pfn))
>>  		return -EFAULT;
>>  
>> -- 
>> 2.10.1
>> 
>
> Thanks,
> -Christoffer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-03-27 11:20     ` Punit Agrawal
  0 siblings, 0 replies; 34+ messages in thread
From: Punit Agrawal @ 2017-03-27 11:20 UTC (permalink / raw)
  To: linux-arm-kernel

Christoffer Dall <cdall@linaro.org> writes:

> On Wed, Mar 15, 2017 at 04:07:27PM +0000, James Morse wrote:
>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
>> SIGBUS to any user space process using the page, and notify all the
>> in-kernel users.
>> 
>> If the page corresponded with guest memory, KVM will unmap this page
>> from its stage2 page tables. The user space process that allocated
>> this memory may have never touched this page in which case it may not
>> be mapped meaning SIGBUS won't be delivered.
>> 
>> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
>> comes to process the stage2 fault.
>> 
>> Do as x86 does, and deliver the SIGBUS when we discover
>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
>> as this matches the user space mapping size.
>> 
>> Signed-off-by: James Morse <james.morse@arm.com>
>> CC: gengdongjiu <gengdj.1984@gmail.com>
>> ---
>>  Without this patch both kvmtool and Qemu exit as the KVM_RUN ioctl() returns
>>  EFAULT.
>>  QEMU: error: kvm run failed Bad address
>>  LVKM: KVM_RUN failed: Bad address
>> 
>>  With this patch both kvmtool and Qemu receive SIGBUS ... and then exit.
>>  In the future Qemu can use this signal to notify the guest, for more details
>>  see hwpoison[1].
>> 
>>  [0] https://www.spinics.net/lists/arm-kernel/msg560009.html
>>  [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/hwpoison.txt
>> 
>> 
>>  arch/arm/kvm/mmu.c | 23 +++++++++++++++++++++++
>>  1 file changed, 23 insertions(+)
>> 
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index 962616fd4ddd..9d1aa294e88f 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -20,8 +20,10 @@
>>  #include <linux/kvm_host.h>
>>  #include <linux/io.h>
>>  #include <linux/hugetlb.h>
>> +#include <linux/sched/signal.h>
>>  #include <trace/events/kvm.h>
>>  #include <asm/pgalloc.h>
>> +#include <asm/siginfo.h>
>>  #include <asm/cacheflush.h>
>>  #include <asm/kvm_arm.h>
>>  #include <asm/kvm_mmu.h>
>> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>>  	__coherent_cache_guest_page(vcpu, pfn, size);
>>  }
>>  
>> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
>> +{
>> +	siginfo_t info;
>> +
>> +	info.si_signo   = SIGBUS;
>> +	info.si_errno   = 0;
>> +	info.si_code    = BUS_MCEERR_AR;
>> +	info.si_addr    = (void __user *)address;
>> +
>> +	if (hugetlb)
>> +		info.si_addr_lsb = PMD_SHIFT;
>> +	else
>> +		info.si_addr_lsb = PAGE_SHIFT;
>> +
>> +	send_sig_info(SIGBUS, &info, current);
>> +}
>> +
>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>>  			  unsigned long fault_status)
>> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>  	smp_rmb();
>>  
>>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
>> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
>> +		kvm_send_hwpoison_signal(hva, hugetlb);
>
> The way this is called means that we'll only notify userspace of a huge
> mapping if userspace is mapping hugetlbfs, and not because the stage2
> mapping may or may not have used transparent huge pages when the error
> was discovered.  Is this the desired semantics?

I think so.

AFAIUI, transparent hugepages are split before being poisoned while all
the underlying pages of a hugepage are poisoned together, i.e., no
splitting.

>
> Also notice that the hva is not necessarily aligned to the beginning of
> the huge page, so can we be giving userspace wrong information by
> pointing in the middle of a huge page and telling it there was an
> address error in the size of the PMD ?
>

I could be reading it wrong but I think we are fine here - the address
(hva) is the location that faulted. And the lsb indicates the least
significant bit of the faulting address (See man sigaction(2)). The
receiver of the signal is expected to use the address and lsb to workout
the extent of corruption.

Though I missed a subtlety while reviewing the patch before. The
reported lsb should be for the userspace hugepage mapping (i.e., hva)
and not for the stage 2.

So in the case of hugepages the value of lsb should be -

huge_page_shift(hstate_vma(vma))

as the kernel supports more than just PMD size hugepages.

Does that make sense?

In light of this, I'd like to retract my Reviewed-by tag for this
version of the patch as I believe we'll need to change the lsb
reporting.

Thanks,
Punit

>> +		return 0;
>> +	}
>>  	if (is_error_noslot_pfn(pfn))
>>  		return -EFAULT;
>>  
>> -- 
>> 2.10.1
>> 
>
> Thanks,
> -Christoffer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-03-27 11:20     ` Punit Agrawal
@ 2017-03-27 12:00       ` James Morse
  -1 siblings, 0 replies; 34+ messages in thread
From: James Morse @ 2017-03-27 12:00 UTC (permalink / raw)
  To: Punit Agrawal, Christoffer Dall
  Cc: Marc Zyngier, Tyler Baicar, linux-arm-kernel, kvmarm, gengdongjiu

Hi guys,

On 27/03/17 12:20, Punit Agrawal wrote:
> Christoffer Dall <cdall@linaro.org> writes:
>> On Wed, Mar 15, 2017 at 04:07:27PM +0000, James Morse wrote:
>>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
>>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
>>> SIGBUS to any user space process using the page, and notify all the
>>> in-kernel users.
>>>
>>> If the page corresponded with guest memory, KVM will unmap this page
>>> from its stage2 page tables. The user space process that allocated
>>> this memory may have never touched this page in which case it may not
>>> be mapped meaning SIGBUS won't be delivered.
>>>
>>> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
>>> comes to process the stage2 fault.
>>>
>>> Do as x86 does, and deliver the SIGBUS when we discover
>>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
>>> as this matches the user space mapping size.

>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>> index 962616fd4ddd..9d1aa294e88f 100644
>>> --- a/arch/arm/kvm/mmu.c
>>> +++ b/arch/arm/kvm/mmu.c
>>> @@ -20,8 +20,10 @@
>>>  #include <linux/kvm_host.h>
>>>  #include <linux/io.h>
>>>  #include <linux/hugetlb.h>
>>> +#include <linux/sched/signal.h>
>>>  #include <trace/events/kvm.h>
>>>  #include <asm/pgalloc.h>
>>> +#include <asm/siginfo.h>
>>>  #include <asm/cacheflush.h>
>>>  #include <asm/kvm_arm.h>
>>>  #include <asm/kvm_mmu.h>
>>> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>>>  	__coherent_cache_guest_page(vcpu, pfn, size);
>>>  }
>>>  
>>> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
>>> +{
>>> +	siginfo_t info;
>>> +
>>> +	info.si_signo   = SIGBUS;
>>> +	info.si_errno   = 0;
>>> +	info.si_code    = BUS_MCEERR_AR;
>>> +	info.si_addr    = (void __user *)address;
>>> +
>>> +	if (hugetlb)
>>> +		info.si_addr_lsb = PMD_SHIFT;
>>> +	else
>>> +		info.si_addr_lsb = PAGE_SHIFT;
>>> +
>>> +	send_sig_info(SIGBUS, &info, current);
>>> +}
>>> +
>>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>>>  			  unsigned long fault_status)
>>> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>  	smp_rmb();
>>>  
>>>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
>>> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
>>> +		kvm_send_hwpoison_signal(hva, hugetlb);
>>
>> The way this is called means that we'll only notify userspace of a huge
>> mapping if userspace is mapping hugetlbfs, and not because the stage2
>> mapping may or may not have used transparent huge pages when the error
>> was discovered.  Is this the desired semantics?

No,


> I think so.
>
> AFAIUI, transparent hugepages are split before being poisoned while all
> the underlying pages of a hugepage are poisoned together, i.e., no
> splitting.

In which case I need to look into this some more!

My thinking was we should report the size that was knocked out of the stage2 to
avoid the guest repeatedly faulting until it has touched every guest-page-size
in the stage2 hole.

Reading the code in that kvm/mmu.c it looked like the mapping sizes would always
be the same as those used by userspace.

If the page was split before KVM could have taken this fault I assumed it would
fault on the page-size mapping and hugetlb would be false. (which is already
wrong for another reason, looks like I grabbed the variable before
transparent_hugepage_adjust() has had a go a it.).


>> Also notice that the hva is not necessarily aligned to the beginning of
>> the huge page, so can we be giving userspace wrong information by
>> pointing in the middle of a huge page and telling it there was an
>> address error in the size of the PMD ?
>>
> 
> I could be reading it wrong but I think we are fine here - the address
> (hva) is the location that faulted. And the lsb indicates the least
> significant bit of the faulting address (See man sigaction(2)). The
> receiver of the signal is expected to use the address and lsb to workout
> the extent of corruption.

kill_proc() in mm/memory-failure.c does this too, but the address is set by
page_address_in_vma() in add_to_kill() of the same file. (I'll chat with Punit
off list.)


> Though I missed a subtlety while reviewing the patch before. The
> reported lsb should be for the userspace hugepage mapping (i.e., hva)
> and not for the stage 2.

I thought these were always supposed to be the same, and using hugetlb was a bug
because I didn't look closely enough at what is_vm_hugetlb_page() does.


> In light of this, I'd like to retract my Reviewed-by tag for this
> version of the patch as I believe we'll need to change the lsb
> reporting.

Sure, lets work out what this should be doing. I'm beginning to suspect x86's
'always page size' was correct to begin with!


Thanks,

James

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-03-27 12:00       ` James Morse
  0 siblings, 0 replies; 34+ messages in thread
From: James Morse @ 2017-03-27 12:00 UTC (permalink / raw)
  To: linux-arm-kernel

Hi guys,

On 27/03/17 12:20, Punit Agrawal wrote:
> Christoffer Dall <cdall@linaro.org> writes:
>> On Wed, Mar 15, 2017 at 04:07:27PM +0000, James Morse wrote:
>>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
>>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
>>> SIGBUS to any user space process using the page, and notify all the
>>> in-kernel users.
>>>
>>> If the page corresponded with guest memory, KVM will unmap this page
>>> from its stage2 page tables. The user space process that allocated
>>> this memory may have never touched this page in which case it may not
>>> be mapped meaning SIGBUS won't be delivered.
>>>
>>> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
>>> comes to process the stage2 fault.
>>>
>>> Do as x86 does, and deliver the SIGBUS when we discover
>>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
>>> as this matches the user space mapping size.

>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>> index 962616fd4ddd..9d1aa294e88f 100644
>>> --- a/arch/arm/kvm/mmu.c
>>> +++ b/arch/arm/kvm/mmu.c
>>> @@ -20,8 +20,10 @@
>>>  #include <linux/kvm_host.h>
>>>  #include <linux/io.h>
>>>  #include <linux/hugetlb.h>
>>> +#include <linux/sched/signal.h>
>>>  #include <trace/events/kvm.h>
>>>  #include <asm/pgalloc.h>
>>> +#include <asm/siginfo.h>
>>>  #include <asm/cacheflush.h>
>>>  #include <asm/kvm_arm.h>
>>>  #include <asm/kvm_mmu.h>
>>> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>>>  	__coherent_cache_guest_page(vcpu, pfn, size);
>>>  }
>>>  
>>> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
>>> +{
>>> +	siginfo_t info;
>>> +
>>> +	info.si_signo   = SIGBUS;
>>> +	info.si_errno   = 0;
>>> +	info.si_code    = BUS_MCEERR_AR;
>>> +	info.si_addr    = (void __user *)address;
>>> +
>>> +	if (hugetlb)
>>> +		info.si_addr_lsb = PMD_SHIFT;
>>> +	else
>>> +		info.si_addr_lsb = PAGE_SHIFT;
>>> +
>>> +	send_sig_info(SIGBUS, &info, current);
>>> +}
>>> +
>>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>>>  			  unsigned long fault_status)
>>> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>  	smp_rmb();
>>>  
>>>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
>>> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
>>> +		kvm_send_hwpoison_signal(hva, hugetlb);
>>
>> The way this is called means that we'll only notify userspace of a huge
>> mapping if userspace is mapping hugetlbfs, and not because the stage2
>> mapping may or may not have used transparent huge pages when the error
>> was discovered.  Is this the desired semantics?

No,


> I think so.
>
> AFAIUI, transparent hugepages are split before being poisoned while all
> the underlying pages of a hugepage are poisoned together, i.e., no
> splitting.

In which case I need to look into this some more!

My thinking was we should report the size that was knocked out of the stage2 to
avoid the guest repeatedly faulting until it has touched every guest-page-size
in the stage2 hole.

Reading the code in that kvm/mmu.c it looked like the mapping sizes would always
be the same as those used by userspace.

If the page was split before KVM could have taken this fault I assumed it would
fault on the page-size mapping and hugetlb would be false. (which is already
wrong for another reason, looks like I grabbed the variable before
transparent_hugepage_adjust() has had a go a it.).


>> Also notice that the hva is not necessarily aligned to the beginning of
>> the huge page, so can we be giving userspace wrong information by
>> pointing in the middle of a huge page and telling it there was an
>> address error in the size of the PMD ?
>>
> 
> I could be reading it wrong but I think we are fine here - the address
> (hva) is the location that faulted. And the lsb indicates the least
> significant bit of the faulting address (See man sigaction(2)). The
> receiver of the signal is expected to use the address and lsb to workout
> the extent of corruption.

kill_proc() in mm/memory-failure.c does this too, but the address is set by
page_address_in_vma() in add_to_kill() of the same file. (I'll chat with Punit
off list.)


> Though I missed a subtlety while reviewing the patch before. The
> reported lsb should be for the userspace hugepage mapping (i.e., hva)
> and not for the stage 2.

I thought these were always supposed to be the same, and using hugetlb was a bug
because I didn't look closely enough at what is_vm_hugetlb_page() does.


> In light of this, I'd like to retract my Reviewed-by tag for this
> version of the patch as I believe we'll need to change the lsb
> reporting.

Sure, lets work out what this should be doing. I'm beginning to suspect x86's
'always page size' was correct to begin with!


Thanks,

James

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-03-27 12:00       ` James Morse
@ 2017-03-27 12:44         ` Christoffer Dall
  -1 siblings, 0 replies; 34+ messages in thread
From: Christoffer Dall @ 2017-03-27 12:44 UTC (permalink / raw)
  To: James Morse
  Cc: Tyler Baicar, Marc Zyngier, Punit Agrawal, linux-arm-kernel,
	kvmarm, gengdongjiu

On Mon, Mar 27, 2017 at 01:00:56PM +0100, James Morse wrote:
> Hi guys,
> 
> On 27/03/17 12:20, Punit Agrawal wrote:
> > Christoffer Dall <cdall@linaro.org> writes:
> >> On Wed, Mar 15, 2017 at 04:07:27PM +0000, James Morse wrote:
> >>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
> >>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
> >>> SIGBUS to any user space process using the page, and notify all the
> >>> in-kernel users.
> >>>
> >>> If the page corresponded with guest memory, KVM will unmap this page
> >>> from its stage2 page tables. The user space process that allocated
> >>> this memory may have never touched this page in which case it may not
> >>> be mapped meaning SIGBUS won't be delivered.
> >>>
> >>> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
> >>> comes to process the stage2 fault.
> >>>
> >>> Do as x86 does, and deliver the SIGBUS when we discover
> >>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
> >>> as this matches the user space mapping size.
> 
> >>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> >>> index 962616fd4ddd..9d1aa294e88f 100644
> >>> --- a/arch/arm/kvm/mmu.c
> >>> +++ b/arch/arm/kvm/mmu.c
> >>> @@ -20,8 +20,10 @@
> >>>  #include <linux/kvm_host.h>
> >>>  #include <linux/io.h>
> >>>  #include <linux/hugetlb.h>
> >>> +#include <linux/sched/signal.h>
> >>>  #include <trace/events/kvm.h>
> >>>  #include <asm/pgalloc.h>
> >>> +#include <asm/siginfo.h>
> >>>  #include <asm/cacheflush.h>
> >>>  #include <asm/kvm_arm.h>
> >>>  #include <asm/kvm_mmu.h>
> >>> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
> >>>  	__coherent_cache_guest_page(vcpu, pfn, size);
> >>>  }
> >>>  
> >>> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
> >>> +{
> >>> +	siginfo_t info;
> >>> +
> >>> +	info.si_signo   = SIGBUS;
> >>> +	info.si_errno   = 0;
> >>> +	info.si_code    = BUS_MCEERR_AR;
> >>> +	info.si_addr    = (void __user *)address;
> >>> +
> >>> +	if (hugetlb)
> >>> +		info.si_addr_lsb = PMD_SHIFT;
> >>> +	else
> >>> +		info.si_addr_lsb = PAGE_SHIFT;
> >>> +
> >>> +	send_sig_info(SIGBUS, &info, current);
> >>> +}
> >>> +
> >>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >>>  			  struct kvm_memory_slot *memslot, unsigned long hva,
> >>>  			  unsigned long fault_status)
> >>> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >>>  	smp_rmb();
> >>>  
> >>>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> >>> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
> >>> +		kvm_send_hwpoison_signal(hva, hugetlb);
> >>
> >> The way this is called means that we'll only notify userspace of a huge
> >> mapping if userspace is mapping hugetlbfs, and not because the stage2
> >> mapping may or may not have used transparent huge pages when the error
> >> was discovered.  Is this the desired semantics?
> 
> No,
> 
> 
> > I think so.
> >
> > AFAIUI, transparent hugepages are split before being poisoned while all
> > the underlying pages of a hugepage are poisoned together, i.e., no
> > splitting.
> 
> In which case I need to look into this some more!
> 
> My thinking was we should report the size that was knocked out of the stage2 to
> avoid the guest repeatedly faulting until it has touched every guest-page-size
> in the stage2 hole.

By signaling something at the fault path, I think it's going to be very
hard to backtrack how the stage 2 page tables looked like when faults
started happening, because I think these are completely decoupled events
(the mmu notifier and the later fault).

> 
> Reading the code in that kvm/mmu.c it looked like the mapping sizes would always
> be the same as those used by userspace.

I think the mapping sizes should be the same between userspace and KVM,
but the mapping size of a particular page (and associated pages) may
vary over time.

> 
> If the page was split before KVM could have taken this fault I assumed it would
> fault on the page-size mapping and hugetlb would be false.

I think you could have a huge page, which gets unmapped as a result on
it getting split (perhaps because there was a failure on one page) and
later as you fault, you can discover a range which can be a hugetlbfs or
transparent huge pages.

The question that I don't know is how Linux behaves if a page is marked
with hwpoison, in that case, if Linux never supports THP and always
marks an entire huge page in a hugetlbfs with the poison, then I think
we're mostly good here.  If not, we should make sure we align with
whatever the rest of the kernel does.

> (which is already
> wrong for another reason, looks like I grabbed the variable before
> transparent_hugepage_adjust() has had a go a it.).
> 

yes, which is why I asked if you only care about hugetlbfs.

> 
> >> Also notice that the hva is not necessarily aligned to the beginning of
> >> the huge page, so can we be giving userspace wrong information by
> >> pointing in the middle of a huge page and telling it there was an
> >> address error in the size of the PMD ?
> >>
> > 
> > I could be reading it wrong but I think we are fine here - the address
> > (hva) is the location that faulted. And the lsb indicates the least
> > significant bit of the faulting address (See man sigaction(2)). The
> > receiver of the signal is expected to use the address and lsb to workout
> > the extent of corruption.
> 
> kill_proc() in mm/memory-failure.c does this too, but the address is set by
> page_address_in_vma() in add_to_kill() of the same file. (I'll chat with Punit
> off list.)
> 
> 
> > Though I missed a subtlety while reviewing the patch before. The
> > reported lsb should be for the userspace hugepage mapping (i.e., hva)
> > and not for the stage 2.
> 
> I thought these were always supposed to be the same, and using hugetlb was a bug
> because I didn't look closely enough at what is_vm_hugetlb_page() does.
> 
> 
> > In light of this, I'd like to retract my Reviewed-by tag for this
> > version of the patch as I believe we'll need to change the lsb
> > reporting.
> 
> Sure, lets work out what this should be doing. I'm beginning to suspect x86's
> 'always page size' was correct to begin with!
> 

I had a sense of that too, but it would be good to understand how you
mark and individual page within a hugetlbfs huge page with hwpoison...

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-03-27 12:44         ` Christoffer Dall
  0 siblings, 0 replies; 34+ messages in thread
From: Christoffer Dall @ 2017-03-27 12:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 27, 2017 at 01:00:56PM +0100, James Morse wrote:
> Hi guys,
> 
> On 27/03/17 12:20, Punit Agrawal wrote:
> > Christoffer Dall <cdall@linaro.org> writes:
> >> On Wed, Mar 15, 2017 at 04:07:27PM +0000, James Morse wrote:
> >>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
> >>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
> >>> SIGBUS to any user space process using the page, and notify all the
> >>> in-kernel users.
> >>>
> >>> If the page corresponded with guest memory, KVM will unmap this page
> >>> from its stage2 page tables. The user space process that allocated
> >>> this memory may have never touched this page in which case it may not
> >>> be mapped meaning SIGBUS won't be delivered.
> >>>
> >>> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
> >>> comes to process the stage2 fault.
> >>>
> >>> Do as x86 does, and deliver the SIGBUS when we discover
> >>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
> >>> as this matches the user space mapping size.
> 
> >>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> >>> index 962616fd4ddd..9d1aa294e88f 100644
> >>> --- a/arch/arm/kvm/mmu.c
> >>> +++ b/arch/arm/kvm/mmu.c
> >>> @@ -20,8 +20,10 @@
> >>>  #include <linux/kvm_host.h>
> >>>  #include <linux/io.h>
> >>>  #include <linux/hugetlb.h>
> >>> +#include <linux/sched/signal.h>
> >>>  #include <trace/events/kvm.h>
> >>>  #include <asm/pgalloc.h>
> >>> +#include <asm/siginfo.h>
> >>>  #include <asm/cacheflush.h>
> >>>  #include <asm/kvm_arm.h>
> >>>  #include <asm/kvm_mmu.h>
> >>> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
> >>>  	__coherent_cache_guest_page(vcpu, pfn, size);
> >>>  }
> >>>  
> >>> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
> >>> +{
> >>> +	siginfo_t info;
> >>> +
> >>> +	info.si_signo   = SIGBUS;
> >>> +	info.si_errno   = 0;
> >>> +	info.si_code    = BUS_MCEERR_AR;
> >>> +	info.si_addr    = (void __user *)address;
> >>> +
> >>> +	if (hugetlb)
> >>> +		info.si_addr_lsb = PMD_SHIFT;
> >>> +	else
> >>> +		info.si_addr_lsb = PAGE_SHIFT;
> >>> +
> >>> +	send_sig_info(SIGBUS, &info, current);
> >>> +}
> >>> +
> >>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >>>  			  struct kvm_memory_slot *memslot, unsigned long hva,
> >>>  			  unsigned long fault_status)
> >>> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >>>  	smp_rmb();
> >>>  
> >>>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> >>> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
> >>> +		kvm_send_hwpoison_signal(hva, hugetlb);
> >>
> >> The way this is called means that we'll only notify userspace of a huge
> >> mapping if userspace is mapping hugetlbfs, and not because the stage2
> >> mapping may or may not have used transparent huge pages when the error
> >> was discovered.  Is this the desired semantics?
> 
> No,
> 
> 
> > I think so.
> >
> > AFAIUI, transparent hugepages are split before being poisoned while all
> > the underlying pages of a hugepage are poisoned together, i.e., no
> > splitting.
> 
> In which case I need to look into this some more!
> 
> My thinking was we should report the size that was knocked out of the stage2 to
> avoid the guest repeatedly faulting until it has touched every guest-page-size
> in the stage2 hole.

By signaling something at the fault path, I think it's going to be very
hard to backtrack how the stage 2 page tables looked like when faults
started happening, because I think these are completely decoupled events
(the mmu notifier and the later fault).

> 
> Reading the code in that kvm/mmu.c it looked like the mapping sizes would always
> be the same as those used by userspace.

I think the mapping sizes should be the same between userspace and KVM,
but the mapping size of a particular page (and associated pages) may
vary over time.

> 
> If the page was split before KVM could have taken this fault I assumed it would
> fault on the page-size mapping and hugetlb would be false.

I think you could have a huge page, which gets unmapped as a result on
it getting split (perhaps because there was a failure on one page) and
later as you fault, you can discover a range which can be a hugetlbfs or
transparent huge pages.

The question that I don't know is how Linux behaves if a page is marked
with hwpoison, in that case, if Linux never supports THP and always
marks an entire huge page in a hugetlbfs with the poison, then I think
we're mostly good here.  If not, we should make sure we align with
whatever the rest of the kernel does.

> (which is already
> wrong for another reason, looks like I grabbed the variable before
> transparent_hugepage_adjust() has had a go a it.).
> 

yes, which is why I asked if you only care about hugetlbfs.

> 
> >> Also notice that the hva is not necessarily aligned to the beginning of
> >> the huge page, so can we be giving userspace wrong information by
> >> pointing in the middle of a huge page and telling it there was an
> >> address error in the size of the PMD ?
> >>
> > 
> > I could be reading it wrong but I think we are fine here - the address
> > (hva) is the location that faulted. And the lsb indicates the least
> > significant bit of the faulting address (See man sigaction(2)). The
> > receiver of the signal is expected to use the address and lsb to workout
> > the extent of corruption.
> 
> kill_proc() in mm/memory-failure.c does this too, but the address is set by
> page_address_in_vma() in add_to_kill() of the same file. (I'll chat with Punit
> off list.)
> 
> 
> > Though I missed a subtlety while reviewing the patch before. The
> > reported lsb should be for the userspace hugepage mapping (i.e., hva)
> > and not for the stage 2.
> 
> I thought these were always supposed to be the same, and using hugetlb was a bug
> because I didn't look closely enough at what is_vm_hugetlb_page() does.
> 
> 
> > In light of this, I'd like to retract my Reviewed-by tag for this
> > version of the patch as I believe we'll need to change the lsb
> > reporting.
> 
> Sure, lets work out what this should be doing. I'm beginning to suspect x86's
> 'always page size' was correct to begin with!
> 

I had a sense of that too, but it would be good to understand how you
mark and individual page within a hugetlbfs huge page with hwpoison...

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-03-27 12:44         ` Christoffer Dall
@ 2017-03-27 13:31           ` Punit Agrawal
  -1 siblings, 0 replies; 34+ messages in thread
From: Punit Agrawal @ 2017-03-27 13:31 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Marc Zyngier, Tyler Baicar, linux-arm-kernel, kvmarm, gengdongjiu

Christoffer Dall <cdall@linaro.org> writes:

> On Mon, Mar 27, 2017 at 01:00:56PM +0100, James Morse wrote:
>> Hi guys,
>> 
>> On 27/03/17 12:20, Punit Agrawal wrote:
>> > Christoffer Dall <cdall@linaro.org> writes:
>> >> On Wed, Mar 15, 2017 at 04:07:27PM +0000, James Morse wrote:
>> >>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
>> >>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
>> >>> SIGBUS to any user space process using the page, and notify all the
>> >>> in-kernel users.
>> >>>
>> >>> If the page corresponded with guest memory, KVM will unmap this page
>> >>> from its stage2 page tables. The user space process that allocated
>> >>> this memory may have never touched this page in which case it may not
>> >>> be mapped meaning SIGBUS won't be delivered.
>> >>>
>> >>> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
>> >>> comes to process the stage2 fault.
>> >>>
>> >>> Do as x86 does, and deliver the SIGBUS when we discover
>> >>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
>> >>> as this matches the user space mapping size.
>> 
>> >>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> >>> index 962616fd4ddd..9d1aa294e88f 100644
>> >>> --- a/arch/arm/kvm/mmu.c
>> >>> +++ b/arch/arm/kvm/mmu.c
>> >>> @@ -20,8 +20,10 @@
>> >>>  #include <linux/kvm_host.h>
>> >>>  #include <linux/io.h>
>> >>>  #include <linux/hugetlb.h>
>> >>> +#include <linux/sched/signal.h>
>> >>>  #include <trace/events/kvm.h>
>> >>>  #include <asm/pgalloc.h>
>> >>> +#include <asm/siginfo.h>
>> >>>  #include <asm/cacheflush.h>
>> >>>  #include <asm/kvm_arm.h>
>> >>>  #include <asm/kvm_mmu.h>
>> >>> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>> >>>  	__coherent_cache_guest_page(vcpu, pfn, size);
>> >>>  }
>> >>>  
>> >>> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
>> >>> +{
>> >>> +	siginfo_t info;
>> >>> +
>> >>> +	info.si_signo   = SIGBUS;
>> >>> +	info.si_errno   = 0;
>> >>> +	info.si_code    = BUS_MCEERR_AR;
>> >>> +	info.si_addr    = (void __user *)address;
>> >>> +
>> >>> +	if (hugetlb)
>> >>> +		info.si_addr_lsb = PMD_SHIFT;
>> >>> +	else
>> >>> +		info.si_addr_lsb = PAGE_SHIFT;
>> >>> +
>> >>> +	send_sig_info(SIGBUS, &info, current);
>> >>> +}
>> >>> +
>> >>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>> >>>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>> >>>  			  unsigned long fault_status)
>> >>> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>> >>>  	smp_rmb();
>> >>>  
>> >>>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
>> >>> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
>> >>> +		kvm_send_hwpoison_signal(hva, hugetlb);
>> >>
>> >> The way this is called means that we'll only notify userspace of a huge
>> >> mapping if userspace is mapping hugetlbfs, and not because the stage2
>> >> mapping may or may not have used transparent huge pages when the error
>> >> was discovered.  Is this the desired semantics?
>> 
>> No,
>> 
>> 
>> > I think so.
>> >
>> > AFAIUI, transparent hugepages are split before being poisoned while all
>> > the underlying pages of a hugepage are poisoned together, i.e., no
>> > splitting.
>> 
>> In which case I need to look into this some more!
>> 
>> My thinking was we should report the size that was knocked out of the stage2 to
>> avoid the guest repeatedly faulting until it has touched every guest-page-size
>> in the stage2 hole.
>
> By signaling something at the fault path, I think it's going to be very
> hard to backtrack how the stage 2 page tables looked like when faults
> started happening, because I think these are completely decoupled events
> (the mmu notifier and the later fault).
>
>> 
>> Reading the code in that kvm/mmu.c it looked like the mapping sizes would always
>> be the same as those used by userspace.
>
> I think the mapping sizes should be the same between userspace and KVM,
> but the mapping size of a particular page (and associated pages) may
> vary over time.

Stage 1 and Stage 2 support different hugepage sizes. A larger size
stage 1 page maps to multiple stage 2 page table entries. For stage 1,
we support PUD_SIZE, CONT_PMD_SIZE, PMD_SIZE and CONT_PTE_SIZE while
only PMD_SIZE is supported for Stage 2.

>
>> 
>> If the page was split before KVM could have taken this fault I assumed it would
>> fault on the page-size mapping and hugetlb would be false.
>
> I think you could have a huge page, which gets unmapped as a result on
> it getting split (perhaps because there was a failure on one page) and
> later as you fault, you can discover a range which can be a hugetlbfs or
> transparent huge pages.
>
> The question that I don't know is how Linux behaves if a page is marked
> with hwpoison, in that case, if Linux never supports THP and always
> marks an entire huge page in a hugetlbfs with the poison, then I think
> we're mostly good here.  If not, we should make sure we align with
> whatever the rest of the kernel does.

AFAICT, a hugetlbfs page is poisoned as a whole while thp is split
before poisoning. Quoting comment near the top of memory_failure() in
mm/memory_failure.c.

    /*
     * Currently errors on hugetlbfs pages are measured in hugepage units,
     * so nr_pages should be 1 << compound_order.  OTOH when errors are on
     * transparent hugepages, they are supposed to be split and error
     * measurement is done in normal page units.  So nr_pages should be one
     * in this case.
     */

>
>> (which is already
>> wrong for another reason, looks like I grabbed the variable before
>> transparent_hugepage_adjust() has had a go a it.).
>> 
>
> yes, which is why I asked if you only care about hugetlbfs.
>

Based on the comment above, we should never get a poisoned page that is
part of a transparent hugepage.

>> 
>> >> Also notice that the hva is not necessarily aligned to the beginning of
>> >> the huge page, so can we be giving userspace wrong information by
>> >> pointing in the middle of a huge page and telling it there was an
>> >> address error in the size of the PMD ?
>> >>
>> > 
>> > I could be reading it wrong but I think we are fine here - the address
>> > (hva) is the location that faulted. And the lsb indicates the least
>> > significant bit of the faulting address (See man sigaction(2)). The
>> > receiver of the signal is expected to use the address and lsb to workout
>> > the extent of corruption.
>> 
>> kill_proc() in mm/memory-failure.c does this too, but the address is set by
>> page_address_in_vma() in add_to_kill() of the same file. (I'll chat with Punit
>> off list.)
>> 
>> 
>> > Though I missed a subtlety while reviewing the patch before. The
>> > reported lsb should be for the userspace hugepage mapping (i.e., hva)
>> > and not for the stage 2.
>> 
>> I thought these were always supposed to be the same, and using hugetlb was a bug
>> because I didn't look closely enough at what is_vm_hugetlb_page() does.

See above.

>> 
>> 
>> > In light of this, I'd like to retract my Reviewed-by tag for this
>> > version of the patch as I believe we'll need to change the lsb
>> > reporting.
>> 
>> Sure, lets work out what this should be doing. I'm beginning to suspect x86's
>> 'always page size' was correct to begin with!
>> 
>
> I had a sense of that too, but it would be good to understand how you
> mark and individual page within a hugetlbfs huge page with hwpoison...

I don't think it is possible to mark an individual page in a hugetlbfs
page - it's all or nothing.

AFAICT, the SIGBUS report is for user mappings and doesn't have to care
whether it's Stage 2 hugetlb page or thp. And the lsb determination should
take the Stage 1 hugepage size into account - something along the lines
of the snippet from previous email.

Hope that makes sense.

Punit

>
> Thanks,
> -Christoffer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-03-27 13:31           ` Punit Agrawal
  0 siblings, 0 replies; 34+ messages in thread
From: Punit Agrawal @ 2017-03-27 13:31 UTC (permalink / raw)
  To: linux-arm-kernel

Christoffer Dall <cdall@linaro.org> writes:

> On Mon, Mar 27, 2017 at 01:00:56PM +0100, James Morse wrote:
>> Hi guys,
>> 
>> On 27/03/17 12:20, Punit Agrawal wrote:
>> > Christoffer Dall <cdall@linaro.org> writes:
>> >> On Wed, Mar 15, 2017 at 04:07:27PM +0000, James Morse wrote:
>> >>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
>> >>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
>> >>> SIGBUS to any user space process using the page, and notify all the
>> >>> in-kernel users.
>> >>>
>> >>> If the page corresponded with guest memory, KVM will unmap this page
>> >>> from its stage2 page tables. The user space process that allocated
>> >>> this memory may have never touched this page in which case it may not
>> >>> be mapped meaning SIGBUS won't be delivered.
>> >>>
>> >>> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
>> >>> comes to process the stage2 fault.
>> >>>
>> >>> Do as x86 does, and deliver the SIGBUS when we discover
>> >>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
>> >>> as this matches the user space mapping size.
>> 
>> >>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> >>> index 962616fd4ddd..9d1aa294e88f 100644
>> >>> --- a/arch/arm/kvm/mmu.c
>> >>> +++ b/arch/arm/kvm/mmu.c
>> >>> @@ -20,8 +20,10 @@
>> >>>  #include <linux/kvm_host.h>
>> >>>  #include <linux/io.h>
>> >>>  #include <linux/hugetlb.h>
>> >>> +#include <linux/sched/signal.h>
>> >>>  #include <trace/events/kvm.h>
>> >>>  #include <asm/pgalloc.h>
>> >>> +#include <asm/siginfo.h>
>> >>>  #include <asm/cacheflush.h>
>> >>>  #include <asm/kvm_arm.h>
>> >>>  #include <asm/kvm_mmu.h>
>> >>> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>> >>>  	__coherent_cache_guest_page(vcpu, pfn, size);
>> >>>  }
>> >>>  
>> >>> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
>> >>> +{
>> >>> +	siginfo_t info;
>> >>> +
>> >>> +	info.si_signo   = SIGBUS;
>> >>> +	info.si_errno   = 0;
>> >>> +	info.si_code    = BUS_MCEERR_AR;
>> >>> +	info.si_addr    = (void __user *)address;
>> >>> +
>> >>> +	if (hugetlb)
>> >>> +		info.si_addr_lsb = PMD_SHIFT;
>> >>> +	else
>> >>> +		info.si_addr_lsb = PAGE_SHIFT;
>> >>> +
>> >>> +	send_sig_info(SIGBUS, &info, current);
>> >>> +}
>> >>> +
>> >>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>> >>>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>> >>>  			  unsigned long fault_status)
>> >>> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>> >>>  	smp_rmb();
>> >>>  
>> >>>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
>> >>> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
>> >>> +		kvm_send_hwpoison_signal(hva, hugetlb);
>> >>
>> >> The way this is called means that we'll only notify userspace of a huge
>> >> mapping if userspace is mapping hugetlbfs, and not because the stage2
>> >> mapping may or may not have used transparent huge pages when the error
>> >> was discovered.  Is this the desired semantics?
>> 
>> No,
>> 
>> 
>> > I think so.
>> >
>> > AFAIUI, transparent hugepages are split before being poisoned while all
>> > the underlying pages of a hugepage are poisoned together, i.e., no
>> > splitting.
>> 
>> In which case I need to look into this some more!
>> 
>> My thinking was we should report the size that was knocked out of the stage2 to
>> avoid the guest repeatedly faulting until it has touched every guest-page-size
>> in the stage2 hole.
>
> By signaling something at the fault path, I think it's going to be very
> hard to backtrack how the stage 2 page tables looked like when faults
> started happening, because I think these are completely decoupled events
> (the mmu notifier and the later fault).
>
>> 
>> Reading the code in that kvm/mmu.c it looked like the mapping sizes would always
>> be the same as those used by userspace.
>
> I think the mapping sizes should be the same between userspace and KVM,
> but the mapping size of a particular page (and associated pages) may
> vary over time.

Stage 1 and Stage 2 support different hugepage sizes. A larger size
stage 1 page maps to multiple stage 2 page table entries. For stage 1,
we support PUD_SIZE, CONT_PMD_SIZE, PMD_SIZE and CONT_PTE_SIZE while
only PMD_SIZE is supported for Stage 2.

>
>> 
>> If the page was split before KVM could have taken this fault I assumed it would
>> fault on the page-size mapping and hugetlb would be false.
>
> I think you could have a huge page, which gets unmapped as a result on
> it getting split (perhaps because there was a failure on one page) and
> later as you fault, you can discover a range which can be a hugetlbfs or
> transparent huge pages.
>
> The question that I don't know is how Linux behaves if a page is marked
> with hwpoison, in that case, if Linux never supports THP and always
> marks an entire huge page in a hugetlbfs with the poison, then I think
> we're mostly good here.  If not, we should make sure we align with
> whatever the rest of the kernel does.

AFAICT, a hugetlbfs page is poisoned as a whole while thp is split
before poisoning. Quoting comment near the top of memory_failure() in
mm/memory_failure.c.

    /*
     * Currently errors on hugetlbfs pages are measured in hugepage units,
     * so nr_pages should be 1 << compound_order.  OTOH when errors are on
     * transparent hugepages, they are supposed to be split and error
     * measurement is done in normal page units.  So nr_pages should be one
     * in this case.
     */

>
>> (which is already
>> wrong for another reason, looks like I grabbed the variable before
>> transparent_hugepage_adjust() has had a go a it.).
>> 
>
> yes, which is why I asked if you only care about hugetlbfs.
>

Based on the comment above, we should never get a poisoned page that is
part of a transparent hugepage.

>> 
>> >> Also notice that the hva is not necessarily aligned to the beginning of
>> >> the huge page, so can we be giving userspace wrong information by
>> >> pointing in the middle of a huge page and telling it there was an
>> >> address error in the size of the PMD ?
>> >>
>> > 
>> > I could be reading it wrong but I think we are fine here - the address
>> > (hva) is the location that faulted. And the lsb indicates the least
>> > significant bit of the faulting address (See man sigaction(2)). The
>> > receiver of the signal is expected to use the address and lsb to workout
>> > the extent of corruption.
>> 
>> kill_proc() in mm/memory-failure.c does this too, but the address is set by
>> page_address_in_vma() in add_to_kill() of the same file. (I'll chat with Punit
>> off list.)
>> 
>> 
>> > Though I missed a subtlety while reviewing the patch before. The
>> > reported lsb should be for the userspace hugepage mapping (i.e., hva)
>> > and not for the stage 2.
>> 
>> I thought these were always supposed to be the same, and using hugetlb was a bug
>> because I didn't look closely enough at what is_vm_hugetlb_page() does.

See above.

>> 
>> 
>> > In light of this, I'd like to retract my Reviewed-by tag for this
>> > version of the patch as I believe we'll need to change the lsb
>> > reporting.
>> 
>> Sure, lets work out what this should be doing. I'm beginning to suspect x86's
>> 'always page size' was correct to begin with!
>> 
>
> I had a sense of that too, but it would be good to understand how you
> mark and individual page within a hugetlbfs huge page with hwpoison...

I don't think it is possible to mark an individual page in a hugetlbfs
page - it's all or nothing.

AFAICT, the SIGBUS report is for user mappings and doesn't have to care
whether it's Stage 2 hugetlb page or thp. And the lsb determination should
take the Stage 1 hugepage size into account - something along the lines
of the snippet from previous email.

Hope that makes sense.

Punit

>
> Thanks,
> -Christoffer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-03-27 13:31           ` Punit Agrawal
@ 2017-03-27 13:38             ` Marc Zyngier
  -1 siblings, 0 replies; 34+ messages in thread
From: Marc Zyngier @ 2017-03-27 13:38 UTC (permalink / raw)
  To: Punit Agrawal, Christoffer Dall
  Cc: Tyler Baicar, linux-arm-kernel, kvmarm, gengdongjiu

On 27/03/17 14:31, Punit Agrawal wrote:
> Christoffer Dall <cdall@linaro.org> writes:
> 
>> On Mon, Mar 27, 2017 at 01:00:56PM +0100, James Morse wrote:
>>> Hi guys,
>>>
>>> On 27/03/17 12:20, Punit Agrawal wrote:
>>>> Christoffer Dall <cdall@linaro.org> writes:
>>>>> On Wed, Mar 15, 2017 at 04:07:27PM +0000, James Morse wrote:
>>>>>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
>>>>>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
>>>>>> SIGBUS to any user space process using the page, and notify all the
>>>>>> in-kernel users.
>>>>>>
>>>>>> If the page corresponded with guest memory, KVM will unmap this page
>>>>>> from its stage2 page tables. The user space process that allocated
>>>>>> this memory may have never touched this page in which case it may not
>>>>>> be mapped meaning SIGBUS won't be delivered.
>>>>>>
>>>>>> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
>>>>>> comes to process the stage2 fault.
>>>>>>
>>>>>> Do as x86 does, and deliver the SIGBUS when we discover
>>>>>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
>>>>>> as this matches the user space mapping size.
>>>
>>>>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>>>>> index 962616fd4ddd..9d1aa294e88f 100644
>>>>>> --- a/arch/arm/kvm/mmu.c
>>>>>> +++ b/arch/arm/kvm/mmu.c
>>>>>> @@ -20,8 +20,10 @@
>>>>>>  #include <linux/kvm_host.h>
>>>>>>  #include <linux/io.h>
>>>>>>  #include <linux/hugetlb.h>
>>>>>> +#include <linux/sched/signal.h>
>>>>>>  #include <trace/events/kvm.h>
>>>>>>  #include <asm/pgalloc.h>
>>>>>> +#include <asm/siginfo.h>
>>>>>>  #include <asm/cacheflush.h>
>>>>>>  #include <asm/kvm_arm.h>
>>>>>>  #include <asm/kvm_mmu.h>
>>>>>> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>>>>>>  	__coherent_cache_guest_page(vcpu, pfn, size);
>>>>>>  }
>>>>>>  
>>>>>> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
>>>>>> +{
>>>>>> +	siginfo_t info;
>>>>>> +
>>>>>> +	info.si_signo   = SIGBUS;
>>>>>> +	info.si_errno   = 0;
>>>>>> +	info.si_code    = BUS_MCEERR_AR;
>>>>>> +	info.si_addr    = (void __user *)address;
>>>>>> +
>>>>>> +	if (hugetlb)
>>>>>> +		info.si_addr_lsb = PMD_SHIFT;
>>>>>> +	else
>>>>>> +		info.si_addr_lsb = PAGE_SHIFT;
>>>>>> +
>>>>>> +	send_sig_info(SIGBUS, &info, current);
>>>>>> +}
>>>>>> +
>>>>>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>>>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>>>>>>  			  unsigned long fault_status)
>>>>>> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>>>  	smp_rmb();
>>>>>>  
>>>>>>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
>>>>>> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
>>>>>> +		kvm_send_hwpoison_signal(hva, hugetlb);
>>>>>
>>>>> The way this is called means that we'll only notify userspace of a huge
>>>>> mapping if userspace is mapping hugetlbfs, and not because the stage2
>>>>> mapping may or may not have used transparent huge pages when the error
>>>>> was discovered.  Is this the desired semantics?
>>>
>>> No,
>>>
>>>
>>>> I think so.
>>>>
>>>> AFAIUI, transparent hugepages are split before being poisoned while all
>>>> the underlying pages of a hugepage are poisoned together, i.e., no
>>>> splitting.
>>>
>>> In which case I need to look into this some more!
>>>
>>> My thinking was we should report the size that was knocked out of the stage2 to
>>> avoid the guest repeatedly faulting until it has touched every guest-page-size
>>> in the stage2 hole.
>>
>> By signaling something at the fault path, I think it's going to be very
>> hard to backtrack how the stage 2 page tables looked like when faults
>> started happening, because I think these are completely decoupled events
>> (the mmu notifier and the later fault).
>>
>>>
>>> Reading the code in that kvm/mmu.c it looked like the mapping sizes would always
>>> be the same as those used by userspace.
>>
>> I think the mapping sizes should be the same between userspace and KVM,
>> but the mapping size of a particular page (and associated pages) may
>> vary over time.
> 
> Stage 1 and Stage 2 support different hugepage sizes. A larger size
> stage 1 page maps to multiple stage 2 page table entries. For stage 1,
> we support PUD_SIZE, CONT_PMD_SIZE, PMD_SIZE and CONT_PTE_SIZE while
> only PMD_SIZE is supported for Stage 2.

What is stage-1 doing here? We have no idea about what stage-1 is doing
(not under KVM's control). Or do you mean userspace instead?

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-03-27 13:38             ` Marc Zyngier
  0 siblings, 0 replies; 34+ messages in thread
From: Marc Zyngier @ 2017-03-27 13:38 UTC (permalink / raw)
  To: linux-arm-kernel

On 27/03/17 14:31, Punit Agrawal wrote:
> Christoffer Dall <cdall@linaro.org> writes:
> 
>> On Mon, Mar 27, 2017 at 01:00:56PM +0100, James Morse wrote:
>>> Hi guys,
>>>
>>> On 27/03/17 12:20, Punit Agrawal wrote:
>>>> Christoffer Dall <cdall@linaro.org> writes:
>>>>> On Wed, Mar 15, 2017 at 04:07:27PM +0000, James Morse wrote:
>>>>>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
>>>>>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
>>>>>> SIGBUS to any user space process using the page, and notify all the
>>>>>> in-kernel users.
>>>>>>
>>>>>> If the page corresponded with guest memory, KVM will unmap this page
>>>>>> from its stage2 page tables. The user space process that allocated
>>>>>> this memory may have never touched this page in which case it may not
>>>>>> be mapped meaning SIGBUS won't be delivered.
>>>>>>
>>>>>> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
>>>>>> comes to process the stage2 fault.
>>>>>>
>>>>>> Do as x86 does, and deliver the SIGBUS when we discover
>>>>>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
>>>>>> as this matches the user space mapping size.
>>>
>>>>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>>>>> index 962616fd4ddd..9d1aa294e88f 100644
>>>>>> --- a/arch/arm/kvm/mmu.c
>>>>>> +++ b/arch/arm/kvm/mmu.c
>>>>>> @@ -20,8 +20,10 @@
>>>>>>  #include <linux/kvm_host.h>
>>>>>>  #include <linux/io.h>
>>>>>>  #include <linux/hugetlb.h>
>>>>>> +#include <linux/sched/signal.h>
>>>>>>  #include <trace/events/kvm.h>
>>>>>>  #include <asm/pgalloc.h>
>>>>>> +#include <asm/siginfo.h>
>>>>>>  #include <asm/cacheflush.h>
>>>>>>  #include <asm/kvm_arm.h>
>>>>>>  #include <asm/kvm_mmu.h>
>>>>>> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>>>>>>  	__coherent_cache_guest_page(vcpu, pfn, size);
>>>>>>  }
>>>>>>  
>>>>>> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
>>>>>> +{
>>>>>> +	siginfo_t info;
>>>>>> +
>>>>>> +	info.si_signo   = SIGBUS;
>>>>>> +	info.si_errno   = 0;
>>>>>> +	info.si_code    = BUS_MCEERR_AR;
>>>>>> +	info.si_addr    = (void __user *)address;
>>>>>> +
>>>>>> +	if (hugetlb)
>>>>>> +		info.si_addr_lsb = PMD_SHIFT;
>>>>>> +	else
>>>>>> +		info.si_addr_lsb = PAGE_SHIFT;
>>>>>> +
>>>>>> +	send_sig_info(SIGBUS, &info, current);
>>>>>> +}
>>>>>> +
>>>>>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>>>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>>>>>>  			  unsigned long fault_status)
>>>>>> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>>>  	smp_rmb();
>>>>>>  
>>>>>>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
>>>>>> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
>>>>>> +		kvm_send_hwpoison_signal(hva, hugetlb);
>>>>>
>>>>> The way this is called means that we'll only notify userspace of a huge
>>>>> mapping if userspace is mapping hugetlbfs, and not because the stage2
>>>>> mapping may or may not have used transparent huge pages when the error
>>>>> was discovered.  Is this the desired semantics?
>>>
>>> No,
>>>
>>>
>>>> I think so.
>>>>
>>>> AFAIUI, transparent hugepages are split before being poisoned while all
>>>> the underlying pages of a hugepage are poisoned together, i.e., no
>>>> splitting.
>>>
>>> In which case I need to look into this some more!
>>>
>>> My thinking was we should report the size that was knocked out of the stage2 to
>>> avoid the guest repeatedly faulting until it has touched every guest-page-size
>>> in the stage2 hole.
>>
>> By signaling something at the fault path, I think it's going to be very
>> hard to backtrack how the stage 2 page tables looked like when faults
>> started happening, because I think these are completely decoupled events
>> (the mmu notifier and the later fault).
>>
>>>
>>> Reading the code in that kvm/mmu.c it looked like the mapping sizes would always
>>> be the same as those used by userspace.
>>
>> I think the mapping sizes should be the same between userspace and KVM,
>> but the mapping size of a particular page (and associated pages) may
>> vary over time.
> 
> Stage 1 and Stage 2 support different hugepage sizes. A larger size
> stage 1 page maps to multiple stage 2 page table entries. For stage 1,
> we support PUD_SIZE, CONT_PMD_SIZE, PMD_SIZE and CONT_PTE_SIZE while
> only PMD_SIZE is supported for Stage 2.

What is stage-1 doing here? We have no idea about what stage-1 is doing
(not under KVM's control). Or do you mean userspace instead?

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-03-27 13:38             ` Marc Zyngier
@ 2017-03-27 14:04               ` Punit Agrawal
  -1 siblings, 0 replies; 34+ messages in thread
From: Punit Agrawal @ 2017-03-27 14:04 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Christoffer Dall, Tyler Baicar, linux-arm-kernel, kvmarm, gengdongjiu

Marc Zyngier <marc.zyngier@arm.com> writes:

> On 27/03/17 14:31, Punit Agrawal wrote:
>> Christoffer Dall <cdall@linaro.org> writes:
>> 
>>> On Mon, Mar 27, 2017 at 01:00:56PM +0100, James Morse wrote:
>>>> Hi guys,
>>>>
>>>> On 27/03/17 12:20, Punit Agrawal wrote:
>>>>> Christoffer Dall <cdall@linaro.org> writes:
>>>>>> On Wed, Mar 15, 2017 at 04:07:27PM +0000, James Morse wrote:
>>>>>>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
>>>>>>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
>>>>>>> SIGBUS to any user space process using the page, and notify all the
>>>>>>> in-kernel users.
>>>>>>>
>>>>>>> If the page corresponded with guest memory, KVM will unmap this page
>>>>>>> from its stage2 page tables. The user space process that allocated
>>>>>>> this memory may have never touched this page in which case it may not
>>>>>>> be mapped meaning SIGBUS won't be delivered.
>>>>>>>
>>>>>>> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
>>>>>>> comes to process the stage2 fault.
>>>>>>>
>>>>>>> Do as x86 does, and deliver the SIGBUS when we discover
>>>>>>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
>>>>>>> as this matches the user space mapping size.
>>>>
>>>>>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>>>>>> index 962616fd4ddd..9d1aa294e88f 100644
>>>>>>> --- a/arch/arm/kvm/mmu.c
>>>>>>> +++ b/arch/arm/kvm/mmu.c
>>>>>>> @@ -20,8 +20,10 @@
>>>>>>>  #include <linux/kvm_host.h>
>>>>>>>  #include <linux/io.h>
>>>>>>>  #include <linux/hugetlb.h>
>>>>>>> +#include <linux/sched/signal.h>
>>>>>>>  #include <trace/events/kvm.h>
>>>>>>>  #include <asm/pgalloc.h>
>>>>>>> +#include <asm/siginfo.h>
>>>>>>>  #include <asm/cacheflush.h>
>>>>>>>  #include <asm/kvm_arm.h>
>>>>>>>  #include <asm/kvm_mmu.h>
>>>>>>> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>>>>>>>  	__coherent_cache_guest_page(vcpu, pfn, size);
>>>>>>>  }
>>>>>>>  
>>>>>>> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
>>>>>>> +{
>>>>>>> +	siginfo_t info;
>>>>>>> +
>>>>>>> +	info.si_signo   = SIGBUS;
>>>>>>> +	info.si_errno   = 0;
>>>>>>> +	info.si_code    = BUS_MCEERR_AR;
>>>>>>> +	info.si_addr    = (void __user *)address;
>>>>>>> +
>>>>>>> +	if (hugetlb)
>>>>>>> +		info.si_addr_lsb = PMD_SHIFT;
>>>>>>> +	else
>>>>>>> +		info.si_addr_lsb = PAGE_SHIFT;
>>>>>>> +
>>>>>>> +	send_sig_info(SIGBUS, &info, current);
>>>>>>> +}
>>>>>>> +
>>>>>>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>>>>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>>>>>>>  			  unsigned long fault_status)
>>>>>>> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>>>>  	smp_rmb();
>>>>>>>  
>>>>>>>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
>>>>>>> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
>>>>>>> +		kvm_send_hwpoison_signal(hva, hugetlb);
>>>>>>
>>>>>> The way this is called means that we'll only notify userspace of a huge
>>>>>> mapping if userspace is mapping hugetlbfs, and not because the stage2
>>>>>> mapping may or may not have used transparent huge pages when the error
>>>>>> was discovered.  Is this the desired semantics?
>>>>
>>>> No,
>>>>
>>>>
>>>>> I think so.
>>>>>
>>>>> AFAIUI, transparent hugepages are split before being poisoned while all
>>>>> the underlying pages of a hugepage are poisoned together, i.e., no
>>>>> splitting.
>>>>
>>>> In which case I need to look into this some more!
>>>>
>>>> My thinking was we should report the size that was knocked out of the stage2 to
>>>> avoid the guest repeatedly faulting until it has touched every guest-page-size
>>>> in the stage2 hole.
>>>
>>> By signaling something at the fault path, I think it's going to be very
>>> hard to backtrack how the stage 2 page tables looked like when faults
>>> started happening, because I think these are completely decoupled events
>>> (the mmu notifier and the later fault).
>>>
>>>>
>>>> Reading the code in that kvm/mmu.c it looked like the mapping sizes would always
>>>> be the same as those used by userspace.
>>>
>>> I think the mapping sizes should be the same between userspace and KVM,
>>> but the mapping size of a particular page (and associated pages) may
>>> vary over time.
>> 
>> Stage 1 and Stage 2 support different hugepage sizes. A larger size
>> stage 1 page maps to multiple stage 2 page table entries. For stage 1,
>> we support PUD_SIZE, CONT_PMD_SIZE, PMD_SIZE and CONT_PTE_SIZE while
>> only PMD_SIZE is supported for Stage 2.
>
> What is stage-1 doing here? We have no idea about what stage-1 is doing
> (not under KVM's control). Or do you mean userspace instead?

I mean userspace here. Sorry for the confusion.

>
> Thanks,
>
> 	M.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-03-27 14:04               ` Punit Agrawal
  0 siblings, 0 replies; 34+ messages in thread
From: Punit Agrawal @ 2017-03-27 14:04 UTC (permalink / raw)
  To: linux-arm-kernel

Marc Zyngier <marc.zyngier@arm.com> writes:

> On 27/03/17 14:31, Punit Agrawal wrote:
>> Christoffer Dall <cdall@linaro.org> writes:
>> 
>>> On Mon, Mar 27, 2017 at 01:00:56PM +0100, James Morse wrote:
>>>> Hi guys,
>>>>
>>>> On 27/03/17 12:20, Punit Agrawal wrote:
>>>>> Christoffer Dall <cdall@linaro.org> writes:
>>>>>> On Wed, Mar 15, 2017 at 04:07:27PM +0000, James Morse wrote:
>>>>>>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
>>>>>>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
>>>>>>> SIGBUS to any user space process using the page, and notify all the
>>>>>>> in-kernel users.
>>>>>>>
>>>>>>> If the page corresponded with guest memory, KVM will unmap this page
>>>>>>> from its stage2 page tables. The user space process that allocated
>>>>>>> this memory may have never touched this page in which case it may not
>>>>>>> be mapped meaning SIGBUS won't be delivered.
>>>>>>>
>>>>>>> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
>>>>>>> comes to process the stage2 fault.
>>>>>>>
>>>>>>> Do as x86 does, and deliver the SIGBUS when we discover
>>>>>>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
>>>>>>> as this matches the user space mapping size.
>>>>
>>>>>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>>>>>> index 962616fd4ddd..9d1aa294e88f 100644
>>>>>>> --- a/arch/arm/kvm/mmu.c
>>>>>>> +++ b/arch/arm/kvm/mmu.c
>>>>>>> @@ -20,8 +20,10 @@
>>>>>>>  #include <linux/kvm_host.h>
>>>>>>>  #include <linux/io.h>
>>>>>>>  #include <linux/hugetlb.h>
>>>>>>> +#include <linux/sched/signal.h>
>>>>>>>  #include <trace/events/kvm.h>
>>>>>>>  #include <asm/pgalloc.h>
>>>>>>> +#include <asm/siginfo.h>
>>>>>>>  #include <asm/cacheflush.h>
>>>>>>>  #include <asm/kvm_arm.h>
>>>>>>>  #include <asm/kvm_mmu.h>
>>>>>>> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>>>>>>>  	__coherent_cache_guest_page(vcpu, pfn, size);
>>>>>>>  }
>>>>>>>  
>>>>>>> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
>>>>>>> +{
>>>>>>> +	siginfo_t info;
>>>>>>> +
>>>>>>> +	info.si_signo   = SIGBUS;
>>>>>>> +	info.si_errno   = 0;
>>>>>>> +	info.si_code    = BUS_MCEERR_AR;
>>>>>>> +	info.si_addr    = (void __user *)address;
>>>>>>> +
>>>>>>> +	if (hugetlb)
>>>>>>> +		info.si_addr_lsb = PMD_SHIFT;
>>>>>>> +	else
>>>>>>> +		info.si_addr_lsb = PAGE_SHIFT;
>>>>>>> +
>>>>>>> +	send_sig_info(SIGBUS, &info, current);
>>>>>>> +}
>>>>>>> +
>>>>>>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>>>>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>>>>>>>  			  unsigned long fault_status)
>>>>>>> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>>>>>>  	smp_rmb();
>>>>>>>  
>>>>>>>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
>>>>>>> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
>>>>>>> +		kvm_send_hwpoison_signal(hva, hugetlb);
>>>>>>
>>>>>> The way this is called means that we'll only notify userspace of a huge
>>>>>> mapping if userspace is mapping hugetlbfs, and not because the stage2
>>>>>> mapping may or may not have used transparent huge pages when the error
>>>>>> was discovered.  Is this the desired semantics?
>>>>
>>>> No,
>>>>
>>>>
>>>>> I think so.
>>>>>
>>>>> AFAIUI, transparent hugepages are split before being poisoned while all
>>>>> the underlying pages of a hugepage are poisoned together, i.e., no
>>>>> splitting.
>>>>
>>>> In which case I need to look into this some more!
>>>>
>>>> My thinking was we should report the size that was knocked out of the stage2 to
>>>> avoid the guest repeatedly faulting until it has touched every guest-page-size
>>>> in the stage2 hole.
>>>
>>> By signaling something at the fault path, I think it's going to be very
>>> hard to backtrack how the stage 2 page tables looked like when faults
>>> started happening, because I think these are completely decoupled events
>>> (the mmu notifier and the later fault).
>>>
>>>>
>>>> Reading the code in that kvm/mmu.c it looked like the mapping sizes would always
>>>> be the same as those used by userspace.
>>>
>>> I think the mapping sizes should be the same between userspace and KVM,
>>> but the mapping size of a particular page (and associated pages) may
>>> vary over time.
>> 
>> Stage 1 and Stage 2 support different hugepage sizes. A larger size
>> stage 1 page maps to multiple stage 2 page table entries. For stage 1,
>> we support PUD_SIZE, CONT_PMD_SIZE, PMD_SIZE and CONT_PTE_SIZE while
>> only PMD_SIZE is supported for Stage 2.
>
> What is stage-1 doing here? We have no idea about what stage-1 is doing
> (not under KVM's control). Or do you mean userspace instead?

I mean userspace here. Sorry for the confusion.

>
> Thanks,
>
> 	M.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-03-27 13:31           ` Punit Agrawal
@ 2017-03-27 14:47             ` Christoffer Dall
  -1 siblings, 0 replies; 34+ messages in thread
From: Christoffer Dall @ 2017-03-27 14:47 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: Marc Zyngier, Tyler Baicar, linux-arm-kernel, kvmarm, gengdongjiu

On Mon, Mar 27, 2017 at 02:31:44PM +0100, Punit Agrawal wrote:
> Christoffer Dall <cdall@linaro.org> writes:
> 
> > On Mon, Mar 27, 2017 at 01:00:56PM +0100, James Morse wrote:
> >> Hi guys,
> >> 
> >> On 27/03/17 12:20, Punit Agrawal wrote:
> >> > Christoffer Dall <cdall@linaro.org> writes:
> >> >> On Wed, Mar 15, 2017 at 04:07:27PM +0000, James Morse wrote:
> >> >>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
> >> >>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
> >> >>> SIGBUS to any user space process using the page, and notify all the
> >> >>> in-kernel users.
> >> >>>
> >> >>> If the page corresponded with guest memory, KVM will unmap this page
> >> >>> from its stage2 page tables. The user space process that allocated
> >> >>> this memory may have never touched this page in which case it may not
> >> >>> be mapped meaning SIGBUS won't be delivered.
> >> >>>
> >> >>> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
> >> >>> comes to process the stage2 fault.
> >> >>>
> >> >>> Do as x86 does, and deliver the SIGBUS when we discover
> >> >>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
> >> >>> as this matches the user space mapping size.
> >> 
> >> >>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> >> >>> index 962616fd4ddd..9d1aa294e88f 100644
> >> >>> --- a/arch/arm/kvm/mmu.c
> >> >>> +++ b/arch/arm/kvm/mmu.c
> >> >>> @@ -20,8 +20,10 @@
> >> >>>  #include <linux/kvm_host.h>
> >> >>>  #include <linux/io.h>
> >> >>>  #include <linux/hugetlb.h>
> >> >>> +#include <linux/sched/signal.h>
> >> >>>  #include <trace/events/kvm.h>
> >> >>>  #include <asm/pgalloc.h>
> >> >>> +#include <asm/siginfo.h>
> >> >>>  #include <asm/cacheflush.h>
> >> >>>  #include <asm/kvm_arm.h>
> >> >>>  #include <asm/kvm_mmu.h>
> >> >>> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
> >> >>>  	__coherent_cache_guest_page(vcpu, pfn, size);
> >> >>>  }
> >> >>>  
> >> >>> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
> >> >>> +{
> >> >>> +	siginfo_t info;
> >> >>> +
> >> >>> +	info.si_signo   = SIGBUS;
> >> >>> +	info.si_errno   = 0;
> >> >>> +	info.si_code    = BUS_MCEERR_AR;
> >> >>> +	info.si_addr    = (void __user *)address;
> >> >>> +
> >> >>> +	if (hugetlb)
> >> >>> +		info.si_addr_lsb = PMD_SHIFT;
> >> >>> +	else
> >> >>> +		info.si_addr_lsb = PAGE_SHIFT;
> >> >>> +
> >> >>> +	send_sig_info(SIGBUS, &info, current);
> >> >>> +}
> >> >>> +
> >> >>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >> >>>  			  struct kvm_memory_slot *memslot, unsigned long hva,
> >> >>>  			  unsigned long fault_status)
> >> >>> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >> >>>  	smp_rmb();
> >> >>>  
> >> >>>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> >> >>> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
> >> >>> +		kvm_send_hwpoison_signal(hva, hugetlb);
> >> >>
> >> >> The way this is called means that we'll only notify userspace of a huge
> >> >> mapping if userspace is mapping hugetlbfs, and not because the stage2
> >> >> mapping may or may not have used transparent huge pages when the error
> >> >> was discovered.  Is this the desired semantics?
> >> 
> >> No,
> >> 
> >> 
> >> > I think so.
> >> >
> >> > AFAIUI, transparent hugepages are split before being poisoned while all
> >> > the underlying pages of a hugepage are poisoned together, i.e., no
> >> > splitting.
> >> 
> >> In which case I need to look into this some more!
> >> 
> >> My thinking was we should report the size that was knocked out of the stage2 to
> >> avoid the guest repeatedly faulting until it has touched every guest-page-size
> >> in the stage2 hole.
> >
> > By signaling something at the fault path, I think it's going to be very
> > hard to backtrack how the stage 2 page tables looked like when faults
> > started happening, because I think these are completely decoupled events
> > (the mmu notifier and the later fault).
> >
> >> 
> >> Reading the code in that kvm/mmu.c it looked like the mapping sizes would always
> >> be the same as those used by userspace.
> >
> > I think the mapping sizes should be the same between userspace and KVM,
> > but the mapping size of a particular page (and associated pages) may
> > vary over time.
> 
> Stage 1 and Stage 2 support different hugepage sizes. A larger size
> stage 1 page maps to multiple stage 2 page table entries. For stage 1,
> we support PUD_SIZE, CONT_PMD_SIZE, PMD_SIZE and CONT_PTE_SIZE while
> only PMD_SIZE is supported for Stage 2.
> 
> >
> >> 
> >> If the page was split before KVM could have taken this fault I assumed it would
> >> fault on the page-size mapping and hugetlb would be false.
> >
> > I think you could have a huge page, which gets unmapped as a result on
> > it getting split (perhaps because there was a failure on one page) and
> > later as you fault, you can discover a range which can be a hugetlbfs or
> > transparent huge pages.
> >
> > The question that I don't know is how Linux behaves if a page is marked
> > with hwpoison, in that case, if Linux never supports THP and always
> > marks an entire huge page in a hugetlbfs with the poison, then I think
> > we're mostly good here.  If not, we should make sure we align with
> > whatever the rest of the kernel does.
> 
> AFAICT, a hugetlbfs page is poisoned as a whole while thp is split
> before poisoning. Quoting comment near the top of memory_failure() in
> mm/memory_failure.c.
> 
>     /*
>      * Currently errors on hugetlbfs pages are measured in hugepage units,
>      * so nr_pages should be 1 << compound_order.  OTOH when errors are on
>      * transparent hugepages, they are supposed to be split and error
>      * measurement is done in normal page units.  So nr_pages should be one
>      * in this case.
>      */
> 
> >
> >> (which is already
> >> wrong for another reason, looks like I grabbed the variable before
> >> transparent_hugepage_adjust() has had a go a it.).
> >> 
> >
> > yes, which is why I asked if you only care about hugetlbfs.
> >
> 
> Based on the comment above, we should never get a poisoned page that is
> part of a transparent hugepage.
> 
> >> 
> >> >> Also notice that the hva is not necessarily aligned to the beginning of
> >> >> the huge page, so can we be giving userspace wrong information by
> >> >> pointing in the middle of a huge page and telling it there was an
> >> >> address error in the size of the PMD ?
> >> >>
> >> > 
> >> > I could be reading it wrong but I think we are fine here - the address
> >> > (hva) is the location that faulted. And the lsb indicates the least
> >> > significant bit of the faulting address (See man sigaction(2)). The
> >> > receiver of the signal is expected to use the address and lsb to workout
> >> > the extent of corruption.
> >> 
> >> kill_proc() in mm/memory-failure.c does this too, but the address is set by
> >> page_address_in_vma() in add_to_kill() of the same file. (I'll chat with Punit
> >> off list.)
> >> 
> >> 
> >> > Though I missed a subtlety while reviewing the patch before. The
> >> > reported lsb should be for the userspace hugepage mapping (i.e., hva)
> >> > and not for the stage 2.
> >> 
> >> I thought these were always supposed to be the same, and using hugetlb was a bug
> >> because I didn't look closely enough at what is_vm_hugetlb_page() does.
> 
> See above.
> 
> >> 
> >> 
> >> > In light of this, I'd like to retract my Reviewed-by tag for this
> >> > version of the patch as I believe we'll need to change the lsb
> >> > reporting.
> >> 
> >> Sure, lets work out what this should be doing. I'm beginning to suspect x86's
> >> 'always page size' was correct to begin with!
> >> 
> >
> > I had a sense of that too, but it would be good to understand how you
> > mark and individual page within a hugetlbfs huge page with hwpoison...
> 
> I don't think it is possible to mark an individual page in a hugetlbfs
> page - it's all or nothing.
> 
> AFAICT, the SIGBUS report is for user mappings and doesn't have to care
> whether it's Stage 2 hugetlb page or thp. And the lsb determination should
> take the Stage 1 hugepage size into account - something along the lines
> of the snippet from previous email.
> 

I think the lsb should indicate the size of the memory region known to
be broken by the kernel - however that whole mechanism works.

-Christoffer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-03-27 14:47             ` Christoffer Dall
  0 siblings, 0 replies; 34+ messages in thread
From: Christoffer Dall @ 2017-03-27 14:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 27, 2017 at 02:31:44PM +0100, Punit Agrawal wrote:
> Christoffer Dall <cdall@linaro.org> writes:
> 
> > On Mon, Mar 27, 2017 at 01:00:56PM +0100, James Morse wrote:
> >> Hi guys,
> >> 
> >> On 27/03/17 12:20, Punit Agrawal wrote:
> >> > Christoffer Dall <cdall@linaro.org> writes:
> >> >> On Wed, Mar 15, 2017 at 04:07:27PM +0000, James Morse wrote:
> >> >>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
> >> >>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
> >> >>> SIGBUS to any user space process using the page, and notify all the
> >> >>> in-kernel users.
> >> >>>
> >> >>> If the page corresponded with guest memory, KVM will unmap this page
> >> >>> from its stage2 page tables. The user space process that allocated
> >> >>> this memory may have never touched this page in which case it may not
> >> >>> be mapped meaning SIGBUS won't be delivered.
> >> >>>
> >> >>> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
> >> >>> comes to process the stage2 fault.
> >> >>>
> >> >>> Do as x86 does, and deliver the SIGBUS when we discover
> >> >>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
> >> >>> as this matches the user space mapping size.
> >> 
> >> >>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> >> >>> index 962616fd4ddd..9d1aa294e88f 100644
> >> >>> --- a/arch/arm/kvm/mmu.c
> >> >>> +++ b/arch/arm/kvm/mmu.c
> >> >>> @@ -20,8 +20,10 @@
> >> >>>  #include <linux/kvm_host.h>
> >> >>>  #include <linux/io.h>
> >> >>>  #include <linux/hugetlb.h>
> >> >>> +#include <linux/sched/signal.h>
> >> >>>  #include <trace/events/kvm.h>
> >> >>>  #include <asm/pgalloc.h>
> >> >>> +#include <asm/siginfo.h>
> >> >>>  #include <asm/cacheflush.h>
> >> >>>  #include <asm/kvm_arm.h>
> >> >>>  #include <asm/kvm_mmu.h>
> >> >>> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
> >> >>>  	__coherent_cache_guest_page(vcpu, pfn, size);
> >> >>>  }
> >> >>>  
> >> >>> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
> >> >>> +{
> >> >>> +	siginfo_t info;
> >> >>> +
> >> >>> +	info.si_signo   = SIGBUS;
> >> >>> +	info.si_errno   = 0;
> >> >>> +	info.si_code    = BUS_MCEERR_AR;
> >> >>> +	info.si_addr    = (void __user *)address;
> >> >>> +
> >> >>> +	if (hugetlb)
> >> >>> +		info.si_addr_lsb = PMD_SHIFT;
> >> >>> +	else
> >> >>> +		info.si_addr_lsb = PAGE_SHIFT;
> >> >>> +
> >> >>> +	send_sig_info(SIGBUS, &info, current);
> >> >>> +}
> >> >>> +
> >> >>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >> >>>  			  struct kvm_memory_slot *memslot, unsigned long hva,
> >> >>>  			  unsigned long fault_status)
> >> >>> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >> >>>  	smp_rmb();
> >> >>>  
> >> >>>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> >> >>> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
> >> >>> +		kvm_send_hwpoison_signal(hva, hugetlb);
> >> >>
> >> >> The way this is called means that we'll only notify userspace of a huge
> >> >> mapping if userspace is mapping hugetlbfs, and not because the stage2
> >> >> mapping may or may not have used transparent huge pages when the error
> >> >> was discovered.  Is this the desired semantics?
> >> 
> >> No,
> >> 
> >> 
> >> > I think so.
> >> >
> >> > AFAIUI, transparent hugepages are split before being poisoned while all
> >> > the underlying pages of a hugepage are poisoned together, i.e., no
> >> > splitting.
> >> 
> >> In which case I need to look into this some more!
> >> 
> >> My thinking was we should report the size that was knocked out of the stage2 to
> >> avoid the guest repeatedly faulting until it has touched every guest-page-size
> >> in the stage2 hole.
> >
> > By signaling something at the fault path, I think it's going to be very
> > hard to backtrack how the stage 2 page tables looked like when faults
> > started happening, because I think these are completely decoupled events
> > (the mmu notifier and the later fault).
> >
> >> 
> >> Reading the code in that kvm/mmu.c it looked like the mapping sizes would always
> >> be the same as those used by userspace.
> >
> > I think the mapping sizes should be the same between userspace and KVM,
> > but the mapping size of a particular page (and associated pages) may
> > vary over time.
> 
> Stage 1 and Stage 2 support different hugepage sizes. A larger size
> stage 1 page maps to multiple stage 2 page table entries. For stage 1,
> we support PUD_SIZE, CONT_PMD_SIZE, PMD_SIZE and CONT_PTE_SIZE while
> only PMD_SIZE is supported for Stage 2.
> 
> >
> >> 
> >> If the page was split before KVM could have taken this fault I assumed it would
> >> fault on the page-size mapping and hugetlb would be false.
> >
> > I think you could have a huge page, which gets unmapped as a result on
> > it getting split (perhaps because there was a failure on one page) and
> > later as you fault, you can discover a range which can be a hugetlbfs or
> > transparent huge pages.
> >
> > The question that I don't know is how Linux behaves if a page is marked
> > with hwpoison, in that case, if Linux never supports THP and always
> > marks an entire huge page in a hugetlbfs with the poison, then I think
> > we're mostly good here.  If not, we should make sure we align with
> > whatever the rest of the kernel does.
> 
> AFAICT, a hugetlbfs page is poisoned as a whole while thp is split
> before poisoning. Quoting comment near the top of memory_failure() in
> mm/memory_failure.c.
> 
>     /*
>      * Currently errors on hugetlbfs pages are measured in hugepage units,
>      * so nr_pages should be 1 << compound_order.  OTOH when errors are on
>      * transparent hugepages, they are supposed to be split and error
>      * measurement is done in normal page units.  So nr_pages should be one
>      * in this case.
>      */
> 
> >
> >> (which is already
> >> wrong for another reason, looks like I grabbed the variable before
> >> transparent_hugepage_adjust() has had a go a it.).
> >> 
> >
> > yes, which is why I asked if you only care about hugetlbfs.
> >
> 
> Based on the comment above, we should never get a poisoned page that is
> part of a transparent hugepage.
> 
> >> 
> >> >> Also notice that the hva is not necessarily aligned to the beginning of
> >> >> the huge page, so can we be giving userspace wrong information by
> >> >> pointing in the middle of a huge page and telling it there was an
> >> >> address error in the size of the PMD ?
> >> >>
> >> > 
> >> > I could be reading it wrong but I think we are fine here - the address
> >> > (hva) is the location that faulted. And the lsb indicates the least
> >> > significant bit of the faulting address (See man sigaction(2)). The
> >> > receiver of the signal is expected to use the address and lsb to workout
> >> > the extent of corruption.
> >> 
> >> kill_proc() in mm/memory-failure.c does this too, but the address is set by
> >> page_address_in_vma() in add_to_kill() of the same file. (I'll chat with Punit
> >> off list.)
> >> 
> >> 
> >> > Though I missed a subtlety while reviewing the patch before. The
> >> > reported lsb should be for the userspace hugepage mapping (i.e., hva)
> >> > and not for the stage 2.
> >> 
> >> I thought these were always supposed to be the same, and using hugetlb was a bug
> >> because I didn't look closely enough at what is_vm_hugetlb_page() does.
> 
> See above.
> 
> >> 
> >> 
> >> > In light of this, I'd like to retract my Reviewed-by tag for this
> >> > version of the patch as I believe we'll need to change the lsb
> >> > reporting.
> >> 
> >> Sure, lets work out what this should be doing. I'm beginning to suspect x86's
> >> 'always page size' was correct to begin with!
> >> 
> >
> > I had a sense of that too, but it would be good to understand how you
> > mark and individual page within a hugetlbfs huge page with hwpoison...
> 
> I don't think it is possible to mark an individual page in a hugetlbfs
> page - it's all or nothing.
> 
> AFAICT, the SIGBUS report is for user mappings and doesn't have to care
> whether it's Stage 2 hugetlb page or thp. And the lsb determination should
> take the Stage 1 hugepage size into account - something along the lines
> of the snippet from previous email.
> 

I think the lsb should indicate the size of the memory region known to
be broken by the kernel - however that whole mechanism works.

-Christoffer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-03-27 14:47             ` Christoffer Dall
@ 2017-03-28 14:50               ` Punit Agrawal
  -1 siblings, 0 replies; 34+ messages in thread
From: Punit Agrawal @ 2017-03-28 14:50 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Marc Zyngier, Tyler Baicar, kvmarm, linux-arm-kernel, gengdongjiu

Christoffer Dall <cdall@linaro.org> writes:

> On Mon, Mar 27, 2017 at 02:31:44PM +0100, Punit Agrawal wrote:
>> Christoffer Dall <cdall@linaro.org> writes:
>> 
>> > On Mon, Mar 27, 2017 at 01:00:56PM +0100, James Morse wrote:
>> >> Hi guys,
>> >> 
>> >> On 27/03/17 12:20, Punit Agrawal wrote:
>> >> > Christoffer Dall <cdall@linaro.org> writes:
>> >> >> On Wed, Mar 15, 2017 at 04:07:27PM +0000, James Morse wrote:
>> >> >>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
>> >> >>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
>> >> >>> SIGBUS to any user space process using the page, and notify all the
>> >> >>> in-kernel users.
>> >> >>>
>> >> >>> If the page corresponded with guest memory, KVM will unmap this page
>> >> >>> from its stage2 page tables. The user space process that allocated
>> >> >>> this memory may have never touched this page in which case it may not
>> >> >>> be mapped meaning SIGBUS won't be delivered.
>> >> >>>
>> >> >>> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
>> >> >>> comes to process the stage2 fault.
>> >> >>>
>> >> >>> Do as x86 does, and deliver the SIGBUS when we discover
>> >> >>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
>> >> >>> as this matches the user space mapping size.
>> >> 
>> >> >>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> >> >>> index 962616fd4ddd..9d1aa294e88f 100644
>> >> >>> --- a/arch/arm/kvm/mmu.c
>> >> >>> +++ b/arch/arm/kvm/mmu.c
>> >> >>> @@ -20,8 +20,10 @@
>> >> >>>  #include <linux/kvm_host.h>
>> >> >>>  #include <linux/io.h>
>> >> >>>  #include <linux/hugetlb.h>
>> >> >>> +#include <linux/sched/signal.h>
>> >> >>>  #include <trace/events/kvm.h>
>> >> >>>  #include <asm/pgalloc.h>
>> >> >>> +#include <asm/siginfo.h>
>> >> >>>  #include <asm/cacheflush.h>
>> >> >>>  #include <asm/kvm_arm.h>
>> >> >>>  #include <asm/kvm_mmu.h>
>> >> >>> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>> >> >>>  	__coherent_cache_guest_page(vcpu, pfn, size);
>> >> >>>  }
>> >> >>>  
>> >> >>> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
>> >> >>> +{
>> >> >>> +	siginfo_t info;
>> >> >>> +
>> >> >>> +	info.si_signo   = SIGBUS;
>> >> >>> +	info.si_errno   = 0;
>> >> >>> +	info.si_code    = BUS_MCEERR_AR;
>> >> >>> +	info.si_addr    = (void __user *)address;
>> >> >>> +
>> >> >>> +	if (hugetlb)
>> >> >>> +		info.si_addr_lsb = PMD_SHIFT;
>> >> >>> +	else
>> >> >>> +		info.si_addr_lsb = PAGE_SHIFT;
>> >> >>> +
>> >> >>> +	send_sig_info(SIGBUS, &info, current);
>> >> >>> +}
>> >> >>> +
>> >> >>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>> >> >>>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>> >> >>>  			  unsigned long fault_status)
>> >> >>> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>> >> >>>  	smp_rmb();
>> >> >>>  
>> >> >>>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
>> >> >>> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
>> >> >>> +		kvm_send_hwpoison_signal(hva, hugetlb);
>> >> >>
>> >> >> The way this is called means that we'll only notify userspace of a huge
>> >> >> mapping if userspace is mapping hugetlbfs, and not because the stage2
>> >> >> mapping may or may not have used transparent huge pages when the error
>> >> >> was discovered.  Is this the desired semantics?
>> >> 
>> >> No,
>> >> 
>> >> 
>> >> > I think so.
>> >> >
>> >> > AFAIUI, transparent hugepages are split before being poisoned while all
>> >> > the underlying pages of a hugepage are poisoned together, i.e., no
>> >> > splitting.
>> >> 
>> >> In which case I need to look into this some more!
>> >> 
>> >> My thinking was we should report the size that was knocked out of the stage2 to
>> >> avoid the guest repeatedly faulting until it has touched every guest-page-size
>> >> in the stage2 hole.
>> >
>> > By signaling something at the fault path, I think it's going to be very
>> > hard to backtrack how the stage 2 page tables looked like when faults
>> > started happening, because I think these are completely decoupled events
>> > (the mmu notifier and the later fault).
>> >
>> >> 
>> >> Reading the code in that kvm/mmu.c it looked like the mapping sizes would always
>> >> be the same as those used by userspace.
>> >
>> > I think the mapping sizes should be the same between userspace and KVM,
>> > but the mapping size of a particular page (and associated pages) may
>> > vary over time.
>> 
>> Stage 1 and Stage 2 support different hugepage sizes. A larger size
>> stage 1 page maps to multiple stage 2 page table entries. For stage 1,
>> we support PUD_SIZE, CONT_PMD_SIZE, PMD_SIZE and CONT_PTE_SIZE while
>> only PMD_SIZE is supported for Stage 2.
>> 
>> >
>> >> 
>> >> If the page was split before KVM could have taken this fault I assumed it would
>> >> fault on the page-size mapping and hugetlb would be false.
>> >
>> > I think you could have a huge page, which gets unmapped as a result on
>> > it getting split (perhaps because there was a failure on one page) and
>> > later as you fault, you can discover a range which can be a hugetlbfs or
>> > transparent huge pages.
>> >
>> > The question that I don't know is how Linux behaves if a page is marked
>> > with hwpoison, in that case, if Linux never supports THP and always
>> > marks an entire huge page in a hugetlbfs with the poison, then I think
>> > we're mostly good here.  If not, we should make sure we align with
>> > whatever the rest of the kernel does.
>> 
>> AFAICT, a hugetlbfs page is poisoned as a whole while thp is split
>> before poisoning. Quoting comment near the top of memory_failure() in
>> mm/memory_failure.c.
>> 
>>     /*
>>      * Currently errors on hugetlbfs pages are measured in hugepage units,
>>      * so nr_pages should be 1 << compound_order.  OTOH when errors are on
>>      * transparent hugepages, they are supposed to be split and error
>>      * measurement is done in normal page units.  So nr_pages should be one
>>      * in this case.
>>      */
>> 
>> >
>> >> (which is already
>> >> wrong for another reason, looks like I grabbed the variable before
>> >> transparent_hugepage_adjust() has had a go a it.).
>> >> 
>> >
>> > yes, which is why I asked if you only care about hugetlbfs.
>> >
>> 
>> Based on the comment above, we should never get a poisoned page that is
>> part of a transparent hugepage.
>> 
>> >> 
>> >> >> Also notice that the hva is not necessarily aligned to the beginning of
>> >> >> the huge page, so can we be giving userspace wrong information by
>> >> >> pointing in the middle of a huge page and telling it there was an
>> >> >> address error in the size of the PMD ?
>> >> >>
>> >> > 
>> >> > I could be reading it wrong but I think we are fine here - the address
>> >> > (hva) is the location that faulted. And the lsb indicates the least
>> >> > significant bit of the faulting address (See man sigaction(2)). The
>> >> > receiver of the signal is expected to use the address and lsb to workout
>> >> > the extent of corruption.
>> >> 
>> >> kill_proc() in mm/memory-failure.c does this too, but the address is set by
>> >> page_address_in_vma() in add_to_kill() of the same file. (I'll chat with Punit
>> >> off list.)
>> >> 
>> >> 
>> >> > Though I missed a subtlety while reviewing the patch before. The
>> >> > reported lsb should be for the userspace hugepage mapping (i.e., hva)
>> >> > and not for the stage 2.
>> >> 
>> >> I thought these were always supposed to be the same, and using hugetlb was a bug
>> >> because I didn't look closely enough at what is_vm_hugetlb_page() does.
>> 
>> See above.
>> 
>> >> 
>> >> 
>> >> > In light of this, I'd like to retract my Reviewed-by tag for this
>> >> > version of the patch as I believe we'll need to change the lsb
>> >> > reporting.
>> >> 
>> >> Sure, lets work out what this should be doing. I'm beginning to suspect x86's
>> >> 'always page size' was correct to begin with!
>> >> 
>> >
>> > I had a sense of that too, but it would be good to understand how you
>> > mark and individual page within a hugetlbfs huge page with hwpoison...
>> 
>> I don't think it is possible to mark an individual page in a hugetlbfs
>> page - it's all or nothing.
>> 
>> AFAICT, the SIGBUS report is for user mappings and doesn't have to care
>> whether it's Stage 2 hugetlb page or thp. And the lsb determination should
>> take the Stage 1 hugepage size into account - something along the lines
>> of the snippet from previous email.
>> 
>
> I think the lsb should indicate the size of the memory region known to
> be broken by the kernel - however that whole mechanism works.

Agreed.

To re-iterate and confirm that we are on the same page (this time using
correct terminology) -

* the kernel poisons pages in PAGE_SIZEed or hugepage sized units
  depending on where the poisoned address maps to. If the affected
  location maps to a transparent hugepage, the thp is split and the
  PAGE_SIZed unit corresponding to the location is poisoned.

* When sending a SIGBUS on encountering a poisoned pfn, the lsb should
  be -
  
  - PAGE_SHIFT, if the hva does not map to a hugepage
  - huge_page_shift(hstate_vma(vma)), if the hva belongs to a hugepage

Hopefully that makes sense.

Punit

>
> -Christoffer
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-03-28 14:50               ` Punit Agrawal
  0 siblings, 0 replies; 34+ messages in thread
From: Punit Agrawal @ 2017-03-28 14:50 UTC (permalink / raw)
  To: linux-arm-kernel

Christoffer Dall <cdall@linaro.org> writes:

> On Mon, Mar 27, 2017 at 02:31:44PM +0100, Punit Agrawal wrote:
>> Christoffer Dall <cdall@linaro.org> writes:
>> 
>> > On Mon, Mar 27, 2017 at 01:00:56PM +0100, James Morse wrote:
>> >> Hi guys,
>> >> 
>> >> On 27/03/17 12:20, Punit Agrawal wrote:
>> >> > Christoffer Dall <cdall@linaro.org> writes:
>> >> >> On Wed, Mar 15, 2017 at 04:07:27PM +0000, James Morse wrote:
>> >> >>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
>> >> >>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
>> >> >>> SIGBUS to any user space process using the page, and notify all the
>> >> >>> in-kernel users.
>> >> >>>
>> >> >>> If the page corresponded with guest memory, KVM will unmap this page
>> >> >>> from its stage2 page tables. The user space process that allocated
>> >> >>> this memory may have never touched this page in which case it may not
>> >> >>> be mapped meaning SIGBUS won't be delivered.
>> >> >>>
>> >> >>> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
>> >> >>> comes to process the stage2 fault.
>> >> >>>
>> >> >>> Do as x86 does, and deliver the SIGBUS when we discover
>> >> >>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
>> >> >>> as this matches the user space mapping size.
>> >> 
>> >> >>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> >> >>> index 962616fd4ddd..9d1aa294e88f 100644
>> >> >>> --- a/arch/arm/kvm/mmu.c
>> >> >>> +++ b/arch/arm/kvm/mmu.c
>> >> >>> @@ -20,8 +20,10 @@
>> >> >>>  #include <linux/kvm_host.h>
>> >> >>>  #include <linux/io.h>
>> >> >>>  #include <linux/hugetlb.h>
>> >> >>> +#include <linux/sched/signal.h>
>> >> >>>  #include <trace/events/kvm.h>
>> >> >>>  #include <asm/pgalloc.h>
>> >> >>> +#include <asm/siginfo.h>
>> >> >>>  #include <asm/cacheflush.h>
>> >> >>>  #include <asm/kvm_arm.h>
>> >> >>>  #include <asm/kvm_mmu.h>
>> >> >>> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>> >> >>>  	__coherent_cache_guest_page(vcpu, pfn, size);
>> >> >>>  }
>> >> >>>  
>> >> >>> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
>> >> >>> +{
>> >> >>> +	siginfo_t info;
>> >> >>> +
>> >> >>> +	info.si_signo   = SIGBUS;
>> >> >>> +	info.si_errno   = 0;
>> >> >>> +	info.si_code    = BUS_MCEERR_AR;
>> >> >>> +	info.si_addr    = (void __user *)address;
>> >> >>> +
>> >> >>> +	if (hugetlb)
>> >> >>> +		info.si_addr_lsb = PMD_SHIFT;
>> >> >>> +	else
>> >> >>> +		info.si_addr_lsb = PAGE_SHIFT;
>> >> >>> +
>> >> >>> +	send_sig_info(SIGBUS, &info, current);
>> >> >>> +}
>> >> >>> +
>> >> >>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>> >> >>>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>> >> >>>  			  unsigned long fault_status)
>> >> >>> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>> >> >>>  	smp_rmb();
>> >> >>>  
>> >> >>>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
>> >> >>> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
>> >> >>> +		kvm_send_hwpoison_signal(hva, hugetlb);
>> >> >>
>> >> >> The way this is called means that we'll only notify userspace of a huge
>> >> >> mapping if userspace is mapping hugetlbfs, and not because the stage2
>> >> >> mapping may or may not have used transparent huge pages when the error
>> >> >> was discovered.  Is this the desired semantics?
>> >> 
>> >> No,
>> >> 
>> >> 
>> >> > I think so.
>> >> >
>> >> > AFAIUI, transparent hugepages are split before being poisoned while all
>> >> > the underlying pages of a hugepage are poisoned together, i.e., no
>> >> > splitting.
>> >> 
>> >> In which case I need to look into this some more!
>> >> 
>> >> My thinking was we should report the size that was knocked out of the stage2 to
>> >> avoid the guest repeatedly faulting until it has touched every guest-page-size
>> >> in the stage2 hole.
>> >
>> > By signaling something at the fault path, I think it's going to be very
>> > hard to backtrack how the stage 2 page tables looked like when faults
>> > started happening, because I think these are completely decoupled events
>> > (the mmu notifier and the later fault).
>> >
>> >> 
>> >> Reading the code in that kvm/mmu.c it looked like the mapping sizes would always
>> >> be the same as those used by userspace.
>> >
>> > I think the mapping sizes should be the same between userspace and KVM,
>> > but the mapping size of a particular page (and associated pages) may
>> > vary over time.
>> 
>> Stage 1 and Stage 2 support different hugepage sizes. A larger size
>> stage 1 page maps to multiple stage 2 page table entries. For stage 1,
>> we support PUD_SIZE, CONT_PMD_SIZE, PMD_SIZE and CONT_PTE_SIZE while
>> only PMD_SIZE is supported for Stage 2.
>> 
>> >
>> >> 
>> >> If the page was split before KVM could have taken this fault I assumed it would
>> >> fault on the page-size mapping and hugetlb would be false.
>> >
>> > I think you could have a huge page, which gets unmapped as a result on
>> > it getting split (perhaps because there was a failure on one page) and
>> > later as you fault, you can discover a range which can be a hugetlbfs or
>> > transparent huge pages.
>> >
>> > The question that I don't know is how Linux behaves if a page is marked
>> > with hwpoison, in that case, if Linux never supports THP and always
>> > marks an entire huge page in a hugetlbfs with the poison, then I think
>> > we're mostly good here.  If not, we should make sure we align with
>> > whatever the rest of the kernel does.
>> 
>> AFAICT, a hugetlbfs page is poisoned as a whole while thp is split
>> before poisoning. Quoting comment near the top of memory_failure() in
>> mm/memory_failure.c.
>> 
>>     /*
>>      * Currently errors on hugetlbfs pages are measured in hugepage units,
>>      * so nr_pages should be 1 << compound_order.  OTOH when errors are on
>>      * transparent hugepages, they are supposed to be split and error
>>      * measurement is done in normal page units.  So nr_pages should be one
>>      * in this case.
>>      */
>> 
>> >
>> >> (which is already
>> >> wrong for another reason, looks like I grabbed the variable before
>> >> transparent_hugepage_adjust() has had a go a it.).
>> >> 
>> >
>> > yes, which is why I asked if you only care about hugetlbfs.
>> >
>> 
>> Based on the comment above, we should never get a poisoned page that is
>> part of a transparent hugepage.
>> 
>> >> 
>> >> >> Also notice that the hva is not necessarily aligned to the beginning of
>> >> >> the huge page, so can we be giving userspace wrong information by
>> >> >> pointing in the middle of a huge page and telling it there was an
>> >> >> address error in the size of the PMD ?
>> >> >>
>> >> > 
>> >> > I could be reading it wrong but I think we are fine here - the address
>> >> > (hva) is the location that faulted. And the lsb indicates the least
>> >> > significant bit of the faulting address (See man sigaction(2)). The
>> >> > receiver of the signal is expected to use the address and lsb to workout
>> >> > the extent of corruption.
>> >> 
>> >> kill_proc() in mm/memory-failure.c does this too, but the address is set by
>> >> page_address_in_vma() in add_to_kill() of the same file. (I'll chat with Punit
>> >> off list.)
>> >> 
>> >> 
>> >> > Though I missed a subtlety while reviewing the patch before. The
>> >> > reported lsb should be for the userspace hugepage mapping (i.e., hva)
>> >> > and not for the stage 2.
>> >> 
>> >> I thought these were always supposed to be the same, and using hugetlb was a bug
>> >> because I didn't look closely enough at what is_vm_hugetlb_page() does.
>> 
>> See above.
>> 
>> >> 
>> >> 
>> >> > In light of this, I'd like to retract my Reviewed-by tag for this
>> >> > version of the patch as I believe we'll need to change the lsb
>> >> > reporting.
>> >> 
>> >> Sure, lets work out what this should be doing. I'm beginning to suspect x86's
>> >> 'always page size' was correct to begin with!
>> >> 
>> >
>> > I had a sense of that too, but it would be good to understand how you
>> > mark and individual page within a hugetlbfs huge page with hwpoison...
>> 
>> I don't think it is possible to mark an individual page in a hugetlbfs
>> page - it's all or nothing.
>> 
>> AFAICT, the SIGBUS report is for user mappings and doesn't have to care
>> whether it's Stage 2 hugetlb page or thp. And the lsb determination should
>> take the Stage 1 hugepage size into account - something along the lines
>> of the snippet from previous email.
>> 
>
> I think the lsb should indicate the size of the memory region known to
> be broken by the kernel - however that whole mechanism works.

Agreed.

To re-iterate and confirm that we are on the same page (this time using
correct terminology) -

* the kernel poisons pages in PAGE_SIZEed or hugepage sized units
  depending on where the poisoned address maps to. If the affected
  location maps to a transparent hugepage, the thp is split and the
  PAGE_SIZed unit corresponding to the location is poisoned.

* When sending a SIGBUS on encountering a poisoned pfn, the lsb should
  be -
  
  - PAGE_SHIFT, if the hva does not map to a hugepage
  - huge_page_shift(hstate_vma(vma)), if the hva belongs to a hugepage

Hopefully that makes sense.

Punit

>
> -Christoffer
> _______________________________________________
> kvmarm mailing list
> kvmarm at lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-03-28 14:50               ` Punit Agrawal
@ 2017-03-28 15:12                 ` Christoffer Dall
  -1 siblings, 0 replies; 34+ messages in thread
From: Christoffer Dall @ 2017-03-28 15:12 UTC (permalink / raw)
  To: Punit Agrawal
  Cc: Marc Zyngier, Tyler Baicar, kvmarm, linux-arm-kernel, gengdongjiu

On Tue, Mar 28, 2017 at 03:50:51PM +0100, Punit Agrawal wrote:
> Christoffer Dall <cdall@linaro.org> writes:
> 
> > On Mon, Mar 27, 2017 at 02:31:44PM +0100, Punit Agrawal wrote:
> >> Christoffer Dall <cdall@linaro.org> writes:
> >> 
> >> > On Mon, Mar 27, 2017 at 01:00:56PM +0100, James Morse wrote:
> >> >> Hi guys,
> >> >> 
> >> >> On 27/03/17 12:20, Punit Agrawal wrote:
> >> >> > Christoffer Dall <cdall@linaro.org> writes:
> >> >> >> On Wed, Mar 15, 2017 at 04:07:27PM +0000, James Morse wrote:
> >> >> >>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
> >> >> >>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
> >> >> >>> SIGBUS to any user space process using the page, and notify all the
> >> >> >>> in-kernel users.
> >> >> >>>
> >> >> >>> If the page corresponded with guest memory, KVM will unmap this page
> >> >> >>> from its stage2 page tables. The user space process that allocated
> >> >> >>> this memory may have never touched this page in which case it may not
> >> >> >>> be mapped meaning SIGBUS won't be delivered.
> >> >> >>>
> >> >> >>> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
> >> >> >>> comes to process the stage2 fault.
> >> >> >>>
> >> >> >>> Do as x86 does, and deliver the SIGBUS when we discover
> >> >> >>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
> >> >> >>> as this matches the user space mapping size.
> >> >> 
> >> >> >>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> >> >> >>> index 962616fd4ddd..9d1aa294e88f 100644
> >> >> >>> --- a/arch/arm/kvm/mmu.c
> >> >> >>> +++ b/arch/arm/kvm/mmu.c
> >> >> >>> @@ -20,8 +20,10 @@
> >> >> >>>  #include <linux/kvm_host.h>
> >> >> >>>  #include <linux/io.h>
> >> >> >>>  #include <linux/hugetlb.h>
> >> >> >>> +#include <linux/sched/signal.h>
> >> >> >>>  #include <trace/events/kvm.h>
> >> >> >>>  #include <asm/pgalloc.h>
> >> >> >>> +#include <asm/siginfo.h>
> >> >> >>>  #include <asm/cacheflush.h>
> >> >> >>>  #include <asm/kvm_arm.h>
> >> >> >>>  #include <asm/kvm_mmu.h>
> >> >> >>> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
> >> >> >>>  	__coherent_cache_guest_page(vcpu, pfn, size);
> >> >> >>>  }
> >> >> >>>  
> >> >> >>> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
> >> >> >>> +{
> >> >> >>> +	siginfo_t info;
> >> >> >>> +
> >> >> >>> +	info.si_signo   = SIGBUS;
> >> >> >>> +	info.si_errno   = 0;
> >> >> >>> +	info.si_code    = BUS_MCEERR_AR;
> >> >> >>> +	info.si_addr    = (void __user *)address;
> >> >> >>> +
> >> >> >>> +	if (hugetlb)
> >> >> >>> +		info.si_addr_lsb = PMD_SHIFT;
> >> >> >>> +	else
> >> >> >>> +		info.si_addr_lsb = PAGE_SHIFT;
> >> >> >>> +
> >> >> >>> +	send_sig_info(SIGBUS, &info, current);
> >> >> >>> +}
> >> >> >>> +
> >> >> >>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >> >> >>>  			  struct kvm_memory_slot *memslot, unsigned long hva,
> >> >> >>>  			  unsigned long fault_status)
> >> >> >>> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >> >> >>>  	smp_rmb();
> >> >> >>>  
> >> >> >>>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> >> >> >>> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
> >> >> >>> +		kvm_send_hwpoison_signal(hva, hugetlb);
> >> >> >>
> >> >> >> The way this is called means that we'll only notify userspace of a huge
> >> >> >> mapping if userspace is mapping hugetlbfs, and not because the stage2
> >> >> >> mapping may or may not have used transparent huge pages when the error
> >> >> >> was discovered.  Is this the desired semantics?
> >> >> 
> >> >> No,
> >> >> 
> >> >> 
> >> >> > I think so.
> >> >> >
> >> >> > AFAIUI, transparent hugepages are split before being poisoned while all
> >> >> > the underlying pages of a hugepage are poisoned together, i.e., no
> >> >> > splitting.
> >> >> 
> >> >> In which case I need to look into this some more!
> >> >> 
> >> >> My thinking was we should report the size that was knocked out of the stage2 to
> >> >> avoid the guest repeatedly faulting until it has touched every guest-page-size
> >> >> in the stage2 hole.
> >> >
> >> > By signaling something at the fault path, I think it's going to be very
> >> > hard to backtrack how the stage 2 page tables looked like when faults
> >> > started happening, because I think these are completely decoupled events
> >> > (the mmu notifier and the later fault).
> >> >
> >> >> 
> >> >> Reading the code in that kvm/mmu.c it looked like the mapping sizes would always
> >> >> be the same as those used by userspace.
> >> >
> >> > I think the mapping sizes should be the same between userspace and KVM,
> >> > but the mapping size of a particular page (and associated pages) may
> >> > vary over time.
> >> 
> >> Stage 1 and Stage 2 support different hugepage sizes. A larger size
> >> stage 1 page maps to multiple stage 2 page table entries. For stage 1,
> >> we support PUD_SIZE, CONT_PMD_SIZE, PMD_SIZE and CONT_PTE_SIZE while
> >> only PMD_SIZE is supported for Stage 2.
> >> 
> >> >
> >> >> 
> >> >> If the page was split before KVM could have taken this fault I assumed it would
> >> >> fault on the page-size mapping and hugetlb would be false.
> >> >
> >> > I think you could have a huge page, which gets unmapped as a result on
> >> > it getting split (perhaps because there was a failure on one page) and
> >> > later as you fault, you can discover a range which can be a hugetlbfs or
> >> > transparent huge pages.
> >> >
> >> > The question that I don't know is how Linux behaves if a page is marked
> >> > with hwpoison, in that case, if Linux never supports THP and always
> >> > marks an entire huge page in a hugetlbfs with the poison, then I think
> >> > we're mostly good here.  If not, we should make sure we align with
> >> > whatever the rest of the kernel does.
> >> 
> >> AFAICT, a hugetlbfs page is poisoned as a whole while thp is split
> >> before poisoning. Quoting comment near the top of memory_failure() in
> >> mm/memory_failure.c.
> >> 
> >>     /*
> >>      * Currently errors on hugetlbfs pages are measured in hugepage units,
> >>      * so nr_pages should be 1 << compound_order.  OTOH when errors are on
> >>      * transparent hugepages, they are supposed to be split and error
> >>      * measurement is done in normal page units.  So nr_pages should be one
> >>      * in this case.
> >>      */
> >> 
> >> >
> >> >> (which is already
> >> >> wrong for another reason, looks like I grabbed the variable before
> >> >> transparent_hugepage_adjust() has had a go a it.).
> >> >> 
> >> >
> >> > yes, which is why I asked if you only care about hugetlbfs.
> >> >
> >> 
> >> Based on the comment above, we should never get a poisoned page that is
> >> part of a transparent hugepage.
> >> 
> >> >> 
> >> >> >> Also notice that the hva is not necessarily aligned to the beginning of
> >> >> >> the huge page, so can we be giving userspace wrong information by
> >> >> >> pointing in the middle of a huge page and telling it there was an
> >> >> >> address error in the size of the PMD ?
> >> >> >>
> >> >> > 
> >> >> > I could be reading it wrong but I think we are fine here - the address
> >> >> > (hva) is the location that faulted. And the lsb indicates the least
> >> >> > significant bit of the faulting address (See man sigaction(2)). The
> >> >> > receiver of the signal is expected to use the address and lsb to workout
> >> >> > the extent of corruption.
> >> >> 
> >> >> kill_proc() in mm/memory-failure.c does this too, but the address is set by
> >> >> page_address_in_vma() in add_to_kill() of the same file. (I'll chat with Punit
> >> >> off list.)
> >> >> 
> >> >> 
> >> >> > Though I missed a subtlety while reviewing the patch before. The
> >> >> > reported lsb should be for the userspace hugepage mapping (i.e., hva)
> >> >> > and not for the stage 2.
> >> >> 
> >> >> I thought these were always supposed to be the same, and using hugetlb was a bug
> >> >> because I didn't look closely enough at what is_vm_hugetlb_page() does.
> >> 
> >> See above.
> >> 
> >> >> 
> >> >> 
> >> >> > In light of this, I'd like to retract my Reviewed-by tag for this
> >> >> > version of the patch as I believe we'll need to change the lsb
> >> >> > reporting.
> >> >> 
> >> >> Sure, lets work out what this should be doing. I'm beginning to suspect x86's
> >> >> 'always page size' was correct to begin with!
> >> >> 
> >> >
> >> > I had a sense of that too, but it would be good to understand how you
> >> > mark and individual page within a hugetlbfs huge page with hwpoison...
> >> 
> >> I don't think it is possible to mark an individual page in a hugetlbfs
> >> page - it's all or nothing.
> >> 
> >> AFAICT, the SIGBUS report is for user mappings and doesn't have to care
> >> whether it's Stage 2 hugetlb page or thp. And the lsb determination should
> >> take the Stage 1 hugepage size into account - something along the lines
> >> of the snippet from previous email.
> >> 
> >
> > I think the lsb should indicate the size of the memory region known to
> > be broken by the kernel - however that whole mechanism works.
> 
> Agreed.
> 
> To re-iterate and confirm that we are on the same page 

Haha, good one.

> (this time using
> correct terminology) -
> 
> * the kernel poisons pages in PAGE_SIZEed or hugepage sized units
>   depending on where the poisoned address maps to. If the affected
>   location maps to a transparent hugepage, the thp is split and the
>   PAGE_SIZed unit corresponding to the location is poisoned.
> 
> * When sending a SIGBUS on encountering a poisoned pfn, the lsb should
>   be -
>   
>   - PAGE_SHIFT, if the hva does not map to a hugepage
>   - huge_page_shift(hstate_vma(vma)), if the hva belongs to a hugepage
> 
> Hopefully that makes sense.
> 

Yes, sounds reasonable to me.  Only thing we could argue, is that if
userspace knows it's dealing with hugetlbfs, it should know the minimal
granularity of the memory it can deal with, so maybe that's why x86
doesn't bother and always just uses PAGE_SHIFT.

In both cases though, the patch was mostly fine, and probably gets the
job done regardless of which lsb is reported.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-03-28 15:12                 ` Christoffer Dall
  0 siblings, 0 replies; 34+ messages in thread
From: Christoffer Dall @ 2017-03-28 15:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Mar 28, 2017 at 03:50:51PM +0100, Punit Agrawal wrote:
> Christoffer Dall <cdall@linaro.org> writes:
> 
> > On Mon, Mar 27, 2017 at 02:31:44PM +0100, Punit Agrawal wrote:
> >> Christoffer Dall <cdall@linaro.org> writes:
> >> 
> >> > On Mon, Mar 27, 2017 at 01:00:56PM +0100, James Morse wrote:
> >> >> Hi guys,
> >> >> 
> >> >> On 27/03/17 12:20, Punit Agrawal wrote:
> >> >> > Christoffer Dall <cdall@linaro.org> writes:
> >> >> >> On Wed, Mar 15, 2017 at 04:07:27PM +0000, James Morse wrote:
> >> >> >>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
> >> >> >>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
> >> >> >>> SIGBUS to any user space process using the page, and notify all the
> >> >> >>> in-kernel users.
> >> >> >>>
> >> >> >>> If the page corresponded with guest memory, KVM will unmap this page
> >> >> >>> from its stage2 page tables. The user space process that allocated
> >> >> >>> this memory may have never touched this page in which case it may not
> >> >> >>> be mapped meaning SIGBUS won't be delivered.
> >> >> >>>
> >> >> >>> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
> >> >> >>> comes to process the stage2 fault.
> >> >> >>>
> >> >> >>> Do as x86 does, and deliver the SIGBUS when we discover
> >> >> >>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
> >> >> >>> as this matches the user space mapping size.
> >> >> 
> >> >> >>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> >> >> >>> index 962616fd4ddd..9d1aa294e88f 100644
> >> >> >>> --- a/arch/arm/kvm/mmu.c
> >> >> >>> +++ b/arch/arm/kvm/mmu.c
> >> >> >>> @@ -20,8 +20,10 @@
> >> >> >>>  #include <linux/kvm_host.h>
> >> >> >>>  #include <linux/io.h>
> >> >> >>>  #include <linux/hugetlb.h>
> >> >> >>> +#include <linux/sched/signal.h>
> >> >> >>>  #include <trace/events/kvm.h>
> >> >> >>>  #include <asm/pgalloc.h>
> >> >> >>> +#include <asm/siginfo.h>
> >> >> >>>  #include <asm/cacheflush.h>
> >> >> >>>  #include <asm/kvm_arm.h>
> >> >> >>>  #include <asm/kvm_mmu.h>
> >> >> >>> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
> >> >> >>>  	__coherent_cache_guest_page(vcpu, pfn, size);
> >> >> >>>  }
> >> >> >>>  
> >> >> >>> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
> >> >> >>> +{
> >> >> >>> +	siginfo_t info;
> >> >> >>> +
> >> >> >>> +	info.si_signo   = SIGBUS;
> >> >> >>> +	info.si_errno   = 0;
> >> >> >>> +	info.si_code    = BUS_MCEERR_AR;
> >> >> >>> +	info.si_addr    = (void __user *)address;
> >> >> >>> +
> >> >> >>> +	if (hugetlb)
> >> >> >>> +		info.si_addr_lsb = PMD_SHIFT;
> >> >> >>> +	else
> >> >> >>> +		info.si_addr_lsb = PAGE_SHIFT;
> >> >> >>> +
> >> >> >>> +	send_sig_info(SIGBUS, &info, current);
> >> >> >>> +}
> >> >> >>> +
> >> >> >>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >> >> >>>  			  struct kvm_memory_slot *memslot, unsigned long hva,
> >> >> >>>  			  unsigned long fault_status)
> >> >> >>> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >> >> >>>  	smp_rmb();
> >> >> >>>  
> >> >> >>>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> >> >> >>> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
> >> >> >>> +		kvm_send_hwpoison_signal(hva, hugetlb);
> >> >> >>
> >> >> >> The way this is called means that we'll only notify userspace of a huge
> >> >> >> mapping if userspace is mapping hugetlbfs, and not because the stage2
> >> >> >> mapping may or may not have used transparent huge pages when the error
> >> >> >> was discovered.  Is this the desired semantics?
> >> >> 
> >> >> No,
> >> >> 
> >> >> 
> >> >> > I think so.
> >> >> >
> >> >> > AFAIUI, transparent hugepages are split before being poisoned while all
> >> >> > the underlying pages of a hugepage are poisoned together, i.e., no
> >> >> > splitting.
> >> >> 
> >> >> In which case I need to look into this some more!
> >> >> 
> >> >> My thinking was we should report the size that was knocked out of the stage2 to
> >> >> avoid the guest repeatedly faulting until it has touched every guest-page-size
> >> >> in the stage2 hole.
> >> >
> >> > By signaling something at the fault path, I think it's going to be very
> >> > hard to backtrack how the stage 2 page tables looked like when faults
> >> > started happening, because I think these are completely decoupled events
> >> > (the mmu notifier and the later fault).
> >> >
> >> >> 
> >> >> Reading the code in that kvm/mmu.c it looked like the mapping sizes would always
> >> >> be the same as those used by userspace.
> >> >
> >> > I think the mapping sizes should be the same between userspace and KVM,
> >> > but the mapping size of a particular page (and associated pages) may
> >> > vary over time.
> >> 
> >> Stage 1 and Stage 2 support different hugepage sizes. A larger size
> >> stage 1 page maps to multiple stage 2 page table entries. For stage 1,
> >> we support PUD_SIZE, CONT_PMD_SIZE, PMD_SIZE and CONT_PTE_SIZE while
> >> only PMD_SIZE is supported for Stage 2.
> >> 
> >> >
> >> >> 
> >> >> If the page was split before KVM could have taken this fault I assumed it would
> >> >> fault on the page-size mapping and hugetlb would be false.
> >> >
> >> > I think you could have a huge page, which gets unmapped as a result on
> >> > it getting split (perhaps because there was a failure on one page) and
> >> > later as you fault, you can discover a range which can be a hugetlbfs or
> >> > transparent huge pages.
> >> >
> >> > The question that I don't know is how Linux behaves if a page is marked
> >> > with hwpoison, in that case, if Linux never supports THP and always
> >> > marks an entire huge page in a hugetlbfs with the poison, then I think
> >> > we're mostly good here.  If not, we should make sure we align with
> >> > whatever the rest of the kernel does.
> >> 
> >> AFAICT, a hugetlbfs page is poisoned as a whole while thp is split
> >> before poisoning. Quoting comment near the top of memory_failure() in
> >> mm/memory_failure.c.
> >> 
> >>     /*
> >>      * Currently errors on hugetlbfs pages are measured in hugepage units,
> >>      * so nr_pages should be 1 << compound_order.  OTOH when errors are on
> >>      * transparent hugepages, they are supposed to be split and error
> >>      * measurement is done in normal page units.  So nr_pages should be one
> >>      * in this case.
> >>      */
> >> 
> >> >
> >> >> (which is already
> >> >> wrong for another reason, looks like I grabbed the variable before
> >> >> transparent_hugepage_adjust() has had a go a it.).
> >> >> 
> >> >
> >> > yes, which is why I asked if you only care about hugetlbfs.
> >> >
> >> 
> >> Based on the comment above, we should never get a poisoned page that is
> >> part of a transparent hugepage.
> >> 
> >> >> 
> >> >> >> Also notice that the hva is not necessarily aligned to the beginning of
> >> >> >> the huge page, so can we be giving userspace wrong information by
> >> >> >> pointing in the middle of a huge page and telling it there was an
> >> >> >> address error in the size of the PMD ?
> >> >> >>
> >> >> > 
> >> >> > I could be reading it wrong but I think we are fine here - the address
> >> >> > (hva) is the location that faulted. And the lsb indicates the least
> >> >> > significant bit of the faulting address (See man sigaction(2)). The
> >> >> > receiver of the signal is expected to use the address and lsb to workout
> >> >> > the extent of corruption.
> >> >> 
> >> >> kill_proc() in mm/memory-failure.c does this too, but the address is set by
> >> >> page_address_in_vma() in add_to_kill() of the same file. (I'll chat with Punit
> >> >> off list.)
> >> >> 
> >> >> 
> >> >> > Though I missed a subtlety while reviewing the patch before. The
> >> >> > reported lsb should be for the userspace hugepage mapping (i.e., hva)
> >> >> > and not for the stage 2.
> >> >> 
> >> >> I thought these were always supposed to be the same, and using hugetlb was a bug
> >> >> because I didn't look closely enough at what is_vm_hugetlb_page() does.
> >> 
> >> See above.
> >> 
> >> >> 
> >> >> 
> >> >> > In light of this, I'd like to retract my Reviewed-by tag for this
> >> >> > version of the patch as I believe we'll need to change the lsb
> >> >> > reporting.
> >> >> 
> >> >> Sure, lets work out what this should be doing. I'm beginning to suspect x86's
> >> >> 'always page size' was correct to begin with!
> >> >> 
> >> >
> >> > I had a sense of that too, but it would be good to understand how you
> >> > mark and individual page within a hugetlbfs huge page with hwpoison...
> >> 
> >> I don't think it is possible to mark an individual page in a hugetlbfs
> >> page - it's all or nothing.
> >> 
> >> AFAICT, the SIGBUS report is for user mappings and doesn't have to care
> >> whether it's Stage 2 hugetlb page or thp. And the lsb determination should
> >> take the Stage 1 hugepage size into account - something along the lines
> >> of the snippet from previous email.
> >> 
> >
> > I think the lsb should indicate the size of the memory region known to
> > be broken by the kernel - however that whole mechanism works.
> 
> Agreed.
> 
> To re-iterate and confirm that we are on the same page 

Haha, good one.

> (this time using
> correct terminology) -
> 
> * the kernel poisons pages in PAGE_SIZEed or hugepage sized units
>   depending on where the poisoned address maps to. If the affected
>   location maps to a transparent hugepage, the thp is split and the
>   PAGE_SIZed unit corresponding to the location is poisoned.
> 
> * When sending a SIGBUS on encountering a poisoned pfn, the lsb should
>   be -
>   
>   - PAGE_SHIFT, if the hva does not map to a hugepage
>   - huge_page_shift(hstate_vma(vma)), if the hva belongs to a hugepage
> 
> Hopefully that makes sense.
> 

Yes, sounds reasonable to me.  Only thing we could argue, is that if
userspace knows it's dealing with hugetlbfs, it should know the minimal
granularity of the memory it can deal with, so maybe that's why x86
doesn't bother and always just uses PAGE_SHIFT.

In both cases though, the patch was mostly fine, and probably gets the
job done regardless of which lsb is reported.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-03-15 16:07 ` James Morse
@ 2017-04-04 23:05   ` gengdongjiu
  -1 siblings, 0 replies; 34+ messages in thread
From: gengdongjiu @ 2017-04-04 23:05 UTC (permalink / raw)
  To: James Morse
  Cc: Punit Agrawal, Marc Zyngier, Tyler Baicar, gengdongjiu,
	wuquanming, linux-arm-kernel, kvmarm

Hi James,
   thanks for the patch, have you consider to told Qemu or KVM tools
the reason for this bus error(SEA/SEI)?

when Qemu or KVM tools get this SIGBUS signal, it do not know receive
this SIGBUS due to SEA or SEI.
OR KVM only send this SIGBUS when encounter SEA? if so, for the SEI
case, how to let Qemu simulate to generate CPER for guest OS SEI.


2017-03-16 0:07 GMT+08:00 James Morse <james.morse@arm.com>:
> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
> broken memory can call memory_failure() in mm/memory-failure.c to deliver
> SIGBUS to any user space process using the page, and notify all the
> in-kernel users.
>
> If the page corresponded with guest memory, KVM will unmap this page
> from its stage2 page tables. The user space process that allocated
> this memory may have never touched this page in which case it may not
> be mapped meaning SIGBUS won't be delivered.
>
> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
> comes to process the stage2 fault.
>
> Do as x86 does, and deliver the SIGBUS when we discover
> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
> as this matches the user space mapping size.
>
> Signed-off-by: James Morse <james.morse@arm.com>
> CC: gengdongjiu <gengdj.1984@gmail.com>
> ---
>  Without this patch both kvmtool and Qemu exit as the KVM_RUN ioctl() returns
>  EFAULT.
>  QEMU: error: kvm run failed Bad address
>  LVKM: KVM_RUN failed: Bad address
>
>  With this patch both kvmtool and Qemu receive SIGBUS ... and then exit.
>  In the future Qemu can use this signal to notify the guest, for more details
>  see hwpoison[1].
>
>  [0] https://www.spinics.net/lists/arm-kernel/msg560009.html
>  [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/hwpoison.txt
>
>
>  arch/arm/kvm/mmu.c | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
>
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 962616fd4ddd..9d1aa294e88f 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -20,8 +20,10 @@
>  #include <linux/kvm_host.h>
>  #include <linux/io.h>
>  #include <linux/hugetlb.h>
> +#include <linux/sched/signal.h>
>  #include <trace/events/kvm.h>
>  #include <asm/pgalloc.h>
> +#include <asm/siginfo.h>
>  #include <asm/cacheflush.h>
>  #include <asm/kvm_arm.h>
>  #include <asm/kvm_mmu.h>
> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>         __coherent_cache_guest_page(vcpu, pfn, size);
>  }
>
> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
> +{
> +       siginfo_t info;
> +
> +       info.si_signo   = SIGBUS;
> +       info.si_errno   = 0;
> +       info.si_code    = BUS_MCEERR_AR;
> +       info.si_addr    = (void __user *)address;
> +
> +       if (hugetlb)
> +               info.si_addr_lsb = PMD_SHIFT;
> +       else
> +               info.si_addr_lsb = PAGE_SHIFT;
> +
> +       send_sig_info(SIGBUS, &info, current);
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>                           struct kvm_memory_slot *memslot, unsigned long hva,
>                           unsigned long fault_status)
> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>         smp_rmb();
>
>         pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> +       if (pfn == KVM_PFN_ERR_HWPOISON) {
> +               kvm_send_hwpoison_signal(hva, hugetlb);
> +               return 0;
> +       }
>         if (is_error_noslot_pfn(pfn))
>                 return -EFAULT;
>
> --
> 2.10.1
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-04-04 23:05   ` gengdongjiu
  0 siblings, 0 replies; 34+ messages in thread
From: gengdongjiu @ 2017-04-04 23:05 UTC (permalink / raw)
  To: linux-arm-kernel

Hi James,
   thanks for the patch, have you consider to told Qemu or KVM tools
the reason for this bus error(SEA/SEI)?

when Qemu or KVM tools get this SIGBUS signal, it do not know receive
this SIGBUS due to SEA or SEI.
OR KVM only send this SIGBUS when encounter SEA? if so, for the SEI
case, how to let Qemu simulate to generate CPER for guest OS SEI.


2017-03-16 0:07 GMT+08:00 James Morse <james.morse@arm.com>:
> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64[0], notifications for
> broken memory can call memory_failure() in mm/memory-failure.c to deliver
> SIGBUS to any user space process using the page, and notify all the
> in-kernel users.
>
> If the page corresponded with guest memory, KVM will unmap this page
> from its stage2 page tables. The user space process that allocated
> this memory may have never touched this page in which case it may not
> be mapped meaning SIGBUS won't be delivered.
>
> When this happens KVM discovers pfn == KVM_PFN_ERR_HWPOISON when it
> comes to process the stage2 fault.
>
> Do as x86 does, and deliver the SIGBUS when we discover
> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
> as this matches the user space mapping size.
>
> Signed-off-by: James Morse <james.morse@arm.com>
> CC: gengdongjiu <gengdj.1984@gmail.com>
> ---
>  Without this patch both kvmtool and Qemu exit as the KVM_RUN ioctl() returns
>  EFAULT.
>  QEMU: error: kvm run failed Bad address
>  LVKM: KVM_RUN failed: Bad address
>
>  With this patch both kvmtool and Qemu receive SIGBUS ... and then exit.
>  In the future Qemu can use this signal to notify the guest, for more details
>  see hwpoison[1].
>
>  [0] https://www.spinics.net/lists/arm-kernel/msg560009.html
>  [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/hwpoison.txt
>
>
>  arch/arm/kvm/mmu.c | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
>
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 962616fd4ddd..9d1aa294e88f 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -20,8 +20,10 @@
>  #include <linux/kvm_host.h>
>  #include <linux/io.h>
>  #include <linux/hugetlb.h>
> +#include <linux/sched/signal.h>
>  #include <trace/events/kvm.h>
>  #include <asm/pgalloc.h>
> +#include <asm/siginfo.h>
>  #include <asm/cacheflush.h>
>  #include <asm/kvm_arm.h>
>  #include <asm/kvm_mmu.h>
> @@ -1237,6 +1239,23 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>         __coherent_cache_guest_page(vcpu, pfn, size);
>  }
>
> +static void kvm_send_hwpoison_signal(unsigned long address, bool hugetlb)
> +{
> +       siginfo_t info;
> +
> +       info.si_signo   = SIGBUS;
> +       info.si_errno   = 0;
> +       info.si_code    = BUS_MCEERR_AR;
> +       info.si_addr    = (void __user *)address;
> +
> +       if (hugetlb)
> +               info.si_addr_lsb = PMD_SHIFT;
> +       else
> +               info.si_addr_lsb = PAGE_SHIFT;
> +
> +       send_sig_info(SIGBUS, &info, current);
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>                           struct kvm_memory_slot *memslot, unsigned long hva,
>                           unsigned long fault_status)
> @@ -1306,6 +1325,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>         smp_rmb();
>
>         pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> +       if (pfn == KVM_PFN_ERR_HWPOISON) {
> +               kvm_send_hwpoison_signal(hva, hugetlb);
> +               return 0;
> +       }
>         if (is_error_noslot_pfn(pfn))
>                 return -EFAULT;
>
> --
> 2.10.1
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-04-04 23:05   ` gengdongjiu
@ 2017-04-06  9:25     ` James Morse
  -1 siblings, 0 replies; 34+ messages in thread
From: James Morse @ 2017-04-06  9:25 UTC (permalink / raw)
  To: gengdongjiu
  Cc: Punit Agrawal, Marc Zyngier, Tyler Baicar, gengdongjiu,
	wuquanming, linux-arm-kernel, kvmarm

Hi gengdongjiu,

On 05/04/17 00:05, gengdongjiu wrote:
> thanks for the patch, have you consider to told Qemu or KVM tools
> the reason for this bus error(SEA/SEI)?

They should never need to know. We should treat Qemu/kvmtool like any other
program. Programs should only need to know about the affect on them, not the
underlying reason or mechanism.

> when Qemu or KVM tools get this SIGBUS signal, it do not know receive
> this SIGBUS due to SEA or SEI.

Why would this matter?

Firmware signalled Linux that something bad happened. Linux handles the problem
and everything keeps running.

The interface with firmware has to be architecture specific. When signalling
user-space it should be architecture agnostic, otherwise we can't write portable
user space code.

If Qemu was affected by the error (currently only if some of its memory was
hwpoisoned) we send it SIGBUS as we would for any other program. Qemu can choose
if and how to signal the guest about this error, it doesn't have to use the same
interface as firmware and the host used. With TCG Qemu may be emulating a
totally different architecture!

Looking at the list of errors in table 250 of UEFI 2.6, cache-errors are the
only case I can imagine we would want to report to a guest, these are
effectively transient memory errors. SIGBUS is still appropriate here, but we
probably need a new si_code value to indicate the error can be cleared. (unlike
hwpoison which appears to never re-use the affected page).

Thanks,

James

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-04-06  9:25     ` James Morse
  0 siblings, 0 replies; 34+ messages in thread
From: James Morse @ 2017-04-06  9:25 UTC (permalink / raw)
  To: linux-arm-kernel

Hi gengdongjiu,

On 05/04/17 00:05, gengdongjiu wrote:
> thanks for the patch, have you consider to told Qemu or KVM tools
> the reason for this bus error(SEA/SEI)?

They should never need to know. We should treat Qemu/kvmtool like any other
program. Programs should only need to know about the affect on them, not the
underlying reason or mechanism.

> when Qemu or KVM tools get this SIGBUS signal, it do not know receive
> this SIGBUS due to SEA or SEI.

Why would this matter?

Firmware signalled Linux that something bad happened. Linux handles the problem
and everything keeps running.

The interface with firmware has to be architecture specific. When signalling
user-space it should be architecture agnostic, otherwise we can't write portable
user space code.

If Qemu was affected by the error (currently only if some of its memory was
hwpoisoned) we send it SIGBUS as we would for any other program. Qemu can choose
if and how to signal the guest about this error, it doesn't have to use the same
interface as firmware and the host used. With TCG Qemu may be emulating a
totally different architecture!

Looking at the list of errors in table 250 of UEFI 2.6, cache-errors are the
only case I can imagine we would want to report to a guest, these are
effectively transient memory errors. SIGBUS is still appropriate here, but we
probably need a new si_code value to indicate the error can be cleared. (unlike
hwpoison which appears to never re-use the affected page).

Thanks,

James

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-04-06  9:25     ` James Morse
@ 2017-04-06 15:06       ` gengdongjiu
  -1 siblings, 0 replies; 34+ messages in thread
From: gengdongjiu @ 2017-04-06 15:06 UTC (permalink / raw)
  To: James Morse
  Cc: Punit Agrawal, Marc Zyngier, Tyler Baicar, gengdongjiu,
	wuquanming, linux-arm-kernel, kvmarm

Hi James,
  thanks for the mail.

2017-04-06 17:25 GMT+08:00, James Morse <james.morse@arm.com>:
> Hi gengdongjiu,
>
> On 05/04/17 00:05, gengdongjiu wrote:
>> thanks for the patch, have you consider to told Qemu or KVM tools
>> the reason for this bus error(SEA/SEI)?
>
> They should never need to know. We should treat Qemu/kvmtool like any other
> program. Programs should only need to know about the affect on them, not
> the
> underlying reason or mechanism.
>
>
>> when Qemu or KVM tools get this SIGBUS signal, it do not know receive
>> this SIGBUS due to SEA or SEI.
>
> Why would this matter?
>
> Firmware signalled Linux that something bad happened. Linux handles the
> problem
> and everything keeps running.
>
> The interface with firmware has to be architecture specific. When
> signalling
> user-space it should be architecture agnostic, otherwise we can't write
> portable
> user space code.
>
>
> If Qemu was affected by the error (currently only if some of its memory was
> hwpoisoned) we send it SIGBUS as we would for any other program. Qemu can
> choose
> if and how to signal the guest about this error, it doesn't have to use the
> same
> interface as firmware and the host used. With TCG Qemu may be emulating a
> totally different architecture!
>
>
> Looking at the list of errors in table 250 of UEFI 2.6, cache-errors are
> the
> only case I can imagine we would want to report to a guest, these are
> effectively transient memory errors. SIGBUS is still appropriate here, but
> we
> probably need a new si_code value to indicate the error can be cleared.
> (unlike
> hwpoison which appears to never re-use the affected page).

James,

I understand your idea.

Below is my previous idea:
  When signalling Qemu, Qemu generate GHES ,then  Qemu/kvmtool  inject
the SEA/SEI to guest OS. For different reason, Qemu/.KVMtool injects
diferent notification type. if inject SEA/SEI/IRQ, guest OS handle the
guest SEA/SEI/IRQ. handling guest OS SEA/SEI and IRQ guest OS software
logical is  different.
for example, guest OS call API "ghes_notify_sea" when happening SEA;
call API  "ghes_notify_sei" when happening SEI.

so what is your suggested way to notify guest OS after Qemu generating the CPER?

X86 Qemu uses below method(send acpi event) to notify guest OS for
runtime modification CPER, this method should not through KVM.
            /* Send _GPE.E05 event */
            acpi_send_event(DEVICE(obj), ACPI_VMGENID_CHANGE_STATUS);

so for the notification guest OS, I think Qemu uses IOCTL to let KVM
inject error may be better.



>
>
> Thanks,
>
> James
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-04-06 15:06       ` gengdongjiu
  0 siblings, 0 replies; 34+ messages in thread
From: gengdongjiu @ 2017-04-06 15:06 UTC (permalink / raw)
  To: linux-arm-kernel

Hi James,
  thanks for the mail.

2017-04-06 17:25 GMT+08:00, James Morse <james.morse@arm.com>:
> Hi gengdongjiu,
>
> On 05/04/17 00:05, gengdongjiu wrote:
>> thanks for the patch, have you consider to told Qemu or KVM tools
>> the reason for this bus error(SEA/SEI)?
>
> They should never need to know. We should treat Qemu/kvmtool like any other
> program. Programs should only need to know about the affect on them, not
> the
> underlying reason or mechanism.
>
>
>> when Qemu or KVM tools get this SIGBUS signal, it do not know receive
>> this SIGBUS due to SEA or SEI.
>
> Why would this matter?
>
> Firmware signalled Linux that something bad happened. Linux handles the
> problem
> and everything keeps running.
>
> The interface with firmware has to be architecture specific. When
> signalling
> user-space it should be architecture agnostic, otherwise we can't write
> portable
> user space code.
>
>
> If Qemu was affected by the error (currently only if some of its memory was
> hwpoisoned) we send it SIGBUS as we would for any other program. Qemu can
> choose
> if and how to signal the guest about this error, it doesn't have to use the
> same
> interface as firmware and the host used. With TCG Qemu may be emulating a
> totally different architecture!
>
>
> Looking at the list of errors in table 250 of UEFI 2.6, cache-errors are
> the
> only case I can imagine we would want to report to a guest, these are
> effectively transient memory errors. SIGBUS is still appropriate here, but
> we
> probably need a new si_code value to indicate the error can be cleared.
> (unlike
> hwpoison which appears to never re-use the affected page).

James,

I understand your idea.

Below is my previous idea:
  When signalling Qemu, Qemu generate GHES ,then  Qemu/kvmtool  inject
the SEA/SEI to guest OS. For different reason, Qemu/.KVMtool injects
diferent notification type. if inject SEA/SEI/IRQ, guest OS handle the
guest SEA/SEI/IRQ. handling guest OS SEA/SEI and IRQ guest OS software
logical is  different.
for example, guest OS call API "ghes_notify_sea" when happening SEA;
call API  "ghes_notify_sei" when happening SEI.

so what is your suggested way to notify guest OS after Qemu generating the CPER?

X86 Qemu uses below method(send acpi event) to notify guest OS for
runtime modification CPER, this method should not through KVM.
            /* Send _GPE.E05 event */
            acpi_send_event(DEVICE(obj), ACPI_VMGENID_CHANGE_STATUS);

so for the notification guest OS, I think Qemu uses IOCTL to let KVM
inject error may be better.



>
>
> Thanks,
>
> James
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-04-06 15:06       ` gengdongjiu
@ 2017-04-07 16:12         ` James Morse
  -1 siblings, 0 replies; 34+ messages in thread
From: James Morse @ 2017-04-07 16:12 UTC (permalink / raw)
  To: gengdongjiu
  Cc: Punit Agrawal, Marc Zyngier, Tyler Baicar, gengdongjiu,
	wuquanming, linux-arm-kernel, kvmarm

Hi gengdongjiu,

On 06/04/17 16:06, gengdongjiu wrote:
> Below is my previous idea:
>   When signalling Qemu, Qemu generate GHES ,then  Qemu/kvmtool  inject
> the SEA/SEI to guest OS. For different reason, Qemu/.KVMtool injects
> diferent notification type. if inject SEA/SEI/IRQ, guest OS handle the
> guest SEA/SEI/IRQ. handling guest OS SEA/SEI and IRQ guest OS software
> logical is  different.
> for example, guest OS call API "ghes_notify_sea" when happening SEA;
> call API  "ghes_notify_sei" when happening SEI.

Sounds reasonable. Qemu shouldn't have to care what the guest OS is so the
injecting notifications should stick to KVM APIs.

> so what is your suggested way to notify guest OS after Qemu generating the CPER?
[...]
> so for the notification guest OS, I think Qemu uses IOCTL to let KVM
> inject error may be better.

I agree.

Synchronous External Abort is something that can always be delivered, Qemu can
make it look like SEA was taken on a vcpu by modifying the registers using KVM's
KVM_SET_ONE_REG ioctl(). The pseudo code for what is required is in the
ARM-ARM's 'AArch64.TakeException'.

SError Interrupt is more complicated as it can be masked. Fortunately the
architecture has a way to inject SError into a guest using HCR_EL2.VSE, I think
KVM should allow users-space to inject SError with this. Marc and Christoffer
will have the best idea about how such an API should work. To be useful for
injecting SEI we need to be able to set VSESR_EL2 along with the HCR_EL2.VSE bit.
KVM will need to know about the RAS extensions to save/restore some of the new
registers listed in A1.7.5 of the new ARM-ARM.

Thanks,

James

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-04-07 16:12         ` James Morse
  0 siblings, 0 replies; 34+ messages in thread
From: James Morse @ 2017-04-07 16:12 UTC (permalink / raw)
  To: linux-arm-kernel

Hi gengdongjiu,

On 06/04/17 16:06, gengdongjiu wrote:
> Below is my previous idea:
>   When signalling Qemu, Qemu generate GHES ,then  Qemu/kvmtool  inject
> the SEA/SEI to guest OS. For different reason, Qemu/.KVMtool injects
> diferent notification type. if inject SEA/SEI/IRQ, guest OS handle the
> guest SEA/SEI/IRQ. handling guest OS SEA/SEI and IRQ guest OS software
> logical is  different.
> for example, guest OS call API "ghes_notify_sea" when happening SEA;
> call API  "ghes_notify_sei" when happening SEI.

Sounds reasonable. Qemu shouldn't have to care what the guest OS is so the
injecting notifications should stick to KVM APIs.

> so what is your suggested way to notify guest OS after Qemu generating the CPER?
[...]
> so for the notification guest OS, I think Qemu uses IOCTL to let KVM
> inject error may be better.

I agree.

Synchronous External Abort is something that can always be delivered, Qemu can
make it look like SEA was taken on a vcpu by modifying the registers using KVM's
KVM_SET_ONE_REG ioctl(). The pseudo code for what is required is in the
ARM-ARM's 'AArch64.TakeException'.

SError Interrupt is more complicated as it can be masked. Fortunately the
architecture has a way to inject SError into a guest using HCR_EL2.VSE, I think
KVM should allow users-space to inject SError with this. Marc and Christoffer
will have the best idea about how such an API should work. To be useful for
injecting SEI we need to be able to set VSESR_EL2 along with the HCR_EL2.VSE bit.
KVM will need to know about the RAS extensions to save/restore some of the new
registers listed in A1.7.5 of the new ARM-ARM.

Thanks,

James

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2017-04-07 16:12 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-15 16:07 [PATCH] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory James Morse
2017-03-15 16:07 ` James Morse
2017-03-17 15:06 ` Punit Agrawal
2017-03-17 15:06   ` Punit Agrawal
2017-03-17 15:48   ` James Morse
2017-03-17 15:48     ` James Morse
2017-03-24 18:30 ` Christoffer Dall
2017-03-24 18:30   ` Christoffer Dall
2017-03-27 11:20   ` Punit Agrawal
2017-03-27 11:20     ` Punit Agrawal
2017-03-27 12:00     ` James Morse
2017-03-27 12:00       ` James Morse
2017-03-27 12:44       ` Christoffer Dall
2017-03-27 12:44         ` Christoffer Dall
2017-03-27 13:31         ` Punit Agrawal
2017-03-27 13:31           ` Punit Agrawal
2017-03-27 13:38           ` Marc Zyngier
2017-03-27 13:38             ` Marc Zyngier
2017-03-27 14:04             ` Punit Agrawal
2017-03-27 14:04               ` Punit Agrawal
2017-03-27 14:47           ` Christoffer Dall
2017-03-27 14:47             ` Christoffer Dall
2017-03-28 14:50             ` Punit Agrawal
2017-03-28 14:50               ` Punit Agrawal
2017-03-28 15:12               ` Christoffer Dall
2017-03-28 15:12                 ` Christoffer Dall
2017-04-04 23:05 ` gengdongjiu
2017-04-04 23:05   ` gengdongjiu
2017-04-06  9:25   ` James Morse
2017-04-06  9:25     ` James Morse
2017-04-06 15:06     ` gengdongjiu
2017-04-06 15:06       ` gengdongjiu
2017-04-07 16:12       ` James Morse
2017-04-07 16:12         ` James Morse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.