[PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-05-24 16:32 ` James Morse
  0 siblings, 0 replies; 38+ messages in thread
From: James Morse @ 2017-05-24 16:32 UTC (permalink / raw)
  To: kvmarm; +Cc: Marc Zyngier, Punit Agrawal, linux-arm-kernel

Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64, notifications for
broken memory can call memory_failure() in mm/memory-failure.c to deliver
SIGBUS to any user space process using the page, and notify all the
in-kernel users.

If the page corresponded with guest memory, KVM will unmap this page
from its stage2 page tables. The user space process that allocated
this memory may have never touched this page in which case it may not
be mapped meaning SIGBUS won't be delivered.

This works well until a guest accesses that page, and KVM discovers
pfn == KVM_PFN_ERR_HWPOISON when it comes to process the stage2 fault.

Do as x86 does, and deliver the SIGBUS when we discover
KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
as this matches the user space mapping size.

Cc: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>

---
This will be needed once we enable ARCH_SUPPORTS_MEMORY_FAILURE for
arm64[0]. It is harmless until then as KVM_PFN_ERR_HWPOISON will
never be seen.

Changes since v1:
 * Pass the vma to kvm_send_hwpoison_signal(), used Punit's huge_page_shift()
   calculation to find the block size.
 * ... tested against hugepage not transparent huge page ...

Today we will inherit some existing breakage between KVM, hugepages
and hwpoison. Patch at [1].

[0] https://www.spinics.net/lists/arm-kernel/msg581657.html
[1] https://marc.info/?l=linux-mm&m=149564219918427&w=2

 virt/kvm/arm/mmu.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 313ee646480f..eaa29aeb7c5b 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -20,6 +20,7 @@
 #include <linux/kvm_host.h>
 #include <linux/io.h>
 #include <linux/hugetlb.h>
+#include <linux/sched/signal.h>
 #include <trace/events/kvm.h>
 #include <asm/pgalloc.h>
 #include <asm/cacheflush.h>
@@ -1249,6 +1250,24 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
 	__coherent_cache_guest_page(vcpu, pfn, size);
 }

+static void kvm_send_hwpoison_signal(unsigned long address,
+				     struct vm_area_struct *vma)
+{
+	siginfo_t info;
+
+	info.si_signo   = SIGBUS;
+	info.si_errno   = 0;
+	info.si_code    = BUS_MCEERR_AR;
+	info.si_addr    = (void __user *)address;
+
+	if (is_vm_hugetlb_page(vma))
+		info.si_addr_lsb = huge_page_shift(hstate_vma(vma));
+	else
+		info.si_addr_lsb = PAGE_SHIFT;
+
+	send_sig_info(SIGBUS, &info, current);
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
 			  unsigned long fault_status)
@@ -1318,6 +1337,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	smp_rmb();

 	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
+	if (pfn == KVM_PFN_ERR_HWPOISON) {
+		kvm_send_hwpoison_signal(hva, vma);
+		return 0;
+	}
 	if (is_error_noslot_pfn(pfn))
 		return -EFAULT;

-- 
2.11.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-05-24 16:32 ` James Morse
  0 siblings, 0 replies; 38+ messages in thread
From: James Morse @ 2017-05-24 16:32 UTC (permalink / raw)
  To: linux-arm-kernel

Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64, notifications for
broken memory can call memory_failure() in mm/memory-failure.c to deliver
SIGBUS to any user space process using the page, and notify all the
in-kernel users.

If the page corresponded with guest memory, KVM will unmap this page
from its stage2 page tables. The user space process that allocated
this memory may have never touched this page in which case it may not
be mapped meaning SIGBUS won't be delivered.

This works well until a guest accesses that page, and KVM discovers
pfn == KVM_PFN_ERR_HWPOISON when it comes to process the stage2 fault.

Do as x86 does, and deliver the SIGBUS when we discover
KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
as this matches the user space mapping size.

Cc: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>

---
This will be needed once we enable ARCH_SUPPORTS_MEMORY_FAILURE for
arm64[0]. It is harmless until then as KVM_PFN_ERR_HWPOISON will
never be seen.

Changes since v1:
 * Pass the vma to kvm_send_hwpoison_signal(), used Punit's huge_page_shift()
   calculation to find the block size.
 * ... tested against hugepage not transparent huge page ...

Today we will inherit some existing breakage between KVM, hugepages
and hwpoison. Patch at [1].

[0] https://www.spinics.net/lists/arm-kernel/msg581657.html
[1] https://marc.info/?l=linux-mm&m=149564219918427&w=2

 virt/kvm/arm/mmu.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 313ee646480f..eaa29aeb7c5b 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -20,6 +20,7 @@
 #include <linux/kvm_host.h>
 #include <linux/io.h>
 #include <linux/hugetlb.h>
+#include <linux/sched/signal.h>
 #include <trace/events/kvm.h>
 #include <asm/pgalloc.h>
 #include <asm/cacheflush.h>
@@ -1249,6 +1250,24 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
 	__coherent_cache_guest_page(vcpu, pfn, size);
 }

+static void kvm_send_hwpoison_signal(unsigned long address,
+				     struct vm_area_struct *vma)
+{
+	siginfo_t info;
+
+	info.si_signo   = SIGBUS;
+	info.si_errno   = 0;
+	info.si_code    = BUS_MCEERR_AR;
+	info.si_addr    = (void __user *)address;
+
+	if (is_vm_hugetlb_page(vma))
+		info.si_addr_lsb = huge_page_shift(hstate_vma(vma));
+	else
+		info.si_addr_lsb = PAGE_SHIFT;
+
+	send_sig_info(SIGBUS, &info, current);
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
 			  unsigned long fault_status)
@@ -1318,6 +1337,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	smp_rmb();

 	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
+	if (pfn == KVM_PFN_ERR_HWPOISON) {
+		kvm_send_hwpoison_signal(hva, vma);
+		return 0;
+	}
 	if (is_error_noslot_pfn(pfn))
 		return -EFAULT;

-- 
2.11.0

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-05-24 16:32 ` James Morse
@ 2017-06-01 22:22   ` Christoffer Dall
  -1 siblings, 0 replies; 38+ messages in thread
From: Christoffer Dall @ 2017-06-01 22:22 UTC (permalink / raw)
  To: James Morse; +Cc: Marc Zyngier, Punit Agrawal, kvmarm, linux-arm-kernel

On Wed, May 24, 2017 at 05:32:50PM +0100, James Morse wrote:
> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64, notifications for
> broken memory can call memory_failure() in mm/memory-failure.c to deliver
> SIGBUS to any user space process using the page, and notify all the
> in-kernel users.
> 
> If the page corresponded with guest memory, KVM will unmap this page
> from its stage2 page tables. The user space process that allocated
> this memory may have never touched this page in which case it may not
> be mapped meaning SIGBUS won't be delivered.

Sorry, I don't remember, what is the scenario where KVM can have a
mapping in stage 2 without there being a corresponding mapping for user
space?

> 
> This works well until a guest accesses that page, and KVM discovers
> pfn == KVM_PFN_ERR_HWPOISON when it comes to process the stage2 fault.
> 
> Do as x86 does, and deliver the SIGBUS when we discover
> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb

But this part about the stage 2 mapping size is not what the code does.
It uses the granularity of the mmap region, if I'm not mistaken.

I lost track of what the right thing was, can you remind me?

Thanks,
-Christoffer


> as this matches the user space mapping size.
> 
> Cc: Punit Agrawal <punit.agrawal@arm.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> 
> ---
> This will be needed once we enable ARCH_SUPPORTS_MEMORY_FAILURE for
> arm64[0]. It is harmless until then as KVM_PFN_ERR_HWPOISON will
> never be seen.
> 
> Changes since v1:
>  * Pass the vma to kvm_send_hwpoison_signal(), used Punit's huge_page_shift()
>    calculation to find the block size.
>  * ... tested against hugepage not transparent huge page ...
> 
> Today we will inherit some existing breakage between KVM, hugepages
> and hwpoison. Patch at [1].
> 
> 
> [0] https://www.spinics.net/lists/arm-kernel/msg581657.html
> [1] https://marc.info/?l=linux-mm&m=149564219918427&w=2
> 
>  virt/kvm/arm/mmu.c | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 313ee646480f..eaa29aeb7c5b 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -20,6 +20,7 @@
>  #include <linux/kvm_host.h>
>  #include <linux/io.h>
>  #include <linux/hugetlb.h>
> +#include <linux/sched/signal.h>
>  #include <trace/events/kvm.h>
>  #include <asm/pgalloc.h>
>  #include <asm/cacheflush.h>
> @@ -1249,6 +1250,24 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>  	__coherent_cache_guest_page(vcpu, pfn, size);
>  }
>  
> +static void kvm_send_hwpoison_signal(unsigned long address,
> +				     struct vm_area_struct *vma)
> +{
> +	siginfo_t info;
> +
> +	info.si_signo   = SIGBUS;
> +	info.si_errno   = 0;
> +	info.si_code    = BUS_MCEERR_AR;
> +	info.si_addr    = (void __user *)address;
> +
> +	if (is_vm_hugetlb_page(vma))
> +		info.si_addr_lsb = huge_page_shift(hstate_vma(vma));
> +	else
> +		info.si_addr_lsb = PAGE_SHIFT;
> +
> +	send_sig_info(SIGBUS, &info, current);
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>  			  unsigned long fault_status)
> @@ -1318,6 +1337,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	smp_rmb();
>  
>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
> +		kvm_send_hwpoison_signal(hva, vma);
> +		return 0;
> +	}
>  	if (is_error_noslot_pfn(pfn))
>  		return -EFAULT;
>  
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-06-01 22:22   ` Christoffer Dall
  0 siblings, 0 replies; 38+ messages in thread
From: Christoffer Dall @ 2017-06-01 22:22 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, May 24, 2017 at 05:32:50PM +0100, James Morse wrote:
> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64, notifications for
> broken memory can call memory_failure() in mm/memory-failure.c to deliver
> SIGBUS to any user space process using the page, and notify all the
> in-kernel users.
> 
> If the page corresponded with guest memory, KVM will unmap this page
> from its stage2 page tables. The user space process that allocated
> this memory may have never touched this page in which case it may not
> be mapped meaning SIGBUS won't be delivered.

Sorry, I don't remember, what is the scenario where KVM can have a
mapping in stage 2 without there being a corresponding mapping for user
space?

> 
> This works well until a guest accesses that page, and KVM discovers
> pfn == KVM_PFN_ERR_HWPOISON when it comes to process the stage2 fault.
> 
> Do as x86 does, and deliver the SIGBUS when we discover
> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb

But this part about the stage 2 mapping size is not what the code does.
It uses the granularity of the mmap region, if I'm not mistaken.

I lost track of what the right thing was, can you remind me?

Thanks,
-Christoffer


> as this matches the user space mapping size.
> 
> Cc: Punit Agrawal <punit.agrawal@arm.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> 
> ---
> This will be needed once we enable ARCH_SUPPORTS_MEMORY_FAILURE for
> arm64[0]. It is harmless until then as KVM_PFN_ERR_HWPOISON will
> never be seen.
> 
> Changes since v1:
>  * Pass the vma to kvm_send_hwpoison_signal(), used Punit's huge_page_shift()
>    calculation to find the block size.
>  * ... tested against hugepage not transparent huge page ...
> 
> Today we will inherit some existing breakage between KVM, hugepages
> and hwpoison. Patch at [1].
> 
> 
> [0] https://www.spinics.net/lists/arm-kernel/msg581657.html
> [1] https://marc.info/?l=linux-mm&m=149564219918427&w=2
> 
>  virt/kvm/arm/mmu.c | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 313ee646480f..eaa29aeb7c5b 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -20,6 +20,7 @@
>  #include <linux/kvm_host.h>
>  #include <linux/io.h>
>  #include <linux/hugetlb.h>
> +#include <linux/sched/signal.h>
>  #include <trace/events/kvm.h>
>  #include <asm/pgalloc.h>
>  #include <asm/cacheflush.h>
> @@ -1249,6 +1250,24 @@ static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, kvm_pfn_t pfn,
>  	__coherent_cache_guest_page(vcpu, pfn, size);
>  }
>  
> +static void kvm_send_hwpoison_signal(unsigned long address,
> +				     struct vm_area_struct *vma)
> +{
> +	siginfo_t info;
> +
> +	info.si_signo   = SIGBUS;
> +	info.si_errno   = 0;
> +	info.si_code    = BUS_MCEERR_AR;
> +	info.si_addr    = (void __user *)address;
> +
> +	if (is_vm_hugetlb_page(vma))
> +		info.si_addr_lsb = huge_page_shift(hstate_vma(vma));
> +	else
> +		info.si_addr_lsb = PAGE_SHIFT;
> +
> +	send_sig_info(SIGBUS, &info, current);
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>  			  unsigned long fault_status)
> @@ -1318,6 +1337,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	smp_rmb();
>  
>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
> +		kvm_send_hwpoison_signal(hva, vma);
> +		return 0;
> +	}
>  	if (is_error_noslot_pfn(pfn))
>  		return -EFAULT;
>  
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-06-01 22:22   ` Christoffer Dall
@ 2017-06-02 10:16     ` James Morse
  -1 siblings, 0 replies; 38+ messages in thread
From: James Morse @ 2017-06-02 10:16 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Marc Zyngier, Punit Agrawal, kvmarm, linux-arm-kernel

Hi Christoffer,

On 01/06/17 23:22, Christoffer Dall wrote:
> On Wed, May 24, 2017 at 05:32:50PM +0100, James Morse wrote:
>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64, notifications for
>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
>> SIGBUS to any user space process using the page, and notify all the
>> in-kernel users.
>>
>> If the page corresponded with guest memory, KVM will unmap this page
>> from its stage2 page tables. The user space process that allocated
>> this memory may have never touched this page in which case it may not
>> be mapped meaning SIGBUS won't be delivered.

> Sorry, I don't remember, what is the scenario where KVM can have a
> mapping in stage 2 without there being a corresponding mapping for user
> space?

(looks like I mean more than one thing by mapping... oops.)

Mapping in that there is a physical page for this user-space address, but when
qemu first touches it, it will be faulted in as its not been touched before.

(rambling on:)
When mm/memory_failure.c:memory_failure() walks the rmap for the poisoned page
it doesn't find qemu, so no user-space processes gets signalled.

If qemu were to touch the page, fixup_user_fault() will return -EHWPOISON and we
deliver sigbus to qemu.

This doesn't happen when the guest touches the page as kvm's core code maps
-EHWPOISON to KVM_PFN_ERR_HWPOISON, which the arm/arm64 kvm code ignores and
drops -EFAULT on userspace instead.


>> This works well until a guest accesses that page, and KVM discovers
>> pfn == KVM_PFN_ERR_HWPOISON when it comes to process the stage2 fault.
>>
>> Do as x86 does, and deliver the SIGBUS when we discover
>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
> 
> But this part about the stage 2 mapping size is not what the code does.
> It uses the granularity of the mmap region, if I'm not mistaken.

Yes. This is what I got wrong last time, Punit put me right, but I forgot to
update the commit message.

I will change this 'Use the'... sentence in v3 to:
> Use the hugepage size as si_addr_lsb if this vma was allocated as a hugepage.
> Transparent hugepages will be split by memory_failure() before we see them
> here.


Thanks,

James

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-06-02 10:16     ` James Morse
  0 siblings, 0 replies; 38+ messages in thread
From: James Morse @ 2017-06-02 10:16 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Christoffer,

On 01/06/17 23:22, Christoffer Dall wrote:
> On Wed, May 24, 2017 at 05:32:50PM +0100, James Morse wrote:
>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64, notifications for
>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
>> SIGBUS to any user space process using the page, and notify all the
>> in-kernel users.
>>
>> If the page corresponded with guest memory, KVM will unmap this page
>> from its stage2 page tables. The user space process that allocated
>> this memory may have never touched this page in which case it may not
>> be mapped meaning SIGBUS won't be delivered.

> Sorry, I don't remember, what is the scenario where KVM can have a
> mapping in stage 2 without there being a corresponding mapping for user
> space?

(looks like I mean more than one thing by mapping... oops.)

Mapping in that there is a physical page for this user-space address, but when
qemu first touches it, it will be faulted in as its not been touched before.

(rambling on:)
When mm/memory_failure.c:memory_failure() walks the rmap for the poisoned page
it doesn't find qemu, so no user-space processes gets signalled.

If qemu were to touch the page, fixup_user_fault() will return -EHWPOISON and we
deliver sigbus to qemu.

This doesn't happen when the guest touches the page as kvm's core code maps
-EHWPOISON to KVM_PFN_ERR_HWPOISON, which the arm/arm64 kvm code ignores and
drops -EFAULT on userspace instead.


>> This works well until a guest accesses that page, and KVM discovers
>> pfn == KVM_PFN_ERR_HWPOISON when it comes to process the stage2 fault.
>>
>> Do as x86 does, and deliver the SIGBUS when we discover
>> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
> 
> But this part about the stage 2 mapping size is not what the code does.
> It uses the granularity of the mmap region, if I'm not mistaken.

Yes. This is what I got wrong last time, Punit put me right, but I forgot to
update the commit message.

I will change this 'Use the'... sentence in v3 to:
> Use the hugepage size as si_addr_lsb if this vma was allocated as a hugepage.
> Transparent hugepages will be split by memory_failure() before we see them
> here.


Thanks,

James

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-06-02 10:16     ` James Morse
@ 2017-06-02 10:43       ` Christoffer Dall
  -1 siblings, 0 replies; 38+ messages in thread
From: Christoffer Dall @ 2017-06-02 10:43 UTC (permalink / raw)
  To: James Morse; +Cc: Marc Zyngier, Punit Agrawal, kvmarm, linux-arm-kernel

On Fri, Jun 02, 2017 at 11:16:10AM +0100, James Morse wrote:
> Hi Christoffer,
> 
> On 01/06/17 23:22, Christoffer Dall wrote:
> > On Wed, May 24, 2017 at 05:32:50PM +0100, James Morse wrote:
> >> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64, notifications for
> >> broken memory can call memory_failure() in mm/memory-failure.c to deliver
> >> SIGBUS to any user space process using the page, and notify all the
> >> in-kernel users.
> >>
> >> If the page corresponded with guest memory, KVM will unmap this page
> >> from its stage2 page tables. The user space process that allocated
> >> this memory may have never touched this page in which case it may not
> >> be mapped meaning SIGBUS won't be delivered.
> 
> > Sorry, I don't remember, what is the scenario where KVM can have a
> > mapping in stage 2 without there being a corresponding mapping for user
> > space?
> 
> (looks like I mean more than one thing by mapping... oops.)
> 
> Mapping in that there is a physical page for this user-space address, but when
> qemu first touches it, it will be faulted in as its not been touched before.

So that's where I still get confused.  When KVM faults in the page on
behalf of an initial VM access, it calls get_user_pages() (or something
equivalent through a series of indirections like gfn_to_pfn_prot()), and
wouldn't that create the equivalent mapping in the normal user space
page table?

(I may be forgetting some important detail here though).

> 
> (rambling on:)
> When mm/memory_failure.c:memory_failure() walks the rmap for the poisoned page
> it doesn't find qemu, so no user-space processes gets signalled.
> 
> If qemu were to touch the page, fixup_user_fault() will return -EHWPOISON and we
> deliver sigbus to qemu.
> 
> This doesn't happen when the guest touches the page as kvm's core code maps
> -EHWPOISON to KVM_PFN_ERR_HWPOISON, which the arm/arm64 kvm code ignores and
> drops -EFAULT on userspace instead.
> 

Assuming I'm wrong about the above, or there's just some subtlety I'm
missing, then the rest of your explanation makes perfect sense.

> 
> >> This works well until a guest accesses that page, and KVM discovers
> >> pfn == KVM_PFN_ERR_HWPOISON when it comes to process the stage2 fault.
> >>
> >> Do as x86 does, and deliver the SIGBUS when we discover
> >> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
> > 
> > But this part about the stage 2 mapping size is not what the code does.
> > It uses the granularity of the mmap region, if I'm not mistaken.
> 
> Yes. This is what I got wrong last time, Punit put me right, but I forgot to
> update the commit message.
> 
> I will change this 'Use the'... sentence in v3 to:
> > Use the hugepage size as si_addr_lsb if this vma was allocated as a hugepage.
> > Transparent hugepages will be split by memory_failure() before we see them
> > here.

Right, I figured it was just the commit message that needed updating.

Sorry about being pedantic, but having a misleading record of this
change is guaranteed to confuse me in the future.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-06-02 10:43       ` Christoffer Dall
  0 siblings, 0 replies; 38+ messages in thread
From: Christoffer Dall @ 2017-06-02 10:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jun 02, 2017 at 11:16:10AM +0100, James Morse wrote:
> Hi Christoffer,
> 
> On 01/06/17 23:22, Christoffer Dall wrote:
> > On Wed, May 24, 2017 at 05:32:50PM +0100, James Morse wrote:
> >> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64, notifications for
> >> broken memory can call memory_failure() in mm/memory-failure.c to deliver
> >> SIGBUS to any user space process using the page, and notify all the
> >> in-kernel users.
> >>
> >> If the page corresponded with guest memory, KVM will unmap this page
> >> from its stage2 page tables. The user space process that allocated
> >> this memory may have never touched this page in which case it may not
> >> be mapped meaning SIGBUS won't be delivered.
> 
> > Sorry, I don't remember, what is the scenario where KVM can have a
> > mapping in stage 2 without there being a corresponding mapping for user
> > space?
> 
> (looks like I mean more than one thing by mapping... oops.)
> 
> Mapping in that there is a physical page for this user-space address, but when
> qemu first touches it, it will be faulted in as its not been touched before.

So that's where I still get confused.  When KVM faults in the page on
behalf of an initial VM access, it calls get_user_pages() (or something
equivalent through a series of indirections like gfn_to_pfn_prot()), and
wouldn't that create the equivalent mapping in the normal user space
page table?

(I may be forgetting some important detail here though).

> 
> (rambling on:)
> When mm/memory_failure.c:memory_failure() walks the rmap for the poisoned page
> it doesn't find qemu, so no user-space processes gets signalled.
> 
> If qemu were to touch the page, fixup_user_fault() will return -EHWPOISON and we
> deliver sigbus to qemu.
> 
> This doesn't happen when the guest touches the page as kvm's core code maps
> -EHWPOISON to KVM_PFN_ERR_HWPOISON, which the arm/arm64 kvm code ignores and
> drops -EFAULT on userspace instead.
> 

Assuming I'm wrong about the above, or there's just some subtlety I'm
missing, then the rest of your explanation makes perfect sense.

> 
> >> This works well until a guest accesses that page, and KVM discovers
> >> pfn == KVM_PFN_ERR_HWPOISON when it comes to process the stage2 fault.
> >>
> >> Do as x86 does, and deliver the SIGBUS when we discover
> >> KVM_PFN_ERR_HWPOISON. Use the stage2 mapping size as the si_addr_lsb
> > 
> > But this part about the stage 2 mapping size is not what the code does.
> > It uses the granularity of the mmap region, if I'm not mistaken.
> 
> Yes. This is what I got wrong last time, Punit put me right, but I forgot to
> update the commit message.
> 
> I will change this 'Use the'... sentence in v3 to:
> > Use the hugepage size as si_addr_lsb if this vma was allocated as a hugepage.
> > Transparent hugepages will be split by memory_failure() before we see them
> > here.

Right, I figured it was just the commit message that needed updating.

Sorry about being pedantic, but having a misleading record of this
change is guaranteed to confuse me in the future.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-06-02 10:43       ` Christoffer Dall
@ 2017-06-07  9:41         ` James Morse
  -1 siblings, 0 replies; 38+ messages in thread
From: James Morse @ 2017-06-07  9:41 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Marc Zyngier, Punit Agrawal, kvmarm, linux-arm-kernel

Hi Christoffer,

On 02/06/17 11:43, Christoffer Dall wrote:
> On Fri, Jun 02, 2017 at 11:16:10AM +0100, James Morse wrote:
>> On 01/06/17 23:22, Christoffer Dall wrote:
>>> On Wed, May 24, 2017 at 05:32:50PM +0100, James Morse wrote:
>>>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64, notifications for
>>>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
>>>> SIGBUS to any user space process using the page, and notify all the
>>>> in-kernel users.
>>>>
>>>> If the page corresponded with guest memory, KVM will unmap this page
>>>> from its stage2 page tables. The user space process that allocated
>>>> this memory may have never touched this page in which case it may not
>>>> be mapped meaning SIGBUS won't be delivered.
>>
>>> Sorry, I don't remember, what is the scenario where KVM can have a
>>> mapping in stage 2 without there being a corresponding mapping for user
>>> space?
>>
>> (looks like I mean more than one thing by mapping... oops.)
>>
>> Mapping in that there is a physical page for this user-space address, but when
>> qemu first touches it, it will be faulted in as its not been touched before.
> 
> So that's where I still get confused.  When KVM faults in the page on
> behalf of an initial VM access, it calls get_user_pages() (or something
> equivalent through a series of indirections like gfn_to_pfn_prot()), and
> wouldn't that create the equivalent mapping in the normal user space
> page table?
> 
> (I may be forgetting some important detail here though).

I evidently stopped before I got to the bottom of this, the commit message is
based on the way I first hit this (physical address not found in any rmaps:
nothing signalled). Maybe there is another bug, or I hit a bug that has been
fixed. I will go back to reproducing this.
(I think we still need the patch if a guest is allowed to keep running while
parts of its memory are hwpoisoned)

You're right this should behave the same as a fault from user-space, on arm64
this path is roughly:
> do_page_fault()
> __do_page_fault()
> handle_mm_fault()
> __handle_mm_fault()
> handle_pte_fault()

__handle_mm_fault() then allocates pud/pmd and calls handle_pte_fault() .. which
disappears back into the jungle.

On a stage2 fault from a guest, the path is roughly:
> kvm_handle_guest_abort()
> user_mem_abort()
> gfn_to_pfn_prot()
> hva_to_pfn()
> hva_to_pfn_slow()
> get_user_pages_unlocked()
> __get_user_pages_unlocked()
> __get_user_pages_locked()
> __get_user_pages()
> faultin_page()
> handle_mm_fault()

>From where the path/behaviour should be the same as the fault through arch code.

FOLL_POPULATE is a red-herring, it doesn't seem to mean anything unless
FOLL_MLOCK is set too.


> Right, I figured it was just the commit message that needed updating.
> 
> Sorry about being pedantic, but having a misleading record of this
> change is guaranteed to confuse me in the future.

If it's wrong it needs fixing. I'm a fan of pedantry in all its forms!



Thanks,

James

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-06-07  9:41         ` James Morse
  0 siblings, 0 replies; 38+ messages in thread
From: James Morse @ 2017-06-07  9:41 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Christoffer,

On 02/06/17 11:43, Christoffer Dall wrote:
> On Fri, Jun 02, 2017 at 11:16:10AM +0100, James Morse wrote:
>> On 01/06/17 23:22, Christoffer Dall wrote:
>>> On Wed, May 24, 2017 at 05:32:50PM +0100, James Morse wrote:
>>>> Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64, notifications for
>>>> broken memory can call memory_failure() in mm/memory-failure.c to deliver
>>>> SIGBUS to any user space process using the page, and notify all the
>>>> in-kernel users.
>>>>
>>>> If the page corresponded with guest memory, KVM will unmap this page
>>>> from its stage2 page tables. The user space process that allocated
>>>> this memory may have never touched this page in which case it may not
>>>> be mapped meaning SIGBUS won't be delivered.
>>
>>> Sorry, I don't remember, what is the scenario where KVM can have a
>>> mapping in stage 2 without there being a corresponding mapping for user
>>> space?
>>
>> (looks like I mean more than one thing by mapping... oops.)
>>
>> Mapping in that there is a physical page for this user-space address, but when
>> qemu first touches it, it will be faulted in as its not been touched before.
> 
> So that's where I still get confused.  When KVM faults in the page on
> behalf of an initial VM access, it calls get_user_pages() (or something
> equivalent through a series of indirections like gfn_to_pfn_prot()), and
> wouldn't that create the equivalent mapping in the normal user space
> page table?
> 
> (I may be forgetting some important detail here though).

I evidently stopped before I got to the bottom of this, the commit message is
based on the way I first hit this (physical address not found in any rmaps:
nothing signalled). Maybe there is another bug, or I hit a bug that has been
fixed. I will go back to reproducing this.
(I think we still need the patch if a guest is allowed to keep running while
parts of its memory are hwpoisoned)

You're right this should behave the same as a fault from user-space, on arm64
this path is roughly:
> do_page_fault()
> __do_page_fault()
> handle_mm_fault()
> __handle_mm_fault()
> handle_pte_fault()

__handle_mm_fault() then allocates pud/pmd and calls handle_pte_fault() .. which
disappears back into the jungle.

On a stage2 fault from a guest, the path is roughly:
> kvm_handle_guest_abort()
> user_mem_abort()
> gfn_to_pfn_prot()
> hva_to_pfn()
> hva_to_pfn_slow()
> get_user_pages_unlocked()
> __get_user_pages_unlocked()
> __get_user_pages_locked()
> __get_user_pages()
> faultin_page()
> handle_mm_fault()

>From where the path/behaviour should be the same as the fault through arch code.

FOLL_POPULATE is a red-herring, it doesn't seem to mean anything unless
FOLL_MLOCK is set too.


> Right, I figured it was just the commit message that needed updating.
> 
> Sorry about being pedantic, but having a misleading record of this
> change is guaranteed to confuse me in the future.

If it's wrong it needs fixing. I'm a fan of pedantry in all its forms!



Thanks,

James

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-06-07  9:41         ` James Morse
@ 2017-06-16 11:32           ` James Morse
  -1 siblings, 0 replies; 38+ messages in thread
From: James Morse @ 2017-06-16 11:32 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Marc Zyngier, Punit Agrawal, kvmarm, linux-arm-kernel

Hi Christoffer,

On 07/06/17 10:41, James Morse wrote:
> I evidently stopped before I got to the bottom of this, the commit message is
> based on the way I first hit this

I've worked out where I went wrong with this. memory_failure()/hwpoison has two
'modes', early and late. My testing was broken for 'early', but caused both to
happen at the same time, leading to this confusion.

The affected page was mapped and found in the rmap, it then gets unmapped by
memory_failure(), which then skipped the early notification because the flags
were wrong. Meanwhile the late notification fires at the same time on another CPU.

So, from the top:
-----%<-----
memory_failure() has two modes, early and late. Early is used by
machine-managers like Qemu to receive a notification when a memory error is
notified to the host. These can then be relayed to the guest before the affected
page is accessed. To enable this, the process must set PR_MCE_KILL_EARLY in
PR_MCE_KILL_SET using the prctl() syscall.

Once the early notification has been handled, nothing stops the machine-manager
or guest from accessing the affected page. If the machine-manager does this the
page will fail to be mapped and SIGBUS will be sent. This patch adds the
equivalent path for when the guest accesses the page, sending SIGBUS to the
machine-manager.

These two signals can be distinguished by the machine-manager using their
si_code: BUS_MCEERR_AO for 'action optional' early notifications, and
BUS_MCEERR_AR for 'action required' synchronous/late notifications.
-----%<-----

If this clears everything up I will post a v3 with the above as the commit message.

Thanks!

James

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-06-16 11:32           ` James Morse
  0 siblings, 0 replies; 38+ messages in thread
From: James Morse @ 2017-06-16 11:32 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Christoffer,

On 07/06/17 10:41, James Morse wrote:
> I evidently stopped before I got to the bottom of this, the commit message is
> based on the way I first hit this

I've worked out where I went wrong with this. memory_failure()/hwpoison has two
'modes', early and late. My testing was broken for 'early', but caused both to
happen at the same time, leading to this confusion.

The affected page was mapped and found in the rmap, it then gets unmapped by
memory_failure(), which then skipped the early notification because the flags
were wrong. Meanwhile the late notification fires at the same time on another CPU.

So, from the top:
-----%<-----
memory_failure() has two modes, early and late. Early is used by
machine-managers like Qemu to receive a notification when a memory error is
notified to the host. These can then be relayed to the guest before the affected
page is accessed. To enable this, the process must set PR_MCE_KILL_EARLY in
PR_MCE_KILL_SET using the prctl() syscall.

Once the early notification has been handled, nothing stops the machine-manager
or guest from accessing the affected page. If the machine-manager does this the
page will fail to be mapped and SIGBUS will be sent. This patch adds the
equivalent path for when the guest accesses the page, sending SIGBUS to the
machine-manager.

These two signals can be distinguished by the machine-manager using their
si_code: BUS_MCEERR_AO for 'action optional' early notifications, and
BUS_MCEERR_AR for 'action required' synchronous/late notifications.
-----%<-----

If this clears everything up I will post a v3 with the above as the commit message.

Thanks!

James

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-05-24 16:32 ` James Morse
@ 2017-06-21  7:42   ` gengdongjiu
  -1 siblings, 0 replies; 38+ messages in thread
From: gengdongjiu @ 2017-06-21  7:42 UTC (permalink / raw)
  To: James Morse, kvmarm
  Cc: wuquanming, Marc Zyngier, Punit Agrawal, Huangshaoyu, linux-arm-kernel

Hi James,
  I have a comment here:

On 2017/5/25 0:32, James Morse wrote:
> +static void kvm_send_hwpoison_signal(unsigned long address,
> +				     struct vm_area_struct *vma)
> +{
> +	siginfo_t info;
> +
> +	info.si_signo   = SIGBUS;
> +	info.si_errno   = 0;
> +	info.si_code    = BUS_MCEERR_AR;
> +	info.si_addr    = (void __user *)address;
> +
> +	if (is_vm_hugetlb_page(vma))
> +		info.si_addr_lsb = huge_page_shift(hstate_vma(vma));
> +	else
> +		info.si_addr_lsb = PAGE_SHIFT;
> +
> +	send_sig_info(SIGBUS, &info, current);
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>  			  unsigned long fault_status)
> @@ -1318,6 +1337,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	smp_rmb();
>  
>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
> +		kvm_send_hwpoison_signal(hva, vma);
> +		return 0;
> +	}
I heard from our CPU hardware team, when happen HWpoison, CPU hardware does not record the IPA address in the HPFAR_EL2.
Only when the SEA error is related to the page table walk, the HPFAR_EL2 register is updated.
here we got the pfn/gfn from the register HPFAR_EL2, if CPU does not update the HPFAR_EL2 register, we may can not use this method to get the pfn/gfn.
could you confirm arm's armv8.0/armv8.2 standard CPU also use such design? if so, we may need to use other method to get the gfn/pfn/hva address.


>  	if (is_error_noslot_pfn(pfn))
>  		return -EFAULT;

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-06-21  7:42   ` gengdongjiu
  0 siblings, 0 replies; 38+ messages in thread
From: gengdongjiu @ 2017-06-21  7:42 UTC (permalink / raw)
  To: linux-arm-kernel

Hi James,
  I have a comment here:

On 2017/5/25 0:32, James Morse wrote:
> +static void kvm_send_hwpoison_signal(unsigned long address,
> +				     struct vm_area_struct *vma)
> +{
> +	siginfo_t info;
> +
> +	info.si_signo   = SIGBUS;
> +	info.si_errno   = 0;
> +	info.si_code    = BUS_MCEERR_AR;
> +	info.si_addr    = (void __user *)address;
> +
> +	if (is_vm_hugetlb_page(vma))
> +		info.si_addr_lsb = huge_page_shift(hstate_vma(vma));
> +	else
> +		info.si_addr_lsb = PAGE_SHIFT;
> +
> +	send_sig_info(SIGBUS, &info, current);
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>  			  unsigned long fault_status)
> @@ -1318,6 +1337,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	smp_rmb();
>  
>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
> +		kvm_send_hwpoison_signal(hva, vma);
> +		return 0;
> +	}
I heard from our CPU hardware team, when happen HWpoison, CPU hardware does not record the IPA address in the HPFAR_EL2.
Only when the SEA error is related to the page table walk, the HPFAR_EL2 register is updated.
here we got the pfn/gfn from the register HPFAR_EL2, if CPU does not update the HPFAR_EL2 register, we may can not use this method to get the pfn/gfn.
could you confirm arm's armv8.0/armv8.2 standard CPU also use such design? if so, we may need to use other method to get the gfn/pfn/hva address.


>  	if (is_error_noslot_pfn(pfn))
>  		return -EFAULT;

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-06-21  7:42   ` gengdongjiu
@ 2017-06-21  9:53     ` James Morse
  -1 siblings, 0 replies; 38+ messages in thread
From: James Morse @ 2017-06-21  9:53 UTC (permalink / raw)
  To: gengdongjiu
  Cc: wuquanming, Marc Zyngier, Achin Gupta, Punit Agrawal,
	Huangshaoyu, linux-arm-kernel, kvmarm

Hi gengdongjiu,

(+CC: Achin)

On 21/06/17 08:42, gengdongjiu wrote:
> On 2017/5/25 0:32, James Morse wrote:
>> +static void kvm_send_hwpoison_signal(unsigned long address,
>> +				     struct vm_area_struct *vma)
>> +{
>> +	siginfo_t info;
>> +
>> +	info.si_signo   = SIGBUS;
>> +	info.si_errno   = 0;
>> +	info.si_code    = BUS_MCEERR_AR;
>> +	info.si_addr    = (void __user *)address;
>> +
>> +	if (is_vm_hugetlb_page(vma))
>> +		info.si_addr_lsb = huge_page_shift(hstate_vma(vma));
>> +	else
>> +		info.si_addr_lsb = PAGE_SHIFT;
>> +
>> +	send_sig_info(SIGBUS, &info, current);
>> +}
>> +
>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>>  			  unsigned long fault_status)
>> @@ -1318,6 +1337,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>  	smp_rmb();
>>  
>>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
>> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
>> +		kvm_send_hwpoison_signal(hva, vma);
>> +		return 0;
>> +	}

> I heard from our CPU hardware team, when happen HWpoison, CPU hardware does not record the IPA address in the HPFAR_EL2.

I think we discussed this before[0], your CPU has a feature called 'hwpoison'
that is uses to support RAS. Linux also has a feature called 'hwpoison' [1][2],
which handles the offline-ing of memory pages when it receives a notification
through APEI. I've tried to call this memory_failure() to avoid this confusion.

This patch is to handle stage2 faults when the page was removed from the stage2
mapping by the memory_failure() code. v3 of this patch[3] does a much better job
of describing this.

(... I don't think your question is related to this patch ...)

> Only when the SEA error is related to the page table walk, the HPFAR_EL2 register is updated.
> here we got the pfn/gfn from the register HPFAR_EL2, if CPU does not update the HPFAR_EL2 register,
> we may can not use this method to get the pfn/gfn.

This patch is only concerned with ordinary stage2 faults. Linux doesn't want to
give the page to the guest as it has been marked with the PG_HWpoison flag.
Instead we trigger memory_failure()'s 'late' notification.

To the CPU this will look like a normal stage2 fault.

> could you confirm arm's armv8.0/armv8.2 standard CPU also use such design?
> if so, we may need to use other method to get the gfn/pfn/hva address.

Your question is what happens when a guest accesses a location your CPU has
marked with its 'hwpoison'. From the RAS spec[4] I would expect this to be
reported as a Synchronous External Abort. For firmware-first error handling this
should be taken to EL3.

>From here firmware should generate CPER records and then notify the OS via one
of the APEI mechanisms. Because a guest was running this notification must reach
EL2. KVM can then switch back to the host and invoke the APEI GHES handler,
which for this kind of error will probably call memory_failure().

What happens if HPFAR_EL2 isn't set when this kind of error occurs?
Provided EL3 can learn the physical address that triggered the exception then
for firmware-first this isn't a problem. These errors should be taken to EL3 and
then described by CPER records to the OS. The OS should process the CPER records
in preference to attempting 'kernel first error handling'.

memory_failure() will unmap the affected pages from user space (and KVMs stage2)
and deliver signals (if Qemu/kvmtool registered for them via prctl()). KVM won't
re-enter the guest if there is a signal pending, from here Qemu/kvmtool get the
affected VA which they can use to notify the guest.

I hope this helps,

James

[0] https://lkml.org/lkml/2017/5/12/492
[1] https://lwn.net/Articles/348886/
[2]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/hwpoison.txt
[3] https://www.spinics.net/lists/arm-kernel/msg589515.html
[4]
https://static.docs.arm.com/ddi0587/a/RAS%20Extension-release%20candidate_march_29.pdf

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-06-21  9:53     ` James Morse
  0 siblings, 0 replies; 38+ messages in thread
From: James Morse @ 2017-06-21  9:53 UTC (permalink / raw)
  To: linux-arm-kernel

Hi gengdongjiu,

(+CC: Achin)

On 21/06/17 08:42, gengdongjiu wrote:
> On 2017/5/25 0:32, James Morse wrote:
>> +static void kvm_send_hwpoison_signal(unsigned long address,
>> +				     struct vm_area_struct *vma)
>> +{
>> +	siginfo_t info;
>> +
>> +	info.si_signo   = SIGBUS;
>> +	info.si_errno   = 0;
>> +	info.si_code    = BUS_MCEERR_AR;
>> +	info.si_addr    = (void __user *)address;
>> +
>> +	if (is_vm_hugetlb_page(vma))
>> +		info.si_addr_lsb = huge_page_shift(hstate_vma(vma));
>> +	else
>> +		info.si_addr_lsb = PAGE_SHIFT;
>> +
>> +	send_sig_info(SIGBUS, &info, current);
>> +}
>> +
>>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>>  			  unsigned long fault_status)
>> @@ -1318,6 +1337,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>>  	smp_rmb();
>>  
>>  	pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
>> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
>> +		kvm_send_hwpoison_signal(hva, vma);
>> +		return 0;
>> +	}

> I heard from our CPU hardware team, when happen HWpoison, CPU hardware does not record the IPA address in the HPFAR_EL2.

I think we discussed this before[0], your CPU has a feature called 'hwpoison'
that is uses to support RAS. Linux also has a feature called 'hwpoison' [1][2],
which handles the offline-ing of memory pages when it receives a notification
through APEI. I've tried to call this memory_failure() to avoid this confusion.

This patch is to handle stage2 faults when the page was removed from the stage2
mapping by the memory_failure() code. v3 of this patch[3] does a much better job
of describing this.

(... I don't think your question is related to this patch ...)

> Only when the SEA error is related to the page table walk, the HPFAR_EL2 register is updated.
> here we got the pfn/gfn from the register HPFAR_EL2, if CPU does not update the HPFAR_EL2 register,
> we may can not use this method to get the pfn/gfn.

This patch is only concerned with ordinary stage2 faults. Linux doesn't want to
give the page to the guest as it has been marked with the PG_HWpoison flag.
Instead we trigger memory_failure()'s 'late' notification.

To the CPU this will look like a normal stage2 fault.

> could you confirm arm's armv8.0/armv8.2 standard CPU also use such design?
> if so, we may need to use other method to get the gfn/pfn/hva address.

Your question is what happens when a guest accesses a location your CPU has
marked with its 'hwpoison'. From the RAS spec[4] I would expect this to be
reported as a Synchronous External Abort. For firmware-first error handling this
should be taken to EL3.

>From here firmware should generate CPER records and then notify the OS via one
of the APEI mechanisms. Because a guest was running this notification must reach
EL2. KVM can then switch back to the host and invoke the APEI GHES handler,
which for this kind of error will probably call memory_failure().

What happens if HPFAR_EL2 isn't set when this kind of error occurs?
Provided EL3 can learn the physical address that triggered the exception then
for firmware-first this isn't a problem. These errors should be taken to EL3 and
then described by CPER records to the OS. The OS should process the CPER records
in preference to attempting 'kernel first error handling'.

memory_failure() will unmap the affected pages from user space (and KVMs stage2)
and deliver signals (if Qemu/kvmtool registered for them via prctl()). KVM won't
re-enter the guest if there is a signal pending, from here Qemu/kvmtool get the
affected VA which they can use to notify the guest.

I hope this helps,

James

[0] https://lkml.org/lkml/2017/5/12/492
[1] https://lwn.net/Articles/348886/
[2]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/vm/hwpoison.txt
[3] https://www.spinics.net/lists/arm-kernel/msg589515.html
[4]
https://static.docs.arm.com/ddi0587/a/RAS%20Extension-release%20candidate_march_29.pdf

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-06-21  9:53     ` James Morse
@ 2017-06-21 10:59       ` gengdongjiu
  -1 siblings, 0 replies; 38+ messages in thread
From: gengdongjiu @ 2017-06-21 10:59 UTC (permalink / raw)
  To: James Morse
  Cc: wuquanming, Marc Zyngier, Achin Gupta, Punit Agrawal,
	Huangshaoyu, linux-arm-kernel, kvmarm

Hi James,

On 2017/6/21 17:53, James Morse wrote:
> I think we discussed this before[0], your CPU has a feature called 'hwpoison'
> that is uses to support RAS. Linux also has a feature called 'hwpoison' [1][2],
> which handles the offline-ing of memory pages when it receives a notification
> through APEI. I've tried to call this memory_failure() to avoid this confusion.
> 
> This patch is to handle stage2 faults when the page was removed from the stage2
> mapping by the memory_failure() code. v3 of this patch[3] does a much better job
> of describing this.
> 
> (... I don't think your question is related to this patch ...)

I know your meaning about the Linux 'hwpoison' feature.
Let see the code that how to get the "pfn"

///get the pfn
fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
gfn = fault_ipa >> PAGE_SHIFT;
pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);

As shown in above code, when happen SEA, the fault_ipa is got from the HPFAR_EL2 register.
if the HPFAR_EL2 does not record the IPA. the fault_ipa is zero, then gfn is zero, so the pfn is unknown.

so below judgement always false although firmware notify the memory_failure through APEI, because we do not get the right fault memory page.
using this API "kvm_vcpu_get_fault_ipa" can not get the right fault memory page if cpu does not update the HPFAR_EL2.

+	if (pfn == KVM_PFN_ERR_HWPOISON) {
+		kvm_send_hwpoison_signal(hva, vma);
+		return 0;
+	}

so may be you need to double confirm that whether armv8.0/armv8.2  standard CPU can always update the  HPFAR_EL2 registers.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-06-21 10:59       ` gengdongjiu
  0 siblings, 0 replies; 38+ messages in thread
From: gengdongjiu @ 2017-06-21 10:59 UTC (permalink / raw)
  To: linux-arm-kernel

Hi James,

On 2017/6/21 17:53, James Morse wrote:
> I think we discussed this before[0], your CPU has a feature called 'hwpoison'
> that is uses to support RAS. Linux also has a feature called 'hwpoison' [1][2],
> which handles the offline-ing of memory pages when it receives a notification
> through APEI. I've tried to call this memory_failure() to avoid this confusion.
> 
> This patch is to handle stage2 faults when the page was removed from the stage2
> mapping by the memory_failure() code. v3 of this patch[3] does a much better job
> of describing this.
> 
> (... I don't think your question is related to this patch ...)

I know your meaning about the Linux 'hwpoison' feature.
Let see the code that how to get the "pfn"

///get the pfn
fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
gfn = fault_ipa >> PAGE_SHIFT;
pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);

As shown in above code, when happen SEA, the fault_ipa is got from the HPFAR_EL2 register.
if the HPFAR_EL2 does not record the IPA. the fault_ipa is zero, then gfn is zero, so the pfn is unknown.

so below judgement always false although firmware notify the memory_failure through APEI, because we do not get the right fault memory page.
using this API "kvm_vcpu_get_fault_ipa" can not get the right fault memory page if cpu does not update the HPFAR_EL2.

+	if (pfn == KVM_PFN_ERR_HWPOISON) {
+		kvm_send_hwpoison_signal(hva, vma);
+		return 0;
+	}

so may be you need to double confirm that whether armv8.0/armv8.2  standard CPU can always update the  HPFAR_EL2 registers.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-06-21  9:53     ` James Morse
@ 2017-06-21 11:12       ` gengdongjiu
  -1 siblings, 0 replies; 38+ messages in thread
From: gengdongjiu @ 2017-06-21 11:12 UTC (permalink / raw)
  To: James Morse
  Cc: wuquanming, Marc Zyngier, Achin Gupta, Punit Agrawal,
	Huangshaoyu, linux-arm-kernel, kvmarm

Hi james

On 2017/6/21 17:53, James Morse wrote:
> What happens if HPFAR_EL2 isn't set when this kind of error occurs?
> Provided EL3 can learn the physical address that triggered the exception then
> for firmware-first this isn't a problem. These errors should be taken to EL3 and
> then described by CPER records to the OS. The OS should process the CPER records
> in preference to attempting 'kernel first error handling'

here the pfn is from the HPFAR_EL2, not from the CPER record, which does not related with the CPER records.
Although OS can process the CPER records, but here the pfn may be always zero.
+	if (pfn == KVM_PFN_ERR_HWPOISON) {
+		kvm_send_hwpoison_signal(hva, vma);
+		return 0;
+	}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-06-21 11:12       ` gengdongjiu
  0 siblings, 0 replies; 38+ messages in thread
From: gengdongjiu @ 2017-06-21 11:12 UTC (permalink / raw)
  To: linux-arm-kernel

Hi james

On 2017/6/21 17:53, James Morse wrote:
> What happens if HPFAR_EL2 isn't set when this kind of error occurs?
> Provided EL3 can learn the physical address that triggered the exception then
> for firmware-first this isn't a problem. These errors should be taken to EL3 and
> then described by CPER records to the OS. The OS should process the CPER records
> in preference to attempting 'kernel first error handling'

here the pfn is from the HPFAR_EL2, not from the CPER record, which does not related with the CPER records.
Although OS can process the CPER records, but here the pfn may be always zero.
+	if (pfn == KVM_PFN_ERR_HWPOISON) {
+		kvm_send_hwpoison_signal(hva, vma);
+		return 0;
+	}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-06-21 10:59       ` gengdongjiu
@ 2017-06-21 12:44         ` James Morse
  -1 siblings, 0 replies; 38+ messages in thread
From: James Morse @ 2017-06-21 12:44 UTC (permalink / raw)
  To: gengdongjiu
  Cc: wuquanming, Marc Zyngier, Achin Gupta, Punit Agrawal,
	Huangshaoyu, linux-arm-kernel, kvmarm

Hi gengdongjiu,

On 21/06/17 11:59, gengdongjiu wrote:
> On 2017/6/21 17:53, James Morse wrote:
>> I think we discussed this before[0], your CPU has a feature called 'hwpoison'
>> that is uses to support RAS. Linux also has a feature called 'hwpoison' [1][2],
>> which handles the offline-ing of memory pages when it receives a notification
>> through APEI. I've tried to call this memory_failure() to avoid this confusion.
>>
>> This patch is to handle stage2 faults when the page was removed from the stage2
>> mapping by the memory_failure() code. v3 of this patch[3] does a much better job
>> of describing this.
>>
>> (... I don't think your question is related to this patch ...)
> 
> I know your meaning about the Linux 'hwpoison' feature.

Okay, I assume we are also talking about firmware-first RAS events and your APEI
notifications use SEA.

I think we are looking at different parts of the code, here is what I see should
happen:

For a Synchronous External Abort the ESR {D,I}FSC bits will be in the range that
indicates an external abort. For a data:external-abort kvm_handle_guest_abort()
will matches this with kvm_vcpu_dabt_isextabt() and makes no further attempt to
handle the fault.

Tyler's RAS series added an earlier check:
> /*
>  * The host kernel will handle the synchronous external abort. There
>  * is no need to pass the error into the guest.
>  */
> if (is_abort_sea(fault_status)) {
> 	if (!handle_guest_sea(fault_ipa, kvm_vcpu_get_hsr(vcpu)))
> 		return 1;
> }
>

This goes on to call ghes_notify_sea() which will handle the error and cause KVM
to exit this function. KVM makes no further attempt to handle the fault as APEI
should have done everything necessary. KVM will re-enter the guest, unless there
are signals pending.

(You're right that here the fault_ipa is the wrong thing to pass, but
handle_guest_sea() doesn't use it...)

We do need to enable HCR_EL2.TEA which was added with v8.2s RAS extensions, but
the cpufeature patch is a pre-requisite.

> Let see the code that how to get the "pfn"
> 
> ///get the pfn
> fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
> gfn = fault_ipa >> PAGE_SHIFT;
> pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);

> As shown in above code, when happen SEA, the fault_ipa is got from the HPFAR_EL2 register.
> if the HPFAR_EL2 does not record the IPA. the fault_ipa is zero, then gfn is zero, so the pfn is unknown.

> so below judgement always false although firmware notify the memory_failure through APEI, because we do not get the right fault memory page.
> using this API "kvm_vcpu_get_fault_ipa" can not get the right fault memory page if cpu does not update the HPFAR_EL2.

The path to the below code that uses fault_ipa->gfn->pfn->hva is:
kvm_handle_guest_abort()
user_mem_abort()
kvm_send_hwpoison_signal().

But for an external abort due to RAS we never get past kvm_handle_guest_abort()
as we call out to the APEI ghes to handle the RAS error instead.

For a data external abort that wasn't due to RAS we still don't get here as KVM
will hit the vcpu with kvm_inject_vabt() instead.

> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
> +		kvm_send_hwpoison_signal(hva, vma);
> +		return 0;
> +	}

Are you seeing a guest repeatedly trigger external-abort on the same address?
If so, can you add debug messages to check if handle_guest_sea() is called? Does
it find work to do? If so kvm_handle_guest_abort() should exit.
Do your CPER records cause memory_failure() to be run?
Does try_to_unmap() in hwpoison_user_mappings() run and succeed? If so the
faulty page should be unmapped from stage2, from now on the guest should only
trigger normal stage2 faults for this address.

How does your firmware choose to route injected-External-Aborts for APEI
notifications? Do we need to enable HCR_EL2.TEA to make this work properly?

> so may be you need to double confirm that whether armv8.0/armv8.2  standard
> CPU can always update the  HPFAR_EL2 registers.

I don't know what the CPUs do, but the ARM-ARM allows the FAR to be not-valid
for some external aborts. This is indicated by the Far-not-Valid bit in the ESR.

With firmware-first RAS the only external aborts that Linux should see are SEA
APEI notifications. We shouldn't expect firmware to set much beyond the minimum
for these. The KVM code touched by this patch shouldn't run for an APEI
notification.

Thanks,

James

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-06-21 12:44         ` James Morse
  0 siblings, 0 replies; 38+ messages in thread
From: James Morse @ 2017-06-21 12:44 UTC (permalink / raw)
  To: linux-arm-kernel

Hi gengdongjiu,

On 21/06/17 11:59, gengdongjiu wrote:
> On 2017/6/21 17:53, James Morse wrote:
>> I think we discussed this before[0], your CPU has a feature called 'hwpoison'
>> that is uses to support RAS. Linux also has a feature called 'hwpoison' [1][2],
>> which handles the offline-ing of memory pages when it receives a notification
>> through APEI. I've tried to call this memory_failure() to avoid this confusion.
>>
>> This patch is to handle stage2 faults when the page was removed from the stage2
>> mapping by the memory_failure() code. v3 of this patch[3] does a much better job
>> of describing this.
>>
>> (... I don't think your question is related to this patch ...)
> 
> I know your meaning about the Linux 'hwpoison' feature.

Okay, I assume we are also talking about firmware-first RAS events and your APEI
notifications use SEA.

I think we are looking at different parts of the code, here is what I see should
happen:

For a Synchronous External Abort the ESR {D,I}FSC bits will be in the range that
indicates an external abort. For a data:external-abort kvm_handle_guest_abort()
will matches this with kvm_vcpu_dabt_isextabt() and makes no further attempt to
handle the fault.

Tyler's RAS series added an earlier check:
> /*
>  * The host kernel will handle the synchronous external abort. There
>  * is no need to pass the error into the guest.
>  */
> if (is_abort_sea(fault_status)) {
> 	if (!handle_guest_sea(fault_ipa, kvm_vcpu_get_hsr(vcpu)))
> 		return 1;
>?}
>

This goes on to call ghes_notify_sea() which will handle the error and cause KVM
to exit this function. KVM makes no further attempt to handle the fault as APEI
should have done everything necessary. KVM will re-enter the guest, unless there
are signals pending.

(You're right that here the fault_ipa is the wrong thing to pass, but
handle_guest_sea() doesn't use it...)

We do need to enable HCR_EL2.TEA which was added with v8.2s RAS extensions, but
the cpufeature patch is a pre-requisite.

> Let see the code that how to get the "pfn"
> 
> ///get the pfn
> fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
> gfn = fault_ipa >> PAGE_SHIFT;
> pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);

> As shown in above code, when happen SEA, the fault_ipa is got from the HPFAR_EL2 register.
> if the HPFAR_EL2 does not record the IPA. the fault_ipa is zero, then gfn is zero, so the pfn is unknown.

> so below judgement always false although firmware notify the memory_failure through APEI, because we do not get the right fault memory page.
> using this API "kvm_vcpu_get_fault_ipa" can not get the right fault memory page if cpu does not update the HPFAR_EL2.

The path to the below code that uses fault_ipa->gfn->pfn->hva is:
kvm_handle_guest_abort()
user_mem_abort()
kvm_send_hwpoison_signal().

But for an external abort due to RAS we never get past kvm_handle_guest_abort()
as we call out to the APEI ghes to handle the RAS error instead.

For a data external abort that wasn't due to RAS we still don't get here as KVM
will hit the vcpu with kvm_inject_vabt() instead.

> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
> +		kvm_send_hwpoison_signal(hva, vma);
> +		return 0;
> +	}

Are you seeing a guest repeatedly trigger external-abort on the same address?
If so, can you add debug messages to check if handle_guest_sea() is called? Does
it find work to do? If so kvm_handle_guest_abort() should exit.
Do your CPER records cause memory_failure() to be run?
Does try_to_unmap() in hwpoison_user_mappings() run and succeed? If so the
faulty page should be unmapped from stage2, from now on the guest should only
trigger normal stage2 faults for this address.

How does your firmware choose to route injected-External-Aborts for APEI
notifications? Do we need to enable HCR_EL2.TEA to make this work properly?

> so may be you need to double confirm that whether armv8.0/armv8.2  standard
> CPU can always update the  HPFAR_EL2 registers.

I don't know what the CPUs do, but the ARM-ARM allows the FAR to be not-valid
for some external aborts. This is indicated by the Far-not-Valid bit in the ESR.

With firmware-first RAS the only external aborts that Linux should see are SEA
APEI notifications. We shouldn't expect firmware to set much beyond the minimum
for these. The KVM code touched by this patch shouldn't run for an APEI
notification.

Thanks,

James

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-06-21 12:44         ` James Morse
@ 2017-06-22  6:47           ` gengdongjiu
  -1 siblings, 0 replies; 38+ messages in thread
From: gengdongjiu @ 2017-06-22  6:47 UTC (permalink / raw)
  To: James Morse
  Cc: wuquanming, Marc Zyngier, Achin Gupta, Punit Agrawal,
	Tyler Baicar, Huangshaoyu, linux-arm-kernel, kvmarm

Hi James,

On 2017/6/21 20:44, James Morse wrote:
> Hi gengdongjiu,
> 
> On 21/06/17 11:59, gengdongjiu wrote:
>> On 2017/6/21 17:53, James Morse wrote:
>>> I think we discussed this before[0], your CPU has a feature called 'hwpoison'
>>> that is uses to support RAS. Linux also has a feature called 'hwpoison' [1][2],
>>> which handles the offline-ing of memory pages when it receives a notification
>>> through APEI. I've tried to call this memory_failure() to avoid this confusion.
>>>
>>> This patch is to handle stage2 faults when the page was removed from the stage2
>>> mapping by the memory_failure() code. v3 of this patch[3] does a much better job
>>> of describing this.
>>>
>>> (... I don't think your question is related to this patch ...)
>>
>> I know your meaning about the Linux 'hwpoison' feature.
> 
> Okay, I assume we are also talking about firmware-first RAS events and your APEI
> notifications use SEA.
> 
> 
> I think we are looking at different parts of the code, here is what I see should
> happen:
> 
> For a Synchronous External Abort the ESR {D,I}FSC bits will be in the range that
> indicates an external abort. For a data:external-abort kvm_handle_guest_abort()
> will matches this with kvm_vcpu_dabt_isextabt() and makes no further attempt to
> handle the fault.
The kvm_vcpu_dabt_isextabt() will check the ESR EA bit[9]
The EA bit [9] is "IMPLEMENTATION DEFINED", so whether match with the kvm_vcpu_dabt_isextabt() depend on the CPU design.
In my platform the EA bit is zero for the RAS Synchronous External Abort, so will not matches the kvm_vcpu_dabt_isextabt(), so the code will continue handle the fault.


static inline bool kvm_vcpu_dabt_isextabt(struct kvm_vcpu *vcpu)
{
	return kvm_vcpu_get_hsr(vcpu) & HSR_DABT_EA;
}

EA, bit [9]
	External Abort. As described in the Architecture Reference Manual. Provides an IMPLEMENTATION
	DEFINED classification of the external abort.
DFSC, bits [5:0], when synchronous external Data Abort
	Data Fault Status Code. The possible values of this field are:
	0b010000 Synchronous External Abort on memory access.
	0b0101xx Synchronous External Abort on translation table walk or hardware update of translation
	table. DFSC[1:0] encode the level.
	Further values are described in the Architecture Reference Manual. The Parity Error codes are not
	used and reserved in the RAS extension.
IFSC, bits [5:0], when synchronous external Instruction Abort
	Instruction Fault Status Code. The possible values of this field are:
	0b010000 Synchronous External Abort on memory access.
	0b0101xx Synchronous External Abort on translation table walk or hardware update of translation
	table. IFSC[1:0] encode the level.
	Further values are described in the Architecture Reference Manual. The Parity Error codes are not
	used and reserved in the RAS extension.

> 
> Tyler's RAS series added an earlier check:
>> /*
>>  * The host kernel will handle the synchronous external abort. There
>>  * is no need to pass the error into the guest.
>>  */
>> if (is_abort_sea(fault_status)) {
>> 	if (!handle_guest_sea(fault_ipa, kvm_vcpu_get_hsr(vcpu)))
>> 		return 1;
>>  }
>>
> 
> This goes on to call ghes_notify_sea() which will handle the error and cause KVM
> to exit this function. KVM makes no further attempt to handle the fault as APEI
> should have done everything necessary. KVM will re-enter the guest, unless there
> are signals pending.
In my code, there is not Tyler's such modification.
firstly, the poison error is related with RAS, and poison error match the is_abort_sea(fault_status)
so the KVM will exit,  and SIGBUS signal have no chance to be delivered.


I suggest it does not return in Tyler's patch or move kvm_send_hwpoison_signal function before return.


CC Tyler.


> 
> (You're right that here the fault_ipa is the wrong thing to pass, but
> handle_guest_sea() doesn't use it...)
> 
> We do need to enable HCR_EL2.TEA which was added with v8.2s RAS extensions, but
> the cpufeature patch is a pre-requisite.
HCR_EL2.TEA will Route synchronous External Abort exceptions from Non-secure EL0 and EL1 to EL2, if not routed to EL3.

in the firmware-first RAS solution, this bit can not control to route to El2 because the exception is route to EL3.

enabling HCR_EL2.TEA can be better for the non firmware-first RAS solution.
but can not solve this issue.

> 
> 
>> Let see the code that how to get the "pfn"
>>
>> ///get the pfn
>> fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
>> gfn = fault_ipa >> PAGE_SHIFT;
>> pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> 
>> As shown in above code, when happen SEA, the fault_ipa is got from the HPFAR_EL2 register.
>> if the HPFAR_EL2 does not record the IPA. the fault_ipa is zero, then gfn is zero, so the pfn is unknown.
> 
>> so below judgement always false although firmware notify the memory_failure through APEI, because we do not get the right fault memory page.
>> using this API "kvm_vcpu_get_fault_ipa" can not get the right fault memory page if cpu does not update the HPFAR_EL2.
> 
> The path to the below code that uses fault_ipa->gfn->pfn->hva is:
> kvm_handle_guest_abort()
> user_mem_abort()
> kvm_send_hwpoison_signal().
> 
> But for an external abort due to RAS we never get past kvm_handle_guest_abort()
> as we call out to the APEI ghes to handle the RAS error instead.
so this patch can not work.


> 
> For a data external abort that wasn't due to RAS we still don't get here as KVM
> will hit the vcpu with kvm_inject_vabt() instead.
poison error is only related with RAS, I think we can mainly consider the RAS.

> 
> 
>> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
>> +		kvm_send_hwpoison_signal(hva, vma);
>> +		return 0;
>> +	}
> 
> Are you seeing a guest repeatedly trigger external-abort on the same address?
> If so, can you add debug messages to check if handle_guest_sea() is called? Does
> it find work to do? If so kvm_handle_guest_abort() should exit.

 If use Tyler's this modification, the kvm_handle_guest_abort will exit when triggering synchronous External Abort,
 in my verification, not have this modification, so the kvm_handle_guest_abort() does not exit
 this change is added recently by Tyler

> Do your CPER records cause memory_failure() to be run?
> Does try_to_unmap() in hwpoison_user_mappings() run and succeed? If so the
> faulty page should be unmapped from stage2, from now on the guest should only
> trigger normal stage2 faults for this address.
> 
> How does your firmware choose to route injected-External-Aborts for APEI
> notifications? Do we need to enable HCR_EL2.TEA to make this work properly?
my firmware will route the injected-External-Aborts to the hypervisor if the error come from guest OS.
enabling the CR_EL2.TEA can not solve this issue.
because the exception still route to EL3.  the SIGBUS signal have not chance to be sent
so I have two suggestions for this issue:
(1) modify Tyler's patch, not return.
(2) call the kvm_send_hwpoison_signal before return.


> 
> 
>> so may be you need to double confirm that whether armv8.0/armv8.2  standard
>> CPU can always update the  HPFAR_EL2 registers.
> 
> I don't know what the CPUs do, but the ARM-ARM allows the FAR to be not-valid
> for some external aborts. This is indicated by the Far-not-Valid bit in the ESR.
> 
> With firmware-first RAS the only external aborts that Linux should see are SEA
> APEI notifications. We shouldn't expect firmware to set much beyond the minimum
> for these. The KVM code touched by this patch shouldn't run for an APEI
> notification.
> 
> 
> Thanks,
> 
> James
> 
> 
> .
> 

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-06-22  6:47           ` gengdongjiu
  0 siblings, 0 replies; 38+ messages in thread
From: gengdongjiu @ 2017-06-22  6:47 UTC (permalink / raw)
  To: linux-arm-kernel

Hi James,

On 2017/6/21 20:44, James Morse wrote:
> Hi gengdongjiu,
> 
> On 21/06/17 11:59, gengdongjiu wrote:
>> On 2017/6/21 17:53, James Morse wrote:
>>> I think we discussed this before[0], your CPU has a feature called 'hwpoison'
>>> that is uses to support RAS. Linux also has a feature called 'hwpoison' [1][2],
>>> which handles the offline-ing of memory pages when it receives a notification
>>> through APEI. I've tried to call this memory_failure() to avoid this confusion.
>>>
>>> This patch is to handle stage2 faults when the page was removed from the stage2
>>> mapping by the memory_failure() code. v3 of this patch[3] does a much better job
>>> of describing this.
>>>
>>> (... I don't think your question is related to this patch ...)
>>
>> I know your meaning about the Linux 'hwpoison' feature.
> 
> Okay, I assume we are also talking about firmware-first RAS events and your APEI
> notifications use SEA.
> 
> 
> I think we are looking at different parts of the code, here is what I see should
> happen:
> 
> For a Synchronous External Abort the ESR {D,I}FSC bits will be in the range that
> indicates an external abort. For a data:external-abort kvm_handle_guest_abort()
> will matches this with kvm_vcpu_dabt_isextabt() and makes no further attempt to
> handle the fault.
The kvm_vcpu_dabt_isextabt() will check the ESR EA bit[9]
The EA bit [9] is "IMPLEMENTATION DEFINED", so whether match with the kvm_vcpu_dabt_isextabt() depend on the CPU design.
In my platform the EA bit is zero for the RAS Synchronous External Abort, so will not matches the kvm_vcpu_dabt_isextabt(), so the code will continue handle the fault.


static inline bool kvm_vcpu_dabt_isextabt(struct kvm_vcpu *vcpu)
{
	return kvm_vcpu_get_hsr(vcpu) & HSR_DABT_EA;
}

EA, bit [9]
	External Abort. As described in the Architecture Reference Manual. Provides an IMPLEMENTATION
	DEFINED classification of the external abort.
DFSC, bits [5:0], when synchronous external Data Abort
	Data Fault Status Code. The possible values of this field are:
	0b010000 Synchronous External Abort on memory access.
	0b0101xx Synchronous External Abort on translation table walk or hardware update of translation
	table. DFSC[1:0] encode the level.
	Further values are described in the Architecture Reference Manual. The Parity Error codes are not
	used and reserved in the RAS extension.
IFSC, bits [5:0], when synchronous external Instruction Abort
	Instruction Fault Status Code. The possible values of this field are:
	0b010000 Synchronous External Abort on memory access.
	0b0101xx Synchronous External Abort on translation table walk or hardware update of translation
	table. IFSC[1:0] encode the level.
	Further values are described in the Architecture Reference Manual. The Parity Error codes are not
	used and reserved in the RAS extension.

> 
> Tyler's RAS series added an earlier check:
>> /*
>>  * The host kernel will handle the synchronous external abort. There
>>  * is no need to pass the error into the guest.
>>  */
>> if (is_abort_sea(fault_status)) {
>> 	if (!handle_guest_sea(fault_ipa, kvm_vcpu_get_hsr(vcpu)))
>> 		return 1;
>> ?}
>>
> 
> This goes on to call ghes_notify_sea() which will handle the error and cause KVM
> to exit this function. KVM makes no further attempt to handle the fault as APEI
> should have done everything necessary. KVM will re-enter the guest, unless there
> are signals pending.
In my code, there is not Tyler's such modification.
firstly, the poison error is related with RAS, and poison error match the is_abort_sea(fault_status)
so the KVM will exit,  and SIGBUS signal have no chance to be delivered.


I suggest it does not return in Tyler's patch or move kvm_send_hwpoison_signal function before return.


CC Tyler.


> 
> (You're right that here the fault_ipa is the wrong thing to pass, but
> handle_guest_sea() doesn't use it...)
> 
> We do need to enable HCR_EL2.TEA which was added with v8.2s RAS extensions, but
> the cpufeature patch is a pre-requisite.
HCR_EL2.TEA will Route synchronous External Abort exceptions from Non-secure EL0 and EL1 to EL2, if not routed to EL3.

in the firmware-first RAS solution, this bit can not control to route to El2 because the exception is route to EL3.

enabling HCR_EL2.TEA can be better for the non firmware-first RAS solution.
but can not solve this issue.

> 
> 
>> Let see the code that how to get the "pfn"
>>
>> ///get the pfn
>> fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
>> gfn = fault_ipa >> PAGE_SHIFT;
>> pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
> 
>> As shown in above code, when happen SEA, the fault_ipa is got from the HPFAR_EL2 register.
>> if the HPFAR_EL2 does not record the IPA. the fault_ipa is zero, then gfn is zero, so the pfn is unknown.
> 
>> so below judgement always false although firmware notify the memory_failure through APEI, because we do not get the right fault memory page.
>> using this API "kvm_vcpu_get_fault_ipa" can not get the right fault memory page if cpu does not update the HPFAR_EL2.
> 
> The path to the below code that uses fault_ipa->gfn->pfn->hva is:
> kvm_handle_guest_abort()
> user_mem_abort()
> kvm_send_hwpoison_signal().
> 
> But for an external abort due to RAS we never get past kvm_handle_guest_abort()
> as we call out to the APEI ghes to handle the RAS error instead.
so this patch can not work.


> 
> For a data external abort that wasn't due to RAS we still don't get here as KVM
> will hit the vcpu with kvm_inject_vabt() instead.
poison error is only related with RAS, I think we can mainly consider the RAS.

> 
> 
>> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
>> +		kvm_send_hwpoison_signal(hva, vma);
>> +		return 0;
>> +	}
> 
> Are you seeing a guest repeatedly trigger external-abort on the same address?
> If so, can you add debug messages to check if handle_guest_sea() is called? Does
> it find work to do? If so kvm_handle_guest_abort() should exit.

 If use Tyler's this modification, the kvm_handle_guest_abort will exit when triggering synchronous External Abort,
 in my verification, not have this modification, so the kvm_handle_guest_abort() does not exit
 this change is added recently by Tyler

> Do your CPER records cause memory_failure() to be run?
> Does try_to_unmap() in hwpoison_user_mappings() run and succeed? If so the
> faulty page should be unmapped from stage2, from now on the guest should only
> trigger normal stage2 faults for this address.
> 
> How does your firmware choose to route injected-External-Aborts for APEI
> notifications? Do we need to enable HCR_EL2.TEA to make this work properly?
my firmware will route the injected-External-Aborts to the hypervisor if the error come from guest OS.
enabling the CR_EL2.TEA can not solve this issue.
because the exception still route to EL3.  the SIGBUS signal have not chance to be sent
so I have two suggestions for this issue:
(1) modify Tyler's patch, not return.
(2) call the kvm_send_hwpoison_signal before return.


> 
> 
>> so may be you need to double confirm that whether armv8.0/armv8.2  standard
>> CPU can always update the  HPFAR_EL2 registers.
> 
> I don't know what the CPUs do, but the ARM-ARM allows the FAR to be not-valid
> for some external aborts. This is indicated by the Far-not-Valid bit in the ESR.
> 
> With firmware-first RAS the only external aborts that Linux should see are SEA
> APEI notifications. We shouldn't expect firmware to set much beyond the minimum
> for these. The KVM code touched by this patch shouldn't run for an APEI
> notification.
> 
> 
> Thanks,
> 
> James
> 
> 
> .
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-06-21 12:44         ` James Morse
@ 2017-06-22 14:49           ` gengdongjiu
  -1 siblings, 0 replies; 38+ messages in thread
From: gengdongjiu @ 2017-06-22 14:49 UTC (permalink / raw)
  To: James Morse
  Cc: wuquanming, Marc Zyngier, Achin Gupta, Punit Agrawal,
	gengdongjiu, Huangshaoyu, kvmarm, linux-arm-kernel

Hi James,

2017-06-21 20:44 GMT+08:00, James Morse <james.morse@arm.com>:
> Hi gengdongjiu,
>
> On 21/06/17 11:59, gengdongjiu wrote:
>> On 2017/6/21 17:53, James Morse wrote:
>>> I think we discussed this before[0], your CPU has a feature called
>>> 'hwpoison'
>>> that is uses to support RAS. Linux also has a feature called 'hwpoison'
>>> [1][2],
>>> which handles the offline-ing of memory pages when it receives a
>>> notification
>>> through APEI. I've tried to call this memory_failure() to avoid this
>>> confusion.
>>>
>>> This patch is to handle stage2 faults when the page was removed from the
>>> stage2
>>> mapping by the memory_failure() code. v3 of this patch[3] does a much
>>> better job
>>> of describing this.
>>>
>>> (... I don't think your question is related to this patch ...)
>>
>> I know your meaning about the Linux 'hwpoison' feature.
>
> Okay, I assume we are also talking about firmware-first RAS events and your
> APEI
> notifications use SEA.
>
>
> I think we are looking at different parts of the code, here is what I see
> should
> happen:
>
> For a Synchronous External Abort the ESR {D,I}FSC bits will be in the range
> that
> indicates an external abort. For a data:external-abort
> kvm_handle_guest_abort()
> will matches this with kvm_vcpu_dabt_isextabt() and makes no further attempt
> to
> handle the fault.
>
> Tyler's RAS series added an earlier check:
>> /*
>>  * The host kernel will handle the synchronous external abort. There
>>  * is no need to pass the error into the guest.
>>  */
>> if (is_abort_sea(fault_status)) {
>> 	if (!handle_guest_sea(fault_ipa, kvm_vcpu_get_hsr(vcpu)))
>> 		return 1;
>> }
>>
>
> This goes on to call ghes_notify_sea() which will handle the error and cause
> KVM
> to exit this function. KVM makes no further attempt to handle the fault as
> APEI
> should have done everything necessary. KVM will re-enter the guest, unless
> there
> are signals pending.
>
> (You're right that here the fault_ipa is the wrong thing to pass, but
> handle_guest_sea() doesn't use it...)
>
> We do need to enable HCR_EL2.TEA which was added with v8.2s RAS extensions,
> but
> the cpufeature patch is a pre-requisite.
>
>
>> Let see the code that how to get the "pfn"
>>
>> ///get the pfn
>> fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
>> gfn = fault_ipa >> PAGE_SHIFT;
>> pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
>
>> As shown in above code, when happen SEA, the fault_ipa is got from the
>> HPFAR_EL2 register.
>> if the HPFAR_EL2 does not record the IPA. the fault_ipa is zero, then gfn
>> is zero, so the pfn is unknown.
>
>> so below judgement always false although firmware notify the
>> memory_failure through APEI, because we do not get the right fault memory
>> page.
>> using this API "kvm_vcpu_get_fault_ipa" can not get the right fault memory
>> page if cpu does not update the HPFAR_EL2.
>
> The path to the below code that uses fault_ipa->gfn->pfn->hva is:
> kvm_handle_guest_abort()
> user_mem_abort()
> kvm_send_hwpoison_signal().
>
> But for an external abort due to RAS we never get past
> kvm_handle_guest_abort()
> as we call out to the APEI ghes to handle the RAS error instead.
>
> For a data external abort that wasn't due to RAS we still don't get here as
> KVM
> will hit the vcpu with kvm_inject_vabt() instead.
>
>
>> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
>> +		kvm_send_hwpoison_signal(hva, vma);
>> +		return 0;
>> +	}
>
> Are you seeing a guest repeatedly trigger external-abort on the same
> address?
> If so, can you add debug messages to check if handle_guest_sea() is called?
> Does
> it find work to do? If so kvm_handle_guest_abort() should exit.
> Do your CPER records cause memory_failure() to be run?
> Does try_to_unmap() in hwpoison_user_mappings() run and succeed? If so the
> faulty page should be unmapped from stage2, from now on the guest should
> only
> trigger normal stage2 faults for this address.

in my local test, if guest OS happen Poison error. the fault_ipa is
zero, becuase the fault IPA does not recorded by the  HPFAR_EL2. this
is huawei's armv8.2 CPU behavior, I do not know whether other arm64
CPU have the same  behavior

>
> How does your firmware choose to route injected-External-Aborts for APEI
> notifications? Do we need to enable HCR_EL2.TEA to make this work properly?
>
>
>> so may be you need to double confirm that whether armv8.0/armv8.2
>> standard
>> CPU can always update the  HPFAR_EL2 registers.
>
> I don't know what the CPUs do, but the ARM-ARM allows the FAR to be
> not-valid
> for some external aborts. This is indicated by the Far-not-Valid bit in the
> ESR.
>
> With firmware-first RAS the only external aborts that Linux should see are
> SEA
> APEI notifications. We shouldn't expect firmware to set much beyond the
> minimum
> for these. The KVM code touched by this patch shouldn't run for an APEI
> notification.
>
>
> Thanks,
>
> James
>
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-06-22 14:49           ` gengdongjiu
  0 siblings, 0 replies; 38+ messages in thread
From: gengdongjiu @ 2017-06-22 14:49 UTC (permalink / raw)
  To: linux-arm-kernel

Hi James,

2017-06-21 20:44 GMT+08:00, James Morse <james.morse@arm.com>:
> Hi gengdongjiu,
>
> On 21/06/17 11:59, gengdongjiu wrote:
>> On 2017/6/21 17:53, James Morse wrote:
>>> I think we discussed this before[0], your CPU has a feature called
>>> 'hwpoison'
>>> that is uses to support RAS. Linux also has a feature called 'hwpoison'
>>> [1][2],
>>> which handles the offline-ing of memory pages when it receives a
>>> notification
>>> through APEI. I've tried to call this memory_failure() to avoid this
>>> confusion.
>>>
>>> This patch is to handle stage2 faults when the page was removed from the
>>> stage2
>>> mapping by the memory_failure() code. v3 of this patch[3] does a much
>>> better job
>>> of describing this.
>>>
>>> (... I don't think your question is related to this patch ...)
>>
>> I know your meaning about the Linux 'hwpoison' feature.
>
> Okay, I assume we are also talking about firmware-first RAS events and your
> APEI
> notifications use SEA.
>
>
> I think we are looking at different parts of the code, here is what I see
> should
> happen:
>
> For a Synchronous External Abort the ESR {D,I}FSC bits will be in the range
> that
> indicates an external abort. For a data:external-abort
> kvm_handle_guest_abort()
> will matches this with kvm_vcpu_dabt_isextabt() and makes no further attempt
> to
> handle the fault.
>
> Tyler's RAS series added an earlier check:
>> /*
>>  * The host kernel will handle the synchronous external abort. There
>>  * is no need to pass the error into the guest.
>>  */
>> if (is_abort_sea(fault_status)) {
>> 	if (!handle_guest_sea(fault_ipa, kvm_vcpu_get_hsr(vcpu)))
>> 		return 1;
>>?}
>>
>
> This goes on to call ghes_notify_sea() which will handle the error and cause
> KVM
> to exit this function. KVM makes no further attempt to handle the fault as
> APEI
> should have done everything necessary. KVM will re-enter the guest, unless
> there
> are signals pending.
>
> (You're right that here the fault_ipa is the wrong thing to pass, but
> handle_guest_sea() doesn't use it...)
>
> We do need to enable HCR_EL2.TEA which was added with v8.2s RAS extensions,
> but
> the cpufeature patch is a pre-requisite.
>
>
>> Let see the code that how to get the "pfn"
>>
>> ///get the pfn
>> fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
>> gfn = fault_ipa >> PAGE_SHIFT;
>> pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
>
>> As shown in above code, when happen SEA, the fault_ipa is got from the
>> HPFAR_EL2 register.
>> if the HPFAR_EL2 does not record the IPA. the fault_ipa is zero, then gfn
>> is zero, so the pfn is unknown.
>
>> so below judgement always false although firmware notify the
>> memory_failure through APEI, because we do not get the right fault memory
>> page.
>> using this API "kvm_vcpu_get_fault_ipa" can not get the right fault memory
>> page if cpu does not update the HPFAR_EL2.
>
> The path to the below code that uses fault_ipa->gfn->pfn->hva is:
> kvm_handle_guest_abort()
> user_mem_abort()
> kvm_send_hwpoison_signal().
>
> But for an external abort due to RAS we never get past
> kvm_handle_guest_abort()
> as we call out to the APEI ghes to handle the RAS error instead.
>
> For a data external abort that wasn't due to RAS we still don't get here as
> KVM
> will hit the vcpu with kvm_inject_vabt() instead.
>
>
>> +	if (pfn == KVM_PFN_ERR_HWPOISON) {
>> +		kvm_send_hwpoison_signal(hva, vma);
>> +		return 0;
>> +	}
>
> Are you seeing a guest repeatedly trigger external-abort on the same
> address?
> If so, can you add debug messages to check if handle_guest_sea() is called?
> Does
> it find work to do? If so kvm_handle_guest_abort() should exit.
> Do your CPER records cause memory_failure() to be run?
> Does try_to_unmap() in hwpoison_user_mappings() run and succeed? If so the
> faulty page should be unmapped from stage2, from now on the guest should
> only
> trigger normal stage2 faults for this address.

in my local test, if guest OS happen Poison error. the fault_ipa is
zero, becuase the fault IPA does not recorded by the  HPFAR_EL2. this
is huawei's armv8.2 CPU behavior, I do not know whether other arm64
CPU have the same  behavior

>
> How does your firmware choose to route injected-External-Aborts for APEI
> notifications? Do we need to enable HCR_EL2.TEA to make this work properly?
>
>
>> so may be you need to double confirm that whether armv8.0/armv8.2
>> standard
>> CPU can always update the  HPFAR_EL2 registers.
>
> I don't know what the CPUs do, but the ARM-ARM allows the FAR to be
> not-valid
> for some external aborts. This is indicated by the Far-not-Valid bit in the
> ESR.
>
> With firmware-first RAS the only external aborts that Linux should see are
> SEA
> APEI notifications. We shouldn't expect firmware to set much beyond the
> minimum
> for these. The KVM code touched by this patch shouldn't run for an APEI
> notification.
>
>
> Thanks,
>
> James
>
> _______________________________________________
> kvmarm mailing list
> kvmarm at lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-06-22  6:47           ` gengdongjiu
@ 2017-06-22 16:39             ` James Morse
  -1 siblings, 0 replies; 38+ messages in thread
From: James Morse @ 2017-06-22 16:39 UTC (permalink / raw)
  To: gengdongjiu
  Cc: wuquanming, Marc Zyngier, Achin Gupta, Punit Agrawal,
	Tyler Baicar, Huangshaoyu, linux-arm-kernel, kvmarm

Hi gengdongjiu,

On 22/06/17 07:47, gengdongjiu wrote:
> On 2017/6/21 20:44, James Morse wrote:
>> On 21/06/17 11:59, gengdongjiu wrote:
>>> On 2017/6/21 17:53, James Morse wrote:
>>>> I think we discussed this before[0], your CPU has a feature called 'hwpoison'
>>>> that is uses to support RAS. Linux also has a feature called 'hwpoison' [1][2],
>>>> which handles the offline-ing of memory pages when it receives a notification
>>>> through APEI. I've tried to call this memory_failure() to avoid this confusion.
>>>>
>>>> This patch is to handle stage2 faults when the page was removed from the stage2
>>>> mapping by the memory_failure() code. v3 of this patch[3] does a much better job
>>>> of describing this.
>>>>
>>>> (... I don't think your question is related to this patch ...)
>>>
>>> I know your meaning about the Linux 'hwpoison' feature.

>> Okay, I assume we are also talking about firmware-first RAS events and your APEI
>> notifications use SEA.

It doesn't look like you are using APEI firmware-first or SEA notifications here.

>> I think we are looking at different parts of the code, here is what I see should
>> happen:
>>
>> For a Synchronous External Abort the ESR {D,I}FSC bits will be in the range that
>> indicates an external abort. For a data:external-abort kvm_handle_guest_abort()
>> will matches this with kvm_vcpu_dabt_isextabt() and makes no further attempt to
>> handle the fault.

> The kvm_vcpu_dabt_isextabt() will check the ESR EA bit[9]
> The EA bit [9] is "IMPLEMENTATION DEFINED", so whether match with the kvm_vcpu_dabt_isextabt() depend on the CPU design.
> In my platform the EA bit is zero for the RAS Synchronous External Abort, so will not matches the kvm_vcpu_dabt_isextabt(),
> so the code will continue handle the fault.

This is the problem, you aren't notifying RAS errors to the OS with an
SEA/synchronous-external-abort.

For APEI firmware-first RAS you must use one of the notification methods listed
in ACPI 6.2s '18.3.2.9 Hardware Error Notification, Table 18-383'.

Notifications using 'SEA' hang on an existing part of the ARM architecture,
firmware has to emulate what the CPU does to avoid breaking existing code.

> EA, bit [9]
> 	External Abort. As described in the Architecture Reference Manual. Provides an IMPLEMENTATION
> 	DEFINED classification of the external abort.

This is from 'D7.2.28 - ESR_ELx, Exception Syndrome Register (ELx)' of
DDI0487B.a. If you don't set that bit, its not an external abort.
Implementation-defined refers to the classification: your CPU implementation is
allowed to choose what it classifies as 'external'.
For RAS, your firmware must set this bit if it wants to use SEA as the APEI
notification type.

What is the ESR_EL2 value when this happens?
Do you trap this to firmware first? (its not firmware-first RAS if you don't).

>> This goes on to call ghes_notify_sea() which will handle the error and cause KVM
>> to exit this function. KVM makes no further attempt to handle the fault as APEI
>> should have done everything necessary. KVM will re-enter the guest, unless there
>> are signals pending.

> In my code, there is not Tyler's such modification.

Tyler's series added support to KVM for handling RAS notifications via SEA.

> firstly, the poison error is related with RAS, and poison error match the is_abort_sea(fault_status)

(So it is an external abort? Now I'm confused)

> so the KVM will exit,  and SIGBUS signal have no chance to be delivered.

kvm_handle_guest_abort() will return 1 through handle_exit() and into
kvm_arch_vcpu_ioctl_run()'s while (ret > 0) { } loop.

KVM should re-enter the guest because the APEI GHES RAS code said it had handled
the error.

> I suggest it does not return in Tyler's patch

This is required because the notification about a RAS fault may not be a
sensible guest fault. In your example its not a sensible guest exit because you
haven't set HPFAR_EL2. KVM can't handle a guest fault without this register.

> or move kvm_send_hwpoison_signal function before return.

Here we are talking about RAS notifications from firmware. (1)

kvm_send_hwpoison_signal() is for handling normal stage2 faults where the page
to map has been poisoned (2). By this point the CPU's RAS features and firmware
are not involved.

These are two different things, a well behaved machine-monitor and guest will
never touch a poisoned page once its received a notification due to (1). This
patch addresses a corner case that covers Qemu/kvmtool not handling (1)
properly, or the guest ignoring the notification, (and all the races in between).

>> (You're right that here the fault_ipa is the wrong thing to pass, but
>> handle_guest_sea() doesn't use it...)
>>
>> We do need to enable HCR_EL2.TEA which was added with v8.2s RAS extensions, but
>> the cpufeature patch is a pre-requisite.

> HCR_EL2.TEA will Route synchronous External Abort exceptions from Non-secure EL0 and EL1 to EL2, if not routed to EL3.
> 
> in the firmware-first RAS solution, this bit can not control to route to El2 because the exception is route to EL3.
> 
> enabling HCR_EL2.TEA can be better for the non firmware-first RAS solution.
> but can not solve this issue.

Your firmware receives an external abort from the CPU, trapped to EL3 by
SCR_EL3.EA. It chooses to notify the OS via APEI's SEA notification type. How
does it choose whether to target EL2 or EL1 for the fake software-generated
SEA-notification? To correctly emulate what the CPU would do, it should inspect
HCR_EL2.TEA, and route the notification accordingly.

> my firmware will route the injected-External-Aborts to the hypervisor if the
> error come from guest OS.

How do you know there is a hypervisor at EL2 and guest OS at EL1? Disable both
KVM and VHE in Linux and boot. Now you have host Linux at EL1 and only the
hyp-stub at EL2.

These are the complications with hanging APEI notification on an existing part
of the architecture.

>> The path to the below code that uses fault_ipa->gfn->pfn->hva is:
>> kvm_handle_guest_abort()
>> user_mem_abort()
>> kvm_send_hwpoison_signal().
>>
>> But for an external abort due to RAS we never get past kvm_handle_guest_abort()
>> as we call out to the APEI ghes to handle the RAS error instead.

> so this patch can not work.

To notify Qemu/kvmtool about a RAS fault? That's not what this patch does.

v3 of this patch has a much better commit message:
----%<----
memory_failure() has two modes, early and late. Early is used by
machine-managers like Qemu to receive a notification when a memory error is
notified to the host. These can then be relayed to the guest before the
affected page is accessed. To enable this, the process must set
PR_MCE_KILL_EARLY in PR_MCE_KILL_SET using the prctl() syscall.

Once the early notification has been handled, nothing stops the
machine-manager or guest from accessing the affected page. If the
machine-manager does this the page will fail to be mapped and SIGBUS will
be sent. This patch adds the equivalent path for when the guest accesses
the page, sending SIGBUS to the machine-manager.

These two signals can be distinguished by the machine-manager using their
si_code: BUS_MCEERR_AO for 'action optional' early notifications, and
BUS_MCEERR_AR for 'action required' synchronous/late notifications.
----%<----

On receipt of a RAS notification, KVM calls out to the APEI code which handles
the error. KVM can now go back to running the guest.

Handling the error may have caused a signal to be sent to Qemu, in which case we
won't get back into the guest, as the signal_pending() test in
kvm_arch_vcpu_ioctl_run() will cause us to return to Qemu instead. This is the
early path. It happens when we get the RAS notification from firmware.

This patch is concerned with the late path. When either Qemu/kvmtool or the
guest failed to handle the early notification and went and touched the affected
memory again. In this case we don't take a fault via firmware as
memory_failure() has unmapped the page from stage2, we take a normal stage2
fault instead. KVM won't re-map the poisoned page, it sends a late notification
instead.

Thanks,

James

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-06-22 16:39             ` James Morse
  0 siblings, 0 replies; 38+ messages in thread
From: James Morse @ 2017-06-22 16:39 UTC (permalink / raw)
  To: linux-arm-kernel

Hi gengdongjiu,

On 22/06/17 07:47, gengdongjiu wrote:
> On 2017/6/21 20:44, James Morse wrote:
>> On 21/06/17 11:59, gengdongjiu wrote:
>>> On 2017/6/21 17:53, James Morse wrote:
>>>> I think we discussed this before[0], your CPU has a feature called 'hwpoison'
>>>> that is uses to support RAS. Linux also has a feature called 'hwpoison' [1][2],
>>>> which handles the offline-ing of memory pages when it receives a notification
>>>> through APEI. I've tried to call this memory_failure() to avoid this confusion.
>>>>
>>>> This patch is to handle stage2 faults when the page was removed from the stage2
>>>> mapping by the memory_failure() code. v3 of this patch[3] does a much better job
>>>> of describing this.
>>>>
>>>> (... I don't think your question is related to this patch ...)
>>>
>>> I know your meaning about the Linux 'hwpoison' feature.

>> Okay, I assume we are also talking about firmware-first RAS events and your APEI
>> notifications use SEA.

It doesn't look like you are using APEI firmware-first or SEA notifications here.

>> I think we are looking at different parts of the code, here is what I see should
>> happen:
>>
>> For a Synchronous External Abort the ESR {D,I}FSC bits will be in the range that
>> indicates an external abort. For a data:external-abort kvm_handle_guest_abort()
>> will matches this with kvm_vcpu_dabt_isextabt() and makes no further attempt to
>> handle the fault.

> The kvm_vcpu_dabt_isextabt() will check the ESR EA bit[9]
> The EA bit [9] is "IMPLEMENTATION DEFINED", so whether match with the kvm_vcpu_dabt_isextabt() depend on the CPU design.
> In my platform the EA bit is zero for the RAS Synchronous External Abort, so will not matches the kvm_vcpu_dabt_isextabt(),
> so the code will continue handle the fault.

This is the problem, you aren't notifying RAS errors to the OS with an
SEA/synchronous-external-abort.

For APEI firmware-first RAS you must use one of the notification methods listed
in ACPI 6.2s '18.3.2.9 Hardware Error Notification, Table 18-383'.

Notifications using 'SEA' hang on an existing part of the ARM architecture,
firmware has to emulate what the CPU does to avoid breaking existing code.

> EA, bit [9]
> 	External Abort. As described in the Architecture Reference Manual. Provides an IMPLEMENTATION
> 	DEFINED classification of the external abort.

This is from 'D7.2.28 - ESR_ELx, Exception Syndrome Register (ELx)' of
DDI0487B.a. If you don't set that bit, its not an external abort.
Implementation-defined refers to the classification: your CPU implementation is
allowed to choose what it classifies as 'external'.
For RAS, your firmware must set this bit if it wants to use SEA as the APEI
notification type.

What is the ESR_EL2 value when this happens?
Do you trap this to firmware first? (its not firmware-first RAS if you don't).

>> This goes on to call ghes_notify_sea() which will handle the error and cause KVM
>> to exit this function. KVM makes no further attempt to handle the fault as APEI
>> should have done everything necessary. KVM will re-enter the guest, unless there
>> are signals pending.

> In my code, there is not Tyler's such modification.

Tyler's series added support to KVM for handling RAS notifications via SEA.

> firstly, the poison error is related with RAS, and poison error match the is_abort_sea(fault_status)

(So it is an external abort? Now I'm confused)

> so the KVM will exit,  and SIGBUS signal have no chance to be delivered.

kvm_handle_guest_abort() will return 1 through handle_exit() and into
kvm_arch_vcpu_ioctl_run()'s while (ret > 0) { } loop.

KVM should re-enter the guest because the APEI GHES RAS code said it had handled
the error.

> I suggest it does not return in Tyler's patch

This is required because the notification about a RAS fault may not be a
sensible guest fault. In your example its not a sensible guest exit because you
haven't set HPFAR_EL2. KVM can't handle a guest fault without this register.

> or move kvm_send_hwpoison_signal function before return.

Here we are talking about RAS notifications from firmware. (1)

kvm_send_hwpoison_signal() is for handling normal stage2 faults where the page
to map has been poisoned (2). By this point the CPU's RAS features and firmware
are not involved.

These are two different things, a well behaved machine-monitor and guest will
never touch a poisoned page once its received a notification due to (1). This
patch addresses a corner case that covers Qemu/kvmtool not handling (1)
properly, or the guest ignoring the notification, (and all the races in between).

>> (You're right that here the fault_ipa is the wrong thing to pass, but
>> handle_guest_sea() doesn't use it...)
>>
>> We do need to enable HCR_EL2.TEA which was added with v8.2s RAS extensions, but
>> the cpufeature patch is a pre-requisite.

> HCR_EL2.TEA will Route synchronous External Abort exceptions from Non-secure EL0 and EL1 to EL2, if not routed to EL3.
> 
> in the firmware-first RAS solution, this bit can not control to route to El2 because the exception is route to EL3.
> 
> enabling HCR_EL2.TEA can be better for the non firmware-first RAS solution.
> but can not solve this issue.

Your firmware receives an external abort from the CPU, trapped to EL3 by
SCR_EL3.EA. It chooses to notify the OS via APEI's SEA notification type. How
does it choose whether to target EL2 or EL1 for the fake software-generated
SEA-notification? To correctly emulate what the CPU would do, it should inspect
HCR_EL2.TEA, and route the notification accordingly.

> my firmware will route the injected-External-Aborts to the hypervisor if the
> error come from guest OS.

How do you know there is a hypervisor at EL2 and guest OS at EL1? Disable both
KVM and VHE in Linux and boot. Now you have host Linux at EL1 and only the
hyp-stub at EL2.

These are the complications with hanging APEI notification on an existing part
of the architecture.

>> The path to the below code that uses fault_ipa->gfn->pfn->hva is:
>> kvm_handle_guest_abort()
>> user_mem_abort()
>> kvm_send_hwpoison_signal().
>>
>> But for an external abort due to RAS we never get past kvm_handle_guest_abort()
>> as we call out to the APEI ghes to handle the RAS error instead.

> so this patch can not work.

To notify Qemu/kvmtool about a RAS fault? That's not what this patch does.

v3 of this patch has a much better commit message:
----%<----
memory_failure() has two modes, early and late. Early is used by
machine-managers like Qemu to receive a notification when a memory error is
notified to the host. These can then be relayed to the guest before the
affected page is accessed. To enable this, the process must set
PR_MCE_KILL_EARLY in PR_MCE_KILL_SET using the prctl() syscall.

Once the early notification has been handled, nothing stops the
machine-manager or guest from accessing the affected page. If the
machine-manager does this the page will fail to be mapped and SIGBUS will
be sent. This patch adds the equivalent path for when the guest accesses
the page, sending SIGBUS to the machine-manager.

These two signals can be distinguished by the machine-manager using their
si_code: BUS_MCEERR_AO for 'action optional' early notifications, and
BUS_MCEERR_AR for 'action required' synchronous/late notifications.
----%<----

On receipt of a RAS notification, KVM calls out to the APEI code which handles
the error. KVM can now go back to running the guest.

Handling the error may have caused a signal to be sent to Qemu, in which case we
won't get back into the guest, as the signal_pending() test in
kvm_arch_vcpu_ioctl_run() will cause us to return to Qemu instead. This is the
early path. It happens when we get the RAS notification from firmware.

This patch is concerned with the late path. When either Qemu/kvmtool or the
guest failed to handle the early notification and went and touched the affected
memory again. In this case we don't take a fault via firmware as
memory_failure() has unmapped the page from stage2, we take a normal stage2
fault instead. KVM won't re-map the poisoned page, it sends a late notification
instead.

Thanks,

James

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-06-22  6:47           ` gengdongjiu
@ 2017-06-23  9:38             ` James Morse
  -1 siblings, 0 replies; 38+ messages in thread
From: James Morse @ 2017-06-23  9:38 UTC (permalink / raw)
  To: gengdongjiu
  Cc: wuquanming, Marc Zyngier, Achin Gupta, Punit Agrawal,
	Tyler Baicar, Huangshaoyu, linux-arm-kernel, kvmarm

Hi gengdongjiu,

I've spotted where we are disagreeing: there are two bits in the ESR relevant to
external aborts. I've been treating them the same.

It looks like KVM is testing the wrong bit.


On 22/06/17 07:47, gengdongjiu wrote:
> On 2017/6/21 20:44, James Morse wrote:
>> I think we are looking at different parts of the code, here is what I see should
>> happen:
>>
>> For a Synchronous External Abort the ESR {D,I}FSC bits will be in the range that
>> indicates an external abort. For a data:external-abort kvm_handle_guest_abort()
>> will matches this with kvm_vcpu_dabt_isextabt() and makes no further attempt to
>> handle the fault.
> The kvm_vcpu_dabt_isextabt() will check the ESR EA bit[9]
> The EA bit [9] is "IMPLEMENTATION DEFINED", so whether match with the kvm_vcpu_dabt_isextabt() depend on the CPU design.
> In my platform the EA bit is zero for the RAS Synchronous External Abort, so will not matches the kvm_vcpu_dabt_isextabt(), so the code will continue handle the fault.
> 
> 
> static inline bool kvm_vcpu_dabt_isextabt(struct kvm_vcpu *vcpu)
> {
> 	return kvm_vcpu_get_hsr(vcpu) & HSR_DABT_EA;
> }
> 
> EA, bit [9]
> 	External Abort. As described in the Architecture Reference Manual. Provides an IMPLEMENTATION
> 	DEFINED classification of the external abort.

(and here you pointed it out, sorry I missed it).

This bit 9 isn't the same as the {I,D}FSC bit 4, which is also always set for an
external abort.


> DFSC, bits [5:0], when synchronous external Data Abort
> 	Data Fault Status Code. The possible values of this field are:
> 	0b010000 Synchronous External Abort on memory access.
> 	0b0101xx Synchronous External Abort on translation table walk or hardware update of translation
> 	table. DFSC[1:0] encode the level.
> 	Further values are described in the Architecture Reference Manual. The Parity Error codes are not
> 	used and reserved in the RAS extension.
> IFSC, bits [5:0], when synchronous external Instruction Abort
> 	Instruction Fault Status Code. The possible values of this field are:
> 	0b010000 Synchronous External Abort on memory access.
> 	0b0101xx Synchronous External Abort on translation table walk or hardware update of translation
> 	table. IFSC[1:0] encode the level.
> 	Further values are described in the Architecture Reference Manual. The Parity Error codes are not
> 	used and reserved in the RAS extension.


So the v8 ARM-ARM has two ways of indicating an external abort, KVM tests bit 9.
The v7 ARM-ARm has a '{D,I}FSR.ExT' bit, which I think is equivalent. Its
described as:
> An implementation can use the DFSR.ExT and IFSR.ExT bits to provide more
> information about external aborts.

This is what the v8 version must mean with its
> External abort type. This bit can provide an IMPLEMENTATION DEFINED
> classification of External aborts.

Which I read as IMP-DEF classification 'as external', as opposed to your reading
as an extra IMP-DEF classification for external aborts.

It looks like an implementation could use this to mean 'how external'. For RAS
an implementation could use it to separate external aborts handled first by
firmware, from those generated directly by the CPU.

So which bits should Linux test? Bit 9 has some extra IMP-DEF meaning, so Linux
should never look at it, which means KVM has a bug handling these.
I will send a patch, are you able to review and test it?


Thanks!

James

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-06-23  9:38             ` James Morse
  0 siblings, 0 replies; 38+ messages in thread
From: James Morse @ 2017-06-23  9:38 UTC (permalink / raw)
  To: linux-arm-kernel

Hi gengdongjiu,

I've spotted where we are disagreeing: there are two bits in the ESR relevant to
external aborts. I've been treating them the same.

It looks like KVM is testing the wrong bit.


On 22/06/17 07:47, gengdongjiu wrote:
> On 2017/6/21 20:44, James Morse wrote:
>> I think we are looking at different parts of the code, here is what I see should
>> happen:
>>
>> For a Synchronous External Abort the ESR {D,I}FSC bits will be in the range that
>> indicates an external abort. For a data:external-abort kvm_handle_guest_abort()
>> will matches this with kvm_vcpu_dabt_isextabt() and makes no further attempt to
>> handle the fault.
> The kvm_vcpu_dabt_isextabt() will check the ESR EA bit[9]
> The EA bit [9] is "IMPLEMENTATION DEFINED", so whether match with the kvm_vcpu_dabt_isextabt() depend on the CPU design.
> In my platform the EA bit is zero for the RAS Synchronous External Abort, so will not matches the kvm_vcpu_dabt_isextabt(), so the code will continue handle the fault.
> 
> 
> static inline bool kvm_vcpu_dabt_isextabt(struct kvm_vcpu *vcpu)
> {
> 	return kvm_vcpu_get_hsr(vcpu) & HSR_DABT_EA;
> }
> 
> EA, bit [9]
> 	External Abort. As described in the Architecture Reference Manual. Provides an IMPLEMENTATION
> 	DEFINED classification of the external abort.

(and here you pointed it out, sorry I missed it).

This bit 9 isn't the same as the {I,D}FSC bit 4, which is also always set for an
external abort.


> DFSC, bits [5:0], when synchronous external Data Abort
> 	Data Fault Status Code. The possible values of this field are:
> 	0b010000 Synchronous External Abort on memory access.
> 	0b0101xx Synchronous External Abort on translation table walk or hardware update of translation
> 	table. DFSC[1:0] encode the level.
> 	Further values are described in the Architecture Reference Manual. The Parity Error codes are not
> 	used and reserved in the RAS extension.
> IFSC, bits [5:0], when synchronous external Instruction Abort
> 	Instruction Fault Status Code. The possible values of this field are:
> 	0b010000 Synchronous External Abort on memory access.
> 	0b0101xx Synchronous External Abort on translation table walk or hardware update of translation
> 	table. IFSC[1:0] encode the level.
> 	Further values are described in the Architecture Reference Manual. The Parity Error codes are not
> 	used and reserved in the RAS extension.


So the v8 ARM-ARM has two ways of indicating an external abort, KVM tests bit 9.
The v7 ARM-ARm has a '{D,I}FSR.ExT' bit, which I think is equivalent. Its
described as:
> An implementation can use the DFSR.ExT and IFSR.ExT bits to provide more
> information about external aborts.

This is what the v8 version must mean with its
> External abort type. This bit can provide an IMPLEMENTATION DEFINED
> classification of External aborts.

Which I read as IMP-DEF classification 'as external', as opposed to your reading
as an extra IMP-DEF classification for external aborts.

It looks like an implementation could use this to mean 'how external'. For RAS
an implementation could use it to separate external aborts handled first by
firmware, from those generated directly by the CPU.

So which bits should Linux test? Bit 9 has some extra IMP-DEF meaning, so Linux
should never look at it, which means KVM has a bug handling these.
I will send a patch, are you able to review and test it?


Thanks!

James

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-06-23  9:38             ` James Morse
@ 2017-06-23 10:18               ` Peter Maydell
  -1 siblings, 0 replies; 38+ messages in thread
From: Peter Maydell @ 2017-06-23 10:18 UTC (permalink / raw)
  To: James Morse
  Cc: wuquanming, Punit Agrawal, Marc Zyngier, Achin Gupta,
	Tyler Baicar, gengdongjiu, Huangshaoyu, kvmarm, arm-mail-list

On 23 June 2017 at 10:38, James Morse <james.morse@arm.com> wrote:
> So the v8 ARM-ARM has two ways of indicating an external abort, KVM tests bit 9
> The v7 ARM-ARm has a '{D,I}FSR.ExT' bit, which I think is equivalent. Its
> described as:
>> An implementation can use the DFSR.ExT and IFSR.ExT bits to provide more
>> information about external aborts.

For instance, the Cortex-A15 TRM documents that it uses the ExT bit
to distinguish AXI DECERR (decode error, ExT==0) and SLVERR (slave
error, ExT==1) abort responses.

>> This is what the v8 version must mean with its
>> External abort type. This bit can provide an IMPLEMENTATION DEFINED
>> classification of External aborts.
>
> Which I read as IMP-DEF classification 'as external', as opposed to your reading
> as an extra IMP-DEF classification for external aborts.

I think gengdongjiu's reading is correct here: if this
is an external abort then you can find out more info about
exactly what kind of EA it is by looking at this bit, but
you have to look at the FSC bits to be sure it is an EA first.

The Cortex-A53 TRM documents that it uses the same SLVERR-vs-DECERR
semantics for DFSR.ExT as the A15. It doesn't say so but it wouldn't
be very surprising if its usage of ESR_ELx.EA was the same.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-06-23 10:18               ` Peter Maydell
  0 siblings, 0 replies; 38+ messages in thread
From: Peter Maydell @ 2017-06-23 10:18 UTC (permalink / raw)
  To: linux-arm-kernel

On 23 June 2017 at 10:38, James Morse <james.morse@arm.com> wrote:
> So the v8 ARM-ARM has two ways of indicating an external abort, KVM tests bit 9
> The v7 ARM-ARm has a '{D,I}FSR.ExT' bit, which I think is equivalent. Its
> described as:
>> An implementation can use the DFSR.ExT and IFSR.ExT bits to provide more
>> information about external aborts.

For instance, the Cortex-A15 TRM documents that it uses the ExT bit
to distinguish AXI DECERR (decode error, ExT==0) and SLVERR (slave
error, ExT==1) abort responses.

>> This is what the v8 version must mean with its
>> External abort type. This bit can provide an IMPLEMENTATION DEFINED
>> classification of External aborts.
>
> Which I read as IMP-DEF classification 'as external', as opposed to your reading
> as an extra IMP-DEF classification for external aborts.

I think gengdongjiu's reading is correct here: if this
is an external abort then you can find out more info about
exactly what kind of EA it is by looking at this bit, but
you have to look at the FSC bits to be sure it is an EA first.

The Cortex-A53 TRM documents that it uses the same SLVERR-vs-DECERR
semantics for DFSR.ExT as the A15. It doesn't say so but it wouldn't
be very surprising if its usage of ESR_ELx.EA was the same.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-06-23  9:38             ` James Morse
@ 2017-06-24  8:23               ` gengdongjiu
  -1 siblings, 0 replies; 38+ messages in thread
From: gengdongjiu @ 2017-06-24  8:23 UTC (permalink / raw)
  To: James Morse
  Cc: wuquanming, Marc Zyngier, Achin Gupta, Punit Agrawal,
	Tyler Baicar, Huangshaoyu, linux-arm-kernel, huhuajun, kvmarm,
	Christoffer Dall

Hi James,
  sorry for the late response.

On 2017/6/23 17:38, James Morse wrote:
> Hi gengdongjiu,
> 
> I've spotted where we are disagreeing: there are two bits in the ESR relevant to
> external aborts. I've been treating them the same.
Ok.

> 
> It looks like KVM is testing the wrong bit.
I agree with you the KVM should not test this bit, because it is IMPLEMENTATION DEFINED classification.
from this patch history, it is made by Marc. do know whether it has some special reason to make this patch.

commit 4055710baca8cb067b059a635cf851eb86410a83
Author: Marc Zyngier <marc.zyngier@arm.com>
Date:   Tue Sep 6 14:02:15 2016 +0100

    arm/arm64: KVM: Inject virtual abort when guest exits on external abort

    If we spot a data abort bearing the ESR_EL2.EA bit set, we know that
    this is an external abort, and that should be punished by the injection
    of an abort.

    Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
    Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 344755d..60e0c1a 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -1432,6 +1432,11 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
        int ret, idx;

        is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
+       if (unlikely(!is_iabt && kvm_vcpu_dabt_isextabt(vcpu))) {
+               kvm_inject_vabt(vcpu);
+               return 1;
+       }
+
        fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);

        trace_kvm_guest_fault(*vcpu_pc(vcpu), kvm_vcpu_get_hsr(vcpu)

> 
> 
> On 22/06/17 07:47, gengdongjiu wrote:
>> On 2017/6/21 20:44, James Morse wrote:
>>> I think we are looking at different parts of the code, here is what I see should
>>> happen:
>>>
>>> For a Synchronous External Abort the ESR {D,I}FSC bits will be in the range that
>>> indicates an external abort. For a data:external-abort kvm_handle_guest_abort()
>>> will matches this with kvm_vcpu_dabt_isextabt() and makes no further attempt to
>>> handle the fault.
>> The kvm_vcpu_dabt_isextabt() will check the ESR EA bit[9]
>> The EA bit [9] is "IMPLEMENTATION DEFINED", so whether match with the kvm_vcpu_dabt_isextabt() depend on the CPU design.
>> In my platform the EA bit is zero for the RAS Synchronous External Abort, so will not matches the kvm_vcpu_dabt_isextabt(), so the code will continue handle the fault.
>>
>>
>> static inline bool kvm_vcpu_dabt_isextabt(struct kvm_vcpu *vcpu)
>> {
>> 	return kvm_vcpu_get_hsr(vcpu) & HSR_DABT_EA;
>> }
>>
>> EA, bit [9]
>> 	External Abort. As described in the Architecture Reference Manual. Provides an IMPLEMENTATION
>> 	DEFINED classification of the external abort.
> 
> (and here you pointed it out, sorry I missed it).
> 
> This bit 9 isn't the same as the {I,D}FSC bit 4, which is also always set for an
> external abort.
 yes, they are two different bits

> 
> 
>> DFSC, bits [5:0], when synchronous external Data Abort
>> 	Data Fault Status Code. The possible values of this field are:
>> 	0b010000 Synchronous External Abort on memory access.
>> 	0b0101xx Synchronous External Abort on translation table walk or hardware update of translation
>> 	table. DFSC[1:0] encode the level.
>> 	Further values are described in the Architecture Reference Manual. The Parity Error codes are not
>> 	used and reserved in the RAS extension.
>> IFSC, bits [5:0], when synchronous external Instruction Abort
>> 	Instruction Fault Status Code. The possible values of this field are:
>> 	0b010000 Synchronous External Abort on memory access.
>> 	0b0101xx Synchronous External Abort on translation table walk or hardware update of translation
>> 	table. IFSC[1:0] encode the level.
>> 	Further values are described in the Architecture Reference Manual. The Parity Error codes are not
>> 	used and reserved in the RAS extension.
> 
> 
> So the v8 ARM-ARM has two ways of indicating an external abort, KVM tests bit 9.
> The v7 ARM-ARm has a '{D,I}FSR.ExT' bit, which I think is equivalent. Its
> described as:
>> An implementation can use the DFSR.ExT and IFSR.ExT bits to provide more
>> information about external aborts.
> 
> This is what the v8 version must mean with its
>> External abort type. This bit can provide an IMPLEMENTATION DEFINED
>> classification of External aborts.
> 
> Which I read as IMP-DEF classification 'as external', as opposed to your reading
> as an extra IMP-DEF classification for external aborts.
OK.
> 
> It looks like an implementation could use this to mean 'how external'. For RAS
> an implementation could use it to separate external aborts handled first by
> firmware, from those generated directly by the CPU.
  for the armv8.0 CPU which does not have RAS feature, also use the EA bit, so may not directly use it separate whether external aborts should be handled first by
  firmware

> 
> So which bits should Linux test? Bit 9 has some extra IMP-DEF meaning, so Linux
> should never look at it, which means KVM has a bug handling these.
  what do you want Linux to test? to test the type of error? check the {I,D}FSC should be OK.

> I will send a patch, are you able to review and test it?
  sure, it is no problem.

> 
> 
> Thanks!
> 
> James
> 
> .
> 

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-06-24  8:23               ` gengdongjiu
  0 siblings, 0 replies; 38+ messages in thread
From: gengdongjiu @ 2017-06-24  8:23 UTC (permalink / raw)
  To: linux-arm-kernel

Hi James,
  sorry for the late response.

On 2017/6/23 17:38, James Morse wrote:
> Hi gengdongjiu,
> 
> I've spotted where we are disagreeing: there are two bits in the ESR relevant to
> external aborts. I've been treating them the same.
Ok.

> 
> It looks like KVM is testing the wrong bit.
I agree with you the KVM should not test this bit, because it is IMPLEMENTATION DEFINED classification.
from this patch history, it is made by Marc. do know whether it has some special reason to make this patch.

commit 4055710baca8cb067b059a635cf851eb86410a83
Author: Marc Zyngier <marc.zyngier@arm.com>
Date:   Tue Sep 6 14:02:15 2016 +0100

    arm/arm64: KVM: Inject virtual abort when guest exits on external abort

    If we spot a data abort bearing the ESR_EL2.EA bit set, we know that
    this is an external abort, and that should be punished by the injection
    of an abort.

    Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
    Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 344755d..60e0c1a 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -1432,6 +1432,11 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
        int ret, idx;

        is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
+       if (unlikely(!is_iabt && kvm_vcpu_dabt_isextabt(vcpu))) {
+               kvm_inject_vabt(vcpu);
+               return 1;
+       }
+
        fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);

        trace_kvm_guest_fault(*vcpu_pc(vcpu), kvm_vcpu_get_hsr(vcpu)

> 
> 
> On 22/06/17 07:47, gengdongjiu wrote:
>> On 2017/6/21 20:44, James Morse wrote:
>>> I think we are looking at different parts of the code, here is what I see should
>>> happen:
>>>
>>> For a Synchronous External Abort the ESR {D,I}FSC bits will be in the range that
>>> indicates an external abort. For a data:external-abort kvm_handle_guest_abort()
>>> will matches this with kvm_vcpu_dabt_isextabt() and makes no further attempt to
>>> handle the fault.
>> The kvm_vcpu_dabt_isextabt() will check the ESR EA bit[9]
>> The EA bit [9] is "IMPLEMENTATION DEFINED", so whether match with the kvm_vcpu_dabt_isextabt() depend on the CPU design.
>> In my platform the EA bit is zero for the RAS Synchronous External Abort, so will not matches the kvm_vcpu_dabt_isextabt(), so the code will continue handle the fault.
>>
>>
>> static inline bool kvm_vcpu_dabt_isextabt(struct kvm_vcpu *vcpu)
>> {
>> 	return kvm_vcpu_get_hsr(vcpu) & HSR_DABT_EA;
>> }
>>
>> EA, bit [9]
>> 	External Abort. As described in the Architecture Reference Manual. Provides an IMPLEMENTATION
>> 	DEFINED classification of the external abort.
> 
> (and here you pointed it out, sorry I missed it).
> 
> This bit 9 isn't the same as the {I,D}FSC bit 4, which is also always set for an
> external abort.
 yes, they are two different bits

> 
> 
>> DFSC, bits [5:0], when synchronous external Data Abort
>> 	Data Fault Status Code. The possible values of this field are:
>> 	0b010000 Synchronous External Abort on memory access.
>> 	0b0101xx Synchronous External Abort on translation table walk or hardware update of translation
>> 	table. DFSC[1:0] encode the level.
>> 	Further values are described in the Architecture Reference Manual. The Parity Error codes are not
>> 	used and reserved in the RAS extension.
>> IFSC, bits [5:0], when synchronous external Instruction Abort
>> 	Instruction Fault Status Code. The possible values of this field are:
>> 	0b010000 Synchronous External Abort on memory access.
>> 	0b0101xx Synchronous External Abort on translation table walk or hardware update of translation
>> 	table. IFSC[1:0] encode the level.
>> 	Further values are described in the Architecture Reference Manual. The Parity Error codes are not
>> 	used and reserved in the RAS extension.
> 
> 
> So the v8 ARM-ARM has two ways of indicating an external abort, KVM tests bit 9.
> The v7 ARM-ARm has a '{D,I}FSR.ExT' bit, which I think is equivalent. Its
> described as:
>> An implementation can use the DFSR.ExT and IFSR.ExT bits to provide more
>> information about external aborts.
> 
> This is what the v8 version must mean with its
>> External abort type. This bit can provide an IMPLEMENTATION DEFINED
>> classification of External aborts.
> 
> Which I read as IMP-DEF classification 'as external', as opposed to your reading
> as an extra IMP-DEF classification for external aborts.
OK.
> 
> It looks like an implementation could use this to mean 'how external'. For RAS
> an implementation could use it to separate external aborts handled first by
> firmware, from those generated directly by the CPU.
  for the armv8.0 CPU which does not have RAS feature, also use the EA bit, so may not directly use it separate whether external aborts should be handled first by
  firmware

> 
> So which bits should Linux test? Bit 9 has some extra IMP-DEF meaning, so Linux
> should never look at it, which means KVM has a bug handling these.
  what do you want Linux to test? to test the type of error? check the {I,D}FSC should be OK.

> I will send a patch, are you able to review and test it?
  sure, it is no problem.

> 
> 
> Thanks!
> 
> James
> 
> .
> 

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-06-22 16:39             ` James Morse
@ 2017-06-24 14:56               ` gengdongjiu
  -1 siblings, 0 replies; 38+ messages in thread
From: gengdongjiu @ 2017-06-24 14:56 UTC (permalink / raw)
  To: James Morse
  Cc: wuquanming, Punit Agrawal, Marc Zyngier, Achin Gupta,
	Tyler Baicar, gengdongjiu, Huangshaoyu, kvmarm, linux-arm-kernel

Hi James,

2017-06-23 0:39 GMT+08:00, James Morse <james.morse@arm.com>:
> Hi gengdongjiu,
>
> On 22/06/17 07:47, gengdongjiu wrote:
>> On 2017/6/21 20:44, James Morse wrote:
>>> On 21/06/17 11:59, gengdongjiu wrote:
>>>> On 2017/6/21 17:53, James Morse wrote:
>>>>> I think we discussed this before[0], your CPU has a feature called
>>>>> 'hwpoison'
>>>>> that is uses to support RAS. Linux also has a feature called 'hwpoison'
>>>>> [1][2],
>>>>> which handles the offline-ing of memory pages when it receives a
>>>>> notification
>>>>> through APEI. I've tried to call this memory_failure() to avoid this
>>>>> confusion.
>>>>>
>>>>> This patch is to handle stage2 faults when the page was removed from
>>>>> the stage2
>>>>> mapping by the memory_failure() code. v3 of this patch[3] does a much
>>>>> better job
>>>>> of describing this.
>>>>>
>>>>> (... I don't think your question is related to this patch ...)
>>>>
>>>> I know your meaning about the Linux 'hwpoison' feature.
>
>>> Okay, I assume we are also talking about firmware-first RAS events and
>>> your APEI
>>> notifications use SEA.
>
> It doesn't look like you are using APEI firmware-first or SEA notifications
> here.
>
>
>>> I think we are looking at different parts of the code, here is what I see
>>> should
>>> happen:
>>>
>>> For a Synchronous External Abort the ESR {D,I}FSC bits will be in the
>>> range that
>>> indicates an external abort. For a data:external-abort
>>> kvm_handle_guest_abort()
>>> will matches this with kvm_vcpu_dabt_isextabt() and makes no further
>>> attempt to
>>> handle the fault.
>
>> The kvm_vcpu_dabt_isextabt() will check the ESR EA bit[9]
>> The EA bit [9] is "IMPLEMENTATION DEFINED", so whether match with the
>> kvm_vcpu_dabt_isextabt() depend on the CPU design.
>> In my platform the EA bit is zero for the RAS Synchronous External Abort,
>> so will not matches the kvm_vcpu_dabt_isextabt(),
>> so the code will continue handle the fault.
>
> This is the problem, you aren't notifying RAS errors to the OS with an
> SEA/synchronous-external-abort.
>
> For APEI firmware-first RAS you must use one of the notification methods
> listed
> in ACPI 6.2s '18.3.2.9 Hardware Error Notification, Table 18-383'.
>
> Notifications using 'SEA' hang on an existing part of the ARM architecture,
> firmware has to emulate what the CPU does to avoid breaking existing code.
>
>
>> EA, bit [9]
>> 	External Abort. As described in the Architecture Reference Manual.
>> Provides an IMPLEMENTATION
>> 	DEFINED classification of the external abort.
>
> This is from 'D7.2.28 - ESR_ELx, Exception Syndrome Register (ELx)' of
> DDI0487B.a. If you don't set that bit, its not an external abort.
> Implementation-defined refers to the classification: your CPU implementation
> is
> allowed to choose what it classifies as 'external'.
> For RAS, your firmware must set this bit if it wants to use SEA as the APEI
> notification type.
  I think this bit are not must set by firmware if the esr_el3 does
not set this bit.

>
> What is the ESR_EL2 value when this happens?
> Do you trap this to firmware first? (its not firmware-first RAS if you
> don't).
>
>
>>> This goes on to call ghes_notify_sea() which will handle the error and
>>> cause KVM
>>> to exit this function. KVM makes no further attempt to handle the fault
>>> as APEI
>>> should have done everything necessary. KVM will re-enter the guest,
>>> unless there
>>> are signals pending.
>
>> In my code, there is not Tyler's such modification.
>
> Tyler's series added support to KVM for handling RAS notifications via SEA.
>
>
>> firstly, the poison error is related with RAS, and poison error match the
>> is_abort_sea(fault_status)
>
> (So it is an external abort? Now I'm confused)
    you may misunderstand that as your another mail mentioned

>
>
>> so the KVM will exit,  and SIGBUS signal have no chance to be delivered.
>
> kvm_handle_guest_abort() will return 1 through handle_exit() and into
> kvm_arch_vcpu_ioctl_run()'s while (ret > 0) { } loop.
>
> KVM should re-enter the guest because the APEI GHES RAS code said it had
> handled
> the error.
yes, it is!  sorry, I  I didn't say it clearly, it should
kvm_handle_guest_abort() exit, not kvm exit.  thank you for your
correction.

>
>
>> I suggest it does not return in Tyler's patch
>
> This is required because the notification about a RAS fault may not be a
> sensible guest fault. In your example its not a sensible guest exit because
> you
> haven't set HPFAR_EL2. KVM can't handle a guest fault without this
> register.
In the firmware-first RAS solution, the HPFAR_EL2 does not set in any
case, becuase it traps to EL3, not EL2.
I think we should use "AT" instruction to translation GVA to IPA
address  and write the translation result to the hpfar_el2 in the
firmware.

>
>
>> or move kvm_send_hwpoison_signal function before return.
>
> Here we are talking about RAS notifications from firmware. (1)
>
> kvm_send_hwpoison_signal() is for handling normal stage2 faults where the
> page
> to map has been poisoned (2). By this point the CPU's RAS features and
> firmware
> are not involved.
>
> These are two different things, a well behaved machine-monitor and guest
> will
> never touch a poisoned page once its received a notification due to (1).
> This
> patch addresses a corner case that covers Qemu/kvmtool not handling (1)
> properly, or the guest ignoring the notification, (and all the races in
> between).
>
>
>>> (You're right that here the fault_ipa is the wrong thing to pass, but
>>> handle_guest_sea() doesn't use it...)
>>>
>>> We do need to enable HCR_EL2.TEA which was added with v8.2s RAS
>>> extensions, but
>>> the cpufeature patch is a pre-requisite.
>
>> HCR_EL2.TEA will Route synchronous External Abort exceptions from
>> Non-secure EL0 and EL1 to EL2, if not routed to EL3.
>>
>> in the firmware-first RAS solution, this bit can not control to route to
>> El2 because the exception is route to EL3.
>>
>> enabling HCR_EL2.TEA can be better for the non firmware-first RAS
>> solution.
>> but can not solve this issue.
>
> Your firmware receives an external abort from the CPU, trapped to EL3 by
> SCR_EL3.EA. It chooses to notify the OS via APEI's SEA notification type.
> How
> does it choose whether to target EL2 or EL1 for the fake software-generated
> SEA-notification? To correctly emulate what the CPU would do, it should
> inspect
> HCR_EL2.TEA, and route the notification accordingly.
  James, you are right, we should set the HCR_EL2.TEA, thanks for your
suggestion. I have submitted a patch for your review.

>
>
>> my firmware will route the injected-External-Aborts to the hypervisor if
>> the
>> error come from guest OS.
>
> How do you know there is a hypervisor at EL2 and guest OS at EL1? Disable
In my previous solution, firmware will check the vttbr_el2, if the
vttbr_el2 is not zero, then firmware jump to El1. otherwise, if the
vttbr_el2 is zero, jump to EL2.  this way may not good,  set the
hcr_el2.TEA is reasonable.

> both
> KVM and VHE in Linux and boot. Now you have host Linux at EL1 and only the
> hyp-stub at EL2.
>
> These are the complications with hanging APEI notification on an existing
> part
> of the architecture.
>
>
>>> The path to the below code that uses fault_ipa->gfn->pfn->hva is:
>>> kvm_handle_guest_abort()
>>> user_mem_abort()
>>> kvm_send_hwpoison_signal().
>>>
>>> But for an external abort due to RAS we never get past
>>> kvm_handle_guest_abort()
>>> as we call out to the APEI ghes to handle the RAS error instead.
>
>> so this patch can not work.
>
> To notify Qemu/kvmtool about a RAS fault? That's not what this patch does.
>
> v3 of this patch has a much better commit message:
> ----%<----
> memory_failure() has two modes, early and late. Early is used by
> machine-managers like Qemu to receive a notification when a memory error is
> notified to the host. These can then be relayed to the guest before the
> affected page is accessed. To enable this, the process must set
> PR_MCE_KILL_EARLY in PR_MCE_KILL_SET using the prctl() syscall.
>
> Once the early notification has been handled, nothing stops the
> machine-manager or guest from accessing the affected page. If the
> machine-manager does this the page will fail to be mapped and SIGBUS will
> be sent. This patch adds the equivalent path for when the guest accesses
> the page, sending SIGBUS to the machine-manager.
>
> These two signals can be distinguished by the machine-manager using their
> si_code: BUS_MCEERR_AO for 'action optional' early notifications, and
> BUS_MCEERR_AR for 'action required' synchronous/late notifications.
> ----%<----
>
> On receipt of a RAS notification, KVM calls out to the APEI code which
> handles
> the error. KVM can now go back to running the guest.
>
> Handling the error may have caused a signal to be sent to Qemu, in which
> case we
> won't get back into the guest, as the signal_pending() test in
> kvm_arch_vcpu_ioctl_run() will cause us to return to Qemu instead. This is
> the
> early path. It happens when we get the RAS notification from firmware.
>
>
> This patch is concerned with the late path. When either Qemu/kvmtool or the
> guest failed to handle the early notification and went and touched the
> affected
> memory again. In this case we don't take a fault via firmware as
> memory_failure() has unmapped the page from stage2, we take a normal stage2
> fault instead. KVM won't re-map the poisoned page, it sends a late
> notification
> instead.
  than you very much for your detailed explaination, got it.

>
>
> Thanks,
>
> James
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-06-24 14:56               ` gengdongjiu
  0 siblings, 0 replies; 38+ messages in thread
From: gengdongjiu @ 2017-06-24 14:56 UTC (permalink / raw)
  To: linux-arm-kernel

Hi James,

2017-06-23 0:39 GMT+08:00, James Morse <james.morse@arm.com>:
> Hi gengdongjiu,
>
> On 22/06/17 07:47, gengdongjiu wrote:
>> On 2017/6/21 20:44, James Morse wrote:
>>> On 21/06/17 11:59, gengdongjiu wrote:
>>>> On 2017/6/21 17:53, James Morse wrote:
>>>>> I think we discussed this before[0], your CPU has a feature called
>>>>> 'hwpoison'
>>>>> that is uses to support RAS. Linux also has a feature called 'hwpoison'
>>>>> [1][2],
>>>>> which handles the offline-ing of memory pages when it receives a
>>>>> notification
>>>>> through APEI. I've tried to call this memory_failure() to avoid this
>>>>> confusion.
>>>>>
>>>>> This patch is to handle stage2 faults when the page was removed from
>>>>> the stage2
>>>>> mapping by the memory_failure() code. v3 of this patch[3] does a much
>>>>> better job
>>>>> of describing this.
>>>>>
>>>>> (... I don't think your question is related to this patch ...)
>>>>
>>>> I know your meaning about the Linux 'hwpoison' feature.
>
>>> Okay, I assume we are also talking about firmware-first RAS events and
>>> your APEI
>>> notifications use SEA.
>
> It doesn't look like you are using APEI firmware-first or SEA notifications
> here.
>
>
>>> I think we are looking at different parts of the code, here is what I see
>>> should
>>> happen:
>>>
>>> For a Synchronous External Abort the ESR {D,I}FSC bits will be in the
>>> range that
>>> indicates an external abort. For a data:external-abort
>>> kvm_handle_guest_abort()
>>> will matches this with kvm_vcpu_dabt_isextabt() and makes no further
>>> attempt to
>>> handle the fault.
>
>> The kvm_vcpu_dabt_isextabt() will check the ESR EA bit[9]
>> The EA bit [9] is "IMPLEMENTATION DEFINED", so whether match with the
>> kvm_vcpu_dabt_isextabt() depend on the CPU design.
>> In my platform the EA bit is zero for the RAS Synchronous External Abort,
>> so will not matches the kvm_vcpu_dabt_isextabt(),
>> so the code will continue handle the fault.
>
> This is the problem, you aren't notifying RAS errors to the OS with an
> SEA/synchronous-external-abort.
>
> For APEI firmware-first RAS you must use one of the notification methods
> listed
> in ACPI 6.2s '18.3.2.9 Hardware Error Notification, Table 18-383'.
>
> Notifications using 'SEA' hang on an existing part of the ARM architecture,
> firmware has to emulate what the CPU does to avoid breaking existing code.
>
>
>> EA, bit [9]
>> 	External Abort. As described in the Architecture Reference Manual.
>> Provides an IMPLEMENTATION
>> 	DEFINED classification of the external abort.
>
> This is from 'D7.2.28 - ESR_ELx, Exception Syndrome Register (ELx)' of
> DDI0487B.a. If you don't set that bit, its not an external abort.
> Implementation-defined refers to the classification: your CPU implementation
> is
> allowed to choose what it classifies as 'external'.
> For RAS, your firmware must set this bit if it wants to use SEA as the APEI
> notification type.
  I think this bit are not must set by firmware if the esr_el3 does
not set this bit.

>
> What is the ESR_EL2 value when this happens?
> Do you trap this to firmware first? (its not firmware-first RAS if you
> don't).
>
>
>>> This goes on to call ghes_notify_sea() which will handle the error and
>>> cause KVM
>>> to exit this function. KVM makes no further attempt to handle the fault
>>> as APEI
>>> should have done everything necessary. KVM will re-enter the guest,
>>> unless there
>>> are signals pending.
>
>> In my code, there is not Tyler's such modification.
>
> Tyler's series added support to KVM for handling RAS notifications via SEA.
>
>
>> firstly, the poison error is related with RAS, and poison error match the
>> is_abort_sea(fault_status)
>
> (So it is an external abort? Now I'm confused)
    you may misunderstand that as your another mail mentioned

>
>
>> so the KVM will exit,  and SIGBUS signal have no chance to be delivered.
>
> kvm_handle_guest_abort() will return 1 through handle_exit() and into
> kvm_arch_vcpu_ioctl_run()'s while (ret > 0) { } loop.
>
> KVM should re-enter the guest because the APEI GHES RAS code said it had
> handled
> the error.
yes, it is!  sorry, I  I didn't say it clearly, it should
kvm_handle_guest_abort() exit, not kvm exit.  thank you for your
correction.

>
>
>> I suggest it does not return in Tyler's patch
>
> This is required because the notification about a RAS fault may not be a
> sensible guest fault. In your example its not a sensible guest exit because
> you
> haven't set HPFAR_EL2. KVM can't handle a guest fault without this
> register.
In the firmware-first RAS solution, the HPFAR_EL2 does not set in any
case, becuase it traps to EL3, not EL2.
I think we should use "AT" instruction to translation GVA to IPA
address  and write the translation result to the hpfar_el2 in the
firmware.

>
>
>> or move kvm_send_hwpoison_signal function before return.
>
> Here we are talking about RAS notifications from firmware. (1)
>
> kvm_send_hwpoison_signal() is for handling normal stage2 faults where the
> page
> to map has been poisoned (2). By this point the CPU's RAS features and
> firmware
> are not involved.
>
> These are two different things, a well behaved machine-monitor and guest
> will
> never touch a poisoned page once its received a notification due to (1).
> This
> patch addresses a corner case that covers Qemu/kvmtool not handling (1)
> properly, or the guest ignoring the notification, (and all the races in
> between).
>
>
>>> (You're right that here the fault_ipa is the wrong thing to pass, but
>>> handle_guest_sea() doesn't use it...)
>>>
>>> We do need to enable HCR_EL2.TEA which was added with v8.2s RAS
>>> extensions, but
>>> the cpufeature patch is a pre-requisite.
>
>> HCR_EL2.TEA will Route synchronous External Abort exceptions from
>> Non-secure EL0 and EL1 to EL2, if not routed to EL3.
>>
>> in the firmware-first RAS solution, this bit can not control to route to
>> El2 because the exception is route to EL3.
>>
>> enabling HCR_EL2.TEA can be better for the non firmware-first RAS
>> solution.
>> but can not solve this issue.
>
> Your firmware receives an external abort from the CPU, trapped to EL3 by
> SCR_EL3.EA. It chooses to notify the OS via APEI's SEA notification type.
> How
> does it choose whether to target EL2 or EL1 for the fake software-generated
> SEA-notification? To correctly emulate what the CPU would do, it should
> inspect
> HCR_EL2.TEA, and route the notification accordingly.
  James, you are right, we should set the HCR_EL2.TEA, thanks for your
suggestion. I have submitted a patch for your review.

>
>
>> my firmware will route the injected-External-Aborts to the hypervisor if
>> the
>> error come from guest OS.
>
> How do you know there is a hypervisor at EL2 and guest OS at EL1? Disable
In my previous solution, firmware will check the vttbr_el2, if the
vttbr_el2 is not zero, then firmware jump to El1. otherwise, if the
vttbr_el2 is zero, jump to EL2.  this way may not good,  set the
hcr_el2.TEA is reasonable.

> both
> KVM and VHE in Linux and boot. Now you have host Linux at EL1 and only the
> hyp-stub at EL2.
>
> These are the complications with hanging APEI notification on an existing
> part
> of the architecture.
>
>
>>> The path to the below code that uses fault_ipa->gfn->pfn->hva is:
>>> kvm_handle_guest_abort()
>>> user_mem_abort()
>>> kvm_send_hwpoison_signal().
>>>
>>> But for an external abort due to RAS we never get past
>>> kvm_handle_guest_abort()
>>> as we call out to the APEI ghes to handle the RAS error instead.
>
>> so this patch can not work.
>
> To notify Qemu/kvmtool about a RAS fault? That's not what this patch does.
>
> v3 of this patch has a much better commit message:
> ----%<----
> memory_failure() has two modes, early and late. Early is used by
> machine-managers like Qemu to receive a notification when a memory error is
> notified to the host. These can then be relayed to the guest before the
> affected page is accessed. To enable this, the process must set
> PR_MCE_KILL_EARLY in PR_MCE_KILL_SET using the prctl() syscall.
>
> Once the early notification has been handled, nothing stops the
> machine-manager or guest from accessing the affected page. If the
> machine-manager does this the page will fail to be mapped and SIGBUS will
> be sent. This patch adds the equivalent path for when the guest accesses
> the page, sending SIGBUS to the machine-manager.
>
> These two signals can be distinguished by the machine-manager using their
> si_code: BUS_MCEERR_AO for 'action optional' early notifications, and
> BUS_MCEERR_AR for 'action required' synchronous/late notifications.
> ----%<----
>
> On receipt of a RAS notification, KVM calls out to the APEI code which
> handles
> the error. KVM can now go back to running the guest.
>
> Handling the error may have caused a signal to be sent to Qemu, in which
> case we
> won't get back into the guest, as the signal_pending() test in
> kvm_arch_vcpu_ioctl_run() will cause us to return to Qemu instead. This is
> the
> early path. It happens when we get the RAS notification from firmware.
>
>
> This patch is concerned with the late path. When either Qemu/kvmtool or the
> guest failed to handle the early notification and went and touched the
> affected
> memory again. In this case we don't take a fault via firmware as
> memory_failure() has unmapped the page from stage2, we take a normal stage2
> fault instead. KVM won't re-map the poisoned page, it sends a late
> notification
> instead.
  than you very much for your detailed explaination, got it.

>
>
> Thanks,
>
> James
> _______________________________________________
> kvmarm mailing list
> kvmarm at lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
  2017-06-23 10:18               ` Peter Maydell
@ 2017-06-24 15:07                 ` gengdongjiu
  -1 siblings, 0 replies; 38+ messages in thread
From: gengdongjiu @ 2017-06-24 15:07 UTC (permalink / raw)
  To: Peter Maydell
  Cc: wuquanming, Punit Agrawal, Marc Zyngier, lishuo1, Achin Gupta,
	Tyler Baicar, gengdongjiu, James Morse, huhuajun, Huangshaoyu,
	wuweiqi, kvmarm, arm-mail-list

Hi Peter,

2017-06-23 18:18 GMT+08:00, Peter Maydell <peter.maydell@linaro.org>:
> On 23 June 2017 at 10:38, James Morse <james.morse@arm.com> wrote:
>> So the v8 ARM-ARM has two ways of indicating an external abort, KVM tests
>> bit 9
>> The v7 ARM-ARm has a '{D,I}FSR.ExT' bit, which I think is equivalent. Its
>> described as:
>>> An implementation can use the DFSR.ExT and IFSR.ExT bits to provide more
>>> information about external aborts.
>
> For instance, the Cortex-A15 TRM documents that it uses the ExT bit
> to distinguish AXI DECERR (decode error, ExT==0) and SLVERR (slave
> error, ExT==1) abort responses.
>
>>> This is what the v8 version must mean with its
>>> External abort type. This bit can provide an IMPLEMENTATION DEFINED
>>> classification of External aborts.
>>
>> Which I read as IMP-DEF classification 'as external', as opposed to your
>> reading
>> as an extra IMP-DEF classification for external aborts.
>
> I think gengdongjiu's reading is correct here: if this
> is an external abort then you can find out more info about
> exactly what kind of EA it is by looking at this bit, but
> you have to look at the FSC bits to be sure it is an EA first.
>
> The Cortex-A53 TRM documents that it uses the same SLVERR-vs-DECERR
> semantics for DFSR.ExT as the A15. It doesn't say so but it wouldn't
> be very surprising if its usage of ESR_ELx.EA was the same.
Arm Cortex-A53 may use this EA bit for AXI DECERR (decode error,
ExT==0) and SLVERR abort.  but not all CPUs have the same semantics.
For example,  Huawei's designed armv8.2 CPU is not that semantics

>
> thanks
> -- PMM
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory
@ 2017-06-24 15:07                 ` gengdongjiu
  0 siblings, 0 replies; 38+ messages in thread
From: gengdongjiu @ 2017-06-24 15:07 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Peter,

2017-06-23 18:18 GMT+08:00, Peter Maydell <peter.maydell@linaro.org>:
> On 23 June 2017 at 10:38, James Morse <james.morse@arm.com> wrote:
>> So the v8 ARM-ARM has two ways of indicating an external abort, KVM tests
>> bit 9
>> The v7 ARM-ARm has a '{D,I}FSR.ExT' bit, which I think is equivalent. Its
>> described as:
>>> An implementation can use the DFSR.ExT and IFSR.ExT bits to provide more
>>> information about external aborts.
>
> For instance, the Cortex-A15 TRM documents that it uses the ExT bit
> to distinguish AXI DECERR (decode error, ExT==0) and SLVERR (slave
> error, ExT==1) abort responses.
>
>>> This is what the v8 version must mean with its
>>> External abort type. This bit can provide an IMPLEMENTATION DEFINED
>>> classification of External aborts.
>>
>> Which I read as IMP-DEF classification 'as external', as opposed to your
>> reading
>> as an extra IMP-DEF classification for external aborts.
>
> I think gengdongjiu's reading is correct here: if this
> is an external abort then you can find out more info about
> exactly what kind of EA it is by looking at this bit, but
> you have to look at the FSC bits to be sure it is an EA first.
>
> The Cortex-A53 TRM documents that it uses the same SLVERR-vs-DECERR
> semantics for DFSR.ExT as the A15. It doesn't say so but it wouldn't
> be very surprising if its usage of ESR_ELx.EA was the same.
Arm Cortex-A53 may use this EA bit for AXI DECERR (decode error,
ExT==0) and SLVERR abort.  but not all CPUs have the same semantics.
For example,  Huawei's designed armv8.2 CPU is not that semantics

>
> thanks
> -- PMM
> _______________________________________________
> kvmarm mailing list
> kvmarm at lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2017-06-24 15:07 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-24 16:32 [PATCH v2] KVM: arm/arm64: Signal SIGBUS when stage2 discovers hwpoison memory James Morse
2017-05-24 16:32 ` James Morse
2017-06-01 22:22 ` Christoffer Dall
2017-06-01 22:22   ` Christoffer Dall
2017-06-02 10:16   ` James Morse
2017-06-02 10:16     ` James Morse
2017-06-02 10:43     ` Christoffer Dall
2017-06-02 10:43       ` Christoffer Dall
2017-06-07  9:41       ` James Morse
2017-06-07  9:41         ` James Morse
2017-06-16 11:32         ` James Morse
2017-06-16 11:32           ` James Morse
2017-06-21  7:42 ` gengdongjiu
2017-06-21  7:42   ` gengdongjiu
2017-06-21  9:53   ` James Morse
2017-06-21  9:53     ` James Morse
2017-06-21 10:59     ` gengdongjiu
2017-06-21 10:59       ` gengdongjiu
2017-06-21 12:44       ` James Morse
2017-06-21 12:44         ` James Morse
2017-06-22  6:47         ` gengdongjiu
2017-06-22  6:47           ` gengdongjiu
2017-06-22 16:39           ` James Morse
2017-06-22 16:39             ` James Morse
2017-06-24 14:56             ` gengdongjiu
2017-06-24 14:56               ` gengdongjiu
2017-06-23  9:38           ` James Morse
2017-06-23  9:38             ` James Morse
2017-06-23 10:18             ` Peter Maydell
2017-06-23 10:18               ` Peter Maydell
2017-06-24 15:07               ` gengdongjiu
2017-06-24 15:07                 ` gengdongjiu
2017-06-24  8:23             ` gengdongjiu
2017-06-24  8:23               ` gengdongjiu
2017-06-22 14:49         ` gengdongjiu
2017-06-22 14:49           ` gengdongjiu
2017-06-21 11:12     ` gengdongjiu
2017-06-21 11:12       ` gengdongjiu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.