kvmarm.lists.cs.columbia.edu archive mirror
 help / color / mirror / Atom feed
* Re: FYI: Possible HPFAR_EL2 corruption (LPAE guests on AArch64 hosts)
       [not found] <7dd77cea-d673-269a-044f-4df269db7e5e@jonmasters.org>
@ 2019-07-08 11:47 ` Mark Rutland
  2019-07-08 12:16   ` Jon Masters
       [not found] ` <20190708093714.57t55inainky2zcq@shell.armlinux.org.uk>
  1 sibling, 1 reply; 7+ messages in thread
From: Mark Rutland @ 2019-07-08 11:47 UTC (permalink / raw)
  To: Jon Masters; +Cc: kvmarm, linux-arm-kernel

On Sun, Jul 07, 2019 at 11:39:46PM -0400, Jon Masters wrote:
> Hi all,

Hi Jon,

[adding Marc and the kvm-arm list]

> TLDR: We think $subject may be a hardware errata and we are
> investigating. I was asked to drop a note to share my initial analysis
> in case others have been experiencing similar problems with 32-bit VMs.
> 
> The Fedora Arm 32-bit builders run as "armv7hl+lpae" (aarch32) LPAE
> (VMSAv8-32 Long-descriptor table format in aarch32 execution state) VMs
> on AArch64 hosts. Under certain conditions, those builders will "pause"
> with the following obscure looking error message:
> 
> kvm [10652]: load/store instruction decoding not implemented
> 
> (which is caused by a fall-through in io_mem_abort, the code assumes
> that if we couldn't find the guest memslot we're taking an IO abort)
> 
> This has been happening on and off for more than a year, tickled further
> by various 32-bit Fedora guest updates, leading to some speculation that
> there was actually a problem with guest toolchains generating
> hard-to-emulate complex load/store instruction sequences not handled in KVM.
> 
> After extensive analysis, I believe instead that it appears on the
> platform we are using in Fedora that a stage 2 fault (e.g. v8.0 software
> access bit update in the host) taken during stage 1 guest page table
> walk will result in an HPFAR_EL2 truncation to a 32-bit address instead
> of the full 48-bit IPA in use due to aarch32 LPAE. I believe that this
> is a hardware errata and have requested that the vendor investigate.
> 
> Meanwhile, I have a /very/ nasty patch that checks the fault conditions
> in kvm_handle_guest_abort and if they match (S1 PTW, etc.), does a
> software walk through the guest page tables looking for a PTE that
> matches with the lower part of the faulting address bits we did get
> reported to the host, then re-injects the correct fault. With this
> patch, the test builder stays up, albeit correcting various faults:
> 
> [  143.670063] JCM: WARNING: Mismatched FIPA and PA translation detected!
> [  143.748447] JCM: Hyper faulting far: 0x3deb0000
> [  143.802808] JCM: Guest faulting far: 0xb6dce3c4 (gfn: 0x3deb)
> [  143.871776] JCM: Guest TTBCR: 0xb5023500, TTBR0: 0x5b06cc40
> [  143.938649] JCM: Guest PGD address: 0x5b06cc50
> [  143.991962] JCM: Guest PGD: 0x5b150003
> [  144.036925] JCM: Guest PMD address: 0x5b150db0
> [  144.090238] JCM: Guest PMD: 0x43deb0003
> [  144.136241] JCM: Guest PTE address: 0x43deb0e70
> [  144.190604] JCM: Guest PTE: 0x42000043bb72fdf
> [  144.242884] JCM: Manually translated as: 0xb6dce3c4->0x43bb72000
> [  144.314972] JCM: Faulting IPA page: 0x3deb0000
> [  144.368286] JCM: Faulting PTE page: 0x43deb0000
> [  144.422641] JCM: Fault occurred while performing S1 PTW -fixing
> [  144.493684] JCM: corrected fault_ipa: 0x43deb0000
> [  144.550133] JCM: Corrected gfn: 0x43deb
> [  144.596145] JCM: handle user_mem_abort
> [  144.641155] JCM: ret: 0x1

When the conditions are met, does the issue continue to trigger
reliably?

e.g. if you return to the guest without fixing the fault, do you always
see the truncation when taking the fault again?

If you try the translation with an AT, does that work as expected? We've
had to use that elsewhere; see __populate_fault_info() in
arch/arm64/kvm/hyp/switch.c.

Thanks,
Mark.

> 
> Eventually, we might be looking to upstream something once we can figure
> out a nice way to generically walk guest page tables and fixup (there's
> no alternatives framework in virt/kvm/arm/mmu.c yet, etc.). I'll ask
> that the vendor take a look at this if we confirm a problem exists.
> 
> We'll followup. I wanted to let folks know we are working it since some
> of you had feared there was something worse with load/store instruction
> generation in recent 32-bit guests. That appears to not be the case.
> 
> Jon.
> 
> P.S. A full writeup of what we are seeing and linked bugzilla with debug
> only quick hack patch is here:
> https://medium.com/@jonmasters_84473/debugging-a-32-bit-fedora-arm-builder-issue-73295d7d673d
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: FYI: Possible HPFAR_EL2 corruption (LPAE guests on AArch64 hosts)
  2019-07-08 11:47 ` FYI: Possible HPFAR_EL2 corruption (LPAE guests on AArch64 hosts) Mark Rutland
@ 2019-07-08 12:16   ` Jon Masters
  2019-07-08 13:04     ` Marc Zyngier
  2019-07-08 13:09     ` Mark Rutland
  0 siblings, 2 replies; 7+ messages in thread
From: Jon Masters @ 2019-07-08 12:16 UTC (permalink / raw)
  To: Mark Rutland; +Cc: kvmarm, linux-arm-kernel

Hi Mark,

Thanks for adding the CCs. See below for more.

On 7/8/19 7:47 AM, Mark Rutland wrote:
> On Sun, Jul 07, 2019 at 11:39:46PM -0400, Jon Masters wrote:
>> Hi all,
> 
> Hi Jon,
> 
> [adding Marc and the kvm-arm list]
> 
>> TLDR: We think $subject may be a hardware errata and we are
>> investigating. I was asked to drop a note to share my initial analysis
>> in case others have been experiencing similar problems with 32-bit VMs.
>>
>> The Fedora Arm 32-bit builders run as "armv7hl+lpae" (aarch32) LPAE
>> (VMSAv8-32 Long-descriptor table format in aarch32 execution state) VMs
>> on AArch64 hosts. Under certain conditions, those builders will "pause"
>> with the following obscure looking error message:
>>
>> kvm [10652]: load/store instruction decoding not implemented
>>
>> (which is caused by a fall-through in io_mem_abort, the code assumes
>> that if we couldn't find the guest memslot we're taking an IO abort)
>>
>> This has been happening on and off for more than a year, tickled further
>> by various 32-bit Fedora guest updates, leading to some speculation that
>> there was actually a problem with guest toolchains generating
>> hard-to-emulate complex load/store instruction sequences not handled in KVM.
>>
>> After extensive analysis, I believe instead that it appears on the
>> platform we are using in Fedora that a stage 2 fault (e.g. v8.0 software
>> access bit update in the host) taken during stage 1 guest page table
>> walk will result in an HPFAR_EL2 truncation to a 32-bit address instead
>> of the full 48-bit IPA in use due to aarch32 LPAE. I believe that this
>> is a hardware errata and have requested that the vendor investigate.
>>
>> Meanwhile, I have a /very/ nasty patch that checks the fault conditions
>> in kvm_handle_guest_abort and if they match (S1 PTW, etc.), does a
>> software walk through the guest page tables looking for a PTE that
>> matches with the lower part of the faulting address bits we did get
>> reported to the host, then re-injects the correct fault. With this
>> patch, the test builder stays up, albeit correcting various faults:
>>
>> [  143.670063] JCM: WARNING: Mismatched FIPA and PA translation detected!
>> [  143.748447] JCM: Hyper faulting far: 0x3deb0000
>> [  143.802808] JCM: Guest faulting far: 0xb6dce3c4 (gfn: 0x3deb)
>> [  143.871776] JCM: Guest TTBCR: 0xb5023500, TTBR0: 0x5b06cc40
>> [  143.938649] JCM: Guest PGD address: 0x5b06cc50
>> [  143.991962] JCM: Guest PGD: 0x5b150003
>> [  144.036925] JCM: Guest PMD address: 0x5b150db0
>> [  144.090238] JCM: Guest PMD: 0x43deb0003
>> [  144.136241] JCM: Guest PTE address: 0x43deb0e70
>> [  144.190604] JCM: Guest PTE: 0x42000043bb72fdf
>> [  144.242884] JCM: Manually translated as: 0xb6dce3c4->0x43bb72000
>> [  144.314972] JCM: Faulting IPA page: 0x3deb0000
>> [  144.368286] JCM: Faulting PTE page: 0x43deb0000
>> [  144.422641] JCM: Fault occurred while performing S1 PTW -fixing
>> [  144.493684] JCM: corrected fault_ipa: 0x43deb0000
>> [  144.550133] JCM: Corrected gfn: 0x43deb
>> [  144.596145] JCM: handle user_mem_abort
>> [  144.641155] JCM: ret: 0x1
> 
> When the conditions are met, does the issue continue to trigger
> reliably?

Yeah. But only for certain faults - seems to be specifically for stage 1
page table walks that cause a trap to stage 2.

> e.g. if you return to the guest without fixing the fault, do you always
> see the truncation when taking the fault again?

I believe so, but I need to specifically check that.

> If you try the translation with an AT, does that work as expected? We've
> had to use that elsewhere; see __populate_fault_info() in
> arch/arm64/kvm/hyp/switch.c.

Yea, I've seen that code for the other errata :) The problem is the
virtual address in the FAR is different from the one we ultimately have
a PA translation for. We take a fault when the hardware walker tries to
perform a load to (e.g.) the PTE leaf during the translation of the VA.
So the PTE itself is what we are trying to load, not the PA of the VA
that the guest userspace/kernel tried to load. Hence an AT won't work,
unless I'm missing something. My first thought had been to do that.

Many thanks,

Jon.


_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: FYI: Possible HPFAR_EL2 corruption (LPAE guests on AArch64 hosts)
       [not found] ` <20190708093714.57t55inainky2zcq@shell.armlinux.org.uk>
@ 2019-07-08 12:40   ` Jon Masters
  0 siblings, 0 replies; 7+ messages in thread
From: Jon Masters @ 2019-07-08 12:40 UTC (permalink / raw)
  To: Russell King - ARM Linux admin; +Cc: kvmarm, linux-arm-kernel

On 7/8/19 5:37 AM, Russell King - ARM Linux admin wrote:
> On Sun, Jul 07, 2019 at 11:39:46PM -0400, Jon Masters wrote:
>> Hi all,
>>
>> TLDR: We think $subject may be a hardware errata and we are
>> investigating. I was asked to drop a note to share my initial analysis
>> in case others have been experiencing similar problems with 32-bit VMs.
>>
>> The Fedora Arm 32-bit builders run as "armv7hl+lpae" (aarch32) LPAE
>> (VMSAv8-32 Long-descriptor table format in aarch32 execution state) VMs
>> on AArch64 hosts. Under certain conditions, those builders will "pause"
>> with the following obscure looking error message:
>>
>> kvm [10652]: load/store instruction decoding not implemented
> 
> Out of interest, because I'm running a number of 32-bit VMs on the
> Macchiatobin board, using a different 64-bit distro...
> 
> How often do these errors occur?  Have you been able to pinpoint any
> particular CPU core?  Does the workload in the VMs have any effect?
> What about the workload in the host?

It's a specific CPU core (not a Cortex design), running a 32-bit LPAE
kernel (needs to be LPAE to have an IPA >32-bit). In the course of a
weekend running stress tests, my test kernel fixed up hundreds of faults
that would otherwise have taken the guest system down.

Specifically, PGDs are allocated from a cache located in low memory (so
we never hit this condition for those), but PTEs are allocated using:

	alloc_pages(PGALLOC_GFP | __GFP_HIGHMEM, 0);

So at some point, we'll allocate a PTE from above 32-bit. When we later
take a fault on those during a stage 1 walk, we hit a problem.

My guess is we do the clock algorithm on the host looking to see for
recent accesses by unsetting access bits on the host (stage2) and since
on Armv8.0 we do a software trap for access bit updates we'll trap to
stage 2 during the stage 1 guest walk the next time around. So simply
pinning the guest memory isn't going to be sufficient to prevent it if
that memory is allocated normally with the host doing software LRU.

But the above is just what I consider the likely cause in my head.

Jon.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: FYI: Possible HPFAR_EL2 corruption (LPAE guests on AArch64 hosts)
  2019-07-08 12:16   ` Jon Masters
@ 2019-07-08 13:04     ` Marc Zyngier
  2019-07-08 13:14       ` Jon Masters
  2019-07-08 13:09     ` Mark Rutland
  1 sibling, 1 reply; 7+ messages in thread
From: Marc Zyngier @ 2019-07-08 13:04 UTC (permalink / raw)
  To: Jon Masters, Mark Rutland; +Cc: kvmarm, linux-arm-kernel

[Adding myself to the cc-list, for real this time! ;-)]

On 08/07/2019 13:16, Jon Masters wrote:
> Hi Mark,
> 
> Thanks for adding the CCs. See below for more.
> 
> On 7/8/19 7:47 AM, Mark Rutland wrote:
>> On Sun, Jul 07, 2019 at 11:39:46PM -0400, Jon Masters wrote:
>>> Hi all,
>>
>> Hi Jon,
>>
>> [adding Marc and the kvm-arm list]
>>
>>> TLDR: We think $subject may be a hardware errata and we are
>>> investigating. I was asked to drop a note to share my initial analysis
>>> in case others have been experiencing similar problems with 32-bit VMs.
>>>
>>> The Fedora Arm 32-bit builders run as "armv7hl+lpae" (aarch32) LPAE
>>> (VMSAv8-32 Long-descriptor table format in aarch32 execution state) VMs
>>> on AArch64 hosts. Under certain conditions, those builders will "pause"
>>> with the following obscure looking error message:
>>>
>>> kvm [10652]: load/store instruction decoding not implemented
>>>
>>> (which is caused by a fall-through in io_mem_abort, the code assumes
>>> that if we couldn't find the guest memslot we're taking an IO abort)
>>>
>>> This has been happening on and off for more than a year, tickled further
>>> by various 32-bit Fedora guest updates, leading to some speculation that
>>> there was actually a problem with guest toolchains generating
>>> hard-to-emulate complex load/store instruction sequences not handled in KVM.
>>>
>>> After extensive analysis, I believe instead that it appears on the
>>> platform we are using in Fedora that a stage 2 fault (e.g. v8.0 software
>>> access bit update in the host) taken during stage 1 guest page table
>>> walk will result in an HPFAR_EL2 truncation to a 32-bit address instead
>>> of the full 48-bit IPA in use due to aarch32 LPAE. I believe that this
>>> is a hardware errata and have requested that the vendor investigate.
>>>
>>> Meanwhile, I have a /very/ nasty patch that checks the fault conditions
>>> in kvm_handle_guest_abort and if they match (S1 PTW, etc.), does a
>>> software walk through the guest page tables looking for a PTE that
>>> matches with the lower part of the faulting address bits we did get
>>> reported to the host, then re-injects the correct fault. With this
>>> patch, the test builder stays up, albeit correcting various faults:
>>>
>>> [  143.670063] JCM: WARNING: Mismatched FIPA and PA translation detected!
>>> [  143.748447] JCM: Hyper faulting far: 0x3deb0000
>>> [  143.802808] JCM: Guest faulting far: 0xb6dce3c4 (gfn: 0x3deb)
>>> [  143.871776] JCM: Guest TTBCR: 0xb5023500, TTBR0: 0x5b06cc40
>>> [  143.938649] JCM: Guest PGD address: 0x5b06cc50
>>> [  143.991962] JCM: Guest PGD: 0x5b150003
>>> [  144.036925] JCM: Guest PMD address: 0x5b150db0
>>> [  144.090238] JCM: Guest PMD: 0x43deb0003
>>> [  144.136241] JCM: Guest PTE address: 0x43deb0e70
>>> [  144.190604] JCM: Guest PTE: 0x42000043bb72fdf
>>> [  144.242884] JCM: Manually translated as: 0xb6dce3c4->0x43bb72000
>>> [  144.314972] JCM: Faulting IPA page: 0x3deb0000
>>> [  144.368286] JCM: Faulting PTE page: 0x43deb0000
>>> [  144.422641] JCM: Fault occurred while performing S1 PTW -fixing
>>> [  144.493684] JCM: corrected fault_ipa: 0x43deb0000
>>> [  144.550133] JCM: Corrected gfn: 0x43deb
>>> [  144.596145] JCM: handle user_mem_abort
>>> [  144.641155] JCM: ret: 0x1
>>
>> When the conditions are met, does the issue continue to trigger
>> reliably?
> 
> Yeah. But only for certain faults - seems to be specifically for stage 1
> page table walks that cause a trap to stage 2.

Do we know for sure this is limited to the guest using LPAE? I
appreciate that this is the configuration you're running in, but it
would be an interesting data point to work out what is happening with
small descriptors.

> 
>> e.g. if you return to the guest without fixing the fault, do you always
>> see the truncation when taking the fault again?
> 
> I believe so, but I need to specifically check that.
> 
>> If you try the translation with an AT, does that work as expected? We've
>> had to use that elsewhere; see __populate_fault_info() in
>> arch/arm64/kvm/hyp/switch.c.
> 
> Yea, I've seen that code for the other errata :) The problem is the
> virtual address in the FAR is different from the one we ultimately have
> a PA translation for. We take a fault when the hardware walker tries to
> perform a load to (e.g.) the PTE leaf during the translation of the VA.
> So the PTE itself is what we are trying to load, not the PA of the VA
> that the guest userspace/kernel tried to load. Hence an AT won't work,
> unless I'm missing something. My first thought had been to do that.

Ah, that's the bit I was missing: S1PTW not completing in S2 because of
the access flag. Duh.

Random idea: an option (although not a desirable one) would be to change
the way we handle page aging on the host by forcing an unmap at S2
instead of twiddling the access flag. Does this change anything on your
system?

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: FYI: Possible HPFAR_EL2 corruption (LPAE guests on AArch64 hosts)
  2019-07-08 12:16   ` Jon Masters
  2019-07-08 13:04     ` Marc Zyngier
@ 2019-07-08 13:09     ` Mark Rutland
  2019-07-08 13:17       ` Jon Masters
  1 sibling, 1 reply; 7+ messages in thread
From: Mark Rutland @ 2019-07-08 13:09 UTC (permalink / raw)
  To: Jon Masters; +Cc: marc.zyngier, kvmarm, linux-arm-kernel

[Adding Marc for real this time]

On Mon, Jul 08, 2019 at 08:16:25AM -0400, Jon Masters wrote:
> On 7/8/19 7:47 AM, Mark Rutland wrote:
> > On Sun, Jul 07, 2019 at 11:39:46PM -0400, Jon Masters wrote:
> >> TLDR: We think $subject may be a hardware errata and we are
> >> investigating. I was asked to drop a note to share my initial analysis
> >> in case others have been experiencing similar problems with 32-bit VMs.
> >>
> >> The Fedora Arm 32-bit builders run as "armv7hl+lpae" (aarch32) LPAE
> >> (VMSAv8-32 Long-descriptor table format in aarch32 execution state) VMs
> >> on AArch64 hosts. Under certain conditions, those builders will "pause"
> >> with the following obscure looking error message:
> >>
> >> kvm [10652]: load/store instruction decoding not implemented
> >>
> >> (which is caused by a fall-through in io_mem_abort, the code assumes
> >> that if we couldn't find the guest memslot we're taking an IO abort)
> >>
> >> This has been happening on and off for more than a year, tickled further
> >> by various 32-bit Fedora guest updates, leading to some speculation that
> >> there was actually a problem with guest toolchains generating
> >> hard-to-emulate complex load/store instruction sequences not handled in KVM.
> >>
> >> After extensive analysis, I believe instead that it appears on the
> >> platform we are using in Fedora that a stage 2 fault (e.g. v8.0 software
> >> access bit update in the host) taken during stage 1 guest page table
> >> walk will result in an HPFAR_EL2 truncation to a 32-bit address instead
> >> of the full 48-bit IPA in use due to aarch32 LPAE. I believe that this
> >> is a hardware errata and have requested that the vendor investigate.
> >>
> >> Meanwhile, I have a /very/ nasty patch that checks the fault conditions
> >> in kvm_handle_guest_abort and if they match (S1 PTW, etc.), does a
> >> software walk through the guest page tables looking for a PTE that
> >> matches with the lower part of the faulting address bits we did get
> >> reported to the host, then re-injects the correct fault. With this
> >> patch, the test builder stays up, albeit correcting various faults:
> >>
> >> [  143.670063] JCM: WARNING: Mismatched FIPA and PA translation detected!
> >> [  143.748447] JCM: Hyper faulting far: 0x3deb0000
> >> [  143.802808] JCM: Guest faulting far: 0xb6dce3c4 (gfn: 0x3deb)
> >> [  143.871776] JCM: Guest TTBCR: 0xb5023500, TTBR0: 0x5b06cc40
> >> [  143.938649] JCM: Guest PGD address: 0x5b06cc50
> >> [  143.991962] JCM: Guest PGD: 0x5b150003
> >> [  144.036925] JCM: Guest PMD address: 0x5b150db0
> >> [  144.090238] JCM: Guest PMD: 0x43deb0003
> >> [  144.136241] JCM: Guest PTE address: 0x43deb0e70
> >> [  144.190604] JCM: Guest PTE: 0x42000043bb72fdf
> >> [  144.242884] JCM: Manually translated as: 0xb6dce3c4->0x43bb72000
> >> [  144.314972] JCM: Faulting IPA page: 0x3deb0000
> >> [  144.368286] JCM: Faulting PTE page: 0x43deb0000
> >> [  144.422641] JCM: Fault occurred while performing S1 PTW -fixing
> >> [  144.493684] JCM: corrected fault_ipa: 0x43deb0000
> >> [  144.550133] JCM: Corrected gfn: 0x43deb
> >> [  144.596145] JCM: handle user_mem_abort
> >> [  144.641155] JCM: ret: 0x1
> > 
> > When the conditions are met, does the issue continue to trigger
> > reliably?
> 
> Yeah. But only for certain faults - seems to be specifically for stage 1
> page table walks that cause a trap to stage 2.

Ok. It sounds like we could write a small guest to trigger that
deliberately with some pre-allocated page tables placed above a 4GiB
IPA.

> > e.g. if you return to the guest without fixing the fault, do you always
> > see the truncation when taking the fault again?
> 
> I believe so, but I need to specifically check that.
> 
> > If you try the translation with an AT, does that work as expected? We've
> > had to use that elsewhere; see __populate_fault_info() in
> > arch/arm64/kvm/hyp/switch.c.
> 
> Yea, I've seen that code for the other errata :) The problem is the
> virtual address in the FAR is different from the one we ultimately have
> a PA translation for. We take a fault when the hardware walker tries to
> perform a load to (e.g.) the PTE leaf during the translation of the VA.
> So the PTE itself is what we are trying to load, not the PA of the VA
> that the guest userspace/kernel tried to load. Hence an AT won't work,
> unless I'm missing something. My first thought had been to do that.

My bad; I thought a failed AT reported the relevant IPA when it failed
as a result of a stage-2 fault, but I see now that it does not.

I don't think that we can reliably walk the guest's Stage-1 tables
without trapping TLB invalidations (and/or stopping all vCPUs), so
that's rather unfortunate.

Thanks,
Mark.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: FYI: Possible HPFAR_EL2 corruption (LPAE guests on AArch64 hosts)
  2019-07-08 13:04     ` Marc Zyngier
@ 2019-07-08 13:14       ` Jon Masters
  0 siblings, 0 replies; 7+ messages in thread
From: Jon Masters @ 2019-07-08 13:14 UTC (permalink / raw)
  To: Marc Zyngier, Mark Rutland; +Cc: kvmarm, linux-arm-kernel

On 7/8/19 9:04 AM, Marc Zyngier wrote:

> [Adding myself to the cc-list, for real this time! ;-)]

Hehe

> On 08/07/2019 13:16, Jon Masters wrote:
>> Hi Mark,
>>
>> Thanks for adding the CCs. See below for more.
>>
>> On 7/8/19 7:47 AM, Mark Rutland wrote:
>>> On Sun, Jul 07, 2019 at 11:39:46PM -0400, Jon Masters wrote:
>>>> Hi all,
>>>
>>> Hi Jon,
>>>
>>> [adding Marc and the kvm-arm list]
>>>
>>>> TLDR: We think $subject may be a hardware errata and we are
>>>> investigating. I was asked to drop a note to share my initial analysis
>>>> in case others have been experiencing similar problems with 32-bit VMs.
>>>>
>>>> The Fedora Arm 32-bit builders run as "armv7hl+lpae" (aarch32) LPAE
>>>> (VMSAv8-32 Long-descriptor table format in aarch32 execution state) VMs
>>>> on AArch64 hosts. Under certain conditions, those builders will "pause"
>>>> with the following obscure looking error message:
>>>>
>>>> kvm [10652]: load/store instruction decoding not implemented
>>>>
>>>> (which is caused by a fall-through in io_mem_abort, the code assumes
>>>> that if we couldn't find the guest memslot we're taking an IO abort)
>>>>
>>>> This has been happening on and off for more than a year, tickled further
>>>> by various 32-bit Fedora guest updates, leading to some speculation that
>>>> there was actually a problem with guest toolchains generating
>>>> hard-to-emulate complex load/store instruction sequences not handled in KVM.
>>>>
>>>> After extensive analysis, I believe instead that it appears on the
>>>> platform we are using in Fedora that a stage 2 fault (e.g. v8.0 software
>>>> access bit update in the host) taken during stage 1 guest page table
>>>> walk will result in an HPFAR_EL2 truncation to a 32-bit address instead
>>>> of the full 48-bit IPA in use due to aarch32 LPAE. I believe that this
>>>> is a hardware errata and have requested that the vendor investigate.
>>>>
>>>> Meanwhile, I have a /very/ nasty patch that checks the fault conditions
>>>> in kvm_handle_guest_abort and if they match (S1 PTW, etc.), does a
>>>> software walk through the guest page tables looking for a PTE that
>>>> matches with the lower part of the faulting address bits we did get
>>>> reported to the host, then re-injects the correct fault. With this
>>>> patch, the test builder stays up, albeit correcting various faults:
>>>>
>>>> [  143.670063] JCM: WARNING: Mismatched FIPA and PA translation detected!
>>>> [  143.748447] JCM: Hyper faulting far: 0x3deb0000
>>>> [  143.802808] JCM: Guest faulting far: 0xb6dce3c4 (gfn: 0x3deb)
>>>> [  143.871776] JCM: Guest TTBCR: 0xb5023500, TTBR0: 0x5b06cc40
>>>> [  143.938649] JCM: Guest PGD address: 0x5b06cc50
>>>> [  143.991962] JCM: Guest PGD: 0x5b150003
>>>> [  144.036925] JCM: Guest PMD address: 0x5b150db0
>>>> [  144.090238] JCM: Guest PMD: 0x43deb0003
>>>> [  144.136241] JCM: Guest PTE address: 0x43deb0e70
>>>> [  144.190604] JCM: Guest PTE: 0x42000043bb72fdf
>>>> [  144.242884] JCM: Manually translated as: 0xb6dce3c4->0x43bb72000
>>>> [  144.314972] JCM: Faulting IPA page: 0x3deb0000
>>>> [  144.368286] JCM: Faulting PTE page: 0x43deb0000
>>>> [  144.422641] JCM: Fault occurred while performing S1 PTW -fixing
>>>> [  144.493684] JCM: corrected fault_ipa: 0x43deb0000
>>>> [  144.550133] JCM: Corrected gfn: 0x43deb
>>>> [  144.596145] JCM: handle user_mem_abort
>>>> [  144.641155] JCM: ret: 0x1
>>>
>>> When the conditions are met, does the issue continue to trigger
>>> reliably?
>>
>> Yeah. But only for certain faults - seems to be specifically for stage 1
>> page table walks that cause a trap to stage 2.
> 
> Do we know for sure this is limited to the guest using LPAE? I
> appreciate that this is the configuration you're running in, but it
> would be an interesting data point to work out what is happening with
> small descriptors.

It appears to be so, yea. I believe it's a truncation errata on this
platform (X-Gene1), and the vendor is currently investigating for us.

>>
>>> e.g. if you return to the guest without fixing the fault, do you always
>>> see the truncation when taking the fault again?
>>
>> I believe so, but I need to specifically check that.
>>
>>> If you try the translation with an AT, does that work as expected? We've
>>> had to use that elsewhere; see __populate_fault_info() in
>>> arch/arm64/kvm/hyp/switch.c.
>>
>> Yea, I've seen that code for the other errata :) The problem is the
>> virtual address in the FAR is different from the one we ultimately have
>> a PA translation for. We take a fault when the hardware walker tries to
>> perform a load to (e.g.) the PTE leaf during the translation of the VA.
>> So the PTE itself is what we are trying to load, not the PA of the VA
>> that the guest userspace/kernel tried to load. Hence an AT won't work,
>> unless I'm missing something. My first thought had been to do that.
> 
> Ah, that's the bit I was missing: S1PTW not completing in S2 because of
> the access flag. Duh.

:)

> Random idea: an option (although not a desirable one) would be to change
> the way we handle page aging on the host by forcing an unmap at S2
> instead of twiddling the access flag. Does this change anything on your
> system?

Well now, that's an interesting idea. If I get chance today, I'll try.

(I'm kinda getting close to punting this to the vendor...it resulted in
a couple of all nighters last week trying to figure what the heck was
going on. BUT I do want to figure out what could be an actually
upstreamable quirk since Fedora is relying on this hardware)

Jon.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: FYI: Possible HPFAR_EL2 corruption (LPAE guests on AArch64 hosts)
  2019-07-08 13:09     ` Mark Rutland
@ 2019-07-08 13:17       ` Jon Masters
  0 siblings, 0 replies; 7+ messages in thread
From: Jon Masters @ 2019-07-08 13:17 UTC (permalink / raw)
  To: Mark Rutland; +Cc: marc.zyngier, kvmarm, linux-arm-kernel

On 7/8/19 9:09 AM, Mark Rutland wrote:
> [Adding Marc for real this time]
> 
> On Mon, Jul 08, 2019 at 08:16:25AM -0400, Jon Masters wrote:
>> On 7/8/19 7:47 AM, Mark Rutland wrote:
>>> On Sun, Jul 07, 2019 at 11:39:46PM -0400, Jon Masters wrote:
>>>> TLDR: We think $subject may be a hardware errata and we are
>>>> investigating. I was asked to drop a note to share my initial analysis
>>>> in case others have been experiencing similar problems with 32-bit VMs.
>>>>
>>>> The Fedora Arm 32-bit builders run as "armv7hl+lpae" (aarch32) LPAE
>>>> (VMSAv8-32 Long-descriptor table format in aarch32 execution state) VMs
>>>> on AArch64 hosts. Under certain conditions, those builders will "pause"
>>>> with the following obscure looking error message:
>>>>
>>>> kvm [10652]: load/store instruction decoding not implemented
>>>>
>>>> (which is caused by a fall-through in io_mem_abort, the code assumes
>>>> that if we couldn't find the guest memslot we're taking an IO abort)
>>>>
>>>> This has been happening on and off for more than a year, tickled further
>>>> by various 32-bit Fedora guest updates, leading to some speculation that
>>>> there was actually a problem with guest toolchains generating
>>>> hard-to-emulate complex load/store instruction sequences not handled in KVM.
>>>>
>>>> After extensive analysis, I believe instead that it appears on the
>>>> platform we are using in Fedora that a stage 2 fault (e.g. v8.0 software
>>>> access bit update in the host) taken during stage 1 guest page table
>>>> walk will result in an HPFAR_EL2 truncation to a 32-bit address instead
>>>> of the full 48-bit IPA in use due to aarch32 LPAE. I believe that this
>>>> is a hardware errata and have requested that the vendor investigate.
>>>>
>>>> Meanwhile, I have a /very/ nasty patch that checks the fault conditions
>>>> in kvm_handle_guest_abort and if they match (S1 PTW, etc.), does a
>>>> software walk through the guest page tables looking for a PTE that
>>>> matches with the lower part of the faulting address bits we did get
>>>> reported to the host, then re-injects the correct fault. With this
>>>> patch, the test builder stays up, albeit correcting various faults:
>>>>
>>>> [  143.670063] JCM: WARNING: Mismatched FIPA and PA translation detected!
>>>> [  143.748447] JCM: Hyper faulting far: 0x3deb0000
>>>> [  143.802808] JCM: Guest faulting far: 0xb6dce3c4 (gfn: 0x3deb)
>>>> [  143.871776] JCM: Guest TTBCR: 0xb5023500, TTBR0: 0x5b06cc40
>>>> [  143.938649] JCM: Guest PGD address: 0x5b06cc50
>>>> [  143.991962] JCM: Guest PGD: 0x5b150003
>>>> [  144.036925] JCM: Guest PMD address: 0x5b150db0
>>>> [  144.090238] JCM: Guest PMD: 0x43deb0003
>>>> [  144.136241] JCM: Guest PTE address: 0x43deb0e70
>>>> [  144.190604] JCM: Guest PTE: 0x42000043bb72fdf
>>>> [  144.242884] JCM: Manually translated as: 0xb6dce3c4->0x43bb72000
>>>> [  144.314972] JCM: Faulting IPA page: 0x3deb0000
>>>> [  144.368286] JCM: Faulting PTE page: 0x43deb0000
>>>> [  144.422641] JCM: Fault occurred while performing S1 PTW -fixing
>>>> [  144.493684] JCM: corrected fault_ipa: 0x43deb0000
>>>> [  144.550133] JCM: Corrected gfn: 0x43deb
>>>> [  144.596145] JCM: handle user_mem_abort
>>>> [  144.641155] JCM: ret: 0x1
>>>
>>> When the conditions are met, does the issue continue to trigger
>>> reliably?
>>
>> Yeah. But only for certain faults - seems to be specifically for stage 1
>> page table walks that cause a trap to stage 2.
> 
> Ok. It sounds like we could write a small guest to trigger that
> deliberately with some pre-allocated page tables placed above a 4GiB
> IPA.

Yea, indeed. It's funny what you realize as you're writing emails about
it - was thinking that earlier :) Ok, that sounds like fun.

>>> e.g. if you return to the guest without fixing the fault, do you always
>>> see the truncation when taking the fault again?
>>
>> I believe so, but I need to specifically check that.
>>
>>> If you try the translation with an AT, does that work as expected? We've
>>> had to use that elsewhere; see __populate_fault_info() in
>>> arch/arm64/kvm/hyp/switch.c.
>>
>> Yea, I've seen that code for the other errata :) The problem is the
>> virtual address in the FAR is different from the one we ultimately have
>> a PA translation for. We take a fault when the hardware walker tries to
>> perform a load to (e.g.) the PTE leaf during the translation of the VA.
>> So the PTE itself is what we are trying to load, not the PA of the VA
>> that the guest userspace/kernel tried to load. Hence an AT won't work,
>> unless I'm missing something. My first thought had been to do that.
> 
> My bad; I thought a failed AT reported the relevant IPA when it failed
> as a result of a stage-2 fault, but I see now that it does not.

Random aside - it would be great if there were an AT variant that did :)

> I don't think that we can reliably walk the guest's Stage-1 tables
> without trapping TLB invalidations (and/or stopping all vCPUs), so
> that's rather unfortunate.

Indeed. In the Fedora case, it's only a single vCPU in each guest so
they effectively already do that (and hence my test hack "works") but
that's another thing that would need to be handled for a real fix.

Jon.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-07-08 13:18 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <7dd77cea-d673-269a-044f-4df269db7e5e@jonmasters.org>
2019-07-08 11:47 ` FYI: Possible HPFAR_EL2 corruption (LPAE guests on AArch64 hosts) Mark Rutland
2019-07-08 12:16   ` Jon Masters
2019-07-08 13:04     ` Marc Zyngier
2019-07-08 13:14       ` Jon Masters
2019-07-08 13:09     ` Mark Rutland
2019-07-08 13:17       ` Jon Masters
     [not found] ` <20190708093714.57t55inainky2zcq@shell.armlinux.org.uk>
2019-07-08 12:40   ` Jon Masters

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).