Re: [PATCH] KVM: arm64: Correctly handle the mmio faulting

From: Santosh Shukla <sashukla@nvidia.com>
To: Marc Zyngier <maz@kernel.org>
Cc: mcrossley@nvidia.com, cjia@nvidia.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, kwankhede@nvidia.com,
	linux-arm-kernel@lists.infradead.org,
	kvmarm@lists.cs.columbia.edu
Subject: Re: [PATCH] KVM: arm64: Correctly handle the mmio faulting
Date: Mon, 26 Oct 2020 10:26:41 +0530	[thread overview]
Message-ID: <f56e0d71-ceb5-8ecf-e865-4ee857e333e1@nvidia.com> (raw)
In-Reply-To: <0a239ac4481fa01c8d09cf2e56dfdabe@kernel.org>


[-- Attachment #1.1: Type: text/plain, Size: 7983 bytes --]

Hi Marc,

Thanks for the review comment.

On 10/23/2020 4:59 PM, Marc Zyngier wrote:
>
> Hi Santosh,
>
> Thanks for this.
>
> On 2020-10-21 17:16, Santosh Shukla wrote:
>> The Commit:6d674e28 introduces a notion to detect and handle the
>> device mapping. The commit checks for the VM_PFNMAP flag is set
>> in vma->flags and if set then marks force_pte to true such that
>> if force_pte is true then ignore the THP function check
>> (/transparent_hugepage_adjust()).
>>
>> There could be an issue with the VM_PFNMAP flag setting and checking.
>> For example consider a case where the mdev vendor driver register's
>> the vma_fault handler named vma_mmio_fault(), which maps the
>> host MMIO region in-turn calls remap_pfn_range() and maps
>> the MMIO's vma space. Where, remap_pfn_range implicitly sets
>> the VM_PFNMAP flag into vma->flags.
>>
>> Now lets assume a mmio fault handing flow where guest first access
>> the MMIO region whose 2nd stage translation is not present.
>> So that results to arm64-kvm hypervisor executing guest abort handler,
>> like below:
>>
>> kvm_handle_guest_abort() -->
>>  user_mem_abort()--> {
>>
>>     ...
>>     0. checks the vma->flags for the VM_PFNMAP.
>>     1. Since VM_PFNMAP flag is not yet set so force_pte _is_ false;
>>     2. gfn_to_pfn_prot() -->
>>         __gfn_to_pfn_memslot() -->
>>             fixup_user_fault() -->
>>                 handle_mm_fault()-->
>>                     __do_fault() -->
>>                        vma_mmio_fault() --> // vendor's mdev fault
>> handler
>>                         remap_pfn_range()--> // Here sets the VM_PFNMAP
>>                                               flag into vma->flags.
>>     3. Now that force_pte is set to false in step-2),
>>        will execute transparent_hugepage_adjust() func and
>>        that lead to Oops [4].
>>  }
>
> Hmmm. Nice. Any chance you could provide us with an actual reproducer?
>
I tried to create the reproducer scenario with vfio-pci driver using
nvidia GPU in PT mode, As because vfio-pci driver now supports
vma faulting (/vfio_pci_mmap_fault) so could create a crude reproducer
situation with that.

To create the repro - I did an ugly hack into arm64/kvm/mmu.c.
The hack is to make sure that stage2 mapping are not created
at the time of vm_init by unsetting VM_PFNMAP flag. This `unsetting` flag
needed because vfio-pci's mmap func(/vfio_pci_mmap) by-default
sets the VM_PFNMAP flag for the MMIO region but I want
the remap_pfn_range() func to set the _PFNMAP flag via vfio's fault
handler func vfio_pci_mmap_fault().

So with above, when guest access the MMIO region, this will
trigger the mmio fault path at arm64-kvm hypervisor layer like below:
user_mem_abort() {->...
     --> Check the VM_PFNMAP flag, since not set so marks force_pte=false
     ....
     __gfn_to_pfn_memslot()-->
     ...
     handle_mm_fault()-->
     do_fault()-->
     vfio_pci_mmio_fault()-->
     remap_pfn_range()--> Now will set the VM_PFNMAP flag.
}

I have also set the force_pte=true, just to avoid the THP Oops which was
mentioned in the previous thread.

hackish change to reproduce scenario:

--->

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 19aacc7d64de..9ef70dc624cf 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -836,9 +836,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
         }
         if (is_error_noslot_pfn(pfn))
                 return -EFAULT;
-
         if (kvm_is_device_pfn(pfn)) {
                 device = true;
+               force_pte = true;
         } else if (logging_active && !write_fault) {
                 /*
                  * Only actually map the page as writable if this was a 
write
@@ -1317,6 +1317,11 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
                 vm_start = max(hva, vma->vm_start);
                 vm_end = min(reg_end, vma->vm_end);

+               /* Hack to make sure stage2 mapping not present, thus 
trigger
+                * user_mem_abort for stage2 mapping */
+               if (vma->vm_flags & VM_PFNMAP) {
+                       vma->vm_flags = vma->vm_flags & (~VM_PFNMAP);
+               }
                 if (vma->vm_flags & VM_PFNMAP) {
                         gpa_t gpa = mem->guest_phys_addr +
                                     (vm_start - mem->userspace_addr);

>>
>> The proposition is to check is_iomap flag before executing the THP
>> function transparent_hugepage_adjust().
>>
>> [4] THP Oops:
>>> pc: kvm_is_transparent_hugepage+0x18/0xb0
>>> ...
>>> ...
>>> user_mem_abort+0x340/0x9b8
>>> kvm_handle_guest_abort+0x248/0x468
>>> handle_exit+0x150/0x1b0
>>> kvm_arch_vcpu_ioctl_run+0x4d4/0x778
>>> kvm_vcpu_ioctl+0x3c0/0x858
>>> ksys_ioctl+0x84/0xb8
>>> __arm64_sys_ioctl+0x28/0x38
>>
>> Tested on Huawei Kunpeng Taishan-200 arm64 server, Using VFIO-mdev
>> device.
>> Linux tip: 583090b1
>>
>> Fixes: 6d674e28 ("KVM: arm/arm64: Properly handle faulting of device
>> mappings")
>> Signed-off-by: Santosh Shukla <sashukla@nvidia.com>
>> ---
>>  arch/arm64/kvm/mmu.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 3d26b47..ff15357 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -1947,7 +1947,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu,
>> phys_addr_t fault_ipa,
>>        * If we are not forced to use page mapping, check if we are
>>        * backed by a THP and thus use block mapping if possible.
>>        */
>> -     if (vma_pagesize == PAGE_SIZE && !force_pte)
>> +     if (vma_pagesize == PAGE_SIZE && !force_pte && !is_iomap(flags))
>>               vma_pagesize = transparent_hugepage_adjust(memslot, hva,
>> &pfn, &fault_ipa);
>>       if (writable)
>
> Why don't you directly set force_pte to true at the point where we
> update
> the flags? It certainly would be a bit more readable:
>
Yes.
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 3d26b47a1343..7a4ad984d54e 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1920,6 +1920,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu,
> phys_addr_t fault_ipa,
>        if (kvm_is_device_pfn(pfn)) {
>                mem_type = PAGE_S2_DEVICE;
>                flags |= KVM_S2PTE_FLAG_IS_IOMAP;
> +               force_pte = true;
>        } else if (logging_active) {
>                /*
>                 * Faults on pages in a memslot with logging enabled
>
> and almost directly applies to what we have queued for 5.10.
>
Right. I believe - Above code is sightly changed at linux-next commit: 
9695c4ff

Modified one looks like below:

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 19aacc7..d4cd253 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -839,6 +839,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,

         if (kvm_is_device_pfn(pfn)) {
                 device = true;
+               force_pte = true;
         } else if (logging_active && !write_fault) {
                 /*
                  * Only actually map the page as writable if this was a 
write

pl. let me know if above is okay and will send out v2.

Thanks.

Santosh


> Thanks,
>
>         M.
> -- 
> Jazz is not dead. It just smells funny...

[-- Attachment #1.2: Type: text/html, Size: 11949 bytes --]

[-- Attachment #2: Type: text/plain, Size: 151 bytes --]

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm