All of lore.kernel.org
 help / color / mirror / Atom feed
From: Santosh Shukla <sashukla@nvidia.com>
To: Marc Zyngier <maz@kernel.org>
Cc: mcrossley@nvidia.com, cjia@nvidia.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, kwankhede@nvidia.com,
	linux-arm-kernel@lists.infradead.org,
	kvmarm@lists.cs.columbia.edu
Subject: Re: [PATCH] KVM: arm64: Correctly handle the mmio faulting
Date: Mon, 26 Oct 2020 10:26:41 +0530	[thread overview]
Message-ID: <f56e0d71-ceb5-8ecf-e865-4ee857e333e1@nvidia.com> (raw)
In-Reply-To: <0a239ac4481fa01c8d09cf2e56dfdabe@kernel.org>


[-- Attachment #1.1: Type: text/plain, Size: 7983 bytes --]

Hi Marc,

Thanks for the review comment.

On 10/23/2020 4:59 PM, Marc Zyngier wrote:
>
> Hi Santosh,
>
> Thanks for this.
>
> On 2020-10-21 17:16, Santosh Shukla wrote:
>> The Commit:6d674e28 introduces a notion to detect and handle the
>> device mapping. The commit checks for the VM_PFNMAP flag is set
>> in vma->flags and if set then marks force_pte to true such that
>> if force_pte is true then ignore the THP function check
>> (/transparent_hugepage_adjust()).
>>
>> There could be an issue with the VM_PFNMAP flag setting and checking.
>> For example consider a case where the mdev vendor driver register's
>> the vma_fault handler named vma_mmio_fault(), which maps the
>> host MMIO region in-turn calls remap_pfn_range() and maps
>> the MMIO's vma space. Where, remap_pfn_range implicitly sets
>> the VM_PFNMAP flag into vma->flags.
>>
>> Now lets assume a mmio fault handing flow where guest first access
>> the MMIO region whose 2nd stage translation is not present.
>> So that results to arm64-kvm hypervisor executing guest abort handler,
>> like below:
>>
>> kvm_handle_guest_abort() -->
>>  user_mem_abort()--> {
>>
>>     ...
>>     0. checks the vma->flags for the VM_PFNMAP.
>>     1. Since VM_PFNMAP flag is not yet set so force_pte _is_ false;
>>     2. gfn_to_pfn_prot() -->
>>         __gfn_to_pfn_memslot() -->
>>             fixup_user_fault() -->
>>                 handle_mm_fault()-->
>>                     __do_fault() -->
>>                        vma_mmio_fault() --> // vendor's mdev fault
>> handler
>>                         remap_pfn_range()--> // Here sets the VM_PFNMAP
>>                                               flag into vma->flags.
>>     3. Now that force_pte is set to false in step-2),
>>        will execute transparent_hugepage_adjust() func and
>>        that lead to Oops [4].
>>  }
>
> Hmmm. Nice. Any chance you could provide us with an actual reproducer?
>
I tried to create the reproducer scenario with vfio-pci driver using
nvidia GPU in PT mode, As because vfio-pci driver now supports
vma faulting (/vfio_pci_mmap_fault) so could create a crude reproducer
situation with that.

To create the repro - I did an ugly hack into arm64/kvm/mmu.c.
The hack is to make sure that stage2 mapping are not created
at the time of vm_init by unsetting VM_PFNMAP flag. This `unsetting` flag
needed because vfio-pci's mmap func(/vfio_pci_mmap) by-default
sets the VM_PFNMAP flag for the MMIO region but I want
the remap_pfn_range() func to set the _PFNMAP flag via vfio's fault
handler func vfio_pci_mmap_fault().

So with above, when guest access the MMIO region, this will
trigger the mmio fault path at arm64-kvm hypervisor layer like below:
user_mem_abort() {->...
     --> Check the VM_PFNMAP flag, since not set so marks force_pte=false
     ....
     __gfn_to_pfn_memslot()-->
     ...
     handle_mm_fault()-->
     do_fault()-->
     vfio_pci_mmio_fault()-->
     remap_pfn_range()--> Now will set the VM_PFNMAP flag.
}

I have also set the force_pte=true, just to avoid the THP Oops which was
mentioned in the previous thread.

hackish change to reproduce scenario:

--->
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 19aacc7d64de..9ef70dc624cf 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -836,9 +836,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
         }
         if (is_error_noslot_pfn(pfn))
                 return -EFAULT;
-
         if (kvm_is_device_pfn(pfn)) {
                 device = true;
+               force_pte = true;
         } else if (logging_active && !write_fault) {
                 /*
                  * Only actually map the page as writable if this was a 
write
@@ -1317,6 +1317,11 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
                 vm_start = max(hva, vma->vm_start);
                 vm_end = min(reg_end, vma->vm_end);

+               /* Hack to make sure stage2 mapping not present, thus 
trigger
+                * user_mem_abort for stage2 mapping */
+               if (vma->vm_flags & VM_PFNMAP) {
+                       vma->vm_flags = vma->vm_flags & (~VM_PFNMAP);
+               }
                 if (vma->vm_flags & VM_PFNMAP) {
                         gpa_t gpa = mem->guest_phys_addr +
                                     (vm_start - mem->userspace_addr);

>>
>> The proposition is to check is_iomap flag before executing the THP
>> function transparent_hugepage_adjust().
>>
>> [4] THP Oops:
>>> pc: kvm_is_transparent_hugepage+0x18/0xb0
>>> ...
>>> ...
>>> user_mem_abort+0x340/0x9b8
>>> kvm_handle_guest_abort+0x248/0x468
>>> handle_exit+0x150/0x1b0
>>> kvm_arch_vcpu_ioctl_run+0x4d4/0x778
>>> kvm_vcpu_ioctl+0x3c0/0x858
>>> ksys_ioctl+0x84/0xb8
>>> __arm64_sys_ioctl+0x28/0x38
>>
>> Tested on Huawei Kunpeng Taishan-200 arm64 server, Using VFIO-mdev
>> device.
>> Linux tip: 583090b1
>>
>> Fixes: 6d674e28 ("KVM: arm/arm64: Properly handle faulting of device
>> mappings")
>> Signed-off-by: Santosh Shukla <sashukla@nvidia.com>
>> ---
>>  arch/arm64/kvm/mmu.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 3d26b47..ff15357 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -1947,7 +1947,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu,
>> phys_addr_t fault_ipa,
>>        * If we are not forced to use page mapping, check if we are
>>        * backed by a THP and thus use block mapping if possible.
>>        */
>> -     if (vma_pagesize == PAGE_SIZE && !force_pte)
>> +     if (vma_pagesize == PAGE_SIZE && !force_pte && !is_iomap(flags))
>>               vma_pagesize = transparent_hugepage_adjust(memslot, hva,
>> &pfn, &fault_ipa);
>>       if (writable)
>
> Why don't you directly set force_pte to true at the point where we
> update
> the flags? It certainly would be a bit more readable:
>
Yes.
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 3d26b47a1343..7a4ad984d54e 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1920,6 +1920,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu,
> phys_addr_t fault_ipa,
>        if (kvm_is_device_pfn(pfn)) {
>                mem_type = PAGE_S2_DEVICE;
>                flags |= KVM_S2PTE_FLAG_IS_IOMAP;
> +               force_pte = true;
>        } else if (logging_active) {
>                /*
>                 * Faults on pages in a memslot with logging enabled
>
> and almost directly applies to what we have queued for 5.10.
>
Right. I believe - Above code is sightly changed at linux-next commit: 
9695c4ff

Modified one looks like below:

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 19aacc7..d4cd253 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -839,6 +839,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,

         if (kvm_is_device_pfn(pfn)) {
                 device = true;
+               force_pte = true;
         } else if (logging_active && !write_fault) {
                 /*
                  * Only actually map the page as writable if this was a 
write

pl. let me know if above is okay and will send out v2.

Thanks.

Santosh


> Thanks,
>
>         M.
> -- 
> Jazz is not dead. It just smells funny...

[-- Attachment #1.2: Type: text/html, Size: 11949 bytes --]

[-- Attachment #2: Type: text/plain, Size: 151 bytes --]

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

  reply	other threads:[~2020-10-26  8:33 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-21 16:16 [PATCH] KVM: arm64: Correctly handle the mmio faulting Santosh Shukla
2020-10-21 16:16 ` Santosh Shukla
2020-10-21 16:16 ` Santosh Shukla
2020-10-23 11:29 ` Marc Zyngier
2020-10-23 11:29   ` Marc Zyngier
2020-10-23 11:29   ` Marc Zyngier
2020-10-26  4:56   ` Santosh Shukla [this message]
2020-10-26  6:50   ` Santosh Shukla
2021-04-21  2:59 ` Keqian Zhu
2021-04-21  2:59   ` Keqian Zhu
2021-04-21  2:59   ` Keqian Zhu
2021-04-21  6:20   ` Gavin Shan
2021-04-21  6:20     ` Gavin Shan
2021-04-21  6:20     ` Gavin Shan
2021-04-21  6:17     ` Keqian Zhu
2021-04-21  6:17       ` Keqian Zhu
2021-04-21  6:17       ` Keqian Zhu
2021-04-21 11:59       ` Marc Zyngier
2021-04-21 11:59         ` Marc Zyngier
2021-04-21 11:59         ` Marc Zyngier
2021-04-22  2:02         ` Gavin Shan
2021-04-22  2:02           ` Gavin Shan
2021-04-22  2:02           ` Gavin Shan
2021-04-22  6:50           ` Marc Zyngier
2021-04-22  6:50             ` Marc Zyngier
2021-04-22  6:50             ` Marc Zyngier
2021-04-22  7:36             ` Tarun Gupta (SW-GPU)
2021-04-22  7:36               ` Tarun Gupta (SW-GPU)
2021-04-22  7:36               ` Tarun Gupta (SW-GPU)
2021-04-22  8:00               ` Santosh Shukla
2021-04-22  8:00                 ` Santosh Shukla
2021-04-22  8:00                 ` Santosh Shukla
2021-04-23  1:06                 ` Keqian Zhu
2021-04-23  1:06                   ` Keqian Zhu
2021-04-23  1:06                   ` Keqian Zhu
2021-04-23  1:38             ` Gavin Shan
2021-04-23  1:38               ` Gavin Shan
2021-04-23  1:38               ` Gavin Shan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f56e0d71-ceb5-8ecf-e865-4ee857e333e1@nvidia.com \
    --to=sashukla@nvidia.com \
    --cc=cjia@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.cs.columbia.edu \
    --cc=kwankhede@nvidia.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maz@kernel.org \
    --cc=mcrossley@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.