[PATCH v4 0/2] kvm/arm64: Try stage2 block mapping for host device MMIO

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/2] kvm/arm64: Try stage2 block mapping for host device MMIO
@ 2021-04-15 14:03 Keqian Zhu
  2021-04-15 14:03 ` [PATCH v4 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions Keqian Zhu
  2021-04-15 14:03 ` [PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO Keqian Zhu
  0 siblings, 2 replies; 19+ messages in thread
From: Keqian Zhu @ 2021-04-15 14:03 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, Marc Zyngier; +Cc: wanghaibin.wang

Hi,

We have two pathes to build stage2 mapping for MMIO regions.

Create time's path and stage2 fault path.

Patch#1 removes the creation time's mapping of MMIO regions
Patch#2 tries stage2 block mapping for host device MMIO at fault path

Changelog:

v4:
 - use get_vma_page_shift() handle all cases. (Marc)
 - get rid of force_pte for device mapping. (Marc)

v3:
 - Do not need to check memslot boundary in device_rough_page_shift(). (Marc)

Thanks,
Keqian


Keqian Zhu (2):
  kvm/arm64: Remove the creation time's mapping of MMIO regions
  kvm/arm64: Try stage2 block mapping for host device MMIO

 arch/arm64/kvm/mmu.c | 99 ++++++++++++++++++++++++--------------------
 1 file changed, 54 insertions(+), 45 deletions(-)

-- 
2.19.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions
  2021-04-15 14:03 [PATCH v4 0/2] kvm/arm64: Try stage2 block mapping for host device MMIO Keqian Zhu
@ 2021-04-15 14:03 ` Keqian Zhu
  2021-04-21  6:38   ` Gavin Shan
  2021-04-23  0:36   ` Gavin Shan
  2021-04-15 14:03 ` [PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO Keqian Zhu
  1 sibling, 2 replies; 19+ messages in thread
From: Keqian Zhu @ 2021-04-15 14:03 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, Marc Zyngier; +Cc: wanghaibin.wang

The MMIO regions may be unmapped for many reasons and can be remapped
by stage2 fault path. Map MMIO regions at creation time becomes a
minor optimization and makes these two mapping path hard to sync.

Remove the mapping code while keep the useful sanity check.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
 arch/arm64/kvm/mmu.c | 38 +++-----------------------------------
 1 file changed, 3 insertions(+), 35 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 8711894db8c2..c59af5ca01b0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1301,7 +1301,6 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 {
 	hva_t hva = mem->userspace_addr;
 	hva_t reg_end = hva + mem->memory_size;
-	bool writable = !(mem->flags & KVM_MEM_READONLY);
 	int ret = 0;
 
 	if (change != KVM_MR_CREATE && change != KVM_MR_MOVE &&
@@ -1318,8 +1317,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 	mmap_read_lock(current->mm);
 	/*
 	 * A memory region could potentially cover multiple VMAs, and any holes
-	 * between them, so iterate over all of them to find out if we can map
-	 * any of them right now.
+	 * between them, so iterate over all of them.
 	 *
 	 *     +--------------------------------------------+
 	 * +---------------+----------------+   +----------------+
@@ -1330,50 +1328,20 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 	 */
 	do {
 		struct vm_area_struct *vma = find_vma(current->mm, hva);
-		hva_t vm_start, vm_end;
 
 		if (!vma || vma->vm_start >= reg_end)
 			break;
 
-		/*
-		 * Take the intersection of this VMA with the memory region
-		 */
-		vm_start = max(hva, vma->vm_start);
-		vm_end = min(reg_end, vma->vm_end);
-
 		if (vma->vm_flags & VM_PFNMAP) {
-			gpa_t gpa = mem->guest_phys_addr +
-				    (vm_start - mem->userspace_addr);
-			phys_addr_t pa;
-
-			pa = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
-			pa += vm_start - vma->vm_start;
-
 			/* IO region dirty page logging not allowed */
 			if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) {
 				ret = -EINVAL;
-				goto out;
-			}
-
-			ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
-						    vm_end - vm_start,
-						    writable);
-			if (ret)
 				break;
+			}
 		}
-		hva = vm_end;
+		hva = min(reg_end, vma->vm_end);
 	} while (hva < reg_end);
 
-	if (change == KVM_MR_FLAGS_ONLY)
-		goto out;
-
-	spin_lock(&kvm->mmu_lock);
-	if (ret)
-		unmap_stage2_range(&kvm->arch.mmu, mem->guest_phys_addr, mem->memory_size);
-	else if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
-		stage2_flush_memslot(kvm, memslot);
-	spin_unlock(&kvm->mmu_lock);
-out:
 	mmap_read_unlock(current->mm);
 	return ret;
 }
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions
  2021-04-15 14:03 ` [PATCH v4 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions Keqian Zhu
@ 2021-04-21  6:38   ` Gavin Shan
  2021-04-21  6:28     ` Keqian Zhu
  2021-04-23  0:36   ` Gavin Shan
  1 sibling, 1 reply; 19+ messages in thread
From: Gavin Shan @ 2021-04-21  6:38 UTC (permalink / raw)
  To: Keqian Zhu, linux-kernel, linux-arm-kernel, kvm, kvmarm, Marc Zyngier

Hi Keqian,

On 4/16/21 12:03 AM, Keqian Zhu wrote:
> The MMIO regions may be unmapped for many reasons and can be remapped
> by stage2 fault path. Map MMIO regions at creation time becomes a
> minor optimization and makes these two mapping path hard to sync.
> 
> Remove the mapping code while keep the useful sanity check.
> 
> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> ---
>   arch/arm64/kvm/mmu.c | 38 +++-----------------------------------
>   1 file changed, 3 insertions(+), 35 deletions(-)
> 

After removing the logic to create stage2 mapping for VM_PFNMAP region,
I think the "do { } while" loop becomes unnecessary and can be dropped
completely. It means the only sanity check is to see if the memory slot
overflows IPA space or not. In that case, KVM_MR_FLAGS_ONLY can be
ignored because the memory slot's base address and length aren't changed
when we have KVM_MR_FLAGS_ONLY.

It seems the patch isn't based on "next" branch because find_vma() was
replaced by find_vma_intersection() by one of my patches :)

Thanks,
Gavin

> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 8711894db8c2..c59af5ca01b0 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1301,7 +1301,6 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>   {
>   	hva_t hva = mem->userspace_addr;
>   	hva_t reg_end = hva + mem->memory_size;
> -	bool writable = !(mem->flags & KVM_MEM_READONLY);
>   	int ret = 0;
>   
>   	if (change != KVM_MR_CREATE && change != KVM_MR_MOVE &&
> @@ -1318,8 +1317,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>   	mmap_read_lock(current->mm);
>   	/*
>   	 * A memory region could potentially cover multiple VMAs, and any holes
> -	 * between them, so iterate over all of them to find out if we can map
> -	 * any of them right now.
> +	 * between them, so iterate over all of them.
>   	 *
>   	 *     +--------------------------------------------+
>   	 * +---------------+----------------+   +----------------+
> @@ -1330,50 +1328,20 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>   	 */
>   	do {
>   		struct vm_area_struct *vma = find_vma(current->mm, hva);
> -		hva_t vm_start, vm_end;
>   
>   		if (!vma || vma->vm_start >= reg_end)
>   			break;
>   
> -		/*
> -		 * Take the intersection of this VMA with the memory region
> -		 */
> -		vm_start = max(hva, vma->vm_start);
> -		vm_end = min(reg_end, vma->vm_end);
> -
>   		if (vma->vm_flags & VM_PFNMAP) {
> -			gpa_t gpa = mem->guest_phys_addr +
> -				    (vm_start - mem->userspace_addr);
> -			phys_addr_t pa;
> -
> -			pa = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
> -			pa += vm_start - vma->vm_start;
> -
>   			/* IO region dirty page logging not allowed */
>   			if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) {
>   				ret = -EINVAL;
> -				goto out;
> -			}
> -
> -			ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
> -						    vm_end - vm_start,
> -						    writable);
> -			if (ret)
>   				break;
> +			}
>   		}
> -		hva = vm_end;
> +		hva = min(reg_end, vma->vm_end);
>   	} while (hva < reg_end);
>   
> -	if (change == KVM_MR_FLAGS_ONLY)
> -		goto out;
> -
> -	spin_lock(&kvm->mmu_lock);
> -	if (ret)
> -		unmap_stage2_range(&kvm->arch.mmu, mem->guest_phys_addr, mem->memory_size);
> -	else if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
> -		stage2_flush_memslot(kvm, memslot);
> -	spin_unlock(&kvm->mmu_lock);
> -out:
>   	mmap_read_unlock(current->mm);
>   	return ret;
>   }
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions
  2021-04-21  6:38   ` Gavin Shan
@ 2021-04-21  6:28     ` Keqian Zhu
  2021-04-22  2:12       ` Gavin Shan
  0 siblings, 1 reply; 19+ messages in thread
From: Keqian Zhu @ 2021-04-21  6:28 UTC (permalink / raw)
  To: Gavin Shan, linux-kernel, linux-arm-kernel, kvm, kvmarm, Marc Zyngier

Hi Gavin,

On 2021/4/21 14:38, Gavin Shan wrote:
> Hi Keqian,
> 
> On 4/16/21 12:03 AM, Keqian Zhu wrote:
>> The MMIO regions may be unmapped for many reasons and can be remapped
>> by stage2 fault path. Map MMIO regions at creation time becomes a
>> minor optimization and makes these two mapping path hard to sync.
>>
>> Remove the mapping code while keep the useful sanity check.
>>
>> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
>> ---
>>   arch/arm64/kvm/mmu.c | 38 +++-----------------------------------
>>   1 file changed, 3 insertions(+), 35 deletions(-)
>>
> 
> After removing the logic to create stage2 mapping for VM_PFNMAP region,
> I think the "do { } while" loop becomes unnecessary and can be dropped
> completely. It means the only sanity check is to see if the memory slot
> overflows IPA space or not. In that case, KVM_MR_FLAGS_ONLY can be
> ignored because the memory slot's base address and length aren't changed
> when we have KVM_MR_FLAGS_ONLY.
Maybe not exactly. Here we do an important sanity check that we shouldn't
log dirty for memslots with VM_PFNMAP.


> 
> It seems the patch isn't based on "next" branch because find_vma() was
> replaced by find_vma_intersection() by one of my patches :)
Yep, I remember it. I will replace it at next merge window...

Thanks,
Keqian

> 
> Thanks,
> Gavin
> 
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 8711894db8c2..c59af5ca01b0 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -1301,7 +1301,6 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>>   {
>>       hva_t hva = mem->userspace_addr;
>>       hva_t reg_end = hva + mem->memory_size;
>> -    bool writable = !(mem->flags & KVM_MEM_READONLY);
>>       int ret = 0;
>>         if (change != KVM_MR_CREATE && change != KVM_MR_MOVE &&
>> @@ -1318,8 +1317,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>>       mmap_read_lock(current->mm);
>>       /*
>>        * A memory region could potentially cover multiple VMAs, and any holes
>> -     * between them, so iterate over all of them to find out if we can map
>> -     * any of them right now.
>> +     * between them, so iterate over all of them.
>>        *
>>        *     +--------------------------------------------+
>>        * +---------------+----------------+   +----------------+
>> @@ -1330,50 +1328,20 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>>        */
>>       do {
>>           struct vm_area_struct *vma = find_vma(current->mm, hva);
>> -        hva_t vm_start, vm_end;
>>             if (!vma || vma->vm_start >= reg_end)
>>               break;
>>   -        /*
>> -         * Take the intersection of this VMA with the memory region
>> -         */
>> -        vm_start = max(hva, vma->vm_start);
>> -        vm_end = min(reg_end, vma->vm_end);
>> -
>>           if (vma->vm_flags & VM_PFNMAP) {
>> -            gpa_t gpa = mem->guest_phys_addr +
>> -                    (vm_start - mem->userspace_addr);
>> -            phys_addr_t pa;
>> -
>> -            pa = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
>> -            pa += vm_start - vma->vm_start;
>> -
>>               /* IO region dirty page logging not allowed */
>>               if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) {
>>                   ret = -EINVAL;
>> -                goto out;
>> -            }
>> -
>> -            ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
>> -                            vm_end - vm_start,
>> -                            writable);
>> -            if (ret)
>>                   break;
>> +            }
>>           }
>> -        hva = vm_end;
>> +        hva = min(reg_end, vma->vm_end);
>>       } while (hva < reg_end);
>>   -    if (change == KVM_MR_FLAGS_ONLY)
>> -        goto out;
>> -
>> -    spin_lock(&kvm->mmu_lock);
>> -    if (ret)
>> -        unmap_stage2_range(&kvm->arch.mmu, mem->guest_phys_addr, mem->memory_size);
>> -    else if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
>> -        stage2_flush_memslot(kvm, memslot);
>> -    spin_unlock(&kvm->mmu_lock);
>> -out:
>>       mmap_read_unlock(current->mm);
>>       return ret;
>>   }
>>
> 
> .
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions
  2021-04-21  6:28     ` Keqian Zhu
@ 2021-04-22  2:12       ` Gavin Shan
  2021-04-22  7:41         ` Keqian Zhu
  0 siblings, 1 reply; 19+ messages in thread
From: Gavin Shan @ 2021-04-22  2:12 UTC (permalink / raw)
  To: Keqian Zhu, linux-kernel, linux-arm-kernel, kvm, kvmarm,
	Marc Zyngier, Santosh Shukla

Hi Keqian,

On 4/21/21 4:28 PM, Keqian Zhu wrote:
> On 2021/4/21 14:38, Gavin Shan wrote:
>> On 4/16/21 12:03 AM, Keqian Zhu wrote:
>>> The MMIO regions may be unmapped for many reasons and can be remapped
>>> by stage2 fault path. Map MMIO regions at creation time becomes a
>>> minor optimization and makes these two mapping path hard to sync.
>>>
>>> Remove the mapping code while keep the useful sanity check.
>>>
>>> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
>>> ---
>>>    arch/arm64/kvm/mmu.c | 38 +++-----------------------------------
>>>    1 file changed, 3 insertions(+), 35 deletions(-)
>>>
>>
>> After removing the logic to create stage2 mapping for VM_PFNMAP region,
>> I think the "do { } while" loop becomes unnecessary and can be dropped
>> completely. It means the only sanity check is to see if the memory slot
>> overflows IPA space or not. In that case, KVM_MR_FLAGS_ONLY can be
>> ignored because the memory slot's base address and length aren't changed
>> when we have KVM_MR_FLAGS_ONLY.
> Maybe not exactly. Here we do an important sanity check that we shouldn't
> log dirty for memslots with VM_PFNMAP.
> 

Yeah, Sorry that I missed that part. Something associated with Santosh's
patch. The flag can be not existing until the page fault happened on
the vma. In this case, the check could be not working properly.

   [PATCH] KVM: arm64: Correctly handle the mmio faulting

[...]

Thanks,
Gavin

>>
>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>> index 8711894db8c2..c59af5ca01b0 100644
>>> --- a/arch/arm64/kvm/mmu.c
>>> +++ b/arch/arm64/kvm/mmu.c
>>> @@ -1301,7 +1301,6 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>>>    {
>>>        hva_t hva = mem->userspace_addr;
>>>        hva_t reg_end = hva + mem->memory_size;
>>> -    bool writable = !(mem->flags & KVM_MEM_READONLY);
>>>        int ret = 0;
>>>          if (change != KVM_MR_CREATE && change != KVM_MR_MOVE &&
>>> @@ -1318,8 +1317,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>>>        mmap_read_lock(current->mm);
>>>        /*
>>>         * A memory region could potentially cover multiple VMAs, and any holes
>>> -     * between them, so iterate over all of them to find out if we can map
>>> -     * any of them right now.
>>> +     * between them, so iterate over all of them.
>>>         *
>>>         *     +--------------------------------------------+
>>>         * +---------------+----------------+   +----------------+
>>> @@ -1330,50 +1328,20 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>>>         */
>>>        do {
>>>            struct vm_area_struct *vma = find_vma(current->mm, hva);
>>> -        hva_t vm_start, vm_end;
>>>              if (!vma || vma->vm_start >= reg_end)
>>>                break;
>>>    -        /*
>>> -         * Take the intersection of this VMA with the memory region
>>> -         */
>>> -        vm_start = max(hva, vma->vm_start);
>>> -        vm_end = min(reg_end, vma->vm_end);
>>> -
>>>            if (vma->vm_flags & VM_PFNMAP) {
>>> -            gpa_t gpa = mem->guest_phys_addr +
>>> -                    (vm_start - mem->userspace_addr);
>>> -            phys_addr_t pa;
>>> -
>>> -            pa = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
>>> -            pa += vm_start - vma->vm_start;
>>> -
>>>                /* IO region dirty page logging not allowed */
>>>                if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) {
>>>                    ret = -EINVAL;
>>> -                goto out;
>>> -            }
>>> -
>>> -            ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
>>> -                            vm_end - vm_start,
>>> -                            writable);
>>> -            if (ret)
>>>                    break;
>>> +            }
>>>            }
>>> -        hva = vm_end;
>>> +        hva = min(reg_end, vma->vm_end);
>>>        } while (hva < reg_end);
>>>    -    if (change == KVM_MR_FLAGS_ONLY)
>>> -        goto out;
>>> -
>>> -    spin_lock(&kvm->mmu_lock);
>>> -    if (ret)
>>> -        unmap_stage2_range(&kvm->arch.mmu, mem->guest_phys_addr, mem->memory_size);
>>> -    else if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
>>> -        stage2_flush_memslot(kvm, memslot);
>>> -    spin_unlock(&kvm->mmu_lock);
>>> -out:
>>>        mmap_read_unlock(current->mm);
>>>        return ret;
>>>    }
>>>
>>
>> .
>>
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions
  2021-04-22  2:12       ` Gavin Shan
@ 2021-04-22  7:41         ` Keqian Zhu
  2021-04-23  1:35           ` Gavin Shan
  0 siblings, 1 reply; 19+ messages in thread
From: Keqian Zhu @ 2021-04-22  7:41 UTC (permalink / raw)
  To: Gavin Shan
  Cc: linux-kernel, linux-arm-kernel, kvm, kvmarm, Marc Zyngier,
	Santosh Shukla

Hi Gavin,

On 2021/4/22 10:12, Gavin Shan wrote:
> Hi Keqian,
> 
> On 4/21/21 4:28 PM, Keqian Zhu wrote:
>> On 2021/4/21 14:38, Gavin Shan wrote:
>>> On 4/16/21 12:03 AM, Keqian Zhu wrote:
>>>> The MMIO regions may be unmapped for many reasons and can be remapped
>>>> by stage2 fault path. Map MMIO regions at creation time becomes a
>>>> minor optimization and makes these two mapping path hard to sync.
>>>>
>>>> Remove the mapping code while keep the useful sanity check.
>>>>
>>>> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
>>>> ---
>>>>    arch/arm64/kvm/mmu.c | 38 +++-----------------------------------
>>>>    1 file changed, 3 insertions(+), 35 deletions(-)
>>>>
>>>
>>> After removing the logic to create stage2 mapping for VM_PFNMAP region,
>>> I think the "do { } while" loop becomes unnecessary and can be dropped
>>> completely. It means the only sanity check is to see if the memory slot
>>> overflows IPA space or not. In that case, KVM_MR_FLAGS_ONLY can be
>>> ignored because the memory slot's base address and length aren't changed
>>> when we have KVM_MR_FLAGS_ONLY.
>> Maybe not exactly. Here we do an important sanity check that we shouldn't
>> log dirty for memslots with VM_PFNMAP.
>>
> 
> Yeah, Sorry that I missed that part. Something associated with Santosh's
> patch. The flag can be not existing until the page fault happened on
> the vma. In this case, the check could be not working properly.
> 
>   [PATCH] KVM: arm64: Correctly handle the mmio faulting
Yeah, you are right.

If that happens, we won't try to use block mapping for memslot with VM_PFNMAP.
But it keeps a same logic with old code.

1. When without dirty-logging, we won't try block mapping for it, and we'll
finally know that it's device, so won't try to do adjust THP (Transparent Huge Page)
for it.
2. If userspace wrongly enables dirty logging for this memslot, we'll force_pte for it.

Thanks,
Keqian

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions
  2021-04-22  7:41         ` Keqian Zhu
@ 2021-04-23  1:35           ` Gavin Shan
  2021-04-23  1:36             ` Keqian Zhu
  0 siblings, 1 reply; 19+ messages in thread
From: Gavin Shan @ 2021-04-23  1:35 UTC (permalink / raw)
  To: Keqian Zhu
  Cc: linux-kernel, linux-arm-kernel, kvm, kvmarm, Marc Zyngier,
	Santosh Shukla

Hi Keqian,

On 4/22/21 5:41 PM, Keqian Zhu wrote:
> On 2021/4/22 10:12, Gavin Shan wrote:
>> On 4/21/21 4:28 PM, Keqian Zhu wrote:
>>> On 2021/4/21 14:38, Gavin Shan wrote:
>>>> On 4/16/21 12:03 AM, Keqian Zhu wrote:

[...]

>>
>> Yeah, Sorry that I missed that part. Something associated with Santosh's
>> patch. The flag can be not existing until the page fault happened on
>> the vma. In this case, the check could be not working properly.
>>
>>    [PATCH] KVM: arm64: Correctly handle the mmio faulting
> Yeah, you are right.
> 
> If that happens, we won't try to use block mapping for memslot with VM_PFNMAP.
> But it keeps a same logic with old code.
> 
> 1. When without dirty-logging, we won't try block mapping for it, and we'll
> finally know that it's device, so won't try to do adjust THP (Transparent Huge Page)
> for it.
> 2. If userspace wrongly enables dirty logging for this memslot, we'll force_pte for it.
> 

It's not about the patch itself and just want more discussion to get more details.
The patch itself looks good to me. I got two questions as below:

(1) The memslot fails to be added if it's backed by MMIO region and dirty logging is
enabled in kvm_arch_prepare_memory_region(). As Santosh reported, the corresponding
vma could be associated with MMIO region and VM_PFNMAP is missed. In this case,
kvm_arch_prepare_memory_region() isn't returning error, meaning the memslot can be
added successfully and block mapping isn't used, as you mentioned. The question is
the memslot is added, but the expected result would be failure.

(2) If dirty logging is enabled on the MMIO memslot, everything should be fine. If
the dirty logging isn't enabled and VM_PFNMAP isn't set yet in user_mem_abort(),
block mapping won't be used and PAGE_SIZE is picked, but the failing IPA might
be good candidate for block mapping. It means we miss something for blocking
mapping?

By the way, do you have idea why dirty logging can't be enabled on MMIO memslot?
I guess Marc might know the history. For example, QEMU is taking "/dev/mem" or
"/dev/kmem" to back guest's memory, the vma is marked as MMIO, but dirty logging
and migration isn't supported?

Thanks,
Gavin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions
  2021-04-23  1:35           ` Gavin Shan
@ 2021-04-23  1:36             ` Keqian Zhu
  0 siblings, 0 replies; 19+ messages in thread
From: Keqian Zhu @ 2021-04-23  1:36 UTC (permalink / raw)
  To: Gavin Shan
  Cc: linux-kernel, linux-arm-kernel, kvm, kvmarm, Marc Zyngier,
	Santosh Shukla

Hi Gavin,

On 2021/4/23 9:35, Gavin Shan wrote:
> Hi Keqian,
> 
> On 4/22/21 5:41 PM, Keqian Zhu wrote:
>> On 2021/4/22 10:12, Gavin Shan wrote:
>>> On 4/21/21 4:28 PM, Keqian Zhu wrote:
>>>> On 2021/4/21 14:38, Gavin Shan wrote:
>>>>> On 4/16/21 12:03 AM, Keqian Zhu wrote:
> 
> [...]
> 
>>>
>>> Yeah, Sorry that I missed that part. Something associated with Santosh's
>>> patch. The flag can be not existing until the page fault happened on
>>> the vma. In this case, the check could be not working properly.
>>>
>>>    [PATCH] KVM: arm64: Correctly handle the mmio faulting
>> Yeah, you are right.
>>
>> If that happens, we won't try to use block mapping for memslot with VM_PFNMAP.
>> But it keeps a same logic with old code.
>>
>> 1. When without dirty-logging, we won't try block mapping for it, and we'll
>> finally know that it's device, so won't try to do adjust THP (Transparent Huge Page)
>> for it.
>> 2. If userspace wrongly enables dirty logging for this memslot, we'll force_pte for it.
>>
> 
> It's not about the patch itself and just want more discussion to get more details.
> The patch itself looks good to me. I got two questions as below:
> 
> (1) The memslot fails to be added if it's backed by MMIO region and dirty logging is
> enabled in kvm_arch_prepare_memory_region(). As Santosh reported, the corresponding
> vma could be associated with MMIO region and VM_PFNMAP is missed. In this case,
> kvm_arch_prepare_memory_region() isn't returning error, meaning the memslot can be
> added successfully and block mapping isn't used, as you mentioned. The question is
> the memslot is added, but the expected result would be failure.
Sure. I think we could try to populate the final flag of vma in kvm_arch_prepare_memory_region().
Maybe through GUP or any better method? It's nice if you can try to solve this. :)

> 
> (2) If dirty logging is enabled on the MMIO memslot, everything should be fine. If
> the dirty logging isn't enabled and VM_PFNMAP isn't set yet in user_mem_abort(),
> block mapping won't be used and PAGE_SIZE is picked, but the failing IPA might
> be good candidate for block mapping. It means we miss something for blocking
> mapping?
Right. This issue also can be solved by populating the final flag of vma in kvm_arch_prepare_memory_region().


> 
> By the way, do you have idea why dirty logging can't be enabled on MMIO memslot?
IIUC, MMIO region is of device memory type, it's associated with device state and action.
For normal memory type, we can write it out-of-order and repeatedly, but for device memory
type, we can't do that. The write to MMIO will trigger device action based on current device
state, also what we can read from MMIO based on current device state. Thus the policy of
dirty logging for normal memory can't be applied to MMIO.



> I guess Marc might know the history. For example, QEMU is taking "/dev/mem" or
> "/dev/kmem" to back guest's memory, the vma is marked as MMIO, but dirty logging
> and migration isn't supported?
The MMIO region is a part of device state. We need extra kernel driver to support migration
of pass-through device, as how to save and restore the device state is closely related to
a specific type of device. You can refer VFIO migration for more detail.

Thanks,
Keqian

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions
  2021-04-15 14:03 ` [PATCH v4 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions Keqian Zhu
  2021-04-21  6:38   ` Gavin Shan
@ 2021-04-23  0:36   ` Gavin Shan
  1 sibling, 0 replies; 19+ messages in thread
From: Gavin Shan @ 2021-04-23  0:36 UTC (permalink / raw)
  To: Keqian Zhu, linux-kernel, linux-arm-kernel, kvm, kvmarm, Marc Zyngier

On 4/16/21 12:03 AM, Keqian Zhu wrote:
> The MMIO regions may be unmapped for many reasons and can be remapped
> by stage2 fault path. Map MMIO regions at creation time becomes a
> minor optimization and makes these two mapping path hard to sync.
> 
> Remove the mapping code while keep the useful sanity check.
> 
> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> ---
>   arch/arm64/kvm/mmu.c | 38 +++-----------------------------------
>   1 file changed, 3 insertions(+), 35 deletions(-)
>

Reviewed-by: Gavin Shan <gshan@redhat.com>

  
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 8711894db8c2..c59af5ca01b0 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1301,7 +1301,6 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>   {
>   	hva_t hva = mem->userspace_addr;
>   	hva_t reg_end = hva + mem->memory_size;
> -	bool writable = !(mem->flags & KVM_MEM_READONLY);
>   	int ret = 0;
>   
>   	if (change != KVM_MR_CREATE && change != KVM_MR_MOVE &&
> @@ -1318,8 +1317,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>   	mmap_read_lock(current->mm);
>   	/*
>   	 * A memory region could potentially cover multiple VMAs, and any holes
> -	 * between them, so iterate over all of them to find out if we can map
> -	 * any of them right now.
> +	 * between them, so iterate over all of them.
>   	 *
>   	 *     +--------------------------------------------+
>   	 * +---------------+----------------+   +----------------+
> @@ -1330,50 +1328,20 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>   	 */
>   	do {
>   		struct vm_area_struct *vma = find_vma(current->mm, hva);
> -		hva_t vm_start, vm_end;
>   
>   		if (!vma || vma->vm_start >= reg_end)
>   			break;
>   
> -		/*
> -		 * Take the intersection of this VMA with the memory region
> -		 */
> -		vm_start = max(hva, vma->vm_start);
> -		vm_end = min(reg_end, vma->vm_end);
> -
>   		if (vma->vm_flags & VM_PFNMAP) {
> -			gpa_t gpa = mem->guest_phys_addr +
> -				    (vm_start - mem->userspace_addr);
> -			phys_addr_t pa;
> -
> -			pa = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
> -			pa += vm_start - vma->vm_start;
> -
>   			/* IO region dirty page logging not allowed */
>   			if (memslot->flags & KVM_MEM_LOG_DIRTY_PAGES) {
>   				ret = -EINVAL;
> -				goto out;
> -			}
> -
> -			ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
> -						    vm_end - vm_start,
> -						    writable);
> -			if (ret)
>   				break;
> +			}
>   		}
> -		hva = vm_end;
> +		hva = min(reg_end, vma->vm_end);
>   	} while (hva < reg_end);
>   
> -	if (change == KVM_MR_FLAGS_ONLY)
> -		goto out;
> -
> -	spin_lock(&kvm->mmu_lock);
> -	if (ret)
> -		unmap_stage2_range(&kvm->arch.mmu, mem->guest_phys_addr, mem->memory_size);
> -	else if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
> -		stage2_flush_memslot(kvm, memslot);
> -	spin_unlock(&kvm->mmu_lock);
> -out:
>   	mmap_read_unlock(current->mm);
>   	return ret;
>   }
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO
  2021-04-15 14:03 [PATCH v4 0/2] kvm/arm64: Try stage2 block mapping for host device MMIO Keqian Zhu
  2021-04-15 14:03 ` [PATCH v4 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions Keqian Zhu
@ 2021-04-15 14:03 ` Keqian Zhu
  2021-04-15 14:08   ` Keqian Zhu
                     ` (2 more replies)
  1 sibling, 3 replies; 19+ messages in thread
From: Keqian Zhu @ 2021-04-15 14:03 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, Marc Zyngier; +Cc: wanghaibin.wang

The MMIO region of a device maybe huge (GB level), try to use
block mapping in stage2 to speedup both map and unmap.

Compared to normal memory mapping, we should consider two more
points when try block mapping for MMIO region:

1. For normal memory mapping, the PA(host physical address) and
HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
the HVA to request hugepage, so we don't need to consider PA
alignment when verifing block mapping. But for device memory
mapping, the PA and HVA may have different alignment.

2. For normal memory mapping, we are sure hugepage size properly
fit into vma, so we don't check whether the mapping size exceeds
the boundary of vma. But for device memory mapping, we should pay
attention to this.

This adds get_vma_page_shift() to get page shift for both normal
memory and device MMIO region, and check these two points when
selecting block mapping size for MMIO region.

Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
---
 arch/arm64/kvm/mmu.c | 61 ++++++++++++++++++++++++++++++++++++--------
 1 file changed, 51 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c59af5ca01b0..5a1cc7751e6d 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -738,6 +738,35 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
 	return PAGE_SIZE;
 }
 
+static int get_vma_page_shift(struct vm_area_struct *vma, unsigned long hva)
+{
+	unsigned long pa;
+
+	if (is_vm_hugetlb_page(vma) && !(vma->vm_flags & VM_PFNMAP))
+		return huge_page_shift(hstate_vma(vma));
+
+	if (!(vma->vm_flags & VM_PFNMAP))
+		return PAGE_SHIFT;
+
+	VM_BUG_ON(is_vm_hugetlb_page(vma));
+
+	pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
+
+#ifndef __PAGETABLE_PMD_FOLDED
+	if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
+	    ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
+	    ALIGN(hva, PUD_SIZE) <= vma->vm_end)
+		return PUD_SHIFT;
+#endif
+
+	if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
+	    ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
+	    ALIGN(hva, PMD_SIZE) <= vma->vm_end)
+		return PMD_SHIFT;
+
+	return PAGE_SHIFT;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
 			  unsigned long fault_status)
@@ -769,7 +798,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		return -EFAULT;
 	}
 
-	/* Let's check if we will get back a huge page backed by hugetlbfs */
+	/*
+	 * Let's check if we will get back a huge page backed by hugetlbfs, or
+	 * get block mapping for device MMIO region.
+	 */
 	mmap_read_lock(current->mm);
 	vma = find_vma_intersection(current->mm, hva, hva + 1);
 	if (unlikely(!vma)) {
@@ -778,15 +810,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		return -EFAULT;
 	}
 
-	if (is_vm_hugetlb_page(vma))
-		vma_shift = huge_page_shift(hstate_vma(vma));
-	else
-		vma_shift = PAGE_SHIFT;
-
-	if (logging_active ||
-	    (vma->vm_flags & VM_PFNMAP)) {
+	/*
+	 * logging_active is guaranteed to never be true for VM_PFNMAP
+	 * memslots.
+	 */
+	if (logging_active) {
 		force_pte = true;
 		vma_shift = PAGE_SHIFT;
+	} else {
+		vma_shift = get_vma_page_shift(vma, hva);
 	}
 
 	switch (vma_shift) {
@@ -854,8 +886,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		return -EFAULT;
 
 	if (kvm_is_device_pfn(pfn)) {
+		/*
+		 * If the page was identified as device early by looking at
+		 * the VMA flags, vma_pagesize is already representing the
+		 * largest quantity we can map.  If instead it was mapped
+		 * via gfn_to_pfn_prot(), vma_pagesize is set to PAGE_SIZE
+		 * and must not be upgraded.
+		 *
+		 * In both cases, we don't let transparent_hugepage_adjust()
+		 * change things at the last minute.
+		 */
 		device = true;
-		force_pte = true;
 	} else if (logging_active && !write_fault) {
 		/*
 		 * Only actually map the page as writable if this was a write
@@ -876,7 +917,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * If we are not forced to use page mapping, check if we are
 	 * backed by a THP and thus use block mapping if possible.
 	 */
-	if (vma_pagesize == PAGE_SIZE && !force_pte)
+	if (vma_pagesize == PAGE_SIZE && !(force_pte || device))
 		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
 							   &pfn, &fault_ipa);
 	if (writable)
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO
  2021-04-15 14:03 ` [PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO Keqian Zhu
@ 2021-04-15 14:08   ` Keqian Zhu
  2021-04-16 14:44     ` Marc Zyngier
  2021-04-21  7:52   ` Gavin Shan
  2021-04-23  0:37   ` Gavin Shan
  2 siblings, 1 reply; 19+ messages in thread
From: Keqian Zhu @ 2021-04-15 14:08 UTC (permalink / raw)
  To: linux-kernel, linux-arm-kernel, kvm, kvmarm, Marc Zyngier; +Cc: wanghaibin.wang

Hi Marc,

On 2021/4/15 22:03, Keqian Zhu wrote:
> The MMIO region of a device maybe huge (GB level), try to use
> block mapping in stage2 to speedup both map and unmap.
> 
> Compared to normal memory mapping, we should consider two more
> points when try block mapping for MMIO region:
> 
> 1. For normal memory mapping, the PA(host physical address) and
> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
> the HVA to request hugepage, so we don't need to consider PA
> alignment when verifing block mapping. But for device memory
> mapping, the PA and HVA may have different alignment.
> 
> 2. For normal memory mapping, we are sure hugepage size properly
> fit into vma, so we don't check whether the mapping size exceeds
> the boundary of vma. But for device memory mapping, we should pay
> attention to this.
> 
> This adds get_vma_page_shift() to get page shift for both normal
> memory and device MMIO region, and check these two points when
> selecting block mapping size for MMIO region.
> 
> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> ---
>  arch/arm64/kvm/mmu.c | 61 ++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 51 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index c59af5ca01b0..5a1cc7751e6d 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -738,6 +738,35 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
>  	return PAGE_SIZE;
>  }
>  
> +static int get_vma_page_shift(struct vm_area_struct *vma, unsigned long hva)
> +{
> +	unsigned long pa;
> +
> +	if (is_vm_hugetlb_page(vma) && !(vma->vm_flags & VM_PFNMAP))
> +		return huge_page_shift(hstate_vma(vma));
> +
> +	if (!(vma->vm_flags & VM_PFNMAP))
> +		return PAGE_SHIFT;
> +
> +	VM_BUG_ON(is_vm_hugetlb_page(vma));
> +
> +	pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
> +
> +#ifndef __PAGETABLE_PMD_FOLDED
> +	if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
> +	    ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
> +	    ALIGN(hva, PUD_SIZE) <= vma->vm_end)
> +		return PUD_SHIFT;
> +#endif
> +
> +	if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
> +	    ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
> +	    ALIGN(hva, PMD_SIZE) <= vma->vm_end)
> +		return PMD_SHIFT;
> +
> +	return PAGE_SHIFT;
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>  			  unsigned long fault_status)
> @@ -769,7 +798,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  		return -EFAULT;
>  	}
>  
> -	/* Let's check if we will get back a huge page backed by hugetlbfs */
> +	/*
> +	 * Let's check if we will get back a huge page backed by hugetlbfs, or
> +	 * get block mapping for device MMIO region.
> +	 */
>  	mmap_read_lock(current->mm);
>  	vma = find_vma_intersection(current->mm, hva, hva + 1);
>  	if (unlikely(!vma)) {
> @@ -778,15 +810,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  		return -EFAULT;
>  	}
>  
> -	if (is_vm_hugetlb_page(vma))
> -		vma_shift = huge_page_shift(hstate_vma(vma));
> -	else
> -		vma_shift = PAGE_SHIFT;
> -
> -	if (logging_active ||
> -	    (vma->vm_flags & VM_PFNMAP)) {
> +	/*
> +	 * logging_active is guaranteed to never be true for VM_PFNMAP
> +	 * memslots.
> +	 */
> +	if (logging_active) {
>  		force_pte = true;
>  		vma_shift = PAGE_SHIFT;
> +	} else {
> +		vma_shift = get_vma_page_shift(vma, hva);
>  	}
I use a if/else manner in v4, please check that. Thanks very much!


BRs,
Keqian

>  
>  	switch (vma_shift) {
> @@ -854,8 +886,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  		return -EFAULT;
>  
>  	if (kvm_is_device_pfn(pfn)) {
> +		/*
> +		 * If the page was identified as device early by looking at
> +		 * the VMA flags, vma_pagesize is already representing the
> +		 * largest quantity we can map.  If instead it was mapped
> +		 * via gfn_to_pfn_prot(), vma_pagesize is set to PAGE_SIZE
> +		 * and must not be upgraded.
> +		 *
> +		 * In both cases, we don't let transparent_hugepage_adjust()
> +		 * change things at the last minute.
> +		 */
>  		device = true;
> -		force_pte = true;
>  	} else if (logging_active && !write_fault) {
>  		/*
>  		 * Only actually map the page as writable if this was a write
> @@ -876,7 +917,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	 * If we are not forced to use page mapping, check if we are
>  	 * backed by a THP and thus use block mapping if possible.
>  	 */
> -	if (vma_pagesize == PAGE_SIZE && !force_pte)
> +	if (vma_pagesize == PAGE_SIZE && !(force_pte || device))
>  		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
>  							   &pfn, &fault_ipa);
>  	if (writable)
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO
  2021-04-15 14:08   ` Keqian Zhu
@ 2021-04-16 14:44     ` Marc Zyngier
  2021-04-17  1:05       ` Keqian Zhu
  0 siblings, 1 reply; 19+ messages in thread
From: Marc Zyngier @ 2021-04-16 14:44 UTC (permalink / raw)
  To: Keqian Zhu; +Cc: linux-kernel, linux-arm-kernel, kvm, kvmarm, wanghaibin.wang

On Thu, 15 Apr 2021 15:08:09 +0100,
Keqian Zhu <zhukeqian1@huawei.com> wrote:
> 
> Hi Marc,
> 
> On 2021/4/15 22:03, Keqian Zhu wrote:
> > The MMIO region of a device maybe huge (GB level), try to use
> > block mapping in stage2 to speedup both map and unmap.
> > 
> > Compared to normal memory mapping, we should consider two more
> > points when try block mapping for MMIO region:
> > 
> > 1. For normal memory mapping, the PA(host physical address) and
> > HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
> > the HVA to request hugepage, so we don't need to consider PA
> > alignment when verifing block mapping. But for device memory
> > mapping, the PA and HVA may have different alignment.
> > 
> > 2. For normal memory mapping, we are sure hugepage size properly
> > fit into vma, so we don't check whether the mapping size exceeds
> > the boundary of vma. But for device memory mapping, we should pay
> > attention to this.
> > 
> > This adds get_vma_page_shift() to get page shift for both normal
> > memory and device MMIO region, and check these two points when
> > selecting block mapping size for MMIO region.
> > 
> > Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> > ---
> >  arch/arm64/kvm/mmu.c | 61 ++++++++++++++++++++++++++++++++++++--------
> >  1 file changed, 51 insertions(+), 10 deletions(-)
> > 
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index c59af5ca01b0..5a1cc7751e6d 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -738,6 +738,35 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
> >  	return PAGE_SIZE;
> >  }
> >  
> > +static int get_vma_page_shift(struct vm_area_struct *vma, unsigned long hva)
> > +{
> > +	unsigned long pa;
> > +
> > +	if (is_vm_hugetlb_page(vma) && !(vma->vm_flags & VM_PFNMAP))
> > +		return huge_page_shift(hstate_vma(vma));
> > +
> > +	if (!(vma->vm_flags & VM_PFNMAP))
> > +		return PAGE_SHIFT;
> > +
> > +	VM_BUG_ON(is_vm_hugetlb_page(vma));
> > +
> > +	pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
> > +
> > +#ifndef __PAGETABLE_PMD_FOLDED
> > +	if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
> > +	    ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
> > +	    ALIGN(hva, PUD_SIZE) <= vma->vm_end)
> > +		return PUD_SHIFT;
> > +#endif
> > +
> > +	if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
> > +	    ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
> > +	    ALIGN(hva, PMD_SIZE) <= vma->vm_end)
> > +		return PMD_SHIFT;
> > +
> > +	return PAGE_SHIFT;
> > +}
> > +
> >  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >  			  struct kvm_memory_slot *memslot, unsigned long hva,
> >  			  unsigned long fault_status)
> > @@ -769,7 +798,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >  		return -EFAULT;
> >  	}
> >  
> > -	/* Let's check if we will get back a huge page backed by hugetlbfs */
> > +	/*
> > +	 * Let's check if we will get back a huge page backed by hugetlbfs, or
> > +	 * get block mapping for device MMIO region.
> > +	 */
> >  	mmap_read_lock(current->mm);
> >  	vma = find_vma_intersection(current->mm, hva, hva + 1);
> >  	if (unlikely(!vma)) {
> > @@ -778,15 +810,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >  		return -EFAULT;
> >  	}
> >  
> > -	if (is_vm_hugetlb_page(vma))
> > -		vma_shift = huge_page_shift(hstate_vma(vma));
> > -	else
> > -		vma_shift = PAGE_SHIFT;
> > -
> > -	if (logging_active ||
> > -	    (vma->vm_flags & VM_PFNMAP)) {
> > +	/*
> > +	 * logging_active is guaranteed to never be true for VM_PFNMAP
> > +	 * memslots.
> > +	 */
> > +	if (logging_active) {
> >  		force_pte = true;
> >  		vma_shift = PAGE_SHIFT;
> > +	} else {
> > +		vma_shift = get_vma_page_shift(vma, hva);
> >  	}
> I use a if/else manner in v4, please check that. Thanks very much!

That's fine. However, it is getting a bit late for 5.13, and we don't
have much time to left it simmer in -next. I'll probably wait until
after the merge window to pick it up.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO
  2021-04-16 14:44     ` Marc Zyngier
@ 2021-04-17  1:05       ` Keqian Zhu
  0 siblings, 0 replies; 19+ messages in thread
From: Keqian Zhu @ 2021-04-17  1:05 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: linux-kernel, linux-arm-kernel, kvm, kvmarm, wanghaibin.wang

Hi Marc,

On 2021/4/16 22:44, Marc Zyngier wrote:
> On Thu, 15 Apr 2021 15:08:09 +0100,
> Keqian Zhu <zhukeqian1@huawei.com> wrote:
>>
>> Hi Marc,
>>
>> On 2021/4/15 22:03, Keqian Zhu wrote:
>>> The MMIO region of a device maybe huge (GB level), try to use
>>> block mapping in stage2 to speedup both map and unmap.
>>>
>>> Compared to normal memory mapping, we should consider two more
>>> points when try block mapping for MMIO region:
>>>
>>> 1. For normal memory mapping, the PA(host physical address) and
>>> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
>>> the HVA to request hugepage, so we don't need to consider PA
>>> alignment when verifing block mapping. But for device memory
>>> mapping, the PA and HVA may have different alignment.
>>>
>>> 2. For normal memory mapping, we are sure hugepage size properly
>>> fit into vma, so we don't check whether the mapping size exceeds
>>> the boundary of vma. But for device memory mapping, we should pay
>>> attention to this.
>>>
>>> This adds get_vma_page_shift() to get page shift for both normal
>>> memory and device MMIO region, and check these two points when
>>> selecting block mapping size for MMIO region.
>>>
>>> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
>>> ---
>>>  arch/arm64/kvm/mmu.c | 61 ++++++++++++++++++++++++++++++++++++--------
>>>  1 file changed, 51 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>> index c59af5ca01b0..5a1cc7751e6d 100644
>>> --- a/arch/arm64/kvm/mmu.c
>>> +++ b/arch/arm64/kvm/mmu.c
>>> @@ -738,6 +738,35 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
>>>  	return PAGE_SIZE;
>>>  }
>>>  
[...]

>>> +	/*
>>> +	 * logging_active is guaranteed to never be true for VM_PFNMAP
>>> +	 * memslots.
>>> +	 */
>>> +	if (logging_active) {
>>>  		force_pte = true;
>>>  		vma_shift = PAGE_SHIFT;
>>> +	} else {
>>> +		vma_shift = get_vma_page_shift(vma, hva);
>>>  	}
>> I use a if/else manner in v4, please check that. Thanks very much!
> 
> That's fine. However, it is getting a bit late for 5.13, and we don't
> have much time to left it simmer in -next. I'll probably wait until
> after the merge window to pick it up.
OK, no problem. Thanks! :)

BRs,
Keqian

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO
  2021-04-15 14:03 ` [PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO Keqian Zhu
  2021-04-15 14:08   ` Keqian Zhu
@ 2021-04-21  7:52   ` Gavin Shan
  2021-04-21  6:36     ` Keqian Zhu
  2021-04-23  0:37   ` Gavin Shan
  2 siblings, 1 reply; 19+ messages in thread
From: Gavin Shan @ 2021-04-21  7:52 UTC (permalink / raw)
  To: Keqian Zhu, linux-kernel, linux-arm-kernel, kvm, kvmarm, Marc Zyngier
  Cc: wanghaibin.wang

Hi Keqian,

On 4/16/21 12:03 AM, Keqian Zhu wrote:
> The MMIO region of a device maybe huge (GB level), try to use
> block mapping in stage2 to speedup both map and unmap.
> 
> Compared to normal memory mapping, we should consider two more
> points when try block mapping for MMIO region:
> 
> 1. For normal memory mapping, the PA(host physical address) and
> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
> the HVA to request hugepage, so we don't need to consider PA
> alignment when verifing block mapping. But for device memory
> mapping, the PA and HVA may have different alignment.
> 
> 2. For normal memory mapping, we are sure hugepage size properly
> fit into vma, so we don't check whether the mapping size exceeds
> the boundary of vma. But for device memory mapping, we should pay
> attention to this.
> 
> This adds get_vma_page_shift() to get page shift for both normal
> memory and device MMIO region, and check these two points when
> selecting block mapping size for MMIO region.
> 
> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> ---
>   arch/arm64/kvm/mmu.c | 61 ++++++++++++++++++++++++++++++++++++--------
>   1 file changed, 51 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index c59af5ca01b0..5a1cc7751e6d 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -738,6 +738,35 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
>   	return PAGE_SIZE;
>   }
>   
> +static int get_vma_page_shift(struct vm_area_struct *vma, unsigned long hva)
> +{
> +	unsigned long pa;
> +
> +	if (is_vm_hugetlb_page(vma) && !(vma->vm_flags & VM_PFNMAP))
> +		return huge_page_shift(hstate_vma(vma));
> +
> +	if (!(vma->vm_flags & VM_PFNMAP))
> +		return PAGE_SHIFT;
> +
> +	VM_BUG_ON(is_vm_hugetlb_page(vma));
> +

I don't understand how VM_PFNMAP is set for hugetlbfs related vma.
I think they are exclusive, meaning the flag is never set for
hugetlbfs vma. If it's true, VM_PFNMAP needn't be checked on hugetlbfs
vma and the VM_BUG_ON() becomes unnecessary.

> +	pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
> +
> +#ifndef __PAGETABLE_PMD_FOLDED
> +	if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
> +	    ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
> +	    ALIGN(hva, PUD_SIZE) <= vma->vm_end)
> +		return PUD_SHIFT;
> +#endif
> +
> +	if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
> +	    ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
> +	    ALIGN(hva, PMD_SIZE) <= vma->vm_end)
> +		return PMD_SHIFT;
> +
> +	return PAGE_SHIFT;
> +}
> +

There is "switch(...)" fallback mechanism in user_mem_abort(). PUD_SIZE/PMD_SIZE
can be downgraded accordingly if the addresses fails in the alignment check
by fault_supports_stage2_huge_mapping(). I think it would make user_mem_abort()
simplified if the logic can be moved to get_vma_page_shift().

Another question if we need the check from fault_supports_stage2_huge_mapping()
if VM_PFNMAP area is going to be covered by block mapping. If so, the "switch(...)"
fallback mechanism needs to be part of get_vma_page_shift().

>   static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>   			  struct kvm_memory_slot *memslot, unsigned long hva,
>   			  unsigned long fault_status)
> @@ -769,7 +798,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>   		return -EFAULT;
>   	}
>   
> -	/* Let's check if we will get back a huge page backed by hugetlbfs */
> +	/*
> +	 * Let's check if we will get back a huge page backed by hugetlbfs, or
> +	 * get block mapping for device MMIO region.
> +	 */
>   	mmap_read_lock(current->mm);
>   	vma = find_vma_intersection(current->mm, hva, hva + 1);
>   	if (unlikely(!vma)) {
> @@ -778,15 +810,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>   		return -EFAULT;
>   	}
>   
> -	if (is_vm_hugetlb_page(vma))
> -		vma_shift = huge_page_shift(hstate_vma(vma));
> -	else
> -		vma_shift = PAGE_SHIFT;
> -
> -	if (logging_active ||
> -	    (vma->vm_flags & VM_PFNMAP)) {
> +	/*
> +	 * logging_active is guaranteed to never be true for VM_PFNMAP
> +	 * memslots.
> +	 */
> +	if (logging_active) {
>   		force_pte = true;
>   		vma_shift = PAGE_SHIFT;
> +	} else {
> +		vma_shift = get_vma_page_shift(vma, hva);
>   	}
>   
>   	switch (vma_shift) {
> @@ -854,8 +886,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>   		return -EFAULT;
>   
>   	if (kvm_is_device_pfn(pfn)) {
> +		/*
> +		 * If the page was identified as device early by looking at
> +		 * the VMA flags, vma_pagesize is already representing the
> +		 * largest quantity we can map.  If instead it was mapped
> +		 * via gfn_to_pfn_prot(), vma_pagesize is set to PAGE_SIZE
> +		 * and must not be upgraded.
> +		 *
> +		 * In both cases, we don't let transparent_hugepage_adjust()
> +		 * change things at the last minute.
> +		 */
>   		device = true;
> -		force_pte = true;
>   	} else if (logging_active && !write_fault) {
>   		/*
>   		 * Only actually map the page as writable if this was a write
> @@ -876,7 +917,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>   	 * If we are not forced to use page mapping, check if we are
>   	 * backed by a THP and thus use block mapping if possible.
>   	 */
> -	if (vma_pagesize == PAGE_SIZE && !force_pte)
> +	if (vma_pagesize == PAGE_SIZE && !(force_pte || device))
>   		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
>   							   &pfn, &fault_ipa);
>   	if (writable)
> 

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO
  2021-04-21  7:52   ` Gavin Shan
@ 2021-04-21  6:36     ` Keqian Zhu
  2021-04-22  2:25       ` Gavin Shan
  0 siblings, 1 reply; 19+ messages in thread
From: Keqian Zhu @ 2021-04-21  6:36 UTC (permalink / raw)
  To: Gavin Shan, linux-kernel, linux-arm-kernel, kvm, kvmarm
  Cc: Marc Zyngier, wanghaibin.wang



On 2021/4/21 15:52, Gavin Shan wrote:
> Hi Keqian,
> 
> On 4/16/21 12:03 AM, Keqian Zhu wrote:
>> The MMIO region of a device maybe huge (GB level), try to use
>> block mapping in stage2 to speedup both map and unmap.
>>
>> Compared to normal memory mapping, we should consider two more
>> points when try block mapping for MMIO region:
>>
>> 1. For normal memory mapping, the PA(host physical address) and
>> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
>> the HVA to request hugepage, so we don't need to consider PA
>> alignment when verifing block mapping. But for device memory
>> mapping, the PA and HVA may have different alignment.
>>
>> 2. For normal memory mapping, we are sure hugepage size properly
>> fit into vma, so we don't check whether the mapping size exceeds
>> the boundary of vma. But for device memory mapping, we should pay
>> attention to this.
>>
>> This adds get_vma_page_shift() to get page shift for both normal
>> memory and device MMIO region, and check these two points when
>> selecting block mapping size for MMIO region.
>>
>> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
>> ---
>>   arch/arm64/kvm/mmu.c | 61 ++++++++++++++++++++++++++++++++++++--------
>>   1 file changed, 51 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index c59af5ca01b0..5a1cc7751e6d 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -738,6 +738,35 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
>>       return PAGE_SIZE;
>>   }
>>   +static int get_vma_page_shift(struct vm_area_struct *vma, unsigned long hva)
>> +{
>> +    unsigned long pa;
>> +
>> +    if (is_vm_hugetlb_page(vma) && !(vma->vm_flags & VM_PFNMAP))
>> +        return huge_page_shift(hstate_vma(vma));
>> +
>> +    if (!(vma->vm_flags & VM_PFNMAP))
>> +        return PAGE_SHIFT;
>> +
>> +    VM_BUG_ON(is_vm_hugetlb_page(vma));
>> +
> 
> I don't understand how VM_PFNMAP is set for hugetlbfs related vma.
> I think they are exclusive, meaning the flag is never set for
> hugetlbfs vma. If it's true, VM_PFNMAP needn't be checked on hugetlbfs
> vma and the VM_BUG_ON() becomes unnecessary.
Yes, but we're not sure all drivers follow this rule. Add a BUG_ON() is
a way to catch issue.

> 
>> +    pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
>> +
>> +#ifndef __PAGETABLE_PMD_FOLDED
>> +    if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
>> +        ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
>> +        ALIGN(hva, PUD_SIZE) <= vma->vm_end)
>> +        return PUD_SHIFT;
>> +#endif
>> +
>> +    if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
>> +        ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
>> +        ALIGN(hva, PMD_SIZE) <= vma->vm_end)
>> +        return PMD_SHIFT;
>> +
>> +    return PAGE_SHIFT;
>> +}
>> +
> 
> There is "switch(...)" fallback mechanism in user_mem_abort(). PUD_SIZE/PMD_SIZE
> can be downgraded accordingly if the addresses fails in the alignment check
> by fault_supports_stage2_huge_mapping(). I think it would make user_mem_abort()
> simplified if the logic can be moved to get_vma_page_shift().
> 
> Another question if we need the check from fault_supports_stage2_huge_mapping()
> if VM_PFNMAP area is going to be covered by block mapping. If so, the "switch(...)"
> fallback mechanism needs to be part of get_vma_page_shift().
Yes, Good suggestion. My idea is that we can keep this series simpler and do further
optimization in another patch series. Do you mind to send a patch?

Thanks,
Keqian

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO
  2021-04-21  6:36     ` Keqian Zhu
@ 2021-04-22  2:25       ` Gavin Shan
  2021-04-22  6:51         ` Marc Zyngier
  0 siblings, 1 reply; 19+ messages in thread
From: Gavin Shan @ 2021-04-22  2:25 UTC (permalink / raw)
  To: Keqian Zhu, linux-kernel, linux-arm-kernel, kvm, kvmarm
  Cc: Marc Zyngier, wanghaibin.wang

Hi Keqian,

On 4/21/21 4:36 PM, Keqian Zhu wrote:
> On 2021/4/21 15:52, Gavin Shan wrote:
>> On 4/16/21 12:03 AM, Keqian Zhu wrote:
>>> The MMIO region of a device maybe huge (GB level), try to use
>>> block mapping in stage2 to speedup both map and unmap.
>>>
>>> Compared to normal memory mapping, we should consider two more
>>> points when try block mapping for MMIO region:
>>>
>>> 1. For normal memory mapping, the PA(host physical address) and
>>> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
>>> the HVA to request hugepage, so we don't need to consider PA
>>> alignment when verifing block mapping. But for device memory
>>> mapping, the PA and HVA may have different alignment.
>>>
>>> 2. For normal memory mapping, we are sure hugepage size properly
>>> fit into vma, so we don't check whether the mapping size exceeds
>>> the boundary of vma. But for device memory mapping, we should pay
>>> attention to this.
>>>
>>> This adds get_vma_page_shift() to get page shift for both normal
>>> memory and device MMIO region, and check these two points when
>>> selecting block mapping size for MMIO region.
>>>
>>> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
>>> ---
>>>    arch/arm64/kvm/mmu.c | 61 ++++++++++++++++++++++++++++++++++++--------
>>>    1 file changed, 51 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>> index c59af5ca01b0..5a1cc7751e6d 100644
>>> --- a/arch/arm64/kvm/mmu.c
>>> +++ b/arch/arm64/kvm/mmu.c
>>> @@ -738,6 +738,35 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
>>>        return PAGE_SIZE;
>>>    }
>>>    +static int get_vma_page_shift(struct vm_area_struct *vma, unsigned long hva)
>>> +{
>>> +    unsigned long pa;
>>> +
>>> +    if (is_vm_hugetlb_page(vma) && !(vma->vm_flags & VM_PFNMAP))
>>> +        return huge_page_shift(hstate_vma(vma));
>>> +
>>> +    if (!(vma->vm_flags & VM_PFNMAP))
>>> +        return PAGE_SHIFT;
>>> +
>>> +    VM_BUG_ON(is_vm_hugetlb_page(vma));
>>> +
>>
>> I don't understand how VM_PFNMAP is set for hugetlbfs related vma.
>> I think they are exclusive, meaning the flag is never set for
>> hugetlbfs vma. If it's true, VM_PFNMAP needn't be checked on hugetlbfs
>> vma and the VM_BUG_ON() becomes unnecessary.
> Yes, but we're not sure all drivers follow this rule. Add a BUG_ON() is
> a way to catch issue.
> 

I think I didn't make things clear. What I meant is VM_PFNMAP can't
be set for hugetlbfs VMAs. So the checks here can be simplified as
below if you agree:

     if (is_vm_hugetlb_page(vma))
         return huge_page_shift(hstate_vma(vma));

     if (!(vma->vm_flags & VM_PFNMAP))
         return PAGE_SHIFT;

     VM_BUG_ON(is_vm_hugetlb_page(vma));       /* Can be dropped */

>>
>>> +    pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
>>> +
>>> +#ifndef __PAGETABLE_PMD_FOLDED
>>> +    if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
>>> +        ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
>>> +        ALIGN(hva, PUD_SIZE) <= vma->vm_end)
>>> +        return PUD_SHIFT;
>>> +#endif
>>> +
>>> +    if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
>>> +        ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
>>> +        ALIGN(hva, PMD_SIZE) <= vma->vm_end)
>>> +        return PMD_SHIFT;
>>> +
>>> +    return PAGE_SHIFT;
>>> +}
>>> +
>>
>> There is "switch(...)" fallback mechanism in user_mem_abort(). PUD_SIZE/PMD_SIZE
>> can be downgraded accordingly if the addresses fails in the alignment check
>> by fault_supports_stage2_huge_mapping(). I think it would make user_mem_abort()
>> simplified if the logic can be moved to get_vma_page_shift().
>>
>> Another question if we need the check from fault_supports_stage2_huge_mapping()
>> if VM_PFNMAP area is going to be covered by block mapping. If so, the "switch(...)"
>> fallback mechanism needs to be part of get_vma_page_shift().
> Yes, Good suggestion. My idea is that we can keep this series simpler and do further
> optimization in another patch series. Do you mind to send a patch?
> 

Yeah, It's fine to keep this series as of being. There are 3 checks applied
here for MMIO region: hva/hpa/ipa and they're distributed in two functions,
making the code a bit hard to follow. I can post patch to simplify it after
your series gets merged :)

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO
  2021-04-22  2:25       ` Gavin Shan
@ 2021-04-22  6:51         ` Marc Zyngier
  2021-04-23  0:42           ` Gavin Shan
  0 siblings, 1 reply; 19+ messages in thread
From: Marc Zyngier @ 2021-04-22  6:51 UTC (permalink / raw)
  To: Gavin Shan
  Cc: Keqian Zhu, linux-kernel, linux-arm-kernel, kvm, kvmarm, wanghaibin.wang

On Thu, 22 Apr 2021 03:25:23 +0100,
Gavin Shan <gshan@redhat.com> wrote:
> 
> Hi Keqian,
> 
> On 4/21/21 4:36 PM, Keqian Zhu wrote:
> > On 2021/4/21 15:52, Gavin Shan wrote:
> >> On 4/16/21 12:03 AM, Keqian Zhu wrote:
> >>> The MMIO region of a device maybe huge (GB level), try to use
> >>> block mapping in stage2 to speedup both map and unmap.
> >>> 
> >>> Compared to normal memory mapping, we should consider two more
> >>> points when try block mapping for MMIO region:
> >>> 
> >>> 1. For normal memory mapping, the PA(host physical address) and
> >>> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
> >>> the HVA to request hugepage, so we don't need to consider PA
> >>> alignment when verifing block mapping. But for device memory
> >>> mapping, the PA and HVA may have different alignment.
> >>> 
> >>> 2. For normal memory mapping, we are sure hugepage size properly
> >>> fit into vma, so we don't check whether the mapping size exceeds
> >>> the boundary of vma. But for device memory mapping, we should pay
> >>> attention to this.
> >>> 
> >>> This adds get_vma_page_shift() to get page shift for both normal
> >>> memory and device MMIO region, and check these two points when
> >>> selecting block mapping size for MMIO region.
> >>> 
> >>> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> >>> ---
> >>>    arch/arm64/kvm/mmu.c | 61 ++++++++++++++++++++++++++++++++++++--------
> >>>    1 file changed, 51 insertions(+), 10 deletions(-)
> >>> 
> >>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> >>> index c59af5ca01b0..5a1cc7751e6d 100644
> >>> --- a/arch/arm64/kvm/mmu.c
> >>> +++ b/arch/arm64/kvm/mmu.c
> >>> @@ -738,6 +738,35 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
> >>>        return PAGE_SIZE;
> >>>    }
> >>>    +static int get_vma_page_shift(struct vm_area_struct *vma, unsigned long hva)
> >>> +{
> >>> +    unsigned long pa;
> >>> +
> >>> +    if (is_vm_hugetlb_page(vma) && !(vma->vm_flags & VM_PFNMAP))
> >>> +        return huge_page_shift(hstate_vma(vma));
> >>> +
> >>> +    if (!(vma->vm_flags & VM_PFNMAP))
> >>> +        return PAGE_SHIFT;
> >>> +
> >>> +    VM_BUG_ON(is_vm_hugetlb_page(vma));
> >>> +
> >> 
> >> I don't understand how VM_PFNMAP is set for hugetlbfs related vma.
> >> I think they are exclusive, meaning the flag is never set for
> >> hugetlbfs vma. If it's true, VM_PFNMAP needn't be checked on hugetlbfs
> >> vma and the VM_BUG_ON() becomes unnecessary.
> > Yes, but we're not sure all drivers follow this rule. Add a BUG_ON() is
> > a way to catch issue.
> > 
> 
> I think I didn't make things clear. What I meant is VM_PFNMAP can't
> be set for hugetlbfs VMAs. So the checks here can be simplified as
> below if you agree:
> 
>     if (is_vm_hugetlb_page(vma))
>         return huge_page_shift(hstate_vma(vma));
> 
>     if (!(vma->vm_flags & VM_PFNMAP))
>         return PAGE_SHIFT;
> 
>     VM_BUG_ON(is_vm_hugetlb_page(vma));       /* Can be dropped */

No. If this case happens, I want to see it. I have explicitly asked
for it, and this check stays.

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO
  2021-04-22  6:51         ` Marc Zyngier
@ 2021-04-23  0:42           ` Gavin Shan
  0 siblings, 0 replies; 19+ messages in thread
From: Gavin Shan @ 2021-04-23  0:42 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Keqian Zhu, linux-kernel, linux-arm-kernel, kvm, kvmarm, wanghaibin.wang

Hi Marc,

On 4/22/21 4:51 PM, Marc Zyngier wrote:
> On Thu, 22 Apr 2021 03:25:23 +0100,
> Gavin Shan <gshan@redhat.com> wrote:
>> On 4/21/21 4:36 PM, Keqian Zhu wrote:
>>> On 2021/4/21 15:52, Gavin Shan wrote:
>>>> On 4/16/21 12:03 AM, Keqian Zhu wrote:
>>>>> The MMIO region of a device maybe huge (GB level), try to use
>>>>> block mapping in stage2 to speedup both map and unmap.
>>>>>
>>>>> Compared to normal memory mapping, we should consider two more
>>>>> points when try block mapping for MMIO region:
>>>>>
>>>>> 1. For normal memory mapping, the PA(host physical address) and
>>>>> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
>>>>> the HVA to request hugepage, so we don't need to consider PA
>>>>> alignment when verifing block mapping. But for device memory
>>>>> mapping, the PA and HVA may have different alignment.
>>>>>
>>>>> 2. For normal memory mapping, we are sure hugepage size properly
>>>>> fit into vma, so we don't check whether the mapping size exceeds
>>>>> the boundary of vma. But for device memory mapping, we should pay
>>>>> attention to this.
>>>>>
>>>>> This adds get_vma_page_shift() to get page shift for both normal
>>>>> memory and device MMIO region, and check these two points when
>>>>> selecting block mapping size for MMIO region.
>>>>>
>>>>> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
>>>>> ---
>>>>>     arch/arm64/kvm/mmu.c | 61 ++++++++++++++++++++++++++++++++++++--------
>>>>>     1 file changed, 51 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>>>> index c59af5ca01b0..5a1cc7751e6d 100644
>>>>> --- a/arch/arm64/kvm/mmu.c
>>>>> +++ b/arch/arm64/kvm/mmu.c
>>>>> @@ -738,6 +738,35 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
>>>>>         return PAGE_SIZE;
>>>>>     }
>>>>>     +static int get_vma_page_shift(struct vm_area_struct *vma, unsigned long hva)
>>>>> +{
>>>>> +    unsigned long pa;
>>>>> +
>>>>> +    if (is_vm_hugetlb_page(vma) && !(vma->vm_flags & VM_PFNMAP))
>>>>> +        return huge_page_shift(hstate_vma(vma));
>>>>> +
>>>>> +    if (!(vma->vm_flags & VM_PFNMAP))
>>>>> +        return PAGE_SHIFT;
>>>>> +
>>>>> +    VM_BUG_ON(is_vm_hugetlb_page(vma));
>>>>> +
>>>>
>>>> I don't understand how VM_PFNMAP is set for hugetlbfs related vma.
>>>> I think they are exclusive, meaning the flag is never set for
>>>> hugetlbfs vma. If it's true, VM_PFNMAP needn't be checked on hugetlbfs
>>>> vma and the VM_BUG_ON() becomes unnecessary.
>>> Yes, but we're not sure all drivers follow this rule. Add a BUG_ON() is
>>> a way to catch issue.
>>>
>>
>> I think I didn't make things clear. What I meant is VM_PFNMAP can't
>> be set for hugetlbfs VMAs. So the checks here can be simplified as
>> below if you agree:
>>
>>      if (is_vm_hugetlb_page(vma))
>>          return huge_page_shift(hstate_vma(vma));
>>
>>      if (!(vma->vm_flags & VM_PFNMAP))
>>          return PAGE_SHIFT;
>>
>>      VM_BUG_ON(is_vm_hugetlb_page(vma));       /* Can be dropped */
> 
> No. If this case happens, I want to see it. I have explicitly asked
> for it, and this check stays.
> 

Thanks for the explanation. To keep VM_BUG_ON() sounds good to me too :)

Thanks,
Gavin


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO
  2021-04-15 14:03 ` [PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO Keqian Zhu
  2021-04-15 14:08   ` Keqian Zhu
  2021-04-21  7:52   ` Gavin Shan
@ 2021-04-23  0:37   ` Gavin Shan
  2 siblings, 0 replies; 19+ messages in thread
From: Gavin Shan @ 2021-04-23  0:37 UTC (permalink / raw)
  To: Keqian Zhu, linux-kernel, linux-arm-kernel, kvm, kvmarm, Marc Zyngier
  Cc: wanghaibin.wang

On 4/16/21 12:03 AM, Keqian Zhu wrote:
> The MMIO region of a device maybe huge (GB level), try to use
> block mapping in stage2 to speedup both map and unmap.
> 
> Compared to normal memory mapping, we should consider two more
> points when try block mapping for MMIO region:
> 
> 1. For normal memory mapping, the PA(host physical address) and
> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
> the HVA to request hugepage, so we don't need to consider PA
> alignment when verifing block mapping. But for device memory
> mapping, the PA and HVA may have different alignment.
> 
> 2. For normal memory mapping, we are sure hugepage size properly
> fit into vma, so we don't check whether the mapping size exceeds
> the boundary of vma. But for device memory mapping, we should pay
> attention to this.
> 
> This adds get_vma_page_shift() to get page shift for both normal
> memory and device MMIO region, and check these two points when
> selecting block mapping size for MMIO region.
> 
> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> ---
>   arch/arm64/kvm/mmu.c | 61 ++++++++++++++++++++++++++++++++++++--------
>   1 file changed, 51 insertions(+), 10 deletions(-)
> 

Reviewed-by: Gavin Shan <gshan@redhat.com>

> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index c59af5ca01b0..5a1cc7751e6d 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -738,6 +738,35 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
>   	return PAGE_SIZE;
>   }
>   
> +static int get_vma_page_shift(struct vm_area_struct *vma, unsigned long hva)
> +{
> +	unsigned long pa;
> +
> +	if (is_vm_hugetlb_page(vma) && !(vma->vm_flags & VM_PFNMAP))
> +		return huge_page_shift(hstate_vma(vma));
> +
> +	if (!(vma->vm_flags & VM_PFNMAP))
> +		return PAGE_SHIFT;
> +
> +	VM_BUG_ON(is_vm_hugetlb_page(vma));
> +
> +	pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
> +
> +#ifndef __PAGETABLE_PMD_FOLDED
> +	if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
> +	    ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
> +	    ALIGN(hva, PUD_SIZE) <= vma->vm_end)
> +		return PUD_SHIFT;
> +#endif
> +
> +	if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
> +	    ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
> +	    ALIGN(hva, PMD_SIZE) <= vma->vm_end)
> +		return PMD_SHIFT;
> +
> +	return PAGE_SHIFT;
> +}
> +
>   static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>   			  struct kvm_memory_slot *memslot, unsigned long hva,
>   			  unsigned long fault_status)
> @@ -769,7 +798,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>   		return -EFAULT;
>   	}
>   
> -	/* Let's check if we will get back a huge page backed by hugetlbfs */
> +	/*
> +	 * Let's check if we will get back a huge page backed by hugetlbfs, or
> +	 * get block mapping for device MMIO region.
> +	 */
>   	mmap_read_lock(current->mm);
>   	vma = find_vma_intersection(current->mm, hva, hva + 1);
>   	if (unlikely(!vma)) {
> @@ -778,15 +810,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>   		return -EFAULT;
>   	}
>   
> -	if (is_vm_hugetlb_page(vma))
> -		vma_shift = huge_page_shift(hstate_vma(vma));
> -	else
> -		vma_shift = PAGE_SHIFT;
> -
> -	if (logging_active ||
> -	    (vma->vm_flags & VM_PFNMAP)) {
> +	/*
> +	 * logging_active is guaranteed to never be true for VM_PFNMAP
> +	 * memslots.
> +	 */
> +	if (logging_active) {
>   		force_pte = true;
>   		vma_shift = PAGE_SHIFT;
> +	} else {
> +		vma_shift = get_vma_page_shift(vma, hva);
>   	}
>   
>   	switch (vma_shift) {
> @@ -854,8 +886,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>   		return -EFAULT;
>   
>   	if (kvm_is_device_pfn(pfn)) {
> +		/*
> +		 * If the page was identified as device early by looking at
> +		 * the VMA flags, vma_pagesize is already representing the
> +		 * largest quantity we can map.  If instead it was mapped
> +		 * via gfn_to_pfn_prot(), vma_pagesize is set to PAGE_SIZE
> +		 * and must not be upgraded.
> +		 *
> +		 * In both cases, we don't let transparent_hugepage_adjust()
> +		 * change things at the last minute.
> +		 */
>   		device = true;
> -		force_pte = true;
>   	} else if (logging_active && !write_fault) {
>   		/*
>   		 * Only actually map the page as writable if this was a write
> @@ -876,7 +917,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>   	 * If we are not forced to use page mapping, check if we are
>   	 * backed by a THP and thus use block mapping if possible.
>   	 */
> -	if (vma_pagesize == PAGE_SIZE && !force_pte)
> +	if (vma_pagesize == PAGE_SIZE && !(force_pte || device))
>   		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
>   							   &pfn, &fault_ipa);
>   	if (writable)
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-04-23  1:36 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-15 14:03 [PATCH v4 0/2] kvm/arm64: Try stage2 block mapping for host device MMIO Keqian Zhu
2021-04-15 14:03 ` [PATCH v4 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions Keqian Zhu
2021-04-21  6:38   ` Gavin Shan
2021-04-21  6:28     ` Keqian Zhu
2021-04-22  2:12       ` Gavin Shan
2021-04-22  7:41         ` Keqian Zhu
2021-04-23  1:35           ` Gavin Shan
2021-04-23  1:36             ` Keqian Zhu
2021-04-23  0:36   ` Gavin Shan
2021-04-15 14:03 ` [PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO Keqian Zhu
2021-04-15 14:08   ` Keqian Zhu
2021-04-16 14:44     ` Marc Zyngier
2021-04-17  1:05       ` Keqian Zhu
2021-04-21  7:52   ` Gavin Shan
2021-04-21  6:36     ` Keqian Zhu
2021-04-22  2:25       ` Gavin Shan
2021-04-22  6:51         ` Marc Zyngier
2021-04-23  0:42           ` Gavin Shan
2021-04-23  0:37   ` Gavin Shan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).