All of lore.kernel.org
 help / color / mirror / Atom feed
From: Marc Zyngier <maz@kernel.org>
To: Keqian Zhu <zhukeqian1@huawei.com>
Cc: <linux-kernel@vger.kernel.org>,
	<linux-arm-kernel@lists.infradead.org>, <kvm@vger.kernel.org>,
	<kvmarm@lists.cs.columbia.edu>, <wanghaibin.wang@huawei.com>,
	Santosh Shukla <sashukla@nvidia.com>
Subject: Re: [PATCH v3 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO
Date: Wed, 14 Apr 2021 10:05:05 +0100	[thread overview]
Message-ID: <87pmyxme2m.wl-maz@kernel.org> (raw)
In-Reply-To: <20210414065109.8616-3-zhukeqian1@huawei.com>

+ Santosh, who found some interesting bugs in that area before.

On Wed, 14 Apr 2021 07:51:09 +0100,
Keqian Zhu <zhukeqian1@huawei.com> wrote:
> 
> The MMIO region of a device maybe huge (GB level), try to use
> block mapping in stage2 to speedup both map and unmap.
> 
> Compared to normal memory mapping, we should consider two more
> points when try block mapping for MMIO region:
> 
> 1. For normal memory mapping, the PA(host physical address) and
> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
> the HVA to request hugepage, so we don't need to consider PA
> alignment when verifing block mapping. But for device memory
> mapping, the PA and HVA may have different alignment.
> 
> 2. For normal memory mapping, we are sure hugepage size properly
> fit into vma, so we don't check whether the mapping size exceeds
> the boundary of vma. But for device memory mapping, we should pay
> attention to this.
> 
> This adds device_rough_page_shift() to check these two points when
> selecting block mapping size.
> 
> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> ---
>  arch/arm64/kvm/mmu.c | 37 +++++++++++++++++++++++++++++++++----
>  1 file changed, 33 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index c59af5ca01b0..1a6d96169d60 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -624,6 +624,31 @@ static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
>  	send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
>  }
>  
> +/*
> + * Find a max mapping size that properly insides the vma. And hva and pa must
> + * have the same alignment to this mapping size. It's rough as there are still
> + * other restrictions, will be checked by fault_supports_stage2_huge_mapping().
> + */
> +static short device_rough_page_shift(struct vm_area_struct *vma,
> +				     unsigned long hva)

My earlier question still stands. Under which circumstances would this
function return something that is *not* the final mapping size? I
really don't see a reason why this would not return the final mapping
size.

> +{
> +	phys_addr_t pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
> +
> +#ifndef __PAGETABLE_PMD_FOLDED
> +	if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
> +	    ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
> +	    ALIGN(hva, PUD_SIZE) <= vma->vm_end)
> +		return PUD_SHIFT;
> +#endif
> +
> +	if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
> +	    ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
> +	    ALIGN(hva, PMD_SIZE) <= vma->vm_end)
> +		return PMD_SHIFT;
> +
> +	return PAGE_SHIFT;
> +}
> +
>  static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
>  					       unsigned long hva,
>  					       unsigned long map_size)
> @@ -769,7 +794,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  		return -EFAULT;
>  	}
>  
> -	/* Let's check if we will get back a huge page backed by hugetlbfs */
> +	/*
> +	 * Let's check if we will get back a huge page backed by hugetlbfs, or
> +	 * get block mapping for device MMIO region.
> +	 */
>  	mmap_read_lock(current->mm);
>  	vma = find_vma_intersection(current->mm, hva, hva + 1);
>  	if (unlikely(!vma)) {
> @@ -780,11 +808,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  
>  	if (is_vm_hugetlb_page(vma))
>  		vma_shift = huge_page_shift(hstate_vma(vma));
> +	else if (vma->vm_flags & VM_PFNMAP)
> +		vma_shift = device_rough_page_shift(vma, hva);

What prevents a VMA from having both VM_HUGETLB and VM_PFNMAP? This is
pretty unlikely, but I'd like to see this case catered for.

>  	else
>  		vma_shift = PAGE_SHIFT;
>  
> -	if (logging_active ||
> -	    (vma->vm_flags & VM_PFNMAP)) {
> +	if (logging_active) {
>  		force_pte = true;
>  		vma_shift = PAGE_SHIFT;
>  	}
> @@ -855,7 +884,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  
>  	if (kvm_is_device_pfn(pfn)) {
>  		device = true;
> -		force_pte = true;
> +		force_pte = (vma_pagesize == PAGE_SIZE);

Why do we need to set force_pte if we are already dealing with
PAGE_SIZE? I guess you are doing this for the sake of avoiding the
call to transparent_hugepage_adjust(), right?

I'd rather you simply don't try to upgrade a device mapping by
explicitly checking for this and keep force_pte for *memory*
exclusively.

Santosh, can you please take a look at this series and try to see if
the problem you fixed in [1] (which ended up as commit 91a2c34b7d6f)
is still OK with this series?

>  	} else if (logging_active && !write_fault) {
>  		/*
>  		 * Only actually map the page as writable if this was a write

Thanks,

	M.

[1] https://lore.kernel.org/kvmarm/1603711447-11998-1-git-send-email-sashukla@nvidia.com/

-- 
Without deviation from the norm, progress is not possible.

WARNING: multiple messages have this Message-ID (diff)
From: Marc Zyngier <maz@kernel.org>
To: Keqian Zhu <zhukeqian1@huawei.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Santosh Shukla <sashukla@nvidia.com>,
	kvmarm@lists.cs.columbia.edu,
	linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH v3 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO
Date: Wed, 14 Apr 2021 10:05:05 +0100	[thread overview]
Message-ID: <87pmyxme2m.wl-maz@kernel.org> (raw)
In-Reply-To: <20210414065109.8616-3-zhukeqian1@huawei.com>

+ Santosh, who found some interesting bugs in that area before.

On Wed, 14 Apr 2021 07:51:09 +0100,
Keqian Zhu <zhukeqian1@huawei.com> wrote:
> 
> The MMIO region of a device maybe huge (GB level), try to use
> block mapping in stage2 to speedup both map and unmap.
> 
> Compared to normal memory mapping, we should consider two more
> points when try block mapping for MMIO region:
> 
> 1. For normal memory mapping, the PA(host physical address) and
> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
> the HVA to request hugepage, so we don't need to consider PA
> alignment when verifing block mapping. But for device memory
> mapping, the PA and HVA may have different alignment.
> 
> 2. For normal memory mapping, we are sure hugepage size properly
> fit into vma, so we don't check whether the mapping size exceeds
> the boundary of vma. But for device memory mapping, we should pay
> attention to this.
> 
> This adds device_rough_page_shift() to check these two points when
> selecting block mapping size.
> 
> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> ---
>  arch/arm64/kvm/mmu.c | 37 +++++++++++++++++++++++++++++++++----
>  1 file changed, 33 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index c59af5ca01b0..1a6d96169d60 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -624,6 +624,31 @@ static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
>  	send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
>  }
>  
> +/*
> + * Find a max mapping size that properly insides the vma. And hva and pa must
> + * have the same alignment to this mapping size. It's rough as there are still
> + * other restrictions, will be checked by fault_supports_stage2_huge_mapping().
> + */
> +static short device_rough_page_shift(struct vm_area_struct *vma,
> +				     unsigned long hva)

My earlier question still stands. Under which circumstances would this
function return something that is *not* the final mapping size? I
really don't see a reason why this would not return the final mapping
size.

> +{
> +	phys_addr_t pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
> +
> +#ifndef __PAGETABLE_PMD_FOLDED
> +	if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
> +	    ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
> +	    ALIGN(hva, PUD_SIZE) <= vma->vm_end)
> +		return PUD_SHIFT;
> +#endif
> +
> +	if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
> +	    ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
> +	    ALIGN(hva, PMD_SIZE) <= vma->vm_end)
> +		return PMD_SHIFT;
> +
> +	return PAGE_SHIFT;
> +}
> +
>  static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
>  					       unsigned long hva,
>  					       unsigned long map_size)
> @@ -769,7 +794,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  		return -EFAULT;
>  	}
>  
> -	/* Let's check if we will get back a huge page backed by hugetlbfs */
> +	/*
> +	 * Let's check if we will get back a huge page backed by hugetlbfs, or
> +	 * get block mapping for device MMIO region.
> +	 */
>  	mmap_read_lock(current->mm);
>  	vma = find_vma_intersection(current->mm, hva, hva + 1);
>  	if (unlikely(!vma)) {
> @@ -780,11 +808,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  
>  	if (is_vm_hugetlb_page(vma))
>  		vma_shift = huge_page_shift(hstate_vma(vma));
> +	else if (vma->vm_flags & VM_PFNMAP)
> +		vma_shift = device_rough_page_shift(vma, hva);

What prevents a VMA from having both VM_HUGETLB and VM_PFNMAP? This is
pretty unlikely, but I'd like to see this case catered for.

>  	else
>  		vma_shift = PAGE_SHIFT;
>  
> -	if (logging_active ||
> -	    (vma->vm_flags & VM_PFNMAP)) {
> +	if (logging_active) {
>  		force_pte = true;
>  		vma_shift = PAGE_SHIFT;
>  	}
> @@ -855,7 +884,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  
>  	if (kvm_is_device_pfn(pfn)) {
>  		device = true;
> -		force_pte = true;
> +		force_pte = (vma_pagesize == PAGE_SIZE);

Why do we need to set force_pte if we are already dealing with
PAGE_SIZE? I guess you are doing this for the sake of avoiding the
call to transparent_hugepage_adjust(), right?

I'd rather you simply don't try to upgrade a device mapping by
explicitly checking for this and keep force_pte for *memory*
exclusively.

Santosh, can you please take a look at this series and try to see if
the problem you fixed in [1] (which ended up as commit 91a2c34b7d6f)
is still OK with this series?

>  	} else if (logging_active && !write_fault) {
>  		/*
>  		 * Only actually map the page as writable if this was a write

Thanks,

	M.

[1] https://lore.kernel.org/kvmarm/1603711447-11998-1-git-send-email-sashukla@nvidia.com/

-- 
Without deviation from the norm, progress is not possible.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

WARNING: multiple messages have this Message-ID (diff)
From: Marc Zyngier <maz@kernel.org>
To: Keqian Zhu <zhukeqian1@huawei.com>
Cc: <linux-kernel@vger.kernel.org>,
	<linux-arm-kernel@lists.infradead.org>, <kvm@vger.kernel.org>,
	<kvmarm@lists.cs.columbia.edu>, <wanghaibin.wang@huawei.com>,
	Santosh Shukla <sashukla@nvidia.com>
Subject: Re: [PATCH v3 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO
Date: Wed, 14 Apr 2021 10:05:05 +0100	[thread overview]
Message-ID: <87pmyxme2m.wl-maz@kernel.org> (raw)
In-Reply-To: <20210414065109.8616-3-zhukeqian1@huawei.com>

+ Santosh, who found some interesting bugs in that area before.

On Wed, 14 Apr 2021 07:51:09 +0100,
Keqian Zhu <zhukeqian1@huawei.com> wrote:
> 
> The MMIO region of a device maybe huge (GB level), try to use
> block mapping in stage2 to speedup both map and unmap.
> 
> Compared to normal memory mapping, we should consider two more
> points when try block mapping for MMIO region:
> 
> 1. For normal memory mapping, the PA(host physical address) and
> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
> the HVA to request hugepage, so we don't need to consider PA
> alignment when verifing block mapping. But for device memory
> mapping, the PA and HVA may have different alignment.
> 
> 2. For normal memory mapping, we are sure hugepage size properly
> fit into vma, so we don't check whether the mapping size exceeds
> the boundary of vma. But for device memory mapping, we should pay
> attention to this.
> 
> This adds device_rough_page_shift() to check these two points when
> selecting block mapping size.
> 
> Signed-off-by: Keqian Zhu <zhukeqian1@huawei.com>
> ---
>  arch/arm64/kvm/mmu.c | 37 +++++++++++++++++++++++++++++++++----
>  1 file changed, 33 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index c59af5ca01b0..1a6d96169d60 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -624,6 +624,31 @@ static void kvm_send_hwpoison_signal(unsigned long address, short lsb)
>  	send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current);
>  }
>  
> +/*
> + * Find a max mapping size that properly insides the vma. And hva and pa must
> + * have the same alignment to this mapping size. It's rough as there are still
> + * other restrictions, will be checked by fault_supports_stage2_huge_mapping().
> + */
> +static short device_rough_page_shift(struct vm_area_struct *vma,
> +				     unsigned long hva)

My earlier question still stands. Under which circumstances would this
function return something that is *not* the final mapping size? I
really don't see a reason why this would not return the final mapping
size.

> +{
> +	phys_addr_t pa = (vma->vm_pgoff << PAGE_SHIFT) + (hva - vma->vm_start);
> +
> +#ifndef __PAGETABLE_PMD_FOLDED
> +	if ((hva & (PUD_SIZE - 1)) == (pa & (PUD_SIZE - 1)) &&
> +	    ALIGN_DOWN(hva, PUD_SIZE) >= vma->vm_start &&
> +	    ALIGN(hva, PUD_SIZE) <= vma->vm_end)
> +		return PUD_SHIFT;
> +#endif
> +
> +	if ((hva & (PMD_SIZE - 1)) == (pa & (PMD_SIZE - 1)) &&
> +	    ALIGN_DOWN(hva, PMD_SIZE) >= vma->vm_start &&
> +	    ALIGN(hva, PMD_SIZE) <= vma->vm_end)
> +		return PMD_SHIFT;
> +
> +	return PAGE_SHIFT;
> +}
> +
>  static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
>  					       unsigned long hva,
>  					       unsigned long map_size)
> @@ -769,7 +794,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  		return -EFAULT;
>  	}
>  
> -	/* Let's check if we will get back a huge page backed by hugetlbfs */
> +	/*
> +	 * Let's check if we will get back a huge page backed by hugetlbfs, or
> +	 * get block mapping for device MMIO region.
> +	 */
>  	mmap_read_lock(current->mm);
>  	vma = find_vma_intersection(current->mm, hva, hva + 1);
>  	if (unlikely(!vma)) {
> @@ -780,11 +808,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  
>  	if (is_vm_hugetlb_page(vma))
>  		vma_shift = huge_page_shift(hstate_vma(vma));
> +	else if (vma->vm_flags & VM_PFNMAP)
> +		vma_shift = device_rough_page_shift(vma, hva);

What prevents a VMA from having both VM_HUGETLB and VM_PFNMAP? This is
pretty unlikely, but I'd like to see this case catered for.

>  	else
>  		vma_shift = PAGE_SHIFT;
>  
> -	if (logging_active ||
> -	    (vma->vm_flags & VM_PFNMAP)) {
> +	if (logging_active) {
>  		force_pte = true;
>  		vma_shift = PAGE_SHIFT;
>  	}
> @@ -855,7 +884,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  
>  	if (kvm_is_device_pfn(pfn)) {
>  		device = true;
> -		force_pte = true;
> +		force_pte = (vma_pagesize == PAGE_SIZE);

Why do we need to set force_pte if we are already dealing with
PAGE_SIZE? I guess you are doing this for the sake of avoiding the
call to transparent_hugepage_adjust(), right?

I'd rather you simply don't try to upgrade a device mapping by
explicitly checking for this and keep force_pte for *memory*
exclusively.

Santosh, can you please take a look at this series and try to see if
the problem you fixed in [1] (which ended up as commit 91a2c34b7d6f)
is still OK with this series?

>  	} else if (logging_active && !write_fault) {
>  		/*
>  		 * Only actually map the page as writable if this was a write

Thanks,

	M.

[1] https://lore.kernel.org/kvmarm/1603711447-11998-1-git-send-email-sashukla@nvidia.com/

-- 
Without deviation from the norm, progress is not possible.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2021-04-14  9:05 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-14  6:51 [PATCH v3 0/2] kvm/arm64: Try stage2 block mapping for host device MMIO Keqian Zhu
2021-04-14  6:51 ` Keqian Zhu
2021-04-14  6:51 ` Keqian Zhu
2021-04-14  6:51 ` [PATCH v3 1/2] kvm/arm64: Remove the creation time's mapping of MMIO regions Keqian Zhu
2021-04-14  6:51   ` Keqian Zhu
2021-04-14  6:51   ` Keqian Zhu
2021-04-14  6:51 ` [PATCH v3 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO Keqian Zhu
2021-04-14  6:51   ` Keqian Zhu
2021-04-14  6:51   ` Keqian Zhu
2021-04-14  9:05   ` Marc Zyngier [this message]
2021-04-14  9:05     ` Marc Zyngier
2021-04-14  9:05     ` Marc Zyngier
2021-04-15  2:20     ` Keqian Zhu
2021-04-15  2:20       ` Keqian Zhu
2021-04-15  2:20       ` Keqian Zhu
2021-04-15 10:23       ` Marc Zyngier
2021-04-15 10:23         ` Marc Zyngier
2021-04-15 10:23         ` Marc Zyngier
2021-04-15 11:26         ` Keqian Zhu
2021-04-15 11:26           ` Keqian Zhu
2021-04-15 11:26           ` Keqian Zhu
2021-04-15 13:46           ` Marc Zyngier
2021-04-15 13:46             ` Marc Zyngier
2021-04-15 13:46             ` Marc Zyngier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87pmyxme2m.wl-maz@kernel.org \
    --to=maz@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.cs.columbia.edu \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sashukla@nvidia.com \
    --cc=wanghaibin.wang@huawei.com \
    --cc=zhukeqian1@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.