linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] kvm/arm64: fixed passthrough gpu into vm on arm64
@ 2022-04-01  9:08 xieming
  2022-04-01 14:19 ` Jason Gunthorpe
  2022-04-01 16:48 ` Marc Zyngier
  0 siblings, 2 replies; 9+ messages in thread
From: xieming @ 2022-04-01  9:08 UTC (permalink / raw)
  To: marc.zyngier, linux, catalin.marinas, will.deacon,
	alex.williamson, sashal
  Cc: xieming, linux-arm-kernel, kvmarm, linux-kernel

when passthrough some pcie device, such as gpus(including
Nvidia and AMD),kvm will report:"Unsupported FSC: EC=0x24
xFSC=0x21 ESR_EL2=0x92000061" err.the main reason is vfio
ioremap vga memory type by DEVICE_nGnRnE, and kvm setting
memory type to PAGE_S2_DEVICE(DEVICE_nGnRE), but in guestos,
all of device io memory type when ioremapping (including gpu
driver TTM memory type) is setting to MT_NORMAL_NC.

according to ARM64 stage1&stage2 conbining rules.
memory type attributes combining rules:
Normal-WB<Normal-WT<NormalNC<Device-GRE<Device-nGRE<
DevicenGnRE<Device-nGnRnE
Normal-WB is weakest,Device-nGnRnE is strongest.

refferring to 'Arm Architecture Reference Manual Armv8,
for Armv8-A architecture profile' pdf, chapter B2.8
refferring to 'ARM System Memory Management Unit Architecture
Specification SMMU architecture version 3.0 and version 3.1' pdf,
chapter 13.1.5

therefore, the I/O memory attribute of the VM is setting to
DevicenGnRE maybe is a mistake. it causes all device memory
accessing in the virtual machine must be aligned.

To summarize: stage2 memory type cannot be stronger than stage1
in arm64 archtechture.

Signed-off-by: xieming <xieming@kylinos.cn>
---
 arch/arm/include/asm/kvm_mmu.h        |  3 ++-
 arch/arm64/include/asm/kvm_mmu.h      |  3 ++-
 arch/arm64/include/asm/memory.h       |  4 +++-
 arch/arm64/include/asm/pgtable-prot.h |  2 +-
 drivers/vfio/pci/vfio_pci.c           |  7 +++++++
 virt/kvm/arm/mmu.c                    | 19 ++++++++++++++++---
 virt/kvm/arm/vgic/vgic-v2.c           |  2 +-
 7 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 523c499e42db..5c7869d25b62 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -64,7 +64,8 @@ void stage2_unmap_vm(struct kvm *kvm);
 int kvm_alloc_stage2_pgd(struct kvm *kvm);
 void kvm_free_stage2_pgd(struct kvm *kvm);
 int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
-			  phys_addr_t pa, unsigned long size, bool writable);
+			  phys_addr_t pa, unsigned long size,
+			  bool writable, bool writecombine);
 
 int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run);
 
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index b2558447c67d..3f98286c7498 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -158,7 +158,8 @@ void stage2_unmap_vm(struct kvm *kvm);
 int kvm_alloc_stage2_pgd(struct kvm *kvm);
 void kvm_free_stage2_pgd(struct kvm *kvm);
 int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
-			  phys_addr_t pa, unsigned long size, bool writable);
+			  phys_addr_t pa, unsigned long size,
+			  bool writable, bool writecombine);
 
 int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run);
 
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 56562ff01076..22565cc8ec20 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -158,13 +158,15 @@
  * Memory types for Stage-2 translation
  */
 #define MT_S2_NORMAL		0xf
+#define MT_S2_NORMAL_NC		0x5
 #define MT_S2_DEVICE_nGnRE	0x1
 
 /*
  * Memory types for Stage-2 translation when ID_AA64MMFR2_EL1.FWB is 0001
- * Stage-2 enforces Normal-WB and Device-nGnRE
+ * Stage-2 enforces Normal-WB and Device-nGnRE and Normal-NC
  */
 #define MT_S2_FWB_NORMAL	6
+#define MT_S2_FWB_NORMAL_NC	5
 #define MT_S2_FWB_DEVICE_nGnRE	1
 
 #ifdef CONFIG_ARM64_4K_PAGES
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index af547be1779b..b94e93135fea 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -88,8 +88,8 @@
 	})
 
 #define PAGE_S2			__pgprot(_PROT_DEFAULT | PAGE_S2_MEMATTR(NORMAL) | PTE_S2_RDONLY | PAGE_S2_XN)
+#define PAGE_S2_NC		__pgprot(_PROT_DEFAULT | PAGE_S2_MEMATTR(NORMAL_NC) | PTE_S2_RDONLY | PAGE_S2_XN)
 #define PAGE_S2_DEVICE		__pgprot(_PROT_DEFAULT | PAGE_S2_MEMATTR(DEVICE_nGnRE) | PTE_S2_RDONLY | PAGE_S2_XN)
-
 #define PAGE_NONE		__pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
 /* shared+writable pages are clean by default, hence PTE_RDONLY|PTE_WRITE */
 #define PAGE_SHARED		__pgprot(_PAGE_DEFAULT | PTE_USER | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN | PTE_WRITE)
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 51b791c750f1..6f66efb71743 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1452,7 +1452,14 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 	}
 
 	vma->vm_private_data = vdev;
+#ifdef CONFIG_ARM64
+	if (vfio_pci_is_vga(pdev))
+		vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
+	else
+		vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+#else
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+#endif
 	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
 
 	/*
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 11103b75c596..a46a58696834 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -206,6 +206,17 @@ static inline void kvm_pgd_populate(pgd_t *pgdp, pud_t *pudp)
 	dsb(ishst);
 }
 
+/**
+ * is_vma_write_combine - check if VMA is mapped with writecombine or not
+ * Return true if VMA mapped with MT_NORMAL_NC otherwise fasle
+ */
+static inline bool is_vma_write_combine(struct vm_area_struct *vma)
+{
+	pteval_t pteval = pgprot_val(vma->vm_page_prot);
+
+	return ((pteval & PTE_ATTRINDX_MASK) == PTE_ATTRINDX(MT_NORMAL_NC));
+}
+
 /*
  * Unmapping vs dcache management:
  *
@@ -1198,7 +1209,8 @@ static int stage2_pmdp_test_and_clear_young(pmd_t *pmd)
  * @size:	The size of the mapping
  */
 int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
-			  phys_addr_t pa, unsigned long size, bool writable)
+			  phys_addr_t pa, unsigned long size,
+			  bool writable, bool writecombine)
 {
 	phys_addr_t addr, end;
 	int ret = 0;
@@ -1209,7 +1221,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 	pfn = __phys_to_pfn(pa);
 
 	for (addr = guest_ipa; addr < end; addr += PAGE_SIZE) {
-		pte_t pte = pfn_pte(pfn, PAGE_S2_DEVICE);
+		pte_t pte = pfn_pte(pfn, writecombine ? PAGE_S2_NC : PAGE_S2_DEVICE);
 
 		if (writable)
 			pte = kvm_s2pte_mkwrite(pte);
@@ -2135,7 +2147,8 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 
 			ret = kvm_phys_addr_ioremap(kvm, gpa, pa,
 						    vm_end - vm_start,
-						    writable);
+						    writable,
+						    is_vma_write_combine(vma));
 			if (ret)
 				break;
 		}
diff --git a/virt/kvm/arm/vgic/vgic-v2.c b/virt/kvm/arm/vgic/vgic-v2.c
index 91b14dfacd1d..bc9f5a9a1fd9 100644
--- a/virt/kvm/arm/vgic/vgic-v2.c
+++ b/virt/kvm/arm/vgic/vgic-v2.c
@@ -352,7 +352,7 @@ int vgic_v2_map_resources(struct kvm *kvm)
 	if (!static_branch_unlikely(&vgic_v2_cpuif_trap)) {
 		ret = kvm_phys_addr_ioremap(kvm, dist->vgic_cpu_base,
 					    kvm_vgic_global_state.vcpu_base,
-					    KVM_VGIC_V2_CPU_SIZE, true);
+					    KVM_VGIC_V2_CPU_SIZE, true, false);
 		if (ret) {
 			kvm_err("Unable to remap VGIC CPU to VCPU\n");
 			goto out;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] kvm/arm64: fixed passthrough gpu into vm on arm64
  2022-04-01  9:08 [PATCH v2] kvm/arm64: fixed passthrough gpu into vm on arm64 xieming
@ 2022-04-01 14:19 ` Jason Gunthorpe
  2022-04-01 16:48 ` Marc Zyngier
  1 sibling, 0 replies; 9+ messages in thread
From: Jason Gunthorpe @ 2022-04-01 14:19 UTC (permalink / raw)
  To: xieming
  Cc: marc.zyngier, linux, catalin.marinas, will.deacon,
	alex.williamson, sashal, linux-arm-kernel, kvmarm, linux-kernel

On Fri, Apr 01, 2022 at 05:08:28PM +0800, xieming wrote:
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 51b791c750f1..6f66efb71743 100644
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1452,7 +1452,14 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>  	}
>  
>  	vma->vm_private_data = vdev;
> +#ifdef CONFIG_ARM64
> +	if (vfio_pci_is_vga(pdev))
> +		vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
> +	else
> +		vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> +#else
>  	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> +#endif

This is a user visible change if VFIO starts making things write
combining then userspace has to have different barriers around MMIO.

Also this problem is bigger than just GPUs, lots of devices use write
combining memory for their BARs and will do so inside VMs as well - so
testing for 'pci_is_vga' is also not right.

I think you need to solve this by having userspace somehow request the
cachability type for the mmap (though I'm not sure how KVM will know
what to do with that), or by having kvm always force write combining
for all ioremaps..

> +/**
> + * is_vma_write_combine - check if VMA is mapped with writecombine or not
> + * Return true if VMA mapped with MT_NORMAL_NC otherwise fasle
> + */
> +static inline bool is_vma_write_combine(struct vm_area_struct *vma)
> +{
> +	pteval_t pteval = pgprot_val(vma->vm_page_prot);
> +
> +	return ((pteval & PTE_ATTRINDX_MASK) == PTE_ATTRINDX(MT_NORMAL_NC));
> +}

Shouldn't KVM be copying the exact pgprot bits from the VMA to the
KVM PTEs when it is mirroring them? eg the difference between
pgprot_device and pgprot_noncached() seems relevant to preserve as well.

> @@ -1209,7 +1221,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>  	pfn = __phys_to_pfn(pa);
>  
>  	for (addr = guest_ipa; addr < end; addr += PAGE_SIZE) {
> -		pte_t pte = pfn_pte(pfn, PAGE_S2_DEVICE);
> +		pte_t pte = pfn_pte(pfn, writecombine ? PAGE_S2_NC : PAGE_S2_DEVICE);

Please don't send patches to the mailing list that are against such
old kernels, this code was deleted in 2020.

Jason

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] kvm/arm64: fixed passthrough gpu into vm on arm64
  2022-04-01  9:08 [PATCH v2] kvm/arm64: fixed passthrough gpu into vm on arm64 xieming
  2022-04-01 14:19 ` Jason Gunthorpe
@ 2022-04-01 16:48 ` Marc Zyngier
  2022-04-04 13:24   ` Jason Gunthorpe
  1 sibling, 1 reply; 9+ messages in thread
From: Marc Zyngier @ 2022-04-01 16:48 UTC (permalink / raw)
  To: xieming
  Cc: linux, catalin.marinas, will, alex.williamson, sashal, kvmarm,
	linux-arm-kernel, linux-kernel

Xieming,

This is the second time I fix email addresses for you. Next time, I
simply won't bother replying.

On Fri, 01 Apr 2022 10:08:28 +0100,
xieming <xieming@kylinos.cn> wrote:
> 
> when passthrough some pcie device, such as gpus(including
> Nvidia and AMD),kvm will report:"Unsupported FSC: EC=0x24
> xFSC=0x21 ESR_EL2=0x92000061" err.the main reason is vfio

I have asked you to describe how you get there, and you still haven't
bothered replying.

> ioremap vga memory type by DEVICE_nGnRnE, and kvm setting
> memory type to PAGE_S2_DEVICE(DEVICE_nGnRE), but in guestos,
> all of device io memory type when ioremapping (including gpu
> driver TTM memory type) is setting to MT_NORMAL_NC.
> 
> according to ARM64 stage1&stage2 conbining rules.
> memory type attributes combining rules:
> Normal-WB<Normal-WT<NormalNC<Device-GRE<Device-nGRE<
> DevicenGnRE<Device-nGnRnE
> Normal-WB is weakest,Device-nGnRnE is strongest.
> 
> refferring to 'Arm Architecture Reference Manual Armv8,
> for Armv8-A architecture profile' pdf, chapter B2.8
> refferring to 'ARM System Memory Management Unit Architecture
> Specification SMMU architecture version 3.0 and version 3.1' pdf,
> chapter 13.1.5
> 
> therefore, the I/O memory attribute of the VM is setting to
> DevicenGnRE maybe is a mistake. it causes all device memory
> accessing in the virtual machine must be aligned.
> 
> To summarize: stage2 memory type cannot be stronger than stage1
> in arm64 archtechture.

You are plain wrong. It can, and most of the time, it *must*.

> 
> Signed-off-by: xieming <xieming@kylinos.cn>
> ---
>  arch/arm/include/asm/kvm_mmu.h        |  3 ++-
>  arch/arm64/include/asm/kvm_mmu.h      |  3 ++-
>  arch/arm64/include/asm/memory.h       |  4 +++-
>  arch/arm64/include/asm/pgtable-prot.h |  2 +-
>  drivers/vfio/pci/vfio_pci.c           |  7 +++++++
>  virt/kvm/arm/mmu.c                    | 19 ++++++++++++++++---
>  virt/kvm/arm/vgic/vgic-v2.c           |  2 +-
>  7 files changed, 32 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 523c499e42db..5c7869d25b62 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h

This file has been removed from the tree *over two years ago*.

> @@ -64,7 +64,8 @@ void stage2_unmap_vm(struct kvm *kvm);
>  int kvm_alloc_stage2_pgd(struct kvm *kvm);
>  void kvm_free_stage2_pgd(struct kvm *kvm);
>  int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> -			  phys_addr_t pa, unsigned long size, bool writable);
> +			  phys_addr_t pa, unsigned long size,
> +			  bool writable, bool writecombine);
>  
>  int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run);
>  
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index b2558447c67d..3f98286c7498 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -158,7 +158,8 @@ void stage2_unmap_vm(struct kvm *kvm);
>  int kvm_alloc_stage2_pgd(struct kvm *kvm);
>  void kvm_free_stage2_pgd(struct kvm *kvm);
>  int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> -			  phys_addr_t pa, unsigned long size, bool writable);
> +			  phys_addr_t pa, unsigned long size,
> +			  bool writable, bool writecombine);

NAK. For a start, there is no such thing as 'write-combine' in the ARM
architecture, and I'm not convinced you can equate WC to Normal-NC.
See the previous discussion at [1].

[1] https://lore.kernel.org/r/20210429162906.32742-1-sdonthineni@nvidia.com

[...]

> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 51b791c750f1..6f66efb71743 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1452,7 +1452,14 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>  	}
>  
>  	vma->vm_private_data = vdev;
> +#ifdef CONFIG_ARM64
> +	if (vfio_pci_is_vga(pdev))
> +		vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
> +	else
> +		vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);

No. That's completely unacceptable. Who says that some VGA (who the
hell implements VGA these days?) implies any sort of attribute other
than device memory? This may work for your particular device under
your own circumstances. Can it be generalised? No.

And as Jason pointed out, this is likely to break userspace.

> +#else
>  	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> +#endif
>  	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
>  
>  	/*
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 11103b75c596..a46a58696834 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -206,6 +206,17 @@ static inline void kvm_pgd_populate(pgd_t *pgdp, pud_t *pudp)
>  	dsb(ishst);
>  }
>  
> +/**
> + * is_vma_write_combine - check if VMA is mapped with writecombine or not
> + * Return true if VMA mapped with MT_NORMAL_NC otherwise fasle
> + */
> +static inline bool is_vma_write_combine(struct vm_area_struct *vma)
> +{
> +	pteval_t pteval = pgprot_val(vma->vm_page_prot);
> +
> +	return ((pteval & PTE_ATTRINDX_MASK) == PTE_ATTRINDX(MT_NORMAL_NC));
> +}

Again, you are making tons of assumptions here, none of which are
acceptable as is.

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] kvm/arm64: fixed passthrough gpu into vm on arm64
  2022-04-01 16:48 ` Marc Zyngier
@ 2022-04-04 13:24   ` Jason Gunthorpe
  2022-04-04 14:47     ` Marc Zyngier
  0 siblings, 1 reply; 9+ messages in thread
From: Jason Gunthorpe @ 2022-04-04 13:24 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: xieming, sashal, catalin.marinas, linux, linux-kernel,
	alex.williamson, will, kvmarm, linux-arm-kernel

On Fri, Apr 01, 2022 at 05:48:59PM +0100, Marc Zyngier wrote:

> NAK. For a start, there is no such thing as 'write-combine' in the ARM
> architecture, and I'm not convinced you can equate WC to Normal-NC.
> See the previous discussion at [1].
> 
> [1] https://lore.kernel.org/r/20210429162906.32742-1-sdonthineni@nvidia.com

We've had a lot of discussions with ARM related to how this works with
drivers like mlx5 that use WC.

ARM has now published some guidance on this:

https://community.arm.com/arm-research/m/resources/1012

As an ecosystem we seem to be drifting toward Normal-NC for this
behavior (largely because it is what Linux does). At least that is
what we are testing and qualifing ARM CPUs against mlx5 with.

I'm guessing it will turn into a SBSA like thing where the ARM ARM is
kind of vauge but a SOC has to implement Normal-NC in a certain way to
be functional for the server market.

Jason

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] kvm/arm64: fixed passthrough gpu into vm on arm64
  2022-04-04 13:24   ` Jason Gunthorpe
@ 2022-04-04 14:47     ` Marc Zyngier
  2022-04-04 17:02       ` Jason Gunthorpe
  0 siblings, 1 reply; 9+ messages in thread
From: Marc Zyngier @ 2022-04-04 14:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: xieming, sashal, catalin.marinas, linux, linux-kernel,
	alex.williamson, will, kvmarm, linux-arm-kernel

On Mon, 04 Apr 2022 14:24:05 +0100,
Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Fri, Apr 01, 2022 at 05:48:59PM +0100, Marc Zyngier wrote:
> 
> > NAK. For a start, there is no such thing as 'write-combine' in the ARM
> > architecture, and I'm not convinced you can equate WC to Normal-NC.
> > See the previous discussion at [1].
> > 
> > [1] https://lore.kernel.org/r/20210429162906.32742-1-sdonthineni@nvidia.com
> 
> We've had a lot of discussions with ARM related to how this works with
> drivers like mlx5 that use WC.
> 
> ARM has now published some guidance on this:
> 
> https://community.arm.com/arm-research/m/resources/1012

Nicely buried where nobody would dare looking.

> 
> As an ecosystem we seem to be drifting toward Normal-NC for this
> behavior (largely because it is what Linux does). At least that is
> what we are testing and qualifing ARM CPUs against mlx5 with.
> 
> I'm guessing it will turn into a SBSA like thing where the ARM ARM is
> kind of vauge but a SOC has to implement Normal-NC in a certain way to
> be functional for the server market.

The main issue is that this equivalence isn't architected, so people
can build whatever they want. SBSA means nothing to KVM (or Linux at
large), and there is currently no way to describe which devices are
safe to map as Normal-NC vs Device.

We either have to take userspace's word for it, or rely on some other
heuristics (do this for PCIe, but not anything else). None of which
are entirely safe. Not to mention that no currently available CPU
implements FEAT_DGH.

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] kvm/arm64: fixed passthrough gpu into vm on arm64
  2022-04-04 14:47     ` Marc Zyngier
@ 2022-04-04 17:02       ` Jason Gunthorpe
  2022-04-05 15:27         ` Marc Zyngier
  2022-04-06  8:17         ` Tian, Kevin
  0 siblings, 2 replies; 9+ messages in thread
From: Jason Gunthorpe @ 2022-04-04 17:02 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: xieming, sashal, catalin.marinas, linux, linux-kernel,
	alex.williamson, will, kvmarm, linux-arm-kernel

On Mon, Apr 04, 2022 at 03:47:11PM +0100, Marc Zyngier wrote:
> > I'm guessing it will turn into a SBSA like thing where the ARM ARM is
> > kind of vauge but a SOC has to implement Normal-NC in a certain way to
> > be functional for the server market.
> 
> The main issue is that this equivalence isn't architected, so people
> can build whatever they want. SBSA means nothing to KVM (or Linux at
> large), and there is currently no way to describe which devices are
> safe to map as Normal-NC vs Device.

And people have, we know of some ARM SOC's that don't work fully with
NORMAL_NC for this usage. That is already a problem for baremetal
Linux, let alone KVM..

That is why I likened it to SBSA - if you want to build a server SOC
that works with existing server software, you have to support
NORMAL_NC in this way. Even if it isn't architected.

The KVM challenge, at least, is to support a CPU with working
NORMAL_NC to create VM that emulates the same CPU with working
NORMAL_NC.

I didn't quite understand your other remarks though - is there a
problem here? It seems like yes from the other thread you pointed at?

I would think that KVM should mirror the process page table
configuration into the KVM page table and make this into a userspace
problem?

That turns it into a VFIO problem to negotiate with userspace and set
the proper pgprot. At least VFIO has a better chance than KVM to
consult DT or something to learn about the specific device's
properties.

I don't know how VFIO/qemu/etc can make this all work automatically
correctly 100% of the time. It seems to me it is the same problem as
just basic baremetal "WC" is troubled on ARM in general today. Maybe
some tables and a command line option in qemu is the best we can hope
for.

Long ago I asked that the ARM folks could come with some Linux
definition of all the WC-like modes and some arch calls to indicate
which one(s) should be used. Nobody seemed interested in doing that,
so the above SOC was left non-working in mainline Linux..

> We either have to take userspace's word for it, or rely on some other
> heuristics (do this for PCIe, but not anything else). None of which
> are entirely safe. Not to mention that no currently available CPU
> implements FEAT_DGH.

DHG is an optimization, not a functional requirement. Currently
available CPUs use one of the more expensive barriers that are
architected to include DHG behavior.

In any event, this is an important optimization. It is why ARMv9 is
introducing a new instruction specifically to optmize it.

Jason

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] kvm/arm64: fixed passthrough gpu into vm on arm64
  2022-04-04 17:02       ` Jason Gunthorpe
@ 2022-04-05 15:27         ` Marc Zyngier
  2022-04-05 16:51           ` Jason Gunthorpe
  2022-04-06  8:17         ` Tian, Kevin
  1 sibling, 1 reply; 9+ messages in thread
From: Marc Zyngier @ 2022-04-05 15:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: xieming, sashal, catalin.marinas, linux, linux-kernel,
	alex.williamson, will, kvmarm, linux-arm-kernel

On Mon, 04 Apr 2022 18:02:02 +0100,
Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Mon, Apr 04, 2022 at 03:47:11PM +0100, Marc Zyngier wrote:
> > > I'm guessing it will turn into a SBSA like thing where the ARM ARM is
> > > kind of vauge but a SOC has to implement Normal-NC in a certain way to
> > > be functional for the server market.
> > 
> > The main issue is that this equivalence isn't architected, so people
> > can build whatever they want. SBSA means nothing to KVM (or Linux at
> > large), and there is currently no way to describe which devices are
> > safe to map as Normal-NC vs Device.
> 
> And people have, we know of some ARM SOC's that don't work fully with
> NORMAL_NC for this usage. That is already a problem for baremetal
> Linux, let alone KVM..
> 
> That is why I likened it to SBSA - if you want to build a server SOC
> that works with existing server software, you have to support
> NORMAL_NC in this way. Even if it isn't architected.

I see it the other way around. If it isn't architected (and in this
case not even detectable in a scalable way), it simply isn't
supportable by SW.

> The KVM challenge, at least, is to support a CPU with working
> NORMAL_NC to create VM that emulates the same CPU with working
> NORMAL_NC.
> 
> I didn't quite understand your other remarks though - is there a
> problem here? It seems like yes from the other thread you pointed at?

The main issue is that we have no idea what the behaviour is on a
given implementation, and no way to even detect that for a given
device, NORMAL_NC is a memory type that won't cause any issue.

> I would think that KVM should mirror the process page table
> configuration into the KVM page table and make this into a userspace
> problem?

But what makes it safe to do this the first place? There are tons of
HW out there that will simply issue a SError if you generate an
unaligned access targeting the right device, and letting userspace
decide on this is simply not acceptable.

> That turns it into a VFIO problem to negotiate with userspace and set
> the proper pgprot. At least VFIO has a better chance than KVM to
> consult DT or something to learn about the specific device's
> properties.
> 
> I don't know how VFIO/qemu/etc can make this all work automatically
> correctly 100% of the time. It seems to me it is the same problem as
> just basic baremetal "WC" is troubled on ARM in general today. Maybe
> some tables and a command line option in qemu is the best we can hope
> for.

Having a firmware description of what can be mapped with what
attributes would be pretty useful indeed. Not sure how that scales,
but the platform definitely needs to advertise *something* so that we
can allow userspace to say something.

> 
> Long ago I asked that the ARM folks could come with some Linux
> definition of all the WC-like modes and some arch calls to indicate
> which one(s) should be used. Nobody seemed interested in doing that,
> so the above SOC was left non-working in mainline Linux..
> 
> > We either have to take userspace's word for it, or rely on some other
> > heuristics (do this for PCIe, but not anything else). None of which
> > are entirely safe. Not to mention that no currently available CPU
> > implements FEAT_DGH.
> 
> DHG is an optimization, not a functional requirement. Currently
> available CPUs use one of the more expensive barriers that are
> architected to include DHG behavior.
> 
> In any event, this is an important optimization. It is why ARMv9 is
> introducing a new instruction specifically to optmize it.

ARMv9? No, seems like it was introduced in the v8.7 time frame, and
allowed retroactively from v8.0. N2 has it, but A510 doesn't, while V1
(an ARMv8.3 part) has it. But at least it is slowly creeping into
implementations.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] kvm/arm64: fixed passthrough gpu into vm on arm64
  2022-04-05 15:27         ` Marc Zyngier
@ 2022-04-05 16:51           ` Jason Gunthorpe
  0 siblings, 0 replies; 9+ messages in thread
From: Jason Gunthorpe @ 2022-04-05 16:51 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: xieming, sashal, catalin.marinas, linux, linux-kernel,
	alex.williamson, will, kvmarm, linux-arm-kernel

On Tue, Apr 05, 2022 at 04:27:16PM +0100, Marc Zyngier wrote:
> On Mon, 04 Apr 2022 18:02:02 +0100,
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > 
> > On Mon, Apr 04, 2022 at 03:47:11PM +0100, Marc Zyngier wrote:
> > > > I'm guessing it will turn into a SBSA like thing where the ARM ARM is
> > > > kind of vauge but a SOC has to implement Normal-NC in a certain way to
> > > > be functional for the server market.
> > > 
> > > The main issue is that this equivalence isn't architected, so people
> > > can build whatever they want. SBSA means nothing to KVM (or Linux at
> > > large), and there is currently no way to describe which devices are
> > > safe to map as Normal-NC vs Device.
> > 
> > And people have, we know of some ARM SOC's that don't work fully with
> > NORMAL_NC for this usage. That is already a problem for baremetal
> > Linux, let alone KVM..
> > 
> > That is why I likened it to SBSA - if you want to build a server SOC
> > that works with existing server software, you have to support
> > NORMAL_NC in this way. Even if it isn't architected.
> 
> I see it the other way around. If it isn't architected (and in this
> case not even detectable in a scalable way), it simply isn't
> supportable by SW.

Except the software already supports it. Catalin decided NORMAL_NC
would be how Linux works in 2014 in commit de2db7432917 ("arm64: Make
DMA coherent and strongly ordered mappings not executable")

There are 47 places under drivers/ that call pgprot_writecombine()
already, and if you make a "server" kind of chip you are likely to
encounter these drivers and must support them. Linux has created a
de-facto spec here.

While I agree the current situation in ARM64 is not nice and could be
improved, it has been supported by SW this way for a long time
already.

> > I didn't quite understand your other remarks though - is there a
> > problem here? It seems like yes from the other thread you pointed at?
> 
> The main issue is that we have no idea what the behaviour is on a
> given implementation, and no way to even detect that for a given
> device, NORMAL_NC is a memory type that won't cause any issue.

I agree with this, but that is a driver problem for calling
pgprot_writecombine() not a KVM problem. vfio is just another driver
in this sense.

We already have arch_can_pci_mmap_wc() which is a half attempt to
solve this problem, but ARM64 doesn't wire it up.

We've also gone far enough down this path for long enough that we
can't break all the existing systems that are working this way
already. So I expect any future accomodation would be some FW
indication that NORMAL_NC doesn't work for pgprot_writecombine(),
probably in DT and probably for an embedded focused chip. Maybe
combined with a quirk list of non-working CPU IDs or something.

Wire it up to arch_can_pci_mmap_wc() and you hvae something reasonable -
except that none of the 47 drivers actually use this call
today. Sigh.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH v2] kvm/arm64: fixed passthrough gpu into vm on arm64
  2022-04-04 17:02       ` Jason Gunthorpe
  2022-04-05 15:27         ` Marc Zyngier
@ 2022-04-06  8:17         ` Tian, Kevin
  1 sibling, 0 replies; 9+ messages in thread
From: Tian, Kevin @ 2022-04-06  8:17 UTC (permalink / raw)
  To: Jason Gunthorpe, Marc Zyngier
  Cc: xieming, sashal, catalin.marinas, linux, linux-kernel,
	alex.williamson, will, kvmarm, linux-arm-kernel

> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Tuesday, April 5, 2022 1:02 AM
> 
> On Mon, Apr 04, 2022 at 03:47:11PM +0100, Marc Zyngier wrote:
> > > I'm guessing it will turn into a SBSA like thing where the ARM ARM is
> > > kind of vauge but a SOC has to implement Normal-NC in a certain way to
> > > be functional for the server market.
> >
> > The main issue is that this equivalence isn't architected, so people
> > can build whatever they want. SBSA means nothing to KVM (or Linux at
> > large), and there is currently no way to describe which devices are
> > safe to map as Normal-NC vs Device.
> 
> And people have, we know of some ARM SOC's that don't work fully with
> NORMAL_NC for this usage. That is already a problem for baremetal
> Linux, let alone KVM..
> 
> That is why I likened it to SBSA - if you want to build a server SOC
> that works with existing server software, you have to support
> NORMAL_NC in this way. Even if it isn't architected.
> 
> The KVM challenge, at least, is to support a CPU with working
> NORMAL_NC to create VM that emulates the same CPU with working
> NORMAL_NC.
> 
> I didn't quite understand your other remarks though - is there a
> problem here? It seems like yes from the other thread you pointed at?
> 
> I would think that KVM should mirror the process page table
> configuration into the KVM page table and make this into a userspace
> problem?
> 
> That turns it into a VFIO problem to negotiate with userspace and set
> the proper pgprot. At least VFIO has a better chance than KVM to
> consult DT or something to learn about the specific device's
> properties.
> 
> I don't know how VFIO/qemu/etc can make this all work automatically
> correctly 100% of the time. It seems to me it is the same problem as
> just basic baremetal "WC" is troubled on ARM in general today. Maybe
> some tables and a command line option in qemu is the best we can hope
> for.
> 

Not knowing those ARM details and how they differ from x86. Just FYI
how it works on Intel platform.

If no assigned device the KVM page table (EPT) PTE is always set to
'forced WB' in a way overriding the guest cache attributes.

If having assigned device, EPT PTE is set to:

  1) UC for mmio regions. The effective memory type is UC or WC
     depending on guest attributes;

  2) forced WB for memory pages if the device cannot do noncoherent
     dma. Guest cache attributes are overridden (same as no device assigned);

  3) a type making the guest cache attributes effective for memory
     pages if noncoherent dma is possible.

All above logic is contained in vmx_get_mt_mask().

Thanks
Kevin

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-04-06 12:26 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-01  9:08 [PATCH v2] kvm/arm64: fixed passthrough gpu into vm on arm64 xieming
2022-04-01 14:19 ` Jason Gunthorpe
2022-04-01 16:48 ` Marc Zyngier
2022-04-04 13:24   ` Jason Gunthorpe
2022-04-04 14:47     ` Marc Zyngier
2022-04-04 17:02       ` Jason Gunthorpe
2022-04-05 15:27         ` Marc Zyngier
2022-04-05 16:51           ` Jason Gunthorpe
2022-04-06  8:17         ` Tian, Kevin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).