dri-devel Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH 0/3] KVM: x86: honor guest memory type
@ 2020-02-13 21:30 Chia-I Wu
  2020-02-13 21:30 ` [RFC PATCH 1/3] KVM: vmx: rewrite the comment in vmx_get_mt_mask Chia-I Wu
                   ` (3 more replies)
  0 siblings, 4 replies; 30+ messages in thread
From: Chia-I Wu @ 2020-02-13 21:30 UTC (permalink / raw)
  To: kvm
  Cc: wanpengli, joro, dri-devel, gurchetansingh, kraxel, pbonzini,
	vkuznets, jmattson

Hi,

Host GPU drivers like to give userspace WC mapping.  When the userspace makes
the mapping available to a guest, it also tells the guest to create a WC
mapping.  However, even when the guest kernel picks the correct memory type,
it gets ignored because of VMX_EPT_IPAT_BIT on Intel.

This series adds a new flag to KVM_SET_USER_MEMORY_REGION, which tells the
host kernel to honor the guest memory type for the memslot.  An alternative
fix is for KVM to unconditionally honor the guest memory type (unless it is
MMIO, to avoid MCEs on Intel).  I believe the alternative fix is how things
are on ARM, and probably also how things are on AMD.

I am new to KVM and HW virtualization technologies.  This series is meant as
an RFC.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/3] KVM: vmx: rewrite the comment in vmx_get_mt_mask
  2020-02-13 21:30 [RFC PATCH 0/3] KVM: x86: honor guest memory type Chia-I Wu
@ 2020-02-13 21:30 ` Chia-I Wu
  2020-02-14  9:36   ` Paolo Bonzini
  2020-02-13 21:30 ` [RFC PATCH 2/3] RFC: KVM: add KVM_MEM_DMA Chia-I Wu
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 30+ messages in thread
From: Chia-I Wu @ 2020-02-13 21:30 UTC (permalink / raw)
  To: kvm
  Cc: wanpengli, joro, dri-devel, gurchetansingh, kraxel, pbonzini,
	vkuznets, jmattson

Better reflect the structure of the code and metion why we could not
always honor the guest.

Signed-off-by: Chia-I Wu <olvaffe@gmail.com>
Cc: Gurchetan Singh <gurchetansingh@chromium.org>
Cc: Gerd Hoffmann <kraxel@redhat.com>
---
 arch/x86/kvm/vmx/vmx.c | 27 +++++++++++++++++----------
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 3be25ecae145..266ef87042da 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6854,17 +6854,24 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 	u8 cache;
 	u64 ipat = 0;
 
-	/* For VT-d and EPT combination
-	 * 1. MMIO: always map as UC
-	 * 2. EPT with VT-d:
-	 *   a. VT-d without snooping control feature: can't guarantee the
-	 *	result, try to trust guest.
-	 *   b. VT-d with snooping control feature: snooping control feature of
-	 *	VT-d engine can guarantee the cache correctness. Just set it
-	 *	to WB to keep consistent with host. So the same as item 3.
-	 * 3. EPT without VT-d: always map as WB and set IPAT=1 to keep
-	 *    consistent with host MTRR
+	/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
+	 * memory aliases with conflicting memory types and sometimes MCEs.
+	 * We have to be careful as to what are honored and when.
+	 *
+	 * For MMIO, guest CD/MTRR are ignored.  The EPT memory type is set to
+	 * UC.  The effective memory type is UC or WC depending on guest PAT.
+	 * This was historically the source of MCEs and we want to be
+	 * conservative.
+	 *
+	 * When there is no need to deal with noncoherent DMA (e.g., no VT-d
+	 * or VT-d has snoop control), guest CD/MTRR/PAT are all ignored.  The
+	 * EPT memory type is set to WB.  The effective memory type is forced
+	 * WB.
+	 *
+	 * Otherwise, we trust guest.  Guest CD/MTRR/PAT are all honored.  The
+	 * EPT memory type is used to emulate guest CD/MTRR.
 	 */
+
 	if (is_mmio) {
 		cache = MTRR_TYPE_UNCACHABLE;
 		goto exit;
-- 
2.25.0.265.gbab2e86ba0-goog

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 2/3] RFC: KVM: add KVM_MEM_DMA
  2020-02-13 21:30 [RFC PATCH 0/3] KVM: x86: honor guest memory type Chia-I Wu
  2020-02-13 21:30 ` [RFC PATCH 1/3] KVM: vmx: rewrite the comment in vmx_get_mt_mask Chia-I Wu
@ 2020-02-13 21:30 ` Chia-I Wu
  2020-02-13 21:30 ` [RFC PATCH 3/3] RFC: KVM: x86: support KVM_CAP_DMA_MEM Chia-I Wu
  2020-02-13 21:41 ` [RFC PATCH 0/3] KVM: x86: honor guest memory type Paolo Bonzini
  3 siblings, 0 replies; 30+ messages in thread
From: Chia-I Wu @ 2020-02-13 21:30 UTC (permalink / raw)
  To: kvm
  Cc: wanpengli, joro, dri-devel, gurchetansingh, kraxel, pbonzini,
	vkuznets, jmattson

When the flag is set, it means the the userspace wants to do DMA
with the memory and the guest will use an appropriate memory type to
access the memory.  The kernel should be prepared to honor the
guest's memory type.

Signed-off-by: Chia-I Wu <olvaffe@gmail.com>
Cc: Gurchetan Singh <gurchetansingh@chromium.org>
Cc: Gerd Hoffmann <kraxel@redhat.com>
---
 Documentation/virt/kvm/api.rst | 17 +++++++++++------
 include/uapi/linux/kvm.h       |  2 ++
 virt/kvm/kvm_main.c            |  6 +++++-
 3 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 97a72a53fa4b..e6a27e6e45c2 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1237,6 +1237,7 @@ yet and must be cleared on entry.
   /* for kvm_memory_region::flags */
   #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
   #define KVM_MEM_READONLY	(1UL << 1)
+  #define KVM_MEM_DMA		(1UL << 2)
 
 This ioctl allows the user to create, modify or delete a guest physical
 memory slot.  Bits 0-15 of "slot" specify the slot id and this value
@@ -1264,12 +1265,16 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
-writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
-to make a new slot read-only.  In this case, writes to this memory will be
-posted to userspace as KVM_EXIT_MMIO exits.
+The flags field supports these flags: KVM_MEM_LOG_DIRTY_PAGES,
+KVM_MEM_READONLY, and KVM_MEM_DMA.  KVM_MEM_LOG_DIRTY_PAGES can be set to
+instruct KVM to keep track of writes to memory within the slot.  See
+KVM_GET_DIRTY_LOG ioctl to know how to use it.  KVM_MEM_READONLY can be set,
+if KVM_CAP_READONLY_MEM capability allows it, to make a new slot read-only.
+In this case, writes to this memory will be posted to userspace as
+KVM_EXIT_MMIO exits.  KVM_MEM_DMA can be set, if KVM_CAP_DMA_MEM capability
+allows it, to make a new slot support DMA.  It is the userspace's
+responsibility to make sure userspace_addr points at a DMA-able memory and the
+guest's responsibility to map guest_phys_addr with the proper memory type.
 
 When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
 the memory region are automatically reflected into the guest.  For example, an
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 4b95f9a31a2f..578292e4b072 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -109,6 +109,7 @@ struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_DMA		(1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
@@ -1010,6 +1011,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_NISV_TO_USER 177
 #define KVM_CAP_ARM_INJECT_EXT_DABT 178
 #define KVM_CAP_S390_VCPU_RESETS 179
+#define KVM_CAP_DMA_MEM 180
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 70f03ce0e5c1..a4b6c782a168 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -940,6 +940,9 @@ static int check_memory_region_flags(const struct kvm_userspace_memory_region *m
 #ifdef __KVM_HAVE_READONLY_MEM
 	valid_flags |= KVM_MEM_READONLY;
 #endif
+#ifdef __KVM_HAVE_DMA_MEM
+	valid_flags |= KVM_MEM_DMA;
+#endif
 
 	if (mem->flags & ~valid_flags)
 		return -EINVAL;
@@ -1047,7 +1050,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		else { /* Modify an existing slot. */
 			if ((mem->userspace_addr != old.userspace_addr) ||
 			    (npages != old.npages) ||
-			    ((new.flags ^ old.flags) & KVM_MEM_READONLY))
+			    ((new.flags ^ old.flags) &
+			     (KVM_MEM_READONLY | KVM_MEM_DMA)))
 				goto out;
 
 			if (base_gfn != old.base_gfn)
-- 
2.25.0.265.gbab2e86ba0-goog

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 3/3] RFC: KVM: x86: support KVM_CAP_DMA_MEM
  2020-02-13 21:30 [RFC PATCH 0/3] KVM: x86: honor guest memory type Chia-I Wu
  2020-02-13 21:30 ` [RFC PATCH 1/3] KVM: vmx: rewrite the comment in vmx_get_mt_mask Chia-I Wu
  2020-02-13 21:30 ` [RFC PATCH 2/3] RFC: KVM: add KVM_MEM_DMA Chia-I Wu
@ 2020-02-13 21:30 ` Chia-I Wu
  2020-02-13 21:41 ` [RFC PATCH 0/3] KVM: x86: honor guest memory type Paolo Bonzini
  3 siblings, 0 replies; 30+ messages in thread
From: Chia-I Wu @ 2020-02-13 21:30 UTC (permalink / raw)
  To: kvm
  Cc: wanpengli, joro, dri-devel, gurchetansingh, kraxel, pbonzini,
	vkuznets, jmattson

When a memslot has KVM_MEM_DMA set, we want VMX_EPT_IPAT_BIT cleared
for the memslot.  Before that is possible, simply call
kvm_arch_register_noncoherent_dma for the memslot.

SVM does not have the ignore-pat bit.  Guest PAT is always honored.

Signed-off-by: Chia-I Wu <olvaffe@gmail.com>
Cc: Gurchetan Singh <gurchetansingh@chromium.org>
Cc: Gerd Hoffmann <kraxel@redhat.com>
---
 arch/x86/include/uapi/asm/kvm.h | 1 +
 arch/x86/kvm/x86.c              | 6 ++++++
 2 files changed, 7 insertions(+)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 503d3f42da16..578b686e3880 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -48,6 +48,7 @@
 #define __KVM_HAVE_XSAVE
 #define __KVM_HAVE_XCRS
 #define __KVM_HAVE_READONLY_MEM
+#define __KVM_HAVE_DMA_MEM
 
 /* Architectural interrupt line count. */
 #define KVM_NR_INTERRUPTS 256
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fb5d64ebc35d..c89a4647fef6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3331,6 +3331,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_GET_TSC_KHZ:
 	case KVM_CAP_KVMCLOCK_CTRL:
 	case KVM_CAP_READONLY_MEM:
+	case KVM_CAP_DMA_MEM:
 	case KVM_CAP_HYPERV_TIME:
 	case KVM_CAP_IOAPIC_POLARITY_IGNORED:
 	case KVM_CAP_TSC_DEADLINE_TIMER:
@@ -10045,6 +10046,11 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 	 */
 	if (change != KVM_MR_DELETE)
 		kvm_mmu_slot_apply_flags(kvm, (struct kvm_memory_slot *) new);
+
+	if (change == KVM_MR_CREATE && new->flags & KVM_MEM_DMA)
+		kvm_arch_register_noncoherent_dma(kvm);
+	else if (change == KVM_MR_DELETE && old->flags & KVM_MEM_DMA)
+		kvm_arch_unregister_noncoherent_dma(kvm);
 }
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
-- 
2.25.0.265.gbab2e86ba0-goog

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-13 21:30 [RFC PATCH 0/3] KVM: x86: honor guest memory type Chia-I Wu
                   ` (2 preceding siblings ...)
  2020-02-13 21:30 ` [RFC PATCH 3/3] RFC: KVM: x86: support KVM_CAP_DMA_MEM Chia-I Wu
@ 2020-02-13 21:41 ` Paolo Bonzini
  2020-02-13 22:18   ` Chia-I Wu
  3 siblings, 1 reply; 30+ messages in thread
From: Paolo Bonzini @ 2020-02-13 21:41 UTC (permalink / raw)
  To: Chia-I Wu, kvm
  Cc: wanpengli, joro, dri-devel, gurchetansingh, kraxel, vkuznets, jmattson

On 13/02/20 22:30, Chia-I Wu wrote:
> Hi,
> 
> Host GPU drivers like to give userspace WC mapping.  When the userspace makes
> the mapping available to a guest, it also tells the guest to create a WC
> mapping.  However, even when the guest kernel picks the correct memory type,
> it gets ignored because of VMX_EPT_IPAT_BIT on Intel.
> 
> This series adds a new flag to KVM_SET_USER_MEMORY_REGION, which tells the
> host kernel to honor the guest memory type for the memslot.  An alternative
> fix is for KVM to unconditionally honor the guest memory type (unless it is
> MMIO, to avoid MCEs on Intel).  I believe the alternative fix is how things
> are on ARM, and probably also how things are on AMD.
> 
> I am new to KVM and HW virtualization technologies.  This series is meant as
> an RFC.
> 

When we tried to do this in the past, we got machine checks everywhere
unfortunately due to the same address being mapped with different memory
types.  Unfortunately I cannot find the entry anymore in bugzilla, but
this was not fixed as far as I know.

Paolo

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-13 21:41 ` [RFC PATCH 0/3] KVM: x86: honor guest memory type Paolo Bonzini
@ 2020-02-13 22:18   ` Chia-I Wu
  2020-02-14 10:26     ` Paolo Bonzini
  0 siblings, 1 reply; 30+ messages in thread
From: Chia-I Wu @ 2020-02-13 22:18 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: wanpengli, kvm, joro, ML dri-devel, Gurchetan Singh,
	Gerd Hoffmann, vkuznets, jmattson

On Thu, Feb 13, 2020 at 1:41 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 13/02/20 22:30, Chia-I Wu wrote:
> > Hi,
> >
> > Host GPU drivers like to give userspace WC mapping.  When the userspace makes
> > the mapping available to a guest, it also tells the guest to create a WC
> > mapping.  However, even when the guest kernel picks the correct memory type,
> > it gets ignored because of VMX_EPT_IPAT_BIT on Intel.
> >
> > This series adds a new flag to KVM_SET_USER_MEMORY_REGION, which tells the
> > host kernel to honor the guest memory type for the memslot.  An alternative
> > fix is for KVM to unconditionally honor the guest memory type (unless it is
> > MMIO, to avoid MCEs on Intel).  I believe the alternative fix is how things
> > are on ARM, and probably also how things are on AMD.
> >
> > I am new to KVM and HW virtualization technologies.  This series is meant as
> > an RFC.
> >
>
> When we tried to do this in the past, we got machine checks everywhere
> unfortunately due to the same address being mapped with different memory
> types.  Unfortunately I cannot find the entry anymore in bugzilla, but
> this was not fixed as far as I know.
Yeah, I did a bit of history digging here

  https://gitlab.freedesktop.org/virgl/virglrenderer/issues/151#note_372594

The bug you mentioned was probably this one

  https://bugzilla.kernel.org/show_bug.cgi?id=104091

which was caused by

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fd717f11015f673487ffc826e59b2bad69d20fe5

From what I can tell, the commit allowed the guests to create cached
mappings to MMIO regions and caused MCEs.  That is different than what
I need, which is to allow guests to create uncached mappings to system
ram (i.e., !kvm_is_mmio_pfn) when the host userspace also has uncached
mappings.  But it is true that this still allows the userspace & guest
kernel to create conflicting memory types.

Implementation-wise, the current implementation uses
kvm_arch_register_noncoherent_dma.  It essentially treats a memslot
with the new flag as a non-coherent vfio device as far as mmu is
concerned.

>
> Paolo
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 1/3] KVM: vmx: rewrite the comment in vmx_get_mt_mask
  2020-02-13 21:30 ` [RFC PATCH 1/3] KVM: vmx: rewrite the comment in vmx_get_mt_mask Chia-I Wu
@ 2020-02-14  9:36   ` Paolo Bonzini
  0 siblings, 0 replies; 30+ messages in thread
From: Paolo Bonzini @ 2020-02-14  9:36 UTC (permalink / raw)
  To: Chia-I Wu, kvm
  Cc: wanpengli, joro, dri-devel, gurchetansingh, kraxel, vkuznets, jmattson

On 13/02/20 22:30, Chia-I Wu wrote:
> Better reflect the structure of the code and metion why we could not
> always honor the guest.
> 
> Signed-off-by: Chia-I Wu <olvaffe@gmail.com>
> Cc: Gurchetan Singh <gurchetansingh@chromium.org>
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 27 +++++++++++++++++----------
>  1 file changed, 17 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 3be25ecae145..266ef87042da 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6854,17 +6854,24 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>  	u8 cache;
>  	u64 ipat = 0;
>  
> -	/* For VT-d and EPT combination
> -	 * 1. MMIO: always map as UC
> -	 * 2. EPT with VT-d:
> -	 *   a. VT-d without snooping control feature: can't guarantee the
> -	 *	result, try to trust guest.
> -	 *   b. VT-d with snooping control feature: snooping control feature of
> -	 *	VT-d engine can guarantee the cache correctness. Just set it
> -	 *	to WB to keep consistent with host. So the same as item 3.
> -	 * 3. EPT without VT-d: always map as WB and set IPAT=1 to keep
> -	 *    consistent with host MTRR
> +	/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
> +	 * memory aliases with conflicting memory types and sometimes MCEs.
> +	 * We have to be careful as to what are honored and when.
> +	 *
> +	 * For MMIO, guest CD/MTRR are ignored.  The EPT memory type is set to
> +	 * UC.  The effective memory type is UC or WC depending on guest PAT.
> +	 * This was historically the source of MCEs and we want to be
> +	 * conservative.
> +	 *
> +	 * When there is no need to deal with noncoherent DMA (e.g., no VT-d
> +	 * or VT-d has snoop control), guest CD/MTRR/PAT are all ignored.  The
> +	 * EPT memory type is set to WB.  The effective memory type is forced
> +	 * WB.
> +	 *
> +	 * Otherwise, we trust guest.  Guest CD/MTRR/PAT are all honored.  The
> +	 * EPT memory type is used to emulate guest CD/MTRR.
>  	 */
> +
>  	if (is_mmio) {
>  		cache = MTRR_TYPE_UNCACHABLE;
>  		goto exit;
> 

This is certainly an improvement, especially the part that points out
how guest PAT still allows MMIO to be handled as WC.

Thanks,

Paolo

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-13 22:18   ` Chia-I Wu
@ 2020-02-14 10:26     ` Paolo Bonzini
  2020-02-14 19:52       ` Sean Christopherson
  2020-02-14 21:15       ` Chia-I Wu
  0 siblings, 2 replies; 30+ messages in thread
From: Paolo Bonzini @ 2020-02-14 10:26 UTC (permalink / raw)
  To: Chia-I Wu
  Cc: wanpengli, kvm, joro, ML dri-devel, Gurchetan Singh,
	Sean Christopherson, Gerd Hoffmann, vkuznets, jmattson

On 13/02/20 23:18, Chia-I Wu wrote:
> 
> The bug you mentioned was probably this one
> 
>   https://bugzilla.kernel.org/show_bug.cgi?id=104091

Yes, indeed.

> From what I can tell, the commit allowed the guests to create cached
> mappings to MMIO regions and caused MCEs.  That is different than what
> I need, which is to allow guests to create uncached mappings to system
> ram (i.e., !kvm_is_mmio_pfn) when the host userspace also has uncached
> mappings.  But it is true that this still allows the userspace & guest
> kernel to create conflicting memory types.

Right, the question is whether the MCEs were tied to MMIO regions 
specifically and if so why.

An interesting remark is in the footnote of table 11-7 in the SDM.  
There, for the MTRR (EPT for us) memory type UC you can read:

  The UC attribute comes from the MTRRs and the processors are not 
  required to snoop their caches since the data could never have
  been cached. This attribute is preferred for performance reasons.

There are two possibilities:

1) the footnote doesn't apply to UC mode coming from EPT page tables.
That would make your change safe.

2) the footnote also applies when the UC attribute comes from the EPT
page tables rather than the MTRRs.  In that case, the host should use
UC as the EPT page attribute if and only if it's consistent with the host
MTRRs; it would be more or less impossible to honor UC in the guest MTRRs.
In that case, something like the patch below would be needed.

It is not clear from the manual why the footnote would not apply to WC; that
is, the manual doesn't say explicitly that the processor does not do snooping
for accesses to WC memory.  But I guess that must be the case, which is why I
used MTRR_TYPE_WRCOMB in the patch below.

Either way, we would have an explanation of why creating cached mapping to
MMIO regions would, and why in practice we're not seeing MCEs for guest RAM
(the guest would have set WB for that memory in its MTRRs, not UC).

One thing you didn't say: how would userspace use KVM_MEM_DMA?  On which
regions would it be set?

Thanks,

Paolo

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index dc331fb06495..2be6f7effa1d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6920,8 +6920,16 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 	}
 
 	cache = kvm_mtrr_get_guest_memory_type(vcpu, gfn);
-
 exit:
+	if (cache == MTRR_TYPE_UNCACHABLE && !is_mmio) {
+		/*
+		 * We cannot set UC in the EPT page tables as it can cause
+		 * machine check exceptions (??).  Hopefully the guest is
+		 * using PAT.
+		 */
+		cache = MTRR_TYPE_WRCOMB;
+	}
+
 	return (cache << VMX_EPT_MT_EPTE_SHIFT) | ipat;
 }
 

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-14 10:26     ` Paolo Bonzini
@ 2020-02-14 19:52       ` Sean Christopherson
  2020-02-14 21:47         ` Chia-I Wu
  2020-02-14 21:15       ` Chia-I Wu
  1 sibling, 1 reply; 30+ messages in thread
From: Sean Christopherson @ 2020-02-14 19:52 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: wanpengli, kvm, joro, ML dri-devel, Gurchetan Singh,
	Gerd Hoffmann, vkuznets, jmattson

On Fri, Feb 14, 2020 at 11:26:06AM +0100, Paolo Bonzini wrote:
> On 13/02/20 23:18, Chia-I Wu wrote:
> > 
> > The bug you mentioned was probably this one
> > 
> >   https://bugzilla.kernel.org/show_bug.cgi?id=104091
> 
> Yes, indeed.
> 
> > From what I can tell, the commit allowed the guests to create cached
> > mappings to MMIO regions and caused MCEs.  That is different than what
> > I need, which is to allow guests to create uncached mappings to system
> > ram (i.e., !kvm_is_mmio_pfn) when the host userspace also has uncached
> > mappings.  But it is true that this still allows the userspace & guest
> > kernel to create conflicting memory types.

This is ok. 

> Right, the question is whether the MCEs were tied to MMIO regions 
> specifically and if so why.

99.99999% likelihood the answer is "yes".  Cacheable accesses to non-existent
memory and most (all?) MMIO regions will cause a #MC.  This includes
speculative accesses.

Commit fd717f11015f ("KVM: x86: apply guest MTRR virtualization on host
reserved pages") explicitly had a comment "1. MMIO: trust guest MTRR",
which is basically a direct avenue to generating #MCs.

IIRC, WC accesses to non-existent memory will also cause #MC, but KVM has
bigger problems if it has PRESENT EPTEs pointing at garbage. 

> An interesting remark is in the footnote of table 11-7 in the SDM.  
> There, for the MTRR (EPT for us) memory type UC you can read:
> 
>   The UC attribute comes from the MTRRs and the processors are not 
>   required to snoop their caches since the data could never have
>   been cached. This attribute is preferred for performance reasons.
> 
> There are two possibilities:
> 
> 1) the footnote doesn't apply to UC mode coming from EPT page tables.
> That would make your change safe.
> 
> 2) the footnote also applies when the UC attribute comes from the EPT
> page tables rather than the MTRRs.  In that case, the host should use
> UC as the EPT page attribute if and only if it's consistent with the host
> MTRRs; it would be more or less impossible to honor UC in the guest MTRRs.
> In that case, something like the patch below would be needed.

(2), the EPTs effectively replace the MTRRs.  The expectation being that
the VMM will use always use EPT memtypes consistent with the MTRRs.

> It is not clear from the manual why the footnote would not apply to WC; that
> is, the manual doesn't say explicitly that the processor does not do snooping
> for accesses to WC memory.  But I guess that must be the case, which is why I
> used MTRR_TYPE_WRCOMB in the patch below.

A few paragraphs below table 11-12 states:

  In particular, a WC page must never be aliased to a cacheable page because
  WC writes may not check the processor caches.

> Either way, we would have an explanation of why creating cached mapping to
> MMIO regions would, and why in practice we're not seeing MCEs for guest RAM
> (the guest would have set WB for that memory in its MTRRs, not UC).

Aliasing (physical) RAM with different memtypes won't cause #MC, just
memory corruption.
 
> One thing you didn't say: how would userspace use KVM_MEM_DMA?  On which
> regions would it be set?
> 
> Thanks,
> 
> Paolo
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index dc331fb06495..2be6f7effa1d 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6920,8 +6920,16 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>  	}
>  
>  	cache = kvm_mtrr_get_guest_memory_type(vcpu, gfn);
> -
>  exit:
> +	if (cache == MTRR_TYPE_UNCACHABLE && !is_mmio) {
> +		/*
> +		 * We cannot set UC in the EPT page tables as it can cause
> +		 * machine check exceptions (??).  Hopefully the guest is
> +		 * using PAT.
> +		 */
> +		cache = MTRR_TYPE_WRCOMB;

This is unnecessary.  Setting UC in the EPT tables will never directly lead
to #MC.  Forcing WC is likely more dangerous.

> +	}
> +
>  	return (cache << VMX_EPT_MT_EPTE_SHIFT) | ipat;
>  }
>  
> 
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-14 10:26     ` Paolo Bonzini
  2020-02-14 19:52       ` Sean Christopherson
@ 2020-02-14 21:15       ` Chia-I Wu
  2020-02-19 10:00         ` Tian, Kevin
  1 sibling, 1 reply; 30+ messages in thread
From: Chia-I Wu @ 2020-02-14 21:15 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: wanpengli, kvm, joro, ML dri-devel, Gurchetan Singh,
	Sean Christopherson, Gerd Hoffmann, vkuznets, jmattson

On Fri, Feb 14, 2020 at 2:26 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 13/02/20 23:18, Chia-I Wu wrote:
> >
> > The bug you mentioned was probably this one
> >
> >   https://bugzilla.kernel.org/show_bug.cgi?id=104091
>
> Yes, indeed.
>
> > From what I can tell, the commit allowed the guests to create cached
> > mappings to MMIO regions and caused MCEs.  That is different than what
> > I need, which is to allow guests to create uncached mappings to system
> > ram (i.e., !kvm_is_mmio_pfn) when the host userspace also has uncached
> > mappings.  But it is true that this still allows the userspace & guest
> > kernel to create conflicting memory types.
>
> Right, the question is whether the MCEs were tied to MMIO regions
> specifically and if so why.
>
> An interesting remark is in the footnote of table 11-7 in the SDM.
> There, for the MTRR (EPT for us) memory type UC you can read:
>
>   The UC attribute comes from the MTRRs and the processors are not
>   required to snoop their caches since the data could never have
>   been cached. This attribute is preferred for performance reasons.
>
> There are two possibilities:
>
> 1) the footnote doesn't apply to UC mode coming from EPT page tables.
> That would make your change safe.
>
> 2) the footnote also applies when the UC attribute comes from the EPT
> page tables rather than the MTRRs.  In that case, the host should use
> UC as the EPT page attribute if and only if it's consistent with the host
> MTRRs; it would be more or less impossible to honor UC in the guest MTRRs.
> In that case, something like the patch below would be needed.
>
> It is not clear from the manual why the footnote would not apply to WC; that
> is, the manual doesn't say explicitly that the processor does not do snooping
> for accesses to WC memory.  But I guess that must be the case, which is why I
> used MTRR_TYPE_WRCOMB in the patch below.
>
> Either way, we would have an explanation of why creating cached mapping to
> MMIO regions would, and why in practice we're not seeing MCEs for guest RAM
> (the guest would have set WB for that memory in its MTRRs, not UC).
>
> One thing you didn't say: how would userspace use KVM_MEM_DMA?  On which
> regions would it be set?
It will be set for shmems that are mapped WC.

GPU/DRM drivers allocate shmems as DMA-able gpu buffers and allow the
userspace to map them cached or WC (I915_MMAP_WC or
AMDGPU_GEM_CREATE_CPU_GTT_USWC for example).  When a shmem is mapped
WC and is made available to the guest, we would like the ability to
map the region WC in the guest.


> Thanks,
>
> Paolo
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index dc331fb06495..2be6f7effa1d 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6920,8 +6920,16 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>         }
>
>         cache = kvm_mtrr_get_guest_memory_type(vcpu, gfn);
> -
>  exit:
> +       if (cache == MTRR_TYPE_UNCACHABLE && !is_mmio) {
> +               /*
> +                * We cannot set UC in the EPT page tables as it can cause
> +                * machine check exceptions (??).  Hopefully the guest is
> +                * using PAT.
> +                */
> +               cache = MTRR_TYPE_WRCOMB;
> +       }
> +
>         return (cache << VMX_EPT_MT_EPTE_SHIFT) | ipat;
>  }
>
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-14 19:52       ` Sean Christopherson
@ 2020-02-14 21:47         ` Chia-I Wu
  2020-02-14 21:56           ` Jim Mattson
  0 siblings, 1 reply; 30+ messages in thread
From: Chia-I Wu @ 2020-02-14 21:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: wanpengli, kvm, joro, ML dri-devel, Gurchetan Singh,
	Gerd Hoffmann, Paolo Bonzini, vkuznets, jmattson

On Fri, Feb 14, 2020 at 11:52 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Fri, Feb 14, 2020 at 11:26:06AM +0100, Paolo Bonzini wrote:
> > On 13/02/20 23:18, Chia-I Wu wrote:
> > >
> > > The bug you mentioned was probably this one
> > >
> > >   https://bugzilla.kernel.org/show_bug.cgi?id=104091
> >
> > Yes, indeed.
> >
> > > From what I can tell, the commit allowed the guests to create cached
> > > mappings to MMIO regions and caused MCEs.  That is different than what
> > > I need, which is to allow guests to create uncached mappings to system
> > > ram (i.e., !kvm_is_mmio_pfn) when the host userspace also has uncached
> > > mappings.  But it is true that this still allows the userspace & guest
> > > kernel to create conflicting memory types.
>
> This is ok.
>
> > Right, the question is whether the MCEs were tied to MMIO regions
> > specifically and if so why.
>
> 99.99999% likelihood the answer is "yes".  Cacheable accesses to non-existent
> memory and most (all?) MMIO regions will cause a #MC.  This includes
> speculative accesses.
>
> Commit fd717f11015f ("KVM: x86: apply guest MTRR virtualization on host
> reserved pages") explicitly had a comment "1. MMIO: trust guest MTRR",
> which is basically a direct avenue to generating #MCs.
>
> IIRC, WC accesses to non-existent memory will also cause #MC, but KVM has
> bigger problems if it has PRESENT EPTEs pointing at garbage.
>
> > An interesting remark is in the footnote of table 11-7 in the SDM.
> > There, for the MTRR (EPT for us) memory type UC you can read:
> >
> >   The UC attribute comes from the MTRRs and the processors are not
> >   required to snoop their caches since the data could never have
> >   been cached. This attribute is preferred for performance reasons.
> >
> > There are two possibilities:
> >
> > 1) the footnote doesn't apply to UC mode coming from EPT page tables.
> > That would make your change safe.
> >
> > 2) the footnote also applies when the UC attribute comes from the EPT
> > page tables rather than the MTRRs.  In that case, the host should use
> > UC as the EPT page attribute if and only if it's consistent with the host
> > MTRRs; it would be more or less impossible to honor UC in the guest MTRRs.
> > In that case, something like the patch below would be needed.
>
> (2), the EPTs effectively replace the MTRRs.  The expectation being that
> the VMM will use always use EPT memtypes consistent with the MTRRs.
This is my understanding as well.

> > It is not clear from the manual why the footnote would not apply to WC; that
> > is, the manual doesn't say explicitly that the processor does not do snooping
> > for accesses to WC memory.  But I guess that must be the case, which is why I
> > used MTRR_TYPE_WRCOMB in the patch below.
>
> A few paragraphs below table 11-12 states:
>
>   In particular, a WC page must never be aliased to a cacheable page because
>   WC writes may not check the processor caches.
>
> > Either way, we would have an explanation of why creating cached mapping to
> > MMIO regions would, and why in practice we're not seeing MCEs for guest RAM
> > (the guest would have set WB for that memory in its MTRRs, not UC).
>
> Aliasing (physical) RAM with different memtypes won't cause #MC, just
> memory corruption.

What we need potentially gives the userspace (the guest kernel, to be
exact) the ability to create conflicting memory types.  If we can be
sure that the worst scenario is for a guest to corrupt its own memory,
by only allowing aliases on physical ram, that seems alright.

AFAICT, it is currently allowed on ARM (verified) and AMD (not
verified, but svm_get_mt_mask returns 0 which supposedly means the NPT
does not restrict what the guest PAT can do).  This diff would do the
trick for Intel without needing any uapi change:

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index e3394c839dea..88af11cc551a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6905,12 +6905,6 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu
*vcpu, gfn_t gfn, bool is_mmio)
                goto exit;
        }

-       if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) {
-               ipat = VMX_EPT_IPAT_BIT;
-               cache = MTRR_TYPE_WRBACK;
-               goto exit;
-       }
-
        if (kvm_read_cr0(vcpu) & X86_CR0_CD) {
                ipat = VMX_EPT_IPAT_BIT;
                if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))



> > One thing you didn't say: how would userspace use KVM_MEM_DMA?  On which
> > regions would it be set?
> >
> > Thanks,
> >
> > Paolo
> >
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index dc331fb06495..2be6f7effa1d 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -6920,8 +6920,16 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> >       }
> >
> >       cache = kvm_mtrr_get_guest_memory_type(vcpu, gfn);
> > -
> >  exit:
> > +     if (cache == MTRR_TYPE_UNCACHABLE && !is_mmio) {
> > +             /*
> > +              * We cannot set UC in the EPT page tables as it can cause
> > +              * machine check exceptions (??).  Hopefully the guest is
> > +              * using PAT.
> > +              */
> > +             cache = MTRR_TYPE_WRCOMB;
>
> This is unnecessary.  Setting UC in the EPT tables will never directly lead
> to #MC.  Forcing WC is likely more dangerous.
>
> > +     }
> > +
> >       return (cache << VMX_EPT_MT_EPTE_SHIFT) | ipat;
> >  }
> >
> >
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-14 21:47         ` Chia-I Wu
@ 2020-02-14 21:56           ` Jim Mattson
  2020-02-14 22:03             ` Sean Christopherson
  0 siblings, 1 reply; 30+ messages in thread
From: Jim Mattson @ 2020-02-14 21:56 UTC (permalink / raw)
  To: Chia-I Wu
  Cc: Wanpeng Li, ML dri-devel, kvm list, Joerg Roedel,
	Sean Christopherson, Gurchetan Singh, Gerd Hoffmann,
	Paolo Bonzini, Vitaly Kuznetsov

On Fri, Feb 14, 2020 at 1:47 PM Chia-I Wu <olvaffe@gmail.com> wrote:
>
> On Fri, Feb 14, 2020 at 11:52 AM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > On Fri, Feb 14, 2020 at 11:26:06AM +0100, Paolo Bonzini wrote:
> > > On 13/02/20 23:18, Chia-I Wu wrote:
> > > >
> > > > The bug you mentioned was probably this one
> > > >
> > > >   https://bugzilla.kernel.org/show_bug.cgi?id=104091
> > >
> > > Yes, indeed.
> > >
> > > > From what I can tell, the commit allowed the guests to create cached
> > > > mappings to MMIO regions and caused MCEs.  That is different than what
> > > > I need, which is to allow guests to create uncached mappings to system
> > > > ram (i.e., !kvm_is_mmio_pfn) when the host userspace also has uncached
> > > > mappings.  But it is true that this still allows the userspace & guest
> > > > kernel to create conflicting memory types.
> >
> > This is ok.
> >
> > > Right, the question is whether the MCEs were tied to MMIO regions
> > > specifically and if so why.
> >
> > 99.99999% likelihood the answer is "yes".  Cacheable accesses to non-existent
> > memory and most (all?) MMIO regions will cause a #MC.  This includes
> > speculative accesses.
> >
> > Commit fd717f11015f ("KVM: x86: apply guest MTRR virtualization on host
> > reserved pages") explicitly had a comment "1. MMIO: trust guest MTRR",
> > which is basically a direct avenue to generating #MCs.
> >
> > IIRC, WC accesses to non-existent memory will also cause #MC, but KVM has
> > bigger problems if it has PRESENT EPTEs pointing at garbage.
> >
> > > An interesting remark is in the footnote of table 11-7 in the SDM.
> > > There, for the MTRR (EPT for us) memory type UC you can read:
> > >
> > >   The UC attribute comes from the MTRRs and the processors are not
> > >   required to snoop their caches since the data could never have
> > >   been cached. This attribute is preferred for performance reasons.
> > >
> > > There are two possibilities:
> > >
> > > 1) the footnote doesn't apply to UC mode coming from EPT page tables.
> > > That would make your change safe.
> > >
> > > 2) the footnote also applies when the UC attribute comes from the EPT
> > > page tables rather than the MTRRs.  In that case, the host should use
> > > UC as the EPT page attribute if and only if it's consistent with the host
> > > MTRRs; it would be more or less impossible to honor UC in the guest MTRRs.
> > > In that case, something like the patch below would be needed.
> >
> > (2), the EPTs effectively replace the MTRRs.  The expectation being that
> > the VMM will use always use EPT memtypes consistent with the MTRRs.
> This is my understanding as well.
>
> > > It is not clear from the manual why the footnote would not apply to WC; that
> > > is, the manual doesn't say explicitly that the processor does not do snooping
> > > for accesses to WC memory.  But I guess that must be the case, which is why I
> > > used MTRR_TYPE_WRCOMB in the patch below.
> >
> > A few paragraphs below table 11-12 states:
> >
> >   In particular, a WC page must never be aliased to a cacheable page because
> >   WC writes may not check the processor caches.
> >
> > > Either way, we would have an explanation of why creating cached mapping to
> > > MMIO regions would, and why in practice we're not seeing MCEs for guest RAM
> > > (the guest would have set WB for that memory in its MTRRs, not UC).
> >
> > Aliasing (physical) RAM with different memtypes won't cause #MC, just
> > memory corruption.
>
> What we need potentially gives the userspace (the guest kernel, to be
> exact) the ability to create conflicting memory types.  If we can be
> sure that the worst scenario is for a guest to corrupt its own memory,
> by only allowing aliases on physical ram, that seems alright.
>
> AFAICT, it is currently allowed on ARM (verified) and AMD (not
> verified, but svm_get_mt_mask returns 0 which supposedly means the NPT
> does not restrict what the guest PAT can do).  This diff would do the
> trick for Intel without needing any uapi change:

I would be concerned about Intel CPU errata such as SKX40 and SKX59.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-14 21:56           ` Jim Mattson
@ 2020-02-14 22:03             ` Sean Christopherson
  2020-02-18 16:28               ` Paolo Bonzini
  0 siblings, 1 reply; 30+ messages in thread
From: Sean Christopherson @ 2020-02-14 22:03 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Wanpeng Li, kvm list, Joerg Roedel, ML dri-devel,
	Gurchetan Singh, Gerd Hoffmann, Paolo Bonzini, Vitaly Kuznetsov

On Fri, Feb 14, 2020 at 01:56:48PM -0800, Jim Mattson wrote:
> On Fri, Feb 14, 2020 at 1:47 PM Chia-I Wu <olvaffe@gmail.com> wrote:
> > AFAICT, it is currently allowed on ARM (verified) and AMD (not
> > verified, but svm_get_mt_mask returns 0 which supposedly means the NPT
> > does not restrict what the guest PAT can do).  This diff would do the
> > trick for Intel without needing any uapi change:
> 
> I would be concerned about Intel CPU errata such as SKX40 and SKX59.

The part KVM cares about, #MC, is already addressed by forcing UC for MMIO.
The data corruption issue is on the guest kernel to correctly use WC
and/or non-temporal writes.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-14 22:03             ` Sean Christopherson
@ 2020-02-18 16:28               ` Paolo Bonzini
  2020-02-18 22:58                 ` Sean Christopherson
  2020-02-19  9:52                 ` Tian, Kevin
  0 siblings, 2 replies; 30+ messages in thread
From: Paolo Bonzini @ 2020-02-18 16:28 UTC (permalink / raw)
  To: Sean Christopherson, Jim Mattson
  Cc: Wanpeng Li, kvm list, Joerg Roedel, ML dri-devel,
	Gurchetan Singh, Gerd Hoffmann, Vitaly Kuznetsov

On 14/02/20 23:03, Sean Christopherson wrote:
>> On Fri, Feb 14, 2020 at 1:47 PM Chia-I Wu <olvaffe@gmail.com> wrote:
>>> AFAICT, it is currently allowed on ARM (verified) and AMD (not
>>> verified, but svm_get_mt_mask returns 0 which supposedly means the NPT
>>> does not restrict what the guest PAT can do).  This diff would do the
>>> trick for Intel without needing any uapi change:
>> I would be concerned about Intel CPU errata such as SKX40 and SKX59.
> The part KVM cares about, #MC, is already addressed by forcing UC for MMIO.
> The data corruption issue is on the guest kernel to correctly use WC
> and/or non-temporal writes.

What about coherency across live migration?  The userspace process would
use cached accesses, and also a WBINVD could potentially corrupt guest
memory.

Paolo

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-18 16:28               ` Paolo Bonzini
@ 2020-02-18 22:58                 ` Sean Christopherson
  2020-02-19  9:52                 ` Tian, Kevin
  1 sibling, 0 replies; 30+ messages in thread
From: Sean Christopherson @ 2020-02-18 22:58 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Wanpeng Li, kvm list, Joerg Roedel, ML dri-devel,
	Gurchetan Singh, Gerd Hoffmann, Vitaly Kuznetsov, Jim Mattson

On Tue, Feb 18, 2020 at 05:28:51PM +0100, Paolo Bonzini wrote:
> On 14/02/20 23:03, Sean Christopherson wrote:
> >> On Fri, Feb 14, 2020 at 1:47 PM Chia-I Wu <olvaffe@gmail.com> wrote:
> >>> AFAICT, it is currently allowed on ARM (verified) and AMD (not
> >>> verified, but svm_get_mt_mask returns 0 which supposedly means the NPT
> >>> does not restrict what the guest PAT can do).  This diff would do the
> >>> trick for Intel without needing any uapi change:
> >> I would be concerned about Intel CPU errata such as SKX40 and SKX59.
> > The part KVM cares about, #MC, is already addressed by forcing UC for MMIO.
> > The data corruption issue is on the guest kernel to correctly use WC
> > and/or non-temporal writes.
> 
> What about coherency across live migration?  The userspace process would
> use cached accesses, and also a WBINVD could potentially corrupt guest
> memory.

Haven't given it an iota of thought.  My comments were purely in the
context of the SKX40 and SKX59 errata, which are silly bugs that no real
world software will ever hit, e.g. no sane software is going to split a
MASKMOV instruction across address spaces.  Hitting SKX59 would further
require the kernel to map MMIO as WB directly adjacent to RAM, which is
beyond crazy.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-18 16:28               ` Paolo Bonzini
  2020-02-18 22:58                 ` Sean Christopherson
@ 2020-02-19  9:52                 ` Tian, Kevin
  2020-02-19 19:36                   ` Chia-I Wu
  1 sibling, 1 reply; 30+ messages in thread
From: Tian, Kevin @ 2020-02-19  9:52 UTC (permalink / raw)
  To: Paolo Bonzini, Christopherson, Sean J, Jim Mattson
  Cc: Wanpeng Li, kvm list, Joerg Roedel, ML dri-devel,
	Gurchetan Singh, Gerd Hoffmann, Vitaly Kuznetsov

> From: Paolo Bonzini
> Sent: Wednesday, February 19, 2020 12:29 AM
> 
> On 14/02/20 23:03, Sean Christopherson wrote:
> >> On Fri, Feb 14, 2020 at 1:47 PM Chia-I Wu <olvaffe@gmail.com> wrote:
> >>> AFAICT, it is currently allowed on ARM (verified) and AMD (not
> >>> verified, but svm_get_mt_mask returns 0 which supposedly means the
> NPT
> >>> does not restrict what the guest PAT can do).  This diff would do the
> >>> trick for Intel without needing any uapi change:
> >> I would be concerned about Intel CPU errata such as SKX40 and SKX59.
> > The part KVM cares about, #MC, is already addressed by forcing UC for
> MMIO.
> > The data corruption issue is on the guest kernel to correctly use WC
> > and/or non-temporal writes.
> 
> What about coherency across live migration?  The userspace process would
> use cached accesses, and also a WBINVD could potentially corrupt guest
> memory.
> 

In such case the userspace process possibly should conservatively use
UC mapping, as if for MMIO regions on a passthrough device. However
there remains a problem. the definition of KVM_MEM_DMA implies 
favoring guest setting, which could be whatever type in concept. Then
assuming UC is also problematic. I'm not sure whether inventing another
interface to query effective memory type from KVM is a good idea. There
is no guarantee that the guest will use same type for every page in the
same slot, then such interface might be messy. Alternatively, maybe
we could just have an interface for KVM userspace to force memory type
for a given slot, if it is mainly used in para-virtualized scenarios (e.g. 
virtio-gpu) where the guest is enlightened to use a forced type (e.g. WC)?

Thanks
Kevin
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-14 21:15       ` Chia-I Wu
@ 2020-02-19 10:00         ` Tian, Kevin
  2020-02-19 19:18           ` Chia-I Wu
  0 siblings, 1 reply; 30+ messages in thread
From: Tian, Kevin @ 2020-02-19 10:00 UTC (permalink / raw)
  To: Chia-I Wu, Paolo Bonzini
  Cc: wanpengli, kvm, joro, ML dri-devel, Gurchetan Singh,
	Christopherson, Sean J, Gerd Hoffmann, vkuznets, jmattson

> From: Chia-I Wu
> Sent: Saturday, February 15, 2020 5:15 AM
> 
> On Fri, Feb 14, 2020 at 2:26 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> > On 13/02/20 23:18, Chia-I Wu wrote:
> > >
> > > The bug you mentioned was probably this one
> > >
> > >   https://bugzilla.kernel.org/show_bug.cgi?id=104091
> >
> > Yes, indeed.
> >
> > > From what I can tell, the commit allowed the guests to create cached
> > > mappings to MMIO regions and caused MCEs.  That is different than what
> > > I need, which is to allow guests to create uncached mappings to system
> > > ram (i.e., !kvm_is_mmio_pfn) when the host userspace also has
> uncached
> > > mappings.  But it is true that this still allows the userspace & guest
> > > kernel to create conflicting memory types.
> >
> > Right, the question is whether the MCEs were tied to MMIO regions
> > specifically and if so why.
> >
> > An interesting remark is in the footnote of table 11-7 in the SDM.
> > There, for the MTRR (EPT for us) memory type UC you can read:
> >
> >   The UC attribute comes from the MTRRs and the processors are not
> >   required to snoop their caches since the data could never have
> >   been cached. This attribute is preferred for performance reasons.
> >
> > There are two possibilities:
> >
> > 1) the footnote doesn't apply to UC mode coming from EPT page tables.
> > That would make your change safe.
> >
> > 2) the footnote also applies when the UC attribute comes from the EPT
> > page tables rather than the MTRRs.  In that case, the host should use
> > UC as the EPT page attribute if and only if it's consistent with the host
> > MTRRs; it would be more or less impossible to honor UC in the guest
> MTRRs.
> > In that case, something like the patch below would be needed.
> >
> > It is not clear from the manual why the footnote would not apply to WC;
> that
> > is, the manual doesn't say explicitly that the processor does not do
> snooping
> > for accesses to WC memory.  But I guess that must be the case, which is
> why I
> > used MTRR_TYPE_WRCOMB in the patch below.
> >
> > Either way, we would have an explanation of why creating cached mapping
> to
> > MMIO regions would, and why in practice we're not seeing MCEs for guest
> RAM
> > (the guest would have set WB for that memory in its MTRRs, not UC).
> >
> > One thing you didn't say: how would userspace use KVM_MEM_DMA?  On
> which
> > regions would it be set?
> It will be set for shmems that are mapped WC.
> 
> GPU/DRM drivers allocate shmems as DMA-able gpu buffers and allow the
> userspace to map them cached or WC (I915_MMAP_WC or
> AMDGPU_GEM_CREATE_CPU_GTT_USWC for example).  When a shmem is
> mapped
> WC and is made available to the guest, we would like the ability to
> map the region WC in the guest.

Curious... How is such slot exposed to the guest? A reserved memory
region? Is it static or might be dynamically added?

Thanks
Kevin
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-19 10:00         ` Tian, Kevin
@ 2020-02-19 19:18           ` Chia-I Wu
  2020-02-20  2:13             ` Tian, Kevin
  0 siblings, 1 reply; 30+ messages in thread
From: Chia-I Wu @ 2020-02-19 19:18 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: wanpengli, kvm, joro, ML dri-devel, Gurchetan Singh,
	Christopherson, Sean J, Gerd Hoffmann, Paolo Bonzini, vkuznets,
	jmattson

On Wed, Feb 19, 2020 at 2:00 AM Tian, Kevin <kevin.tian@intel.com> wrote:
>
> > From: Chia-I Wu
> > Sent: Saturday, February 15, 2020 5:15 AM
> >
> > On Fri, Feb 14, 2020 at 2:26 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> > >
> > > On 13/02/20 23:18, Chia-I Wu wrote:
> > > >
> > > > The bug you mentioned was probably this one
> > > >
> > > >   https://bugzilla.kernel.org/show_bug.cgi?id=104091
> > >
> > > Yes, indeed.
> > >
> > > > From what I can tell, the commit allowed the guests to create cached
> > > > mappings to MMIO regions and caused MCEs.  That is different than what
> > > > I need, which is to allow guests to create uncached mappings to system
> > > > ram (i.e., !kvm_is_mmio_pfn) when the host userspace also has
> > uncached
> > > > mappings.  But it is true that this still allows the userspace & guest
> > > > kernel to create conflicting memory types.
> > >
> > > Right, the question is whether the MCEs were tied to MMIO regions
> > > specifically and if so why.
> > >
> > > An interesting remark is in the footnote of table 11-7 in the SDM.
> > > There, for the MTRR (EPT for us) memory type UC you can read:
> > >
> > >   The UC attribute comes from the MTRRs and the processors are not
> > >   required to snoop their caches since the data could never have
> > >   been cached. This attribute is preferred for performance reasons.
> > >
> > > There are two possibilities:
> > >
> > > 1) the footnote doesn't apply to UC mode coming from EPT page tables.
> > > That would make your change safe.
> > >
> > > 2) the footnote also applies when the UC attribute comes from the EPT
> > > page tables rather than the MTRRs.  In that case, the host should use
> > > UC as the EPT page attribute if and only if it's consistent with the host
> > > MTRRs; it would be more or less impossible to honor UC in the guest
> > MTRRs.
> > > In that case, something like the patch below would be needed.
> > >
> > > It is not clear from the manual why the footnote would not apply to WC;
> > that
> > > is, the manual doesn't say explicitly that the processor does not do
> > snooping
> > > for accesses to WC memory.  But I guess that must be the case, which is
> > why I
> > > used MTRR_TYPE_WRCOMB in the patch below.
> > >
> > > Either way, we would have an explanation of why creating cached mapping
> > to
> > > MMIO regions would, and why in practice we're not seeing MCEs for guest
> > RAM
> > > (the guest would have set WB for that memory in its MTRRs, not UC).
> > >
> > > One thing you didn't say: how would userspace use KVM_MEM_DMA?  On
> > which
> > > regions would it be set?
> > It will be set for shmems that are mapped WC.
> >
> > GPU/DRM drivers allocate shmems as DMA-able gpu buffers and allow the
> > userspace to map them cached or WC (I915_MMAP_WC or
> > AMDGPU_GEM_CREATE_CPU_GTT_USWC for example).  When a shmem is
> > mapped
> > WC and is made available to the guest, we would like the ability to
> > map the region WC in the guest.
>
> Curious... How is such slot exposed to the guest? A reserved memory
> region? Is it static or might be dynamically added?
The plan is for virtio-gpu device to reserve a huge memory region in
the guest.  Memslots may be added dynamically or statically to back
the region.

Dynamic: the host adds a 16MB GPU allocation as a memslot at a time.
The guest kernel suballocates from the 16MB pool.

Static: the host creates a huge PROT_NONE memfd and adds it as a
memslot.  GPU allocations are mremap()ed into the memfd region to
provide the real mapping.

These options are considered because the number of memslots are
limited: 32 on ARM and 509 on x86.  If the number of memslots could be
made larger (4096 or more), we would also consider adding each
individual GPU allocation as a memslot.

These are actually questions we need feedback.  Besides, GPU
allocations can be assumed to be kernel dma-bufs in this context.  I
wonder if it makes sense to have a variation of
KVM_SET_USER_MEMORY_REGION that takes dma-bufs.


>
> Thanks
> Kevin
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-19  9:52                 ` Tian, Kevin
@ 2020-02-19 19:36                   ` Chia-I Wu
  2020-02-20  2:04                     ` Tian, Kevin
  0 siblings, 1 reply; 30+ messages in thread
From: Chia-I Wu @ 2020-02-19 19:36 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Wanpeng Li, ML dri-devel, kvm list, Joerg Roedel, Christopherson,
	Sean J, Gurchetan Singh, Gerd Hoffmann, Paolo Bonzini,
	Vitaly Kuznetsov, Jim Mattson

On Wed, Feb 19, 2020 at 1:52 AM Tian, Kevin <kevin.tian@intel.com> wrote:
>
> > From: Paolo Bonzini
> > Sent: Wednesday, February 19, 2020 12:29 AM
> >
> > On 14/02/20 23:03, Sean Christopherson wrote:
> > >> On Fri, Feb 14, 2020 at 1:47 PM Chia-I Wu <olvaffe@gmail.com> wrote:
> > >>> AFAICT, it is currently allowed on ARM (verified) and AMD (not
> > >>> verified, but svm_get_mt_mask returns 0 which supposedly means the
> > NPT
> > >>> does not restrict what the guest PAT can do).  This diff would do the
> > >>> trick for Intel without needing any uapi change:
> > >> I would be concerned about Intel CPU errata such as SKX40 and SKX59.
> > > The part KVM cares about, #MC, is already addressed by forcing UC for
> > MMIO.
> > > The data corruption issue is on the guest kernel to correctly use WC
> > > and/or non-temporal writes.
> >
> > What about coherency across live migration?  The userspace process would
> > use cached accesses, and also a WBINVD could potentially corrupt guest
> > memory.
> >
>
> In such case the userspace process possibly should conservatively use
> UC mapping, as if for MMIO regions on a passthrough device. However
> there remains a problem. the definition of KVM_MEM_DMA implies
> favoring guest setting, which could be whatever type in concept. Then
> assuming UC is also problematic. I'm not sure whether inventing another
> interface to query effective memory type from KVM is a good idea. There
> is no guarantee that the guest will use same type for every page in the
> same slot, then such interface might be messy. Alternatively, maybe
> we could just have an interface for KVM userspace to force memory type
> for a given slot, if it is mainly used in para-virtualized scenarios (e.g.
> virtio-gpu) where the guest is enlightened to use a forced type (e.g. WC)?
KVM forcing the memory type for a given slot should work too.  But the
ignore-guest-pat bit seems to be Intel-specific.  We will need to
define how the second-level page attributes combine with the guest
page attributes somehow.

KVM should in theory be able to tell that the userspace region is
mapped with a certain memory type and can force the same memory type
onto the guest.  The userspace does not need to be involved.  But that
sounds very slow?  This may be a dumb question, but would it help to
add KVM_SET_DMA_BUF and let KVM negotiate the memory type with the
in-kernel GPU drivers?


>
> Thanks
> Kevin
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-19 19:36                   ` Chia-I Wu
@ 2020-02-20  2:04                     ` Tian, Kevin
  2020-02-20  2:38                       ` Tian, Kevin
  0 siblings, 1 reply; 30+ messages in thread
From: Tian, Kevin @ 2020-02-20  2:04 UTC (permalink / raw)
  To: Chia-I Wu
  Cc: Wanpeng Li, ML dri-devel, kvm list, Joerg Roedel, Christopherson,
	Sean J, Gurchetan Singh, Gerd Hoffmann, Paolo Bonzini,
	Vitaly Kuznetsov, Jim Mattson

> From: Chia-I Wu <olvaffe@gmail.com>
> Sent: Thursday, February 20, 2020 3:37 AM
> 
> On Wed, Feb 19, 2020 at 1:52 AM Tian, Kevin <kevin.tian@intel.com> wrote:
> >
> > > From: Paolo Bonzini
> > > Sent: Wednesday, February 19, 2020 12:29 AM
> > >
> > > On 14/02/20 23:03, Sean Christopherson wrote:
> > > >> On Fri, Feb 14, 2020 at 1:47 PM Chia-I Wu <olvaffe@gmail.com> wrote:
> > > >>> AFAICT, it is currently allowed on ARM (verified) and AMD (not
> > > >>> verified, but svm_get_mt_mask returns 0 which supposedly means
> the
> > > NPT
> > > >>> does not restrict what the guest PAT can do).  This diff would do the
> > > >>> trick for Intel without needing any uapi change:
> > > >> I would be concerned about Intel CPU errata such as SKX40 and SKX59.
> > > > The part KVM cares about, #MC, is already addressed by forcing UC for
> > > MMIO.
> > > > The data corruption issue is on the guest kernel to correctly use WC
> > > > and/or non-temporal writes.
> > >
> > > What about coherency across live migration?  The userspace process
> would
> > > use cached accesses, and also a WBINVD could potentially corrupt guest
> > > memory.
> > >
> >
> > In such case the userspace process possibly should conservatively use
> > UC mapping, as if for MMIO regions on a passthrough device. However
> > there remains a problem. the definition of KVM_MEM_DMA implies
> > favoring guest setting, which could be whatever type in concept. Then
> > assuming UC is also problematic. I'm not sure whether inventing another
> > interface to query effective memory type from KVM is a good idea. There
> > is no guarantee that the guest will use same type for every page in the
> > same slot, then such interface might be messy. Alternatively, maybe
> > we could just have an interface for KVM userspace to force memory type
> > for a given slot, if it is mainly used in para-virtualized scenarios (e.g.
> > virtio-gpu) where the guest is enlightened to use a forced type (e.g. WC)?
> KVM forcing the memory type for a given slot should work too.  But the
> ignore-guest-pat bit seems to be Intel-specific.  We will need to
> define how the second-level page attributes combine with the guest
> page attributes somehow.

oh, I'm not aware of that difference. without an ipat-equivalent
capability, I'm not sure how to forcing random type here. If you look at 
table 11-7 in Intel SDM, none of MTRR (EPT) memory type can lead to
consistent effective type when combining with random PAT value. So
 it is definitely a dead end.

> 
> KVM should in theory be able to tell that the userspace region is
> mapped with a certain memory type and can force the same memory type
> onto the guest.  The userspace does not need to be involved.  But that
> sounds very slow?  This may be a dumb question, but would it help to
> add KVM_SET_DMA_BUF and let KVM negotiate the memory type with the
> in-kernel GPU drivers?
> 
> 

KVM_SET_DMA_BUF looks more reasonable. But I guess we don't need
KVM to be aware of such negotiation. We can continue your original
proposal to have KVM simply favor guest memory type (maybe still call
KVM_MEM_DMA). On the other hand, Qemu should just mmap on the 
fd handle of the dmabuf passed from the virtio-gpu device backend,  e.g.
to conduct migration. That way the mmap request is finally served by 
DRM and underlying GPU drivers, with proper type enforced automatically.

Thanks
Kevin
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-19 19:18           ` Chia-I Wu
@ 2020-02-20  2:13             ` Tian, Kevin
  2020-02-20 23:02               ` Chia-I Wu
  0 siblings, 1 reply; 30+ messages in thread
From: Tian, Kevin @ 2020-02-20  2:13 UTC (permalink / raw)
  To: Chia-I Wu
  Cc: wanpengli, kvm, joro, ML dri-devel, Gurchetan Singh,
	Christopherson, Sean J, Gerd Hoffmann, Paolo Bonzini, vkuznets,
	jmattson

> From: Chia-I Wu <olvaffe@gmail.com>
> Sent: Thursday, February 20, 2020 3:18 AM
> 
> On Wed, Feb 19, 2020 at 2:00 AM Tian, Kevin <kevin.tian@intel.com> wrote:
> >
> > > From: Chia-I Wu
> > > Sent: Saturday, February 15, 2020 5:15 AM
> > >
> > > On Fri, Feb 14, 2020 at 2:26 AM Paolo Bonzini <pbonzini@redhat.com>
> wrote:
> > > >
> > > > On 13/02/20 23:18, Chia-I Wu wrote:
> > > > >
> > > > > The bug you mentioned was probably this one
> > > > >
> > > > >   https://bugzilla.kernel.org/show_bug.cgi?id=104091
> > > >
> > > > Yes, indeed.
> > > >
> > > > > From what I can tell, the commit allowed the guests to create cached
> > > > > mappings to MMIO regions and caused MCEs.  That is different than
> what
> > > > > I need, which is to allow guests to create uncached mappings to
> system
> > > > > ram (i.e., !kvm_is_mmio_pfn) when the host userspace also has
> > > uncached
> > > > > mappings.  But it is true that this still allows the userspace & guest
> > > > > kernel to create conflicting memory types.
> > > >
> > > > Right, the question is whether the MCEs were tied to MMIO regions
> > > > specifically and if so why.
> > > >
> > > > An interesting remark is in the footnote of table 11-7 in the SDM.
> > > > There, for the MTRR (EPT for us) memory type UC you can read:
> > > >
> > > >   The UC attribute comes from the MTRRs and the processors are not
> > > >   required to snoop their caches since the data could never have
> > > >   been cached. This attribute is preferred for performance reasons.
> > > >
> > > > There are two possibilities:
> > > >
> > > > 1) the footnote doesn't apply to UC mode coming from EPT page tables.
> > > > That would make your change safe.
> > > >
> > > > 2) the footnote also applies when the UC attribute comes from the EPT
> > > > page tables rather than the MTRRs.  In that case, the host should use
> > > > UC as the EPT page attribute if and only if it's consistent with the host
> > > > MTRRs; it would be more or less impossible to honor UC in the guest
> > > MTRRs.
> > > > In that case, something like the patch below would be needed.
> > > >
> > > > It is not clear from the manual why the footnote would not apply to WC;
> > > that
> > > > is, the manual doesn't say explicitly that the processor does not do
> > > snooping
> > > > for accesses to WC memory.  But I guess that must be the case, which is
> > > why I
> > > > used MTRR_TYPE_WRCOMB in the patch below.
> > > >
> > > > Either way, we would have an explanation of why creating cached
> mapping
> > > to
> > > > MMIO regions would, and why in practice we're not seeing MCEs for
> guest
> > > RAM
> > > > (the guest would have set WB for that memory in its MTRRs, not UC).
> > > >
> > > > One thing you didn't say: how would userspace use KVM_MEM_DMA?
> On
> > > which
> > > > regions would it be set?
> > > It will be set for shmems that are mapped WC.
> > >
> > > GPU/DRM drivers allocate shmems as DMA-able gpu buffers and allow
> the
> > > userspace to map them cached or WC (I915_MMAP_WC or
> > > AMDGPU_GEM_CREATE_CPU_GTT_USWC for example).  When a shmem
> is
> > > mapped
> > > WC and is made available to the guest, we would like the ability to
> > > map the region WC in the guest.
> >
> > Curious... How is such slot exposed to the guest? A reserved memory
> > region? Is it static or might be dynamically added?
> The plan is for virtio-gpu device to reserve a huge memory region in
> the guest.  Memslots may be added dynamically or statically to back
> the region.

so the region is marked as E820_RESERVED to prevent guest kernel 
from using it for other purpose and then virtio-gpu device will report
virtio-gpu driver of the exact location of the region through its own
interface?

> 
> Dynamic: the host adds a 16MB GPU allocation as a memslot at a time.
> The guest kernel suballocates from the 16MB pool.
> 
> Static: the host creates a huge PROT_NONE memfd and adds it as a
> memslot.  GPU allocations are mremap()ed into the memfd region to
> provide the real mapping.
> 
> These options are considered because the number of memslots are
> limited: 32 on ARM and 509 on x86.  If the number of memslots could be
> made larger (4096 or more), we would also consider adding each
> individual GPU allocation as a memslot.
> 
> These are actually questions we need feedback.  Besides, GPU
> allocations can be assumed to be kernel dma-bufs in this context.  I
> wonder if it makes sense to have a variation of
> KVM_SET_USER_MEMORY_REGION that takes dma-bufs.

I feel it makes more sense.

Thanks
Kevin
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-20  2:04                     ` Tian, Kevin
@ 2020-02-20  2:38                       ` Tian, Kevin
  2020-02-20 22:23                         ` Chia-I Wu
  0 siblings, 1 reply; 30+ messages in thread
From: Tian, Kevin @ 2020-02-20  2:38 UTC (permalink / raw)
  To: Tian, Kevin, Chia-I Wu
  Cc: Wanpeng Li, ML dri-devel, kvm list, Joerg Roedel, Christopherson,
	Sean J, Gurchetan Singh, Gerd Hoffmann, Paolo Bonzini,
	Vitaly Kuznetsov, Jim Mattson

> From: Tian, Kevin
> Sent: Thursday, February 20, 2020 10:05 AM
> 
> > From: Chia-I Wu <olvaffe@gmail.com>
> > Sent: Thursday, February 20, 2020 3:37 AM
> >
> > On Wed, Feb 19, 2020 at 1:52 AM Tian, Kevin <kevin.tian@intel.com> wrote:
> > >
> > > > From: Paolo Bonzini
> > > > Sent: Wednesday, February 19, 2020 12:29 AM
> > > >
> > > > On 14/02/20 23:03, Sean Christopherson wrote:
> > > > >> On Fri, Feb 14, 2020 at 1:47 PM Chia-I Wu <olvaffe@gmail.com>
> wrote:
> > > > >>> AFAICT, it is currently allowed on ARM (verified) and AMD (not
> > > > >>> verified, but svm_get_mt_mask returns 0 which supposedly means
> > the
> > > > NPT
> > > > >>> does not restrict what the guest PAT can do).  This diff would do the
> > > > >>> trick for Intel without needing any uapi change:
> > > > >> I would be concerned about Intel CPU errata such as SKX40 and
> SKX59.
> > > > > The part KVM cares about, #MC, is already addressed by forcing UC
> for
> > > > MMIO.
> > > > > The data corruption issue is on the guest kernel to correctly use WC
> > > > > and/or non-temporal writes.
> > > >
> > > > What about coherency across live migration?  The userspace process
> > would
> > > > use cached accesses, and also a WBINVD could potentially corrupt guest
> > > > memory.
> > > >
> > >
> > > In such case the userspace process possibly should conservatively use
> > > UC mapping, as if for MMIO regions on a passthrough device. However
> > > there remains a problem. the definition of KVM_MEM_DMA implies
> > > favoring guest setting, which could be whatever type in concept. Then
> > > assuming UC is also problematic. I'm not sure whether inventing another
> > > interface to query effective memory type from KVM is a good idea. There
> > > is no guarantee that the guest will use same type for every page in the
> > > same slot, then such interface might be messy. Alternatively, maybe
> > > we could just have an interface for KVM userspace to force memory type
> > > for a given slot, if it is mainly used in para-virtualized scenarios (e.g.
> > > virtio-gpu) where the guest is enlightened to use a forced type (e.g. WC)?
> > KVM forcing the memory type for a given slot should work too.  But the
> > ignore-guest-pat bit seems to be Intel-specific.  We will need to
> > define how the second-level page attributes combine with the guest
> > page attributes somehow.
> 
> oh, I'm not aware of that difference. without an ipat-equivalent
> capability, I'm not sure how to forcing random type here. If you look at
> table 11-7 in Intel SDM, none of MTRR (EPT) memory type can lead to
> consistent effective type when combining with random PAT value. So
>  it is definitely a dead end.
> 
> >
> > KVM should in theory be able to tell that the userspace region is
> > mapped with a certain memory type and can force the same memory type
> > onto the guest.  The userspace does not need to be involved.  But that
> > sounds very slow?  This may be a dumb question, but would it help to
> > add KVM_SET_DMA_BUF and let KVM negotiate the memory type with the
> > in-kernel GPU drivers?
> >
> >
> 
> KVM_SET_DMA_BUF looks more reasonable. But I guess we don't need
> KVM to be aware of such negotiation. We can continue your original
> proposal to have KVM simply favor guest memory type (maybe still call
> KVM_MEM_DMA). On the other hand, Qemu should just mmap on the
> fd handle of the dmabuf passed from the virtio-gpu device backend,  e.g.
> to conduct migration. That way the mmap request is finally served by
> DRM and underlying GPU drivers, with proper type enforced automatically.
> 

Thinking more possibly we don't need introduce new interface to KVM.
As long as Qemu uses dmabuf interface to mmap the specific region,
KVM can simply check memory type in host page table given hva of a 
memslot. If the type is UC or WC, it implies that userspace wants a 
non-coherent mapping which should be reflected in the guest side too.
In such case, KVM can go to non-cohenrent DMA path and favor guest 
memory type automatically. 

Thanks
Kevin
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-20  2:38                       ` Tian, Kevin
@ 2020-02-20 22:23                         ` Chia-I Wu
  2020-02-21  0:23                           ` Tian, Kevin
  0 siblings, 1 reply; 30+ messages in thread
From: Chia-I Wu @ 2020-02-20 22:23 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Wanpeng Li, ML dri-devel, kvm list, Joerg Roedel, Christopherson,
	Sean J, Gurchetan Singh, Gerd Hoffmann, Paolo Bonzini,
	Vitaly Kuznetsov, Jim Mattson

On Wed, Feb 19, 2020 at 6:38 PM Tian, Kevin <kevin.tian@intel.com> wrote:
>
> > From: Tian, Kevin
> > Sent: Thursday, February 20, 2020 10:05 AM
> >
> > > From: Chia-I Wu <olvaffe@gmail.com>
> > > Sent: Thursday, February 20, 2020 3:37 AM
> > >
> > > On Wed, Feb 19, 2020 at 1:52 AM Tian, Kevin <kevin.tian@intel.com> wrote:
> > > >
> > > > > From: Paolo Bonzini
> > > > > Sent: Wednesday, February 19, 2020 12:29 AM
> > > > >
> > > > > On 14/02/20 23:03, Sean Christopherson wrote:
> > > > > >> On Fri, Feb 14, 2020 at 1:47 PM Chia-I Wu <olvaffe@gmail.com>
> > wrote:
> > > > > >>> AFAICT, it is currently allowed on ARM (verified) and AMD (not
> > > > > >>> verified, but svm_get_mt_mask returns 0 which supposedly means
> > > the
> > > > > NPT
> > > > > >>> does not restrict what the guest PAT can do).  This diff would do the
> > > > > >>> trick for Intel without needing any uapi change:
> > > > > >> I would be concerned about Intel CPU errata such as SKX40 and
> > SKX59.
> > > > > > The part KVM cares about, #MC, is already addressed by forcing UC
> > for
> > > > > MMIO.
> > > > > > The data corruption issue is on the guest kernel to correctly use WC
> > > > > > and/or non-temporal writes.
> > > > >
> > > > > What about coherency across live migration?  The userspace process
> > > would
> > > > > use cached accesses, and also a WBINVD could potentially corrupt guest
> > > > > memory.
> > > > >
> > > >
> > > > In such case the userspace process possibly should conservatively use
> > > > UC mapping, as if for MMIO regions on a passthrough device. However
> > > > there remains a problem. the definition of KVM_MEM_DMA implies
> > > > favoring guest setting, which could be whatever type in concept. Then
> > > > assuming UC is also problematic. I'm not sure whether inventing another
> > > > interface to query effective memory type from KVM is a good idea. There
> > > > is no guarantee that the guest will use same type for every page in the
> > > > same slot, then such interface might be messy. Alternatively, maybe
> > > > we could just have an interface for KVM userspace to force memory type
> > > > for a given slot, if it is mainly used in para-virtualized scenarios (e.g.
> > > > virtio-gpu) where the guest is enlightened to use a forced type (e.g. WC)?
> > > KVM forcing the memory type for a given slot should work too.  But the
> > > ignore-guest-pat bit seems to be Intel-specific.  We will need to
> > > define how the second-level page attributes combine with the guest
> > > page attributes somehow.
> >
> > oh, I'm not aware of that difference. without an ipat-equivalent
> > capability, I'm not sure how to forcing random type here. If you look at
> > table 11-7 in Intel SDM, none of MTRR (EPT) memory type can lead to
> > consistent effective type when combining with random PAT value. So
> >  it is definitely a dead end.
> >
> > >
> > > KVM should in theory be able to tell that the userspace region is
> > > mapped with a certain memory type and can force the same memory type
> > > onto the guest.  The userspace does not need to be involved.  But that
> > > sounds very slow?  This may be a dumb question, but would it help to
> > > add KVM_SET_DMA_BUF and let KVM negotiate the memory type with the
> > > in-kernel GPU drivers?
> > >
> > >
> >
> > KVM_SET_DMA_BUF looks more reasonable. But I guess we don't need
> > KVM to be aware of such negotiation. We can continue your original
> > proposal to have KVM simply favor guest memory type (maybe still call
> > KVM_MEM_DMA). On the other hand, Qemu should just mmap on the
> > fd handle of the dmabuf passed from the virtio-gpu device backend,  e.g.
> > to conduct migration. That way the mmap request is finally served by
> > DRM and underlying GPU drivers, with proper type enforced automatically.
> >
>
> Thinking more possibly we don't need introduce new interface to KVM.
> As long as Qemu uses dmabuf interface to mmap the specific region,
> KVM can simply check memory type in host page table given hva of a
> memslot. If the type is UC or WC, it implies that userspace wants a
> non-coherent mapping which should be reflected in the guest side too.
> In such case, KVM can go to non-cohenrent DMA path and favor guest
> memory type automatically.
Sorry, I mixed two things together.

Userspace access to dmabuf mmap must be guarded by
DMA_BUF_SYNC_{START,END} ioctls.  It is possible that the GPU driver
always picks a WB mapping and let the ioctls flush/invalidate CPU
caches.  We actually want the guest memory type to match vkMapMemory's
memory type, which can be different from dmabuf mmap's memory type.
It is not enough for KVM to inspect the hva's memory type.

KVM_SET_DMA_BUF, if supported, is a signal to KVM that the guest
memory type should be honored (or forced if there is a new op in
dma_buf_ops that tells KVM which memory type to force).  KVM_MEM_DMA
flag in this RFC sends the same signal.  Unless KVM_SET_DMA_BUF gives
the userspace other features such as setting unlimited number of
dmabufs to subregions of a memslot, it is not very useful.

If uapi change is to be avoided, it is the easiest that guest memory
type is always honored unless it causes #MC (i.e.,is_mmio==true).


>
> Thanks
> Kevin
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-20  2:13             ` Tian, Kevin
@ 2020-02-20 23:02               ` Chia-I Wu
  0 siblings, 0 replies; 30+ messages in thread
From: Chia-I Wu @ 2020-02-20 23:02 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: wanpengli, kvm, joro, ML dri-devel, Gurchetan Singh,
	Christopherson, Sean J, Gerd Hoffmann, Paolo Bonzini, vkuznets,
	jmattson

On Wed, Feb 19, 2020 at 6:13 PM Tian, Kevin <kevin.tian@intel.com> wrote:
> > > Curious... How is such slot exposed to the guest? A reserved memory
> > > region? Is it static or might be dynamically added?
> > The plan is for virtio-gpu device to reserve a huge memory region in
> > the guest.  Memslots may be added dynamically or statically to back
> > the region.
>
> so the region is marked as E820_RESERVED to prevent guest kernel
> from using it for other purpose and then virtio-gpu device will report
> virtio-gpu driver of the exact location of the region through its own
> interface?
The current plan is that the virtio-gpu device will have a bar for the
region, which is like the vram aperture on real GPUs.  The virtio-gpu
driver manages the region like managing vram.

When the guest userspace allocates from vram, the guest kernel
reserves an unused range from the region and tells the host the
offset.  The host allocates a real GPU buffer, maps the buffer, and
add a memslot with gpa==bar_base+offset (or mremap).  When the guest
userspace mmap, the guest kernel does a io_remap_pfn_range.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-20 22:23                         ` Chia-I Wu
@ 2020-02-21  0:23                           ` Tian, Kevin
  2020-02-21  4:45                             ` Chia-I Wu
  0 siblings, 1 reply; 30+ messages in thread
From: Tian, Kevin @ 2020-02-21  0:23 UTC (permalink / raw)
  To: Chia-I Wu
  Cc: Wanpeng Li, ML dri-devel, kvm list, Joerg Roedel, Christopherson,
	Sean J, Gurchetan Singh, Gerd Hoffmann, Paolo Bonzini,
	Vitaly Kuznetsov, Jim Mattson

> From: Chia-I Wu <olvaffe@gmail.com>
> Sent: Friday, February 21, 2020 6:24 AM
> 
> On Wed, Feb 19, 2020 at 6:38 PM Tian, Kevin <kevin.tian@intel.com> wrote:
> >
> > > From: Tian, Kevin
> > > Sent: Thursday, February 20, 2020 10:05 AM
> > >
> > > > From: Chia-I Wu <olvaffe@gmail.com>
> > > > Sent: Thursday, February 20, 2020 3:37 AM
> > > >
> > > > On Wed, Feb 19, 2020 at 1:52 AM Tian, Kevin <kevin.tian@intel.com>
> wrote:
> > > > >
> > > > > > From: Paolo Bonzini
> > > > > > Sent: Wednesday, February 19, 2020 12:29 AM
> > > > > >
> > > > > > On 14/02/20 23:03, Sean Christopherson wrote:
> > > > > > >> On Fri, Feb 14, 2020 at 1:47 PM Chia-I Wu <olvaffe@gmail.com>
> > > wrote:
> > > > > > >>> AFAICT, it is currently allowed on ARM (verified) and AMD (not
> > > > > > >>> verified, but svm_get_mt_mask returns 0 which supposedly
> means
> > > > the
> > > > > > NPT
> > > > > > >>> does not restrict what the guest PAT can do).  This diff would do
> the
> > > > > > >>> trick for Intel without needing any uapi change:
> > > > > > >> I would be concerned about Intel CPU errata such as SKX40 and
> > > SKX59.
> > > > > > > The part KVM cares about, #MC, is already addressed by forcing
> UC
> > > for
> > > > > > MMIO.
> > > > > > > The data corruption issue is on the guest kernel to correctly use
> WC
> > > > > > > and/or non-temporal writes.
> > > > > >
> > > > > > What about coherency across live migration?  The userspace
> process
> > > > would
> > > > > > use cached accesses, and also a WBINVD could potentially corrupt
> guest
> > > > > > memory.
> > > > > >
> > > > >
> > > > > In such case the userspace process possibly should conservatively use
> > > > > UC mapping, as if for MMIO regions on a passthrough device.
> However
> > > > > there remains a problem. the definition of KVM_MEM_DMA implies
> > > > > favoring guest setting, which could be whatever type in concept. Then
> > > > > assuming UC is also problematic. I'm not sure whether inventing
> another
> > > > > interface to query effective memory type from KVM is a good idea.
> There
> > > > > is no guarantee that the guest will use same type for every page in the
> > > > > same slot, then such interface might be messy. Alternatively, maybe
> > > > > we could just have an interface for KVM userspace to force memory
> type
> > > > > for a given slot, if it is mainly used in para-virtualized scenarios (e.g.
> > > > > virtio-gpu) where the guest is enlightened to use a forced type (e.g.
> WC)?
> > > > KVM forcing the memory type for a given slot should work too.  But the
> > > > ignore-guest-pat bit seems to be Intel-specific.  We will need to
> > > > define how the second-level page attributes combine with the guest
> > > > page attributes somehow.
> > >
> > > oh, I'm not aware of that difference. without an ipat-equivalent
> > > capability, I'm not sure how to forcing random type here. If you look at
> > > table 11-7 in Intel SDM, none of MTRR (EPT) memory type can lead to
> > > consistent effective type when combining with random PAT value. So
> > >  it is definitely a dead end.
> > >
> > > >
> > > > KVM should in theory be able to tell that the userspace region is
> > > > mapped with a certain memory type and can force the same memory
> type
> > > > onto the guest.  The userspace does not need to be involved.  But that
> > > > sounds very slow?  This may be a dumb question, but would it help to
> > > > add KVM_SET_DMA_BUF and let KVM negotiate the memory type with
> the
> > > > in-kernel GPU drivers?
> > > >
> > > >
> > >
> > > KVM_SET_DMA_BUF looks more reasonable. But I guess we don't need
> > > KVM to be aware of such negotiation. We can continue your original
> > > proposal to have KVM simply favor guest memory type (maybe still call
> > > KVM_MEM_DMA). On the other hand, Qemu should just mmap on the
> > > fd handle of the dmabuf passed from the virtio-gpu device backend,  e.g.
> > > to conduct migration. That way the mmap request is finally served by
> > > DRM and underlying GPU drivers, with proper type enforced
> automatically.
> > >
> >
> > Thinking more possibly we don't need introduce new interface to KVM.
> > As long as Qemu uses dmabuf interface to mmap the specific region,
> > KVM can simply check memory type in host page table given hva of a
> > memslot. If the type is UC or WC, it implies that userspace wants a
> > non-coherent mapping which should be reflected in the guest side too.
> > In such case, KVM can go to non-cohenrent DMA path and favor guest
> > memory type automatically.
> Sorry, I mixed two things together.
> 
> Userspace access to dmabuf mmap must be guarded by
> DMA_BUF_SYNC_{START,END} ioctls.  It is possible that the GPU driver
> always picks a WB mapping and let the ioctls flush/invalidate CPU
> caches.  We actually want the guest memory type to match vkMapMemory's
> memory type, which can be different from dmabuf mmap's memory type.
> It is not enough for KVM to inspect the hva's memory type.

I'm not familiar with dmabuf and what is the difference between
vkMapMemory and mmap. Just a simple thought that whatever
memory type/synchronization enforced on the host userspace should
ideally be applied to guest userspace too. e.g. in above example we
possibly want the guest to use WB and issue flush/invalidate hypercalls
to guard with other potential parallel operations in the host side. 
otherwise I cannot see how synchronization can be done when one
use WB with sync primitives while the other simply use WC w/o such
primitives.

> 
> KVM_SET_DMA_BUF, if supported, is a signal to KVM that the guest
> memory type should be honored (or forced if there is a new op in
> dma_buf_ops that tells KVM which memory type to force).  KVM_MEM_DMA
> flag in this RFC sends the same signal.  Unless KVM_SET_DMA_BUF gives
> the userspace other features such as setting unlimited number of
> dmabufs to subregions of a memslot, it is not very useful.

the good part of a new interface is its simplicity, but only in slot
granularity. instead having KVM to inspect hva can support page
granularity, but adding run-time overhead. Let's see how Paolo
thinks. 😊

> 
> If uapi change is to be avoided, it is the easiest that guest memory
> type is always honored unless it causes #MC (i.e.,is_mmio==true).
> 

I feel this goes too far...

Thanks
Kevin
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-21  0:23                           ` Tian, Kevin
@ 2020-02-21  4:45                             ` Chia-I Wu
  2020-02-21  4:51                               ` Chia-I Wu
  0 siblings, 1 reply; 30+ messages in thread
From: Chia-I Wu @ 2020-02-21  4:45 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Wanpeng Li, ML dri-devel, kvm list, Joerg Roedel, Christopherson,
	Sean J, Gurchetan Singh, Gerd Hoffmann, Paolo Bonzini,
	Vitaly Kuznetsov, Jim Mattson

[-- Attachment #1.1: Type: text/plain, Size: 8101 bytes --]

On Thu, Feb 20, 2020 at 4:23 PM Tian, Kevin <kevin.tian@intel.com> wrote:

> > From: Chia-I Wu <olvaffe@gmail.com>
> > Sent: Friday, February 21, 2020 6:24 AM
> >
> > On Wed, Feb 19, 2020 at 6:38 PM Tian, Kevin <kevin.tian@intel.com>
> wrote:
> > >
> > > > From: Tian, Kevin
> > > > Sent: Thursday, February 20, 2020 10:05 AM
> > > >
> > > > > From: Chia-I Wu <olvaffe@gmail.com>
> > > > > Sent: Thursday, February 20, 2020 3:37 AM
> > > > >
> > > > > On Wed, Feb 19, 2020 at 1:52 AM Tian, Kevin <kevin.tian@intel.com>
> > wrote:
> > > > > >
> > > > > > > From: Paolo Bonzini
> > > > > > > Sent: Wednesday, February 19, 2020 12:29 AM
> > > > > > >
> > > > > > > On 14/02/20 23:03, Sean Christopherson wrote:
> > > > > > > >> On Fri, Feb 14, 2020 at 1:47 PM Chia-I Wu <
> olvaffe@gmail.com>
> > > > wrote:
> > > > > > > >>> AFAICT, it is currently allowed on ARM (verified) and AMD
> (not
> > > > > > > >>> verified, but svm_get_mt_mask returns 0 which supposedly
> > means
> > > > > the
> > > > > > > NPT
> > > > > > > >>> does not restrict what the guest PAT can do).  This diff
> would do
> > the
> > > > > > > >>> trick for Intel without needing any uapi change:
> > > > > > > >> I would be concerned about Intel CPU errata such as SKX40
> and
> > > > SKX59.
> > > > > > > > The part KVM cares about, #MC, is already addressed by
> forcing
> > UC
> > > > for
> > > > > > > MMIO.
> > > > > > > > The data corruption issue is on the guest kernel to
> correctly use
> > WC
> > > > > > > > and/or non-temporal writes.
> > > > > > >
> > > > > > > What about coherency across live migration?  The userspace
> > process
> > > > > would
> > > > > > > use cached accesses, and also a WBINVD could potentially
> corrupt
> > guest
> > > > > > > memory.
> > > > > > >
> > > > > >
> > > > > > In such case the userspace process possibly should
> conservatively use
> > > > > > UC mapping, as if for MMIO regions on a passthrough device.
> > However
> > > > > > there remains a problem. the definition of KVM_MEM_DMA implies
> > > > > > favoring guest setting, which could be whatever type in concept.
> Then
> > > > > > assuming UC is also problematic. I'm not sure whether inventing
> > another
> > > > > > interface to query effective memory type from KVM is a good idea.
> > There
> > > > > > is no guarantee that the guest will use same type for every page
> in the
> > > > > > same slot, then such interface might be messy. Alternatively,
> maybe
> > > > > > we could just have an interface for KVM userspace to force memory
> > type
> > > > > > for a given slot, if it is mainly used in para-virtualized
> scenarios (e.g.
> > > > > > virtio-gpu) where the guest is enlightened to use a forced type
> (e.g.
> > WC)?
> > > > > KVM forcing the memory type for a given slot should work too.  But
> the
> > > > > ignore-guest-pat bit seems to be Intel-specific.  We will need to
> > > > > define how the second-level page attributes combine with the guest
> > > > > page attributes somehow.
> > > >
> > > > oh, I'm not aware of that difference. without an ipat-equivalent
> > > > capability, I'm not sure how to forcing random type here. If you
> look at
> > > > table 11-7 in Intel SDM, none of MTRR (EPT) memory type can lead to
> > > > consistent effective type when combining with random PAT value. So
> > > >  it is definitely a dead end.
> > > >
> > > > >
> > > > > KVM should in theory be able to tell that the userspace region is
> > > > > mapped with a certain memory type and can force the same memory
> > type
> > > > > onto the guest.  The userspace does not need to be involved.  But
> that
> > > > > sounds very slow?  This may be a dumb question, but would it help
> to
> > > > > add KVM_SET_DMA_BUF and let KVM negotiate the memory type with
> > the
> > > > > in-kernel GPU drivers?
> > > > >
> > > > >
> > > >
> > > > KVM_SET_DMA_BUF looks more reasonable. But I guess we don't need
> > > > KVM to be aware of such negotiation. We can continue your original
> > > > proposal to have KVM simply favor guest memory type (maybe still call
> > > > KVM_MEM_DMA). On the other hand, Qemu should just mmap on the
> > > > fd handle of the dmabuf passed from the virtio-gpu device backend,
> e.g.
> > > > to conduct migration. That way the mmap request is finally served by
> > > > DRM and underlying GPU drivers, with proper type enforced
> > automatically.
> > > >
> > >
> > > Thinking more possibly we don't need introduce new interface to KVM.
> > > As long as Qemu uses dmabuf interface to mmap the specific region,
> > > KVM can simply check memory type in host page table given hva of a
> > > memslot. If the type is UC or WC, it implies that userspace wants a
> > > non-coherent mapping which should be reflected in the guest side too.
> > > In such case, KVM can go to non-cohenrent DMA path and favor guest
> > > memory type automatically.
> > Sorry, I mixed two things together.
> >
> > Userspace access to dmabuf mmap must be guarded by
> > DMA_BUF_SYNC_{START,END} ioctls.  It is possible that the GPU driver
> > always picks a WB mapping and let the ioctls flush/invalidate CPU
> > caches.  We actually want the guest memory type to match vkMapMemory's
> > memory type, which can be different from dmabuf mmap's memory type.
> > It is not enough for KVM to inspect the hva's memory type.
>
> I'm not familiar with dmabuf and what is the difference between
> vkMapMemory and mmap. Just a simple thought that whatever
> memory type/synchronization enforced on the host userspace should
> ideally be applied to guest userspace too. e.g. in above example we
> possibly want the guest to use WB and issue flush/invalidate hypercalls
> to guard with other potential parallel operations in the host side.
> otherwise I cannot see how synchronization can be done when one
> use WB with sync primitives while the other simply use WC w/o such
> primitives.
>

I am reasonably familiar with the GPU stacks, but I am not familiar with
KVM :)

When allocating a GPU memory, the userspace can specify whether it wants a
coherent one or an incoherent one.  vkMapMemory returns a coherent or a
incoherent mapping respectively.  Indeed we also want the guest userspace
to have a coherent or a incoherent mapping respectively.

The GPU memory can be exported as a dmabuf to share with another device or
process.  For security, we allocate the GPU memory in a GPU process and we
export the dmabuf to the hypervisor.  mmap of dmabuf semantically returns
an incoherent mapping.  As a result, the guest will set up a mapping that
has the same memory type as the vkMapMemory mapping does, but the hva in
KVM_SET_USER_MEMORY_REGION points to the dmabuf's incoherent mapping.

If you think it is the best for KVM to inspect hva to determine the memory
type with page granularity, that is reasonable and should work for us too.
The userspace can do something (e.g., add a GPU driver dependency to the
hypervisor such that the dma-buf is imported as a GPU memory and mapped
using vkMapMemory) or I can work with dma-buf maintainers to see if
dma-buf's semantics can be changed.


> >
> > KVM_SET_DMA_BUF, if supported, is a signal to KVM that the guest
> > memory type should be honored (or forced if there is a new op in
> > dma_buf_ops that tells KVM which memory type to force).  KVM_MEM_DMA
> > flag in this RFC sends the same signal.  Unless KVM_SET_DMA_BUF gives
> > the userspace other features such as setting unlimited number of
> > dmabufs to subregions of a memslot, it is not very useful.
>
> the good part of a new interface is its simplicity, but only in slot
> granularity. instead having KVM to inspect hva can support page
> granularity, but adding run-time overhead. Let's see how Paolo
> thinks. 😊
>
> >
> > If uapi change is to be avoided, it is the easiest that guest memory
> > type is always honored unless it causes #MC (i.e.,is_mmio==true).
> >
>
> I feel this goes too far...
>
> Thanks
> Kevin
>

[-- Attachment #1.2: Type: text/html, Size: 10470 bytes --]

<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Feb 20, 2020 at 4:23 PM Tian, Kevin &lt;<a href="mailto:kevin.tian@intel.com">kevin.tian@intel.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">&gt; From: Chia-I Wu &lt;<a href="mailto:olvaffe@gmail.com" target="_blank">olvaffe@gmail.com</a>&gt;<br>
&gt; Sent: Friday, February 21, 2020 6:24 AM<br>
&gt; <br>
&gt; On Wed, Feb 19, 2020 at 6:38 PM Tian, Kevin &lt;<a href="mailto:kevin.tian@intel.com" target="_blank">kevin.tian@intel.com</a>&gt; wrote:<br>
&gt; &gt;<br>
&gt; &gt; &gt; From: Tian, Kevin<br>
&gt; &gt; &gt; Sent: Thursday, February 20, 2020 10:05 AM<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; From: Chia-I Wu &lt;<a href="mailto:olvaffe@gmail.com" target="_blank">olvaffe@gmail.com</a>&gt;<br>
&gt; &gt; &gt; &gt; Sent: Thursday, February 20, 2020 3:37 AM<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; On Wed, Feb 19, 2020 at 1:52 AM Tian, Kevin &lt;<a href="mailto:kevin.tian@intel.com" target="_blank">kevin.tian@intel.com</a>&gt;<br>
&gt; wrote:<br>
&gt; &gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; &gt; &gt; From: Paolo Bonzini<br>
&gt; &gt; &gt; &gt; &gt; &gt; Sent: Wednesday, February 19, 2020 12:29 AM<br>
&gt; &gt; &gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; &gt; &gt; On 14/02/20 23:03, Sean Christopherson wrote:<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt; On Fri, Feb 14, 2020 at 1:47 PM Chia-I Wu &lt;<a href="mailto:olvaffe@gmail.com" target="_blank">olvaffe@gmail.com</a>&gt;<br>
&gt; &gt; &gt; wrote:<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt;&gt; AFAICT, it is currently allowed on ARM (verified) and AMD (not<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt;&gt; verified, but svm_get_mt_mask returns 0 which supposedly<br>
&gt; means<br>
&gt; &gt; &gt; &gt; the<br>
&gt; &gt; &gt; &gt; &gt; &gt; NPT<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt;&gt; does not restrict what the guest PAT can do).  This diff would do<br>
&gt; the<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt;&gt; trick for Intel without needing any uapi change:<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt;&gt; I would be concerned about Intel CPU errata such as SKX40 and<br>
&gt; &gt; &gt; SKX59.<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; The part KVM cares about, #MC, is already addressed by forcing<br>
&gt; UC<br>
&gt; &gt; &gt; for<br>
&gt; &gt; &gt; &gt; &gt; &gt; MMIO.<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; The data corruption issue is on the guest kernel to correctly use<br>
&gt; WC<br>
&gt; &gt; &gt; &gt; &gt; &gt; &gt; and/or non-temporal writes.<br>
&gt; &gt; &gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; &gt; &gt; What about coherency across live migration?  The userspace<br>
&gt; process<br>
&gt; &gt; &gt; &gt; would<br>
&gt; &gt; &gt; &gt; &gt; &gt; use cached accesses, and also a WBINVD could potentially corrupt<br>
&gt; guest<br>
&gt; &gt; &gt; &gt; &gt; &gt; memory.<br>
&gt; &gt; &gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; &gt; In such case the userspace process possibly should conservatively use<br>
&gt; &gt; &gt; &gt; &gt; UC mapping, as if for MMIO regions on a passthrough device.<br>
&gt; However<br>
&gt; &gt; &gt; &gt; &gt; there remains a problem. the definition of KVM_MEM_DMA implies<br>
&gt; &gt; &gt; &gt; &gt; favoring guest setting, which could be whatever type in concept. Then<br>
&gt; &gt; &gt; &gt; &gt; assuming UC is also problematic. I&#39;m not sure whether inventing<br>
&gt; another<br>
&gt; &gt; &gt; &gt; &gt; interface to query effective memory type from KVM is a good idea.<br>
&gt; There<br>
&gt; &gt; &gt; &gt; &gt; is no guarantee that the guest will use same type for every page in the<br>
&gt; &gt; &gt; &gt; &gt; same slot, then such interface might be messy. Alternatively, maybe<br>
&gt; &gt; &gt; &gt; &gt; we could just have an interface for KVM userspace to force memory<br>
&gt; type<br>
&gt; &gt; &gt; &gt; &gt; for a given slot, if it is mainly used in para-virtualized scenarios (e.g.<br>
&gt; &gt; &gt; &gt; &gt; virtio-gpu) where the guest is enlightened to use a forced type (e.g.<br>
&gt; WC)?<br>
&gt; &gt; &gt; &gt; KVM forcing the memory type for a given slot should work too.  But the<br>
&gt; &gt; &gt; &gt; ignore-guest-pat bit seems to be Intel-specific.  We will need to<br>
&gt; &gt; &gt; &gt; define how the second-level page attributes combine with the guest<br>
&gt; &gt; &gt; &gt; page attributes somehow.<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; oh, I&#39;m not aware of that difference. without an ipat-equivalent<br>
&gt; &gt; &gt; capability, I&#39;m not sure how to forcing random type here. If you look at<br>
&gt; &gt; &gt; table 11-7 in Intel SDM, none of MTRR (EPT) memory type can lead to<br>
&gt; &gt; &gt; consistent effective type when combining with random PAT value. So<br>
&gt; &gt; &gt;  it is definitely a dead end.<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; KVM should in theory be able to tell that the userspace region is<br>
&gt; &gt; &gt; &gt; mapped with a certain memory type and can force the same memory<br>
&gt; type<br>
&gt; &gt; &gt; &gt; onto the guest.  The userspace does not need to be involved.  But that<br>
&gt; &gt; &gt; &gt; sounds very slow?  This may be a dumb question, but would it help to<br>
&gt; &gt; &gt; &gt; add KVM_SET_DMA_BUF and let KVM negotiate the memory type with<br>
&gt; the<br>
&gt; &gt; &gt; &gt; in-kernel GPU drivers?<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; KVM_SET_DMA_BUF looks more reasonable. But I guess we don&#39;t need<br>
&gt; &gt; &gt; KVM to be aware of such negotiation. We can continue your original<br>
&gt; &gt; &gt; proposal to have KVM simply favor guest memory type (maybe still call<br>
&gt; &gt; &gt; KVM_MEM_DMA). On the other hand, Qemu should just mmap on the<br>
&gt; &gt; &gt; fd handle of the dmabuf passed from the virtio-gpu device backend,  e.g.<br>
&gt; &gt; &gt; to conduct migration. That way the mmap request is finally served by<br>
&gt; &gt; &gt; DRM and underlying GPU drivers, with proper type enforced<br>
&gt; automatically.<br>
&gt; &gt; &gt;<br>
&gt; &gt;<br>
&gt; &gt; Thinking more possibly we don&#39;t need introduce new interface to KVM.<br>
&gt; &gt; As long as Qemu uses dmabuf interface to mmap the specific region,<br>
&gt; &gt; KVM can simply check memory type in host page table given hva of a<br>
&gt; &gt; memslot. If the type is UC or WC, it implies that userspace wants a<br>
&gt; &gt; non-coherent mapping which should be reflected in the guest side too.<br>
&gt; &gt; In such case, KVM can go to non-cohenrent DMA path and favor guest<br>
&gt; &gt; memory type automatically.<br>
&gt; Sorry, I mixed two things together.<br>
&gt; <br>
&gt; Userspace access to dmabuf mmap must be guarded by<br>
&gt; DMA_BUF_SYNC_{START,END} ioctls.  It is possible that the GPU driver<br>
&gt; always picks a WB mapping and let the ioctls flush/invalidate CPU<br>
&gt; caches.  We actually want the guest memory type to match vkMapMemory&#39;s<br>
&gt; memory type, which can be different from dmabuf mmap&#39;s memory type.<br>
&gt; It is not enough for KVM to inspect the hva&#39;s memory type.<br>
<br>
I&#39;m not familiar with dmabuf and what is the difference between<br>
vkMapMemory and mmap. Just a simple thought that whatever<br>
memory type/synchronization enforced on the host userspace should<br>
ideally be applied to guest userspace too. e.g. in above example we<br>
possibly want the guest to use WB and issue flush/invalidate hypercalls<br>
to guard with other potential parallel operations in the host side. <br>
otherwise I cannot see how synchronization can be done when one<br>
use WB with sync primitives while the other simply use WC w/o such<br>
primitives.<br></blockquote><div><br></div><div><div>I am reasonably familiar with the GPU stacks, but I am not familiar with KVM :)</div><div></div></div><div><br></div><div>When allocating a GPU memory, the userspace can specify whether it wants a coherent one or an incoherent one.  vkMapMemory returns a coherent or a incoherent mapping respectively.  Indeed we also want the guest userspace to have a coherent or a incoherent mapping respectively.</div><div><br></div><div>The GPU memory can be exported as a dmabuf to share with another device or process.  For security, we allocate the GPU memory in a GPU process and we export the dmabuf to the hypervisor.  mmap of dmabuf semantically returns an incoherent mapping.  As a result, the guest will set up a mapping that has the same memory type as the vkMapMemory mapping does, but the hva in KVM_SET_USER_MEMORY_REGION points to the dmabuf&#39;s incoherent mapping.</div><div><br></div><div>If you think it is the best for KVM to inspect hva to determine the memory type with page granularity, that is reasonable and should work for us too.  The userspace can do something (e.g., add a GPU driver dependency to the hypervisor such that the dma-buf is imported as a GPU memory and mapped using vkMapMemory) or I can work with dma-buf maintainers to see if dma-buf&#39;s semantics can be changed.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
&gt; <br>
&gt; KVM_SET_DMA_BUF, if supported, is a signal to KVM that the guest<br>
&gt; memory type should be honored (or forced if there is a new op in<br>
&gt; dma_buf_ops that tells KVM which memory type to force).  KVM_MEM_DMA<br>
&gt; flag in this RFC sends the same signal.  Unless KVM_SET_DMA_BUF gives<br>
&gt; the userspace other features such as setting unlimited number of<br>
&gt; dmabufs to subregions of a memslot, it is not very useful.<br>
<br>
the good part of a new interface is its simplicity, but only in slot<br>
granularity. instead having KVM to inspect hva can support page<br>
granularity, but adding run-time overhead. Let&#39;s see how Paolo<br>
thinks. 😊<br>
<br>
&gt; <br>
&gt; If uapi change is to be avoided, it is the easiest that guest memory<br>
&gt; type is always honored unless it causes #MC (i.e.,is_mmio==true).<br>
&gt; <br>
<br>
I feel this goes too far...<br>
<br>
Thanks<br>
Kevin<br>
</blockquote></div></div>

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-21  4:45                             ` Chia-I Wu
@ 2020-02-21  4:51                               ` Chia-I Wu
  2020-02-21  5:39                                 ` Tian, Kevin
  0 siblings, 1 reply; 30+ messages in thread
From: Chia-I Wu @ 2020-02-21  4:51 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Wanpeng Li, ML dri-devel, kvm list, Joerg Roedel, Christopherson,
	Sean J, Gurchetan Singh, Gerd Hoffmann, Paolo Bonzini,
	Vitaly Kuznetsov, Jim Mattson

(resend because gmail did not format to plain text...)

On Thu, Feb 20, 2020 at 8:45 PM Chia-I Wu <olvaffe@gmail.com> wrote:
>
>
>
> On Thu, Feb 20, 2020 at 4:23 PM Tian, Kevin <kevin.tian@intel.com> wrote:
>>
>> > From: Chia-I Wu <olvaffe@gmail.com>
>> > Sent: Friday, February 21, 2020 6:24 AM
>> >
>> > On Wed, Feb 19, 2020 at 6:38 PM Tian, Kevin <kevin.tian@intel.com> wrote:
>> > >
>> > > > From: Tian, Kevin
>> > > > Sent: Thursday, February 20, 2020 10:05 AM
>> > > >
>> > > > > From: Chia-I Wu <olvaffe@gmail.com>
>> > > > > Sent: Thursday, February 20, 2020 3:37 AM
>> > > > >
>> > > > > On Wed, Feb 19, 2020 at 1:52 AM Tian, Kevin <kevin.tian@intel.com>
>> > wrote:
>> > > > > >
>> > > > > > > From: Paolo Bonzini
>> > > > > > > Sent: Wednesday, February 19, 2020 12:29 AM
>> > > > > > >
>> > > > > > > On 14/02/20 23:03, Sean Christopherson wrote:
>> > > > > > > >> On Fri, Feb 14, 2020 at 1:47 PM Chia-I Wu <olvaffe@gmail.com>
>> > > > wrote:
>> > > > > > > >>> AFAICT, it is currently allowed on ARM (verified) and AMD (not
>> > > > > > > >>> verified, but svm_get_mt_mask returns 0 which supposedly
>> > means
>> > > > > the
>> > > > > > > NPT
>> > > > > > > >>> does not restrict what the guest PAT can do).  This diff would do
>> > the
>> > > > > > > >>> trick for Intel without needing any uapi change:
>> > > > > > > >> I would be concerned about Intel CPU errata such as SKX40 and
>> > > > SKX59.
>> > > > > > > > The part KVM cares about, #MC, is already addressed by forcing
>> > UC
>> > > > for
>> > > > > > > MMIO.
>> > > > > > > > The data corruption issue is on the guest kernel to correctly use
>> > WC
>> > > > > > > > and/or non-temporal writes.
>> > > > > > >
>> > > > > > > What about coherency across live migration?  The userspace
>> > process
>> > > > > would
>> > > > > > > use cached accesses, and also a WBINVD could potentially corrupt
>> > guest
>> > > > > > > memory.
>> > > > > > >
>> > > > > >
>> > > > > > In such case the userspace process possibly should conservatively use
>> > > > > > UC mapping, as if for MMIO regions on a passthrough device.
>> > However
>> > > > > > there remains a problem. the definition of KVM_MEM_DMA implies
>> > > > > > favoring guest setting, which could be whatever type in concept. Then
>> > > > > > assuming UC is also problematic. I'm not sure whether inventing
>> > another
>> > > > > > interface to query effective memory type from KVM is a good idea.
>> > There
>> > > > > > is no guarantee that the guest will use same type for every page in the
>> > > > > > same slot, then such interface might be messy. Alternatively, maybe
>> > > > > > we could just have an interface for KVM userspace to force memory
>> > type
>> > > > > > for a given slot, if it is mainly used in para-virtualized scenarios (e.g.
>> > > > > > virtio-gpu) where the guest is enlightened to use a forced type (e.g.
>> > WC)?
>> > > > > KVM forcing the memory type for a given slot should work too.  But the
>> > > > > ignore-guest-pat bit seems to be Intel-specific.  We will need to
>> > > > > define how the second-level page attributes combine with the guest
>> > > > > page attributes somehow.
>> > > >
>> > > > oh, I'm not aware of that difference. without an ipat-equivalent
>> > > > capability, I'm not sure how to forcing random type here. If you look at
>> > > > table 11-7 in Intel SDM, none of MTRR (EPT) memory type can lead to
>> > > > consistent effective type when combining with random PAT value. So
>> > > >  it is definitely a dead end.
>> > > >
>> > > > >
>> > > > > KVM should in theory be able to tell that the userspace region is
>> > > > > mapped with a certain memory type and can force the same memory
>> > type
>> > > > > onto the guest.  The userspace does not need to be involved.  But that
>> > > > > sounds very slow?  This may be a dumb question, but would it help to
>> > > > > add KVM_SET_DMA_BUF and let KVM negotiate the memory type with
>> > the
>> > > > > in-kernel GPU drivers?
>> > > > >
>> > > > >
>> > > >
>> > > > KVM_SET_DMA_BUF looks more reasonable. But I guess we don't need
>> > > > KVM to be aware of such negotiation. We can continue your original
>> > > > proposal to have KVM simply favor guest memory type (maybe still call
>> > > > KVM_MEM_DMA). On the other hand, Qemu should just mmap on the
>> > > > fd handle of the dmabuf passed from the virtio-gpu device backend,  e.g.
>> > > > to conduct migration. That way the mmap request is finally served by
>> > > > DRM and underlying GPU drivers, with proper type enforced
>> > automatically.
>> > > >
>> > >
>> > > Thinking more possibly we don't need introduce new interface to KVM.
>> > > As long as Qemu uses dmabuf interface to mmap the specific region,
>> > > KVM can simply check memory type in host page table given hva of a
>> > > memslot. If the type is UC or WC, it implies that userspace wants a
>> > > non-coherent mapping which should be reflected in the guest side too.
>> > > In such case, KVM can go to non-cohenrent DMA path and favor guest
>> > > memory type automatically.
>> > Sorry, I mixed two things together.
>> >
>> > Userspace access to dmabuf mmap must be guarded by
>> > DMA_BUF_SYNC_{START,END} ioctls.  It is possible that the GPU driver
>> > always picks a WB mapping and let the ioctls flush/invalidate CPU
>> > caches.  We actually want the guest memory type to match vkMapMemory's
>> > memory type, which can be different from dmabuf mmap's memory type.
>> > It is not enough for KVM to inspect the hva's memory type.
>>
>> I'm not familiar with dmabuf and what is the difference between
>> vkMapMemory and mmap. Just a simple thought that whatever
>> memory type/synchronization enforced on the host userspace should
>> ideally be applied to guest userspace too. e.g. in above example we
>> possibly want the guest to use WB and issue flush/invalidate hypercalls
>> to guard with other potential parallel operations in the host side.
>> otherwise I cannot see how synchronization can be done when one
>> use WB with sync primitives while the other simply use WC w/o such
>> primitives.
>
I am reasonably familiar with the GPU stacks, but I am not familiar with KVM
:)

When allocating a GPU memory, the userspace can specify whether it wants a
coherent one or an incoherent one.  vkMapMemory returns a coherent or a
incoherent mapping respectively.  Indeed we also want the guest userspace to
have a coherent or a incoherent mapping respectively.

The GPU memory can be exported as a dmabuf to share with another device or
process.  For security, we allocate the GPU memory in a GPU process and we
export the dmabuf to the hypervisor.  mmap of dmabuf semantically returns an
incoherent mapping.  As a result, the guest will set up a mapping that has the
same memory type as the vkMapMemory mapping does, but the hva in
KVM_SET_USER_MEMORY_REGION points to the dmabuf's incoherent mapping.

If you think it is the best for KVM to inspect hva to determine the memory
type with page granularity, that is reasonable and should work for us too.
The userspace can do something (e.g., add a GPU driver dependency to the
hypervisor such that the dma-buf is imported as a GPU memory and mapped using
vkMapMemory) or I can work with dma-buf maintainers to see if dma-buf's
semantics can be changed.

>
>>
>> >
>> > KVM_SET_DMA_BUF, if supported, is a signal to KVM that the guest
>> > memory type should be honored (or forced if there is a new op in
>> > dma_buf_ops that tells KVM which memory type to force).  KVM_MEM_DMA
>> > flag in this RFC sends the same signal.  Unless KVM_SET_DMA_BUF gives
>> > the userspace other features such as setting unlimited number of
>> > dmabufs to subregions of a memslot, it is not very useful.
>>
>> the good part of a new interface is its simplicity, but only in slot
>> granularity. instead having KVM to inspect hva can support page
>> granularity, but adding run-time overhead. Let's see how Paolo
>> thinks.
>>
>> >
>> > If uapi change is to be avoided, it is the easiest that guest memory
>> > type is always honored unless it causes #MC (i.e.,is_mmio==true).
>> >
>>
>> I feel this goes too far...
>>
>> Thanks
>> Kevin
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-21  4:51                               ` Chia-I Wu
@ 2020-02-21  5:39                                 ` Tian, Kevin
  2020-02-21 15:59                                   ` Sean Christopherson
  0 siblings, 1 reply; 30+ messages in thread
From: Tian, Kevin @ 2020-02-21  5:39 UTC (permalink / raw)
  To: Chia-I Wu
  Cc: Wanpeng Li, ML dri-devel, kvm list, Joerg Roedel, Christopherson,
	Sean J, Gurchetan Singh, Gerd Hoffmann, Paolo Bonzini,
	Vitaly Kuznetsov, Jim Mattson

> From: Chia-I Wu <olvaffe@gmail.com>
> Sent: Friday, February 21, 2020 12:51 PM
> 
> (resend because gmail did not format to plain text...)
> 
> On Thu, Feb 20, 2020 at 8:45 PM Chia-I Wu <olvaffe@gmail.com> wrote:
> >
> >
> >
> > On Thu, Feb 20, 2020 at 4:23 PM Tian, Kevin <kevin.tian@intel.com> wrote:
> >>
> >> > From: Chia-I Wu <olvaffe@gmail.com>
> >> > Sent: Friday, February 21, 2020 6:24 AM
> >> >
> >> > On Wed, Feb 19, 2020 at 6:38 PM Tian, Kevin <kevin.tian@intel.com>
> wrote:
> >> > >
> >> > > > From: Tian, Kevin
> >> > > > Sent: Thursday, February 20, 2020 10:05 AM
> >> > > >
> >> > > > > From: Chia-I Wu <olvaffe@gmail.com>
> >> > > > > Sent: Thursday, February 20, 2020 3:37 AM
> >> > > > >
> >> > > > > On Wed, Feb 19, 2020 at 1:52 AM Tian, Kevin
> <kevin.tian@intel.com>
> >> > wrote:
> >> > > > > >
> >> > > > > > > From: Paolo Bonzini
> >> > > > > > > Sent: Wednesday, February 19, 2020 12:29 AM
> >> > > > > > >
> >> > > > > > > On 14/02/20 23:03, Sean Christopherson wrote:
> >> > > > > > > >> On Fri, Feb 14, 2020 at 1:47 PM Chia-I Wu
> <olvaffe@gmail.com>
> >> > > > wrote:
> >> > > > > > > >>> AFAICT, it is currently allowed on ARM (verified) and AMD
> (not
> >> > > > > > > >>> verified, but svm_get_mt_mask returns 0 which supposedly
> >> > means
> >> > > > > the
> >> > > > > > > NPT
> >> > > > > > > >>> does not restrict what the guest PAT can do).  This diff
> would do
> >> > the
> >> > > > > > > >>> trick for Intel without needing any uapi change:
> >> > > > > > > >> I would be concerned about Intel CPU errata such as SKX40
> and
> >> > > > SKX59.
> >> > > > > > > > The part KVM cares about, #MC, is already addressed by
> forcing
> >> > UC
> >> > > > for
> >> > > > > > > MMIO.
> >> > > > > > > > The data corruption issue is on the guest kernel to correctly
> use
> >> > WC
> >> > > > > > > > and/or non-temporal writes.
> >> > > > > > >
> >> > > > > > > What about coherency across live migration?  The userspace
> >> > process
> >> > > > > would
> >> > > > > > > use cached accesses, and also a WBINVD could potentially
> corrupt
> >> > guest
> >> > > > > > > memory.
> >> > > > > > >
> >> > > > > >
> >> > > > > > In such case the userspace process possibly should conservatively
> use
> >> > > > > > UC mapping, as if for MMIO regions on a passthrough device.
> >> > However
> >> > > > > > there remains a problem. the definition of KVM_MEM_DMA
> implies
> >> > > > > > favoring guest setting, which could be whatever type in concept.
> Then
> >> > > > > > assuming UC is also problematic. I'm not sure whether inventing
> >> > another
> >> > > > > > interface to query effective memory type from KVM is a good
> idea.
> >> > There
> >> > > > > > is no guarantee that the guest will use same type for every page
> in the
> >> > > > > > same slot, then such interface might be messy. Alternatively,
> maybe
> >> > > > > > we could just have an interface for KVM userspace to force
> memory
> >> > type
> >> > > > > > for a given slot, if it is mainly used in para-virtualized scenarios
> (e.g.
> >> > > > > > virtio-gpu) where the guest is enlightened to use a forced type
> (e.g.
> >> > WC)?
> >> > > > > KVM forcing the memory type for a given slot should work too.  But
> the
> >> > > > > ignore-guest-pat bit seems to be Intel-specific.  We will need to
> >> > > > > define how the second-level page attributes combine with the
> guest
> >> > > > > page attributes somehow.
> >> > > >
> >> > > > oh, I'm not aware of that difference. without an ipat-equivalent
> >> > > > capability, I'm not sure how to forcing random type here. If you look
> at
> >> > > > table 11-7 in Intel SDM, none of MTRR (EPT) memory type can lead
> to
> >> > > > consistent effective type when combining with random PAT value. So
> >> > > >  it is definitely a dead end.
> >> > > >
> >> > > > >
> >> > > > > KVM should in theory be able to tell that the userspace region is
> >> > > > > mapped with a certain memory type and can force the same
> memory
> >> > type
> >> > > > > onto the guest.  The userspace does not need to be involved.  But
> that
> >> > > > > sounds very slow?  This may be a dumb question, but would it help
> to
> >> > > > > add KVM_SET_DMA_BUF and let KVM negotiate the memory type
> with
> >> > the
> >> > > > > in-kernel GPU drivers?
> >> > > > >
> >> > > > >
> >> > > >
> >> > > > KVM_SET_DMA_BUF looks more reasonable. But I guess we don't
> need
> >> > > > KVM to be aware of such negotiation. We can continue your original
> >> > > > proposal to have KVM simply favor guest memory type (maybe still
> call
> >> > > > KVM_MEM_DMA). On the other hand, Qemu should just mmap on
> the
> >> > > > fd handle of the dmabuf passed from the virtio-gpu device backend,
> e.g.
> >> > > > to conduct migration. That way the mmap request is finally served by
> >> > > > DRM and underlying GPU drivers, with proper type enforced
> >> > automatically.
> >> > > >
> >> > >
> >> > > Thinking more possibly we don't need introduce new interface to KVM.
> >> > > As long as Qemu uses dmabuf interface to mmap the specific region,
> >> > > KVM can simply check memory type in host page table given hva of a
> >> > > memslot. If the type is UC or WC, it implies that userspace wants a
> >> > > non-coherent mapping which should be reflected in the guest side too.
> >> > > In such case, KVM can go to non-cohenrent DMA path and favor guest
> >> > > memory type automatically.
> >> > Sorry, I mixed two things together.
> >> >
> >> > Userspace access to dmabuf mmap must be guarded by
> >> > DMA_BUF_SYNC_{START,END} ioctls.  It is possible that the GPU driver
> >> > always picks a WB mapping and let the ioctls flush/invalidate CPU
> >> > caches.  We actually want the guest memory type to match
> vkMapMemory's
> >> > memory type, which can be different from dmabuf mmap's memory
> type.
> >> > It is not enough for KVM to inspect the hva's memory type.
> >>
> >> I'm not familiar with dmabuf and what is the difference between
> >> vkMapMemory and mmap. Just a simple thought that whatever
> >> memory type/synchronization enforced on the host userspace should
> >> ideally be applied to guest userspace too. e.g. in above example we
> >> possibly want the guest to use WB and issue flush/invalidate hypercalls
> >> to guard with other potential parallel operations in the host side.
> >> otherwise I cannot see how synchronization can be done when one
> >> use WB with sync primitives while the other simply use WC w/o such
> >> primitives.
> >
> I am reasonably familiar with the GPU stacks, but I am not familiar with KVM
> :)
> 
> When allocating a GPU memory, the userspace can specify whether it wants
> a
> coherent one or an incoherent one.  vkMapMemory returns a coherent or a
> incoherent mapping respectively.  Indeed we also want the guest userspace
> to
> have a coherent or a incoherent mapping respectively.
> 
> The GPU memory can be exported as a dmabuf to share with another device
> or
> process.  For security, we allocate the GPU memory in a GPU process and we
> export the dmabuf to the hypervisor.  mmap of dmabuf semantically returns
> an
> incoherent mapping.  As a result, the guest will set up a mapping that has the
> same memory type as the vkMapMemory mapping does, but the hva in
> KVM_SET_USER_MEMORY_REGION points to the dmabuf's incoherent
> mapping.
> 
> If you think it is the best for KVM to inspect hva to determine the memory
> type with page granularity, that is reasonable and should work for us too.
> The userspace can do something (e.g., add a GPU driver dependency to the
> hypervisor such that the dma-buf is imported as a GPU memory and mapped
> using
> vkMapMemory) or I can work with dma-buf maintainers to see if dma-buf's
> semantics can be changed.

I think you need consider the live migration requirement as Paolo pointed out.
The migration thread needs to read/write the region, then it must use the
same type as GPU process and guest to read/write the region. In such case, 
the hva mapped by Qemu should have the desired type as the guest. However,
adding GPU driver dependency to Qemu might trigger some concern. I'm not
sure whether there is generic mechanism though, to share dmabuf fd between GPU
process and Qemu while allowing Qemu to follow the desired type w/o using
vkMapMemory...

Note this is orthogonal to whether introducing a new uapi or implicitly checking
hva to favor guest memory type. It's purely about Qemu itself. Ideally anyone 
with the desire to access a dma-buf object should follow the expected semantics.
It's interesting that dma-buf sub-system doesn't provide a centralized 
synchronization about memory type between multiple mmap paths. 

> 
> >
> >>
> >> >
> >> > KVM_SET_DMA_BUF, if supported, is a signal to KVM that the guest
> >> > memory type should be honored (or forced if there is a new op in
> >> > dma_buf_ops that tells KVM which memory type to force).
> KVM_MEM_DMA
> >> > flag in this RFC sends the same signal.  Unless KVM_SET_DMA_BUF gives
> >> > the userspace other features such as setting unlimited number of
> >> > dmabufs to subregions of a memslot, it is not very useful.
> >>
> >> the good part of a new interface is its simplicity, but only in slot
> >> granularity. instead having KVM to inspect hva can support page
> >> granularity, but adding run-time overhead. Let's see how Paolo
> >> thinks.
> >>
> >> >
> >> > If uapi change is to be avoided, it is the easiest that guest memory
> >> > type is always honored unless it causes #MC (i.e.,is_mmio==true).
> >> >
> >>
> >> I feel this goes too far...
> >>
> >> Thanks
> >> Kevin
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-21  5:39                                 ` Tian, Kevin
@ 2020-02-21 15:59                                   ` Sean Christopherson
  2020-02-21 18:21                                     ` Chia-I Wu
  0 siblings, 1 reply; 30+ messages in thread
From: Sean Christopherson @ 2020-02-21 15:59 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Wanpeng Li, kvm list, Joerg Roedel, ML dri-devel,
	Gurchetan Singh, Gerd Hoffmann, Paolo Bonzini, Vitaly Kuznetsov,
	Jim Mattson

On Thu, Feb 20, 2020 at 09:39:05PM -0800, Tian, Kevin wrote:
> > From: Chia-I Wu <olvaffe@gmail.com>
> > Sent: Friday, February 21, 2020 12:51 PM
> > If you think it is the best for KVM to inspect hva to determine the memory
> > type with page granularity, that is reasonable and should work for us too.
> > The userspace can do something (e.g., add a GPU driver dependency to the
> > hypervisor such that the dma-buf is imported as a GPU memory and mapped
> > using
> > vkMapMemory) or I can work with dma-buf maintainers to see if dma-buf's
> > semantics can be changed.
> 
> I think you need consider the live migration requirement as Paolo pointed out.
> The migration thread needs to read/write the region, then it must use the
> same type as GPU process and guest to read/write the region. In such case, 
> the hva mapped by Qemu should have the desired type as the guest. However,
> adding GPU driver dependency to Qemu might trigger some concern. I'm not
> sure whether there is generic mechanism though, to share dmabuf fd between GPU
> process and Qemu while allowing Qemu to follow the desired type w/o using
> vkMapMemory...

Alternatively, KVM could make KVM_MEM_DMA and KVM_MEM_LOG_DIRTY_PAGES
mutually exclusive, i.e. force a transition to WB memtype for the guest
(with appropriate zapping) when migration is activated.  I think that
would work?

> Note this is orthogonal to whether introducing a new uapi or implicitly checking
> hva to favor guest memory type. It's purely about Qemu itself. Ideally anyone 
> with the desire to access a dma-buf object should follow the expected semantics.
> It's interesting that dma-buf sub-system doesn't provide a centralized 
> synchronization about memory type between multiple mmap paths. 
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/3] KVM: x86: honor guest memory type
  2020-02-21 15:59                                   ` Sean Christopherson
@ 2020-02-21 18:21                                     ` Chia-I Wu
  0 siblings, 0 replies; 30+ messages in thread
From: Chia-I Wu @ 2020-02-21 18:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Tian, Kevin, Wanpeng Li, kvm list, Joerg Roedel, ML dri-devel,
	Gurchetan Singh, Gerd Hoffmann, Paolo Bonzini, Vitaly Kuznetsov,
	Jim Mattson

On Fri, Feb 21, 2020 at 7:59 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Thu, Feb 20, 2020 at 09:39:05PM -0800, Tian, Kevin wrote:
> > > From: Chia-I Wu <olvaffe@gmail.com>
> > > Sent: Friday, February 21, 2020 12:51 PM
> > > If you think it is the best for KVM to inspect hva to determine the memory
> > > type with page granularity, that is reasonable and should work for us too.
> > > The userspace can do something (e.g., add a GPU driver dependency to the
> > > hypervisor such that the dma-buf is imported as a GPU memory and mapped
> > > using
> > > vkMapMemory) or I can work with dma-buf maintainers to see if dma-buf's
> > > semantics can be changed.
> >
> > I think you need consider the live migration requirement as Paolo pointed out.
> > The migration thread needs to read/write the region, then it must use the
> > same type as GPU process and guest to read/write the region. In such case,
> > the hva mapped by Qemu should have the desired type as the guest. However,
> > adding GPU driver dependency to Qemu might trigger some concern. I'm not
> > sure whether there is generic mechanism though, to share dmabuf fd between GPU
> > process and Qemu while allowing Qemu to follow the desired type w/o using
> > vkMapMemory...
>
> Alternatively, KVM could make KVM_MEM_DMA and KVM_MEM_LOG_DIRTY_PAGES
> mutually exclusive, i.e. force a transition to WB memtype for the guest
> (with appropriate zapping) when migration is activated.  I think that
> would work?
Hm, virtio-gpu does not allow live migration when the 3D function
(virgl=on) is enabled.  This is the relevant code in qemu:

    if (virtio_gpu_virgl_enabled(g->conf)) {
        error_setg(&g->migration_blocker, "virgl is not yet migratable");

Although we (virtio-gpu and virglrenderer projects) plan to make host
GPU buffers available to the guest via memslots, those buffers should
be considered a part of the "GPU state".  The migration thread should
work with virglrenderer and let virglrenderer save/restore them, if
live migration is to be supported.

QEMU depends on GPU drivers already when configured with
--enable-virglrenderer.  There is vhost-user-gpu that can move the
dependency to a GPU process.  But there are still going to be cases
(e.g., nVidia's proprietary driver does not support dma-buf) where
QEMU cannot avoid GPU driver dependency.




> > Note this is orthogonal to whether introducing a new uapi or implicitly checking
> > hva to favor guest memory type. It's purely about Qemu itself. Ideally anyone
> > with the desire to access a dma-buf object should follow the expected semantics.
> > It's interesting that dma-buf sub-system doesn't provide a centralized
> > synchronization about memory type between multiple mmap paths.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, back to index

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-13 21:30 [RFC PATCH 0/3] KVM: x86: honor guest memory type Chia-I Wu
2020-02-13 21:30 ` [RFC PATCH 1/3] KVM: vmx: rewrite the comment in vmx_get_mt_mask Chia-I Wu
2020-02-14  9:36   ` Paolo Bonzini
2020-02-13 21:30 ` [RFC PATCH 2/3] RFC: KVM: add KVM_MEM_DMA Chia-I Wu
2020-02-13 21:30 ` [RFC PATCH 3/3] RFC: KVM: x86: support KVM_CAP_DMA_MEM Chia-I Wu
2020-02-13 21:41 ` [RFC PATCH 0/3] KVM: x86: honor guest memory type Paolo Bonzini
2020-02-13 22:18   ` Chia-I Wu
2020-02-14 10:26     ` Paolo Bonzini
2020-02-14 19:52       ` Sean Christopherson
2020-02-14 21:47         ` Chia-I Wu
2020-02-14 21:56           ` Jim Mattson
2020-02-14 22:03             ` Sean Christopherson
2020-02-18 16:28               ` Paolo Bonzini
2020-02-18 22:58                 ` Sean Christopherson
2020-02-19  9:52                 ` Tian, Kevin
2020-02-19 19:36                   ` Chia-I Wu
2020-02-20  2:04                     ` Tian, Kevin
2020-02-20  2:38                       ` Tian, Kevin
2020-02-20 22:23                         ` Chia-I Wu
2020-02-21  0:23                           ` Tian, Kevin
2020-02-21  4:45                             ` Chia-I Wu
2020-02-21  4:51                               ` Chia-I Wu
2020-02-21  5:39                                 ` Tian, Kevin
2020-02-21 15:59                                   ` Sean Christopherson
2020-02-21 18:21                                     ` Chia-I Wu
2020-02-14 21:15       ` Chia-I Wu
2020-02-19 10:00         ` Tian, Kevin
2020-02-19 19:18           ` Chia-I Wu
2020-02-20  2:13             ` Tian, Kevin
2020-02-20 23:02               ` Chia-I Wu

dri-devel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/dri-devel/0 dri-devel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dri-devel dri-devel/ https://lore.kernel.org/dri-devel \
		dri-devel@lists.freedesktop.org
	public-inbox-index dri-devel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.freedesktop.lists.dri-devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git