All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-13 11:31 ` Andrew Jones
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-13 11:31 UTC (permalink / raw)
  To: kvmarm, qemu-devel, ard.biesheuvel, christoffer.dall,
	marc.zyngier, peter.maydell, pbonzini, agraf
  Cc: catalin.marinas, j.fanguede, lersek, m.smarduch

Introduce a new memory region flag, KVM_MEM_UNCACHED, which is
needed by ARM. This flag informs KVM that the given memory region
is typically mapped by the guest as non-cacheable. KVM for ARM
then ensures that that memory is indeed mapped non-cacheable by
the guest, and also remaps that region as non-cacheable for
userspace, allowing them both to maintain a coherent view.

Changes since v1:
 1) don't pin pages [Paolo]
 2) ensure the guest maps the memory non-cacheable [me]
 3) clean up memslot flag documentation [Christoffer]
changes 1 and 2 effectively redesigned/rewrote v1. Find v1 here
http://www.spinics.net/lists/kvm-arm/msg14022.html

The QEMU series for v1 hasn't really changed. Only the linux
header hack needed to bump KVM_CAP_UNCACHED_MEM from 107 to
116.  Find the series here
http://www.spinics.net/lists/kvm-arm/msg14026.html

Testing:
This series still needs lots of testing, but I thought I'd
kick it to the list early, as there's been recent interest
in solving this problem, and I'd like to get test results
and opinions on this approach from others sooner than later.
I've tested with AAVMF (UEFI for AArch64 mach-virt guests).
AAVMF has a kludge in it to avoid the coherency problem.
I've tested both with and without that kludge active. Both
worked for me (almost). Sometimes with the non-kludged
version I was still able to see a bit of corruption in
grub's output after edk2 loaded it - not much, and not always,
but something. Anyway, it's quite frustrating, as I'm not sure
what I'm missing...

This series applies to Linus' 110bc76729d4, but I tested with
a version backported to the current RHELSA kernel.

Thanks for reviews and testing!

drew


Andrew Jones (3):
  arm/arm64: pageattr: add set_memory_nc
  KVM: promote KVM_MEMSLOT_INCOHERENT to uapi
  arm/arm64: KVM: implement 'uncached' mem coherency

 Documentation/virtual/kvm/api.txt     | 20 ++++++++++++------
 arch/arm/include/asm/cacheflush.h     |  1 +
 arch/arm/include/asm/kvm_mmu.h        |  5 ++++-
 arch/arm/include/asm/pgtable-3level.h |  1 +
 arch/arm/include/asm/pgtable.h        |  1 +
 arch/arm/include/uapi/asm/kvm.h       |  1 +
 arch/arm/kvm/arm.c                    |  1 +
 arch/arm/kvm/mmu.c                    | 39 ++++++++++++++++++++++-------------
 arch/arm/mm/pageattr.c                |  7 +++++++
 arch/arm64/include/asm/cacheflush.h   |  1 +
 arch/arm64/include/asm/kvm_mmu.h      |  5 ++++-
 arch/arm64/include/asm/memory.h       |  1 +
 arch/arm64/include/asm/pgtable.h      |  1 +
 arch/arm64/include/uapi/asm/kvm.h     |  1 +
 arch/arm64/mm/pageattr.c              |  8 +++++++
 include/linux/kvm_host.h              |  1 -
 include/uapi/linux/kvm.h              |  2 ++
 virt/kvm/kvm_main.c                   |  7 ++++++-
 18 files changed, 79 insertions(+), 24 deletions(-)

-- 
2.1.0

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-13 11:31 ` Andrew Jones
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-13 11:31 UTC (permalink / raw)
  To: kvmarm, qemu-devel, ard.biesheuvel, christoffer.dall,
	marc.zyngier, peter.maydell, pbonzini, agraf
  Cc: catalin.marinas, j.fanguede, lersek, m.smarduch

Introduce a new memory region flag, KVM_MEM_UNCACHED, which is
needed by ARM. This flag informs KVM that the given memory region
is typically mapped by the guest as non-cacheable. KVM for ARM
then ensures that that memory is indeed mapped non-cacheable by
the guest, and also remaps that region as non-cacheable for
userspace, allowing them both to maintain a coherent view.

Changes since v1:
 1) don't pin pages [Paolo]
 2) ensure the guest maps the memory non-cacheable [me]
 3) clean up memslot flag documentation [Christoffer]
changes 1 and 2 effectively redesigned/rewrote v1. Find v1 here
http://www.spinics.net/lists/kvm-arm/msg14022.html

The QEMU series for v1 hasn't really changed. Only the linux
header hack needed to bump KVM_CAP_UNCACHED_MEM from 107 to
116.  Find the series here
http://www.spinics.net/lists/kvm-arm/msg14026.html

Testing:
This series still needs lots of testing, but I thought I'd
kick it to the list early, as there's been recent interest
in solving this problem, and I'd like to get test results
and opinions on this approach from others sooner than later.
I've tested with AAVMF (UEFI for AArch64 mach-virt guests).
AAVMF has a kludge in it to avoid the coherency problem.
I've tested both with and without that kludge active. Both
worked for me (almost). Sometimes with the non-kludged
version I was still able to see a bit of corruption in
grub's output after edk2 loaded it - not much, and not always,
but something. Anyway, it's quite frustrating, as I'm not sure
what I'm missing...

This series applies to Linus' 110bc76729d4, but I tested with
a version backported to the current RHELSA kernel.

Thanks for reviews and testing!

drew


Andrew Jones (3):
  arm/arm64: pageattr: add set_memory_nc
  KVM: promote KVM_MEMSLOT_INCOHERENT to uapi
  arm/arm64: KVM: implement 'uncached' mem coherency

 Documentation/virtual/kvm/api.txt     | 20 ++++++++++++------
 arch/arm/include/asm/cacheflush.h     |  1 +
 arch/arm/include/asm/kvm_mmu.h        |  5 ++++-
 arch/arm/include/asm/pgtable-3level.h |  1 +
 arch/arm/include/asm/pgtable.h        |  1 +
 arch/arm/include/uapi/asm/kvm.h       |  1 +
 arch/arm/kvm/arm.c                    |  1 +
 arch/arm/kvm/mmu.c                    | 39 ++++++++++++++++++++++-------------
 arch/arm/mm/pageattr.c                |  7 +++++++
 arch/arm64/include/asm/cacheflush.h   |  1 +
 arch/arm64/include/asm/kvm_mmu.h      |  5 ++++-
 arch/arm64/include/asm/memory.h       |  1 +
 arch/arm64/include/asm/pgtable.h      |  1 +
 arch/arm64/include/uapi/asm/kvm.h     |  1 +
 arch/arm64/mm/pageattr.c              |  8 +++++++
 include/linux/kvm_host.h              |  1 -
 include/uapi/linux/kvm.h              |  2 ++
 virt/kvm/kvm_main.c                   |  7 ++++++-
 18 files changed, 79 insertions(+), 24 deletions(-)

-- 
2.1.0

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [Qemu-devel] [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
  2015-05-13 11:31 ` Andrew Jones
@ 2015-05-13 11:31   ` Andrew Jones
  -1 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-13 11:31 UTC (permalink / raw)
  To: kvmarm, qemu-devel, ard.biesheuvel, christoffer.dall,
	marc.zyngier, peter.maydell, pbonzini, agraf
  Cc: catalin.marinas, j.fanguede, lersek, m.smarduch

Provide a method to change normal, cacheable memory to non-cacheable.
KVM will make use of this to keep emulated device memory regions
coherent with the guest.

Signed-off-by: Andrew Jones <drjones@redhat.com>
---
 arch/arm/include/asm/cacheflush.h   | 1 +
 arch/arm/mm/pageattr.c              | 7 +++++++
 arch/arm64/include/asm/cacheflush.h | 1 +
 arch/arm64/mm/pageattr.c            | 8 ++++++++
 4 files changed, 17 insertions(+)

diff --git a/arch/arm/include/asm/cacheflush.h b/arch/arm/include/asm/cacheflush.h
index 2d46862e7bef7..682a8b13d6019 100644
--- a/arch/arm/include/asm/cacheflush.h
+++ b/arch/arm/include/asm/cacheflush.h
@@ -486,6 +486,7 @@ int set_memory_ro(unsigned long addr, int numpages);
 int set_memory_rw(unsigned long addr, int numpages);
 int set_memory_x(unsigned long addr, int numpages);
 int set_memory_nx(unsigned long addr, int numpages);
+int set_memory_nc(unsigned long addr, int numpages);
 
 #ifdef CONFIG_DEBUG_RODATA
 void mark_rodata_ro(void);
diff --git a/arch/arm/mm/pageattr.c b/arch/arm/mm/pageattr.c
index cf30daff89325..9f9f752cab871 100644
--- a/arch/arm/mm/pageattr.c
+++ b/arch/arm/mm/pageattr.c
@@ -92,3 +92,10 @@ int set_memory_x(unsigned long addr, int numpages)
 					__pgprot(0),
 					__pgprot(L_PTE_XN));
 }
+
+int set_memory_nc(unsigned long addr, int numpages)
+{
+	return change_memory_common(addr, numpages,
+					__pgprot(L_PTE_MT_BUFFERABLE),
+					__pgprot(L_PTE_MT_MASK));
+}
diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index 67d309cc3b6b8..ef671f38c19ad 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -152,6 +152,7 @@ int set_memory_ro(unsigned long addr, int numpages);
 int set_memory_rw(unsigned long addr, int numpages);
 int set_memory_x(unsigned long addr, int numpages);
 int set_memory_nx(unsigned long addr, int numpages);
+int set_memory_nc(unsigned long addr, int numpages);
 
 #ifdef CONFIG_DEBUG_RODATA
 void mark_rodata_ro(void);
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index e47ed1c5dce1b..c837adcf26fc6 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -96,3 +96,11 @@ int set_memory_x(unsigned long addr, int numpages)
 					__pgprot(PTE_PXN));
 }
 EXPORT_SYMBOL_GPL(set_memory_x);
+
+int set_memory_nc(unsigned long addr, int numpages)
+{
+	return change_memory_common(addr, numpages,
+					__pgprot(PTE_ATTRINDX(MT_NORMAL_NC)),
+					__pgprot(PTE_ATTRINDX_MASK));
+}
+EXPORT_SYMBOL_GPL(set_memory_nc);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
@ 2015-05-13 11:31   ` Andrew Jones
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-13 11:31 UTC (permalink / raw)
  To: kvmarm, qemu-devel, ard.biesheuvel, christoffer.dall,
	marc.zyngier, peter.maydell, pbonzini, agraf
  Cc: catalin.marinas, lersek

Provide a method to change normal, cacheable memory to non-cacheable.
KVM will make use of this to keep emulated device memory regions
coherent with the guest.

Signed-off-by: Andrew Jones <drjones@redhat.com>
---
 arch/arm/include/asm/cacheflush.h   | 1 +
 arch/arm/mm/pageattr.c              | 7 +++++++
 arch/arm64/include/asm/cacheflush.h | 1 +
 arch/arm64/mm/pageattr.c            | 8 ++++++++
 4 files changed, 17 insertions(+)

diff --git a/arch/arm/include/asm/cacheflush.h b/arch/arm/include/asm/cacheflush.h
index 2d46862e7bef7..682a8b13d6019 100644
--- a/arch/arm/include/asm/cacheflush.h
+++ b/arch/arm/include/asm/cacheflush.h
@@ -486,6 +486,7 @@ int set_memory_ro(unsigned long addr, int numpages);
 int set_memory_rw(unsigned long addr, int numpages);
 int set_memory_x(unsigned long addr, int numpages);
 int set_memory_nx(unsigned long addr, int numpages);
+int set_memory_nc(unsigned long addr, int numpages);
 
 #ifdef CONFIG_DEBUG_RODATA
 void mark_rodata_ro(void);
diff --git a/arch/arm/mm/pageattr.c b/arch/arm/mm/pageattr.c
index cf30daff89325..9f9f752cab871 100644
--- a/arch/arm/mm/pageattr.c
+++ b/arch/arm/mm/pageattr.c
@@ -92,3 +92,10 @@ int set_memory_x(unsigned long addr, int numpages)
 					__pgprot(0),
 					__pgprot(L_PTE_XN));
 }
+
+int set_memory_nc(unsigned long addr, int numpages)
+{
+	return change_memory_common(addr, numpages,
+					__pgprot(L_PTE_MT_BUFFERABLE),
+					__pgprot(L_PTE_MT_MASK));
+}
diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index 67d309cc3b6b8..ef671f38c19ad 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -152,6 +152,7 @@ int set_memory_ro(unsigned long addr, int numpages);
 int set_memory_rw(unsigned long addr, int numpages);
 int set_memory_x(unsigned long addr, int numpages);
 int set_memory_nx(unsigned long addr, int numpages);
+int set_memory_nc(unsigned long addr, int numpages);
 
 #ifdef CONFIG_DEBUG_RODATA
 void mark_rodata_ro(void);
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index e47ed1c5dce1b..c837adcf26fc6 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -96,3 +96,11 @@ int set_memory_x(unsigned long addr, int numpages)
 					__pgprot(PTE_PXN));
 }
 EXPORT_SYMBOL_GPL(set_memory_x);
+
+int set_memory_nc(unsigned long addr, int numpages)
+{
+	return change_memory_common(addr, numpages,
+					__pgprot(PTE_ATTRINDX(MT_NORMAL_NC)),
+					__pgprot(PTE_ATTRINDX_MASK));
+}
+EXPORT_SYMBOL_GPL(set_memory_nc);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [Qemu-devel] [RFC/RFT PATCH v2 2/3] KVM: promote KVM_MEMSLOT_INCOHERENT to uapi
  2015-05-13 11:31 ` Andrew Jones
@ 2015-05-13 11:31   ` Andrew Jones
  -1 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-13 11:31 UTC (permalink / raw)
  To: kvmarm, qemu-devel, ard.biesheuvel, christoffer.dall,
	marc.zyngier, peter.maydell, pbonzini, agraf
  Cc: catalin.marinas, j.fanguede, lersek, m.smarduch

Commit 1050dcda30529 introduced KVM_MEMSLOT_INCOHERENT to flag memory
regions that may have coherency issues due to mapping host system RAM
as non-cacheable. This was introduced as a KVM internal flag, but now
give KVM userspace access to it so that it may use it for hinting
likely problematic regions. Also rename to KVM_MEM_UNCACHED.

Signed-off-by: Andrew Jones <drjones@redhat.com>
---
 Documentation/virtual/kvm/api.txt | 20 ++++++++++++++------
 arch/arm/include/uapi/asm/kvm.h   |  1 +
 arch/arm/kvm/arm.c                |  1 +
 arch/arm/kvm/mmu.c                |  4 ++--
 arch/arm64/include/uapi/asm/kvm.h |  1 +
 include/linux/kvm_host.h          |  1 -
 include/uapi/linux/kvm.h          |  2 ++
 virt/kvm/kvm_main.c               |  7 ++++++-
 8 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 9fa2bf8c3f6f1..04ffd9f5db5f2 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -905,6 +905,7 @@ struct kvm_userspace_memory_region {
 /* for kvm_memory_region::flags */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_UNCACHED	(1UL << 2)
 
 This ioctl allows the user to create or modify a guest physical memory
 slot.  When changing an existing slot, it may be moved in the guest
@@ -920,12 +921,19 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
-writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
-to make a new slot read-only.  In this case, writes to this memory will be
-posted to userspace as KVM_EXIT_MMIO exits.
+The flags field supports the following flags. All flags may be used in
+combination.
+  - KVM_MEM_LOG_DIRTY_PAGES: no capability required
+      Set this to instruct KVM to keep track of writes to memory within
+      the slot. (See the KVM_GET_DIRTY_LOG ioctl)
+  - KVM_MEM_READONLY: depends on capability KVM_CAP_READONLY_MEM
+      Set this to make a new slot read-only. In this case, writes to
+      this memory will be posted to userspace as KVM_EXIT_MMIO exits.
+      EINVAL will be returned if the slot is not new.
+  - KVM_MEM_UNCACHED: depends on capability KVM_CAP_UNCACHED_MEM
+      Set this to make a new slot uncached, i.e. userspace will always
+      directly read/write RAM for this memory region. EINVAL will be
+      returned if the slot is not new.
 
 When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
 the memory region are automatically reflected into the guest.  For example, an
diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h
index df3f60cb1168a..893a331655115 100644
--- a/arch/arm/include/uapi/asm/kvm.h
+++ b/arch/arm/include/uapi/asm/kvm.h
@@ -26,6 +26,7 @@
 #define __KVM_HAVE_GUEST_DEBUG
 #define __KVM_HAVE_IRQ_LINE
 #define __KVM_HAVE_READONLY_MEM
+#define __KVM_HAVE_UNCACHED_MEM
 
 #define KVM_REG_SIZE(id)						\
 	(1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index d9631ecddd56e..cbb532de9c8b5 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -182,6 +182,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ARM_PSCI_0_2:
 	case KVM_CAP_READONLY_MEM:
 	case KVM_CAP_MP_STATE:
+	case KVM_CAP_UNCACHED_MEM:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 1d5accbd3dcf2..bc1665acd73e7 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -1307,7 +1307,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (!hugetlb && !force_pte)
 		hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
 
-	fault_ipa_uncached = memslot->flags & KVM_MEMSLOT_INCOHERENT;
+	fault_ipa_uncached = memslot->flags & KVM_MEM_UNCACHED;
 
 	if (hugetlb) {
 		pmd_t new_pmd = pfn_pmd(pfn, mem_type);
@@ -1834,7 +1834,7 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
 	 * regions as incoherent.
 	 */
 	if (slot->flags & KVM_MEM_READONLY)
-		slot->flags |= KVM_MEMSLOT_INCOHERENT;
+		slot->flags |= KVM_MEM_UNCACHED;
 	return 0;
 }
 
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index d26832022127e..be46855ca01b7 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -38,6 +38,7 @@
 #define __KVM_HAVE_GUEST_DEBUG
 #define __KVM_HAVE_IRQ_LINE
 #define __KVM_HAVE_READONLY_MEM
+#define __KVM_HAVE_UNCACHED_MEM
 
 #define KVM_REG_SIZE(id)						\
 	(1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ad45054309a0f..74d6a14529f53 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -39,7 +39,6 @@
  * include/linux/kvm_h.
  */
 #define KVM_MEMSLOT_INVALID	(1UL << 16)
-#define KVM_MEMSLOT_INCOHERENT	(1UL << 17)
 
 /* Two fragments for cross MMIO pages. */
 #define KVM_MAX_MMIO_FRAGMENTS	2
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 4b60056776d14..ceb6e805e5f77 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -108,6 +108,7 @@ struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_UNCACHED	(1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
@@ -814,6 +815,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_S390_INJECT_IRQ 113
 #define KVM_CAP_S390_IRQ_STATE 114
 #define KVM_CAP_PPC_HWRNG 115
+#define KVM_CAP_UNCACHED_MEM 116
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 90977418aeb6e..d1d535d80a075 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -719,6 +719,10 @@ static int check_memory_region_flags(struct kvm_userspace_memory_region *mem)
 	valid_flags |= KVM_MEM_READONLY;
 #endif
 
+#ifdef __KVM_HAVE_UNCACHED_MEM
+	valid_flags |= KVM_MEM_UNCACHED;
+#endif
+
 	if (mem->flags & ~valid_flags)
 		return -EINVAL;
 
@@ -816,7 +820,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		else { /* Modify an existing slot. */
 			if ((mem->userspace_addr != old.userspace_addr) ||
 			    (npages != old.npages) ||
-			    ((new.flags ^ old.flags) & KVM_MEM_READONLY))
+			    ((new.flags ^ old.flags) &
+			     (KVM_MEM_READONLY | KVM_MEM_UNCACHED)))
 				goto out;
 
 			if (base_gfn != old.base_gfn)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [RFC/RFT PATCH v2 2/3] KVM: promote KVM_MEMSLOT_INCOHERENT to uapi
@ 2015-05-13 11:31   ` Andrew Jones
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-13 11:31 UTC (permalink / raw)
  To: kvmarm, qemu-devel, ard.biesheuvel, christoffer.dall,
	marc.zyngier, peter.maydell, pbonzini, agraf
  Cc: catalin.marinas, lersek

Commit 1050dcda30529 introduced KVM_MEMSLOT_INCOHERENT to flag memory
regions that may have coherency issues due to mapping host system RAM
as non-cacheable. This was introduced as a KVM internal flag, but now
give KVM userspace access to it so that it may use it for hinting
likely problematic regions. Also rename to KVM_MEM_UNCACHED.

Signed-off-by: Andrew Jones <drjones@redhat.com>
---
 Documentation/virtual/kvm/api.txt | 20 ++++++++++++++------
 arch/arm/include/uapi/asm/kvm.h   |  1 +
 arch/arm/kvm/arm.c                |  1 +
 arch/arm/kvm/mmu.c                |  4 ++--
 arch/arm64/include/uapi/asm/kvm.h |  1 +
 include/linux/kvm_host.h          |  1 -
 include/uapi/linux/kvm.h          |  2 ++
 virt/kvm/kvm_main.c               |  7 ++++++-
 8 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 9fa2bf8c3f6f1..04ffd9f5db5f2 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -905,6 +905,7 @@ struct kvm_userspace_memory_region {
 /* for kvm_memory_region::flags */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_UNCACHED	(1UL << 2)
 
 This ioctl allows the user to create or modify a guest physical memory
 slot.  When changing an existing slot, it may be moved in the guest
@@ -920,12 +921,19 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
-writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
-to make a new slot read-only.  In this case, writes to this memory will be
-posted to userspace as KVM_EXIT_MMIO exits.
+The flags field supports the following flags. All flags may be used in
+combination.
+  - KVM_MEM_LOG_DIRTY_PAGES: no capability required
+      Set this to instruct KVM to keep track of writes to memory within
+      the slot. (See the KVM_GET_DIRTY_LOG ioctl)
+  - KVM_MEM_READONLY: depends on capability KVM_CAP_READONLY_MEM
+      Set this to make a new slot read-only. In this case, writes to
+      this memory will be posted to userspace as KVM_EXIT_MMIO exits.
+      EINVAL will be returned if the slot is not new.
+  - KVM_MEM_UNCACHED: depends on capability KVM_CAP_UNCACHED_MEM
+      Set this to make a new slot uncached, i.e. userspace will always
+      directly read/write RAM for this memory region. EINVAL will be
+      returned if the slot is not new.
 
 When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
 the memory region are automatically reflected into the guest.  For example, an
diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h
index df3f60cb1168a..893a331655115 100644
--- a/arch/arm/include/uapi/asm/kvm.h
+++ b/arch/arm/include/uapi/asm/kvm.h
@@ -26,6 +26,7 @@
 #define __KVM_HAVE_GUEST_DEBUG
 #define __KVM_HAVE_IRQ_LINE
 #define __KVM_HAVE_READONLY_MEM
+#define __KVM_HAVE_UNCACHED_MEM
 
 #define KVM_REG_SIZE(id)						\
 	(1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index d9631ecddd56e..cbb532de9c8b5 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -182,6 +182,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ARM_PSCI_0_2:
 	case KVM_CAP_READONLY_MEM:
 	case KVM_CAP_MP_STATE:
+	case KVM_CAP_UNCACHED_MEM:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 1d5accbd3dcf2..bc1665acd73e7 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -1307,7 +1307,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (!hugetlb && !force_pte)
 		hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
 
-	fault_ipa_uncached = memslot->flags & KVM_MEMSLOT_INCOHERENT;
+	fault_ipa_uncached = memslot->flags & KVM_MEM_UNCACHED;
 
 	if (hugetlb) {
 		pmd_t new_pmd = pfn_pmd(pfn, mem_type);
@@ -1834,7 +1834,7 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
 	 * regions as incoherent.
 	 */
 	if (slot->flags & KVM_MEM_READONLY)
-		slot->flags |= KVM_MEMSLOT_INCOHERENT;
+		slot->flags |= KVM_MEM_UNCACHED;
 	return 0;
 }
 
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index d26832022127e..be46855ca01b7 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -38,6 +38,7 @@
 #define __KVM_HAVE_GUEST_DEBUG
 #define __KVM_HAVE_IRQ_LINE
 #define __KVM_HAVE_READONLY_MEM
+#define __KVM_HAVE_UNCACHED_MEM
 
 #define KVM_REG_SIZE(id)						\
 	(1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ad45054309a0f..74d6a14529f53 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -39,7 +39,6 @@
  * include/linux/kvm_h.
  */
 #define KVM_MEMSLOT_INVALID	(1UL << 16)
-#define KVM_MEMSLOT_INCOHERENT	(1UL << 17)
 
 /* Two fragments for cross MMIO pages. */
 #define KVM_MAX_MMIO_FRAGMENTS	2
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 4b60056776d14..ceb6e805e5f77 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -108,6 +108,7 @@ struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_UNCACHED	(1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
@@ -814,6 +815,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_S390_INJECT_IRQ 113
 #define KVM_CAP_S390_IRQ_STATE 114
 #define KVM_CAP_PPC_HWRNG 115
+#define KVM_CAP_UNCACHED_MEM 116
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 90977418aeb6e..d1d535d80a075 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -719,6 +719,10 @@ static int check_memory_region_flags(struct kvm_userspace_memory_region *mem)
 	valid_flags |= KVM_MEM_READONLY;
 #endif
 
+#ifdef __KVM_HAVE_UNCACHED_MEM
+	valid_flags |= KVM_MEM_UNCACHED;
+#endif
+
 	if (mem->flags & ~valid_flags)
 		return -EINVAL;
 
@@ -816,7 +820,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
 		else { /* Modify an existing slot. */
 			if ((mem->userspace_addr != old.userspace_addr) ||
 			    (npages != old.npages) ||
-			    ((new.flags ^ old.flags) & KVM_MEM_READONLY))
+			    ((new.flags ^ old.flags) &
+			     (KVM_MEM_READONLY | KVM_MEM_UNCACHED)))
 				goto out;
 
 			if (base_gfn != old.base_gfn)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [Qemu-devel] [RFC/RFT PATCH v2 3/3] arm/arm64: KVM: implement 'uncached' mem coherency
  2015-05-13 11:31 ` Andrew Jones
@ 2015-05-13 11:31   ` Andrew Jones
  -1 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-13 11:31 UTC (permalink / raw)
  To: kvmarm, qemu-devel, ard.biesheuvel, christoffer.dall,
	marc.zyngier, peter.maydell, pbonzini, agraf
  Cc: catalin.marinas, j.fanguede, lersek, m.smarduch

When S1 and S2 memory attributes combine wrt to caching policy,
non-cacheable types take precedence. If a guest maps a region as
device memory, which KVM userspace is using to emulate the device
using normal, cacheable memory, then we lose coherency. With
KVM_MEM_UNCACHED, KVM userspace can now hint to KVM which memory
regions are likely to be problematic. With this patch, as pages
of these types of regions are faulted into the guest, not only do
we flush the page's dcache, but we also change userspace's
mapping to NC in order to maintain coherency.

What if the guest doesn't do what we expect? While we can't
force a guest to use cacheable memory, we can take advantage of
the non-cacheable precedence, and force it to use non-cacheable.
So, this patch also introduces PAGE_S2_NORMAL_NC, and uses it on
KVM_MEM_UNCACHED regions to force them to NC.

We now have both guest and userspace on the same page (pun intended)

Signed-off-by: Andrew Jones <drjones@redhat.com>
---
 arch/arm/include/asm/kvm_mmu.h        |  5 ++++-
 arch/arm/include/asm/pgtable-3level.h |  1 +
 arch/arm/include/asm/pgtable.h        |  1 +
 arch/arm/kvm/mmu.c                    | 37 +++++++++++++++++++++++------------
 arch/arm64/include/asm/kvm_mmu.h      |  5 ++++-
 arch/arm64/include/asm/memory.h       |  1 +
 arch/arm64/include/asm/pgtable.h      |  1 +
 7 files changed, 36 insertions(+), 15 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 405aa18833073..e8034a80b12e5 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -214,8 +214,11 @@ static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
 	while (size) {
 		void *va = kmap_atomic_pfn(pfn);
 
-		if (need_flush)
+		if (need_flush) {
 			kvm_flush_dcache_to_poc(va, PAGE_SIZE);
+			if (ipa_uncached)
+				set_memory_nc((unsigned long)va, 1);
+		}
 
 		if (icache_is_pipt())
 			__cpuc_coherent_user_range((unsigned long)va,
diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index a745a2a53853c..39b3f7a40e663 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -121,6 +121,7 @@
  * 2nd stage PTE definitions for LPAE.
  */
 #define L_PTE_S2_MT_UNCACHED		(_AT(pteval_t, 0x0) << 2) /* strongly ordered */
+#define L_PTE_S2_MT_NORMAL_NC		(_AT(pteval_t, 0x5) << 2) /* normal non-cacheable */
 #define L_PTE_S2_MT_WRITETHROUGH	(_AT(pteval_t, 0xa) << 2) /* normal inner write-through */
 #define L_PTE_S2_MT_WRITEBACK		(_AT(pteval_t, 0xf) << 2) /* normal inner write-back */
 #define L_PTE_S2_MT_DEV_SHARED		(_AT(pteval_t, 0x1) << 2) /* device */
diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index f40354198bad4..ae13ca8b0a23d 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -100,6 +100,7 @@ extern pgprot_t		pgprot_s2_device;
 #define PAGE_HYP		_MOD_PROT(pgprot_kernel, L_PTE_HYP)
 #define PAGE_HYP_DEVICE		_MOD_PROT(pgprot_hyp_device, L_PTE_HYP)
 #define PAGE_S2			_MOD_PROT(pgprot_s2, L_PTE_S2_RDONLY)
+#define PAGE_S2_NORMAL_NC	__pgprot((pgprot_val(PAGE_S2) & ~L_PTE_S2_MT_MASK) | L_PTE_S2_MT_NORMAL_NC)
 #define PAGE_S2_DEVICE		_MOD_PROT(pgprot_s2_device, L_PTE_S2_RDONLY)
 
 #define __PAGE_NONE		__pgprot(_L_PTE_DEFAULT | L_PTE_RDONLY | L_PTE_XN | L_PTE_NONE)
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index bc1665acd73e7..6b3bd8061bd2a 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -1220,7 +1220,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	struct vm_area_struct *vma;
 	pfn_t pfn;
 	pgprot_t mem_type = PAGE_S2;
-	bool fault_ipa_uncached;
+	bool fault_ipa_uncached = false;
 	bool logging_active = memslot_is_logging(memslot);
 	unsigned long flags = 0;
 
@@ -1300,6 +1300,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			writable = false;
 	}
 
+	if (memslot->flags & KVM_MEM_UNCACHED) {
+		mem_type = PAGE_S2_NORMAL_NC;
+		fault_ipa_uncached = true;
+	}
+
 	spin_lock(&kvm->mmu_lock);
 	if (mmu_notifier_retry(kvm, mmu_seq))
 		goto out_unlock;
@@ -1307,8 +1312,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (!hugetlb && !force_pte)
 		hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
 
-	fault_ipa_uncached = memslot->flags & KVM_MEM_UNCACHED;
-
 	if (hugetlb) {
 		pmd_t new_pmd = pfn_pmd(pfn, mem_type);
 		new_pmd = pmd_mkhuge(new_pmd);
@@ -1462,6 +1465,7 @@ static int handle_hva_to_gpa(struct kvm *kvm,
 			     unsigned long start,
 			     unsigned long end,
 			     int (*handler)(struct kvm *kvm,
+					    struct kvm_memory_slot *slot,
 					    gpa_t gpa, void *data),
 			     void *data)
 {
@@ -1491,14 +1495,15 @@ static int handle_hva_to_gpa(struct kvm *kvm,
 
 		for (; gfn < gfn_end; ++gfn) {
 			gpa_t gpa = gfn << PAGE_SHIFT;
-			ret |= handler(kvm, gpa, data);
+			ret |= handler(kvm, memslot, gpa, data);
 		}
 	}
 
 	return ret;
 }
 
-static int kvm_unmap_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
+static int kvm_unmap_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
+				 gpa_t gpa, void *data)
 {
 	unmap_stage2_range(kvm, gpa, PAGE_SIZE);
 	return 0;
@@ -1527,9 +1532,15 @@ int kvm_unmap_hva_range(struct kvm *kvm,
 	return 0;
 }
 
-static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
+static int kvm_set_spte_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
+				gpa_t gpa, void *data)
 {
-	pte_t *pte = (pte_t *)data;
+	pte_t pte = *((pte_t *)data);
+
+	if (slot->flags & KVM_MEM_UNCACHED)
+		pte = pfn_pte(pte_pfn(pte), PAGE_S2_NORMAL_NC);
+	else
+		pte = pfn_pte(pte_pfn(pte), PAGE_S2);
 
 	/*
 	 * We can always call stage2_set_pte with KVM_S2PTE_FLAG_LOGGING_ACTIVE
@@ -1538,7 +1549,7 @@ static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
 	 * therefore stage2_set_pte() never needs to clear out a huge PMD
 	 * through this calling path.
 	 */
-	stage2_set_pte(kvm, NULL, gpa, pte, 0);
+	stage2_set_pte(kvm, NULL, gpa, &pte, 0);
 	return 0;
 }
 
@@ -1546,17 +1557,16 @@ static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
 void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
 {
 	unsigned long end = hva + PAGE_SIZE;
-	pte_t stage2_pte;
 
 	if (!kvm->arch.pgd)
 		return;
 
 	trace_kvm_set_spte_hva(hva);
-	stage2_pte = pfn_pte(pte_pfn(pte), PAGE_S2);
-	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &stage2_pte);
+	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pte);
 }
 
-static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
+static int kvm_age_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
+			       gpa_t gpa, void *data)
 {
 	pmd_t *pmd;
 	pte_t *pte;
@@ -1586,7 +1596,8 @@ static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
 	return 0;
 }
 
-static int kvm_test_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
+static int kvm_test_age_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
+				    gpa_t gpa, void *data)
 {
 	pmd_t *pmd;
 	pte_t *pte;
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 61505676d0853..af5f0f0eccef9 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -236,8 +236,11 @@ static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
 {
 	void *va = page_address(pfn_to_page(pfn));
 
-	if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached)
+	if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached) {
 		kvm_flush_dcache_to_poc(va, size);
+		if (ipa_uncached)
+			set_memory_nc((unsigned long)va, size/PAGE_SIZE);
+	}
 
 	if (!icache_is_aliasing()) {		/* PIPT */
 		flush_icache_range((unsigned long)va,
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index f800d45ea2265..800730f7aa7d9 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -105,6 +105,7 @@
  * Memory types for Stage-2 translation
  */
 #define MT_S2_NORMAL		0xf
+#define MT_S2_NORMAL_NC		0x5
 #define MT_S2_DEVICE_nGnRE	0x1
 
 #ifndef __ASSEMBLY__
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 56283f8a675c5..a254925ce1b6b 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -78,6 +78,7 @@ extern void __pgd_error(const char *file, int line, unsigned long val);
 #define PAGE_HYP_DEVICE		__pgprot(PROT_DEVICE_nGnRE | PTE_HYP)
 
 #define PAGE_S2			__pgprot(PROT_DEFAULT | PTE_S2_MEMATTR(MT_S2_NORMAL) | PTE_S2_RDONLY)
+#define PAGE_S2_NORMAL_NC	__pgprot(PROT_DEFAULT | PTE_S2_MEMATTR(MT_S2_NORMAL_NC) | PTE_S2_RDONLY)
 #define PAGE_S2_DEVICE		__pgprot(PROT_DEFAULT | PTE_S2_MEMATTR(MT_S2_DEVICE_nGnRE) | PTE_S2_RDONLY | PTE_UXN)
 
 #define PAGE_NONE		__pgprot(((_PAGE_DEFAULT) & ~PTE_TYPE_MASK) | PTE_PROT_NONE | PTE_PXN | PTE_UXN)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [RFC/RFT PATCH v2 3/3] arm/arm64: KVM: implement 'uncached' mem coherency
@ 2015-05-13 11:31   ` Andrew Jones
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-13 11:31 UTC (permalink / raw)
  To: kvmarm, qemu-devel, ard.biesheuvel, christoffer.dall,
	marc.zyngier, peter.maydell, pbonzini, agraf
  Cc: catalin.marinas, lersek

When S1 and S2 memory attributes combine wrt to caching policy,
non-cacheable types take precedence. If a guest maps a region as
device memory, which KVM userspace is using to emulate the device
using normal, cacheable memory, then we lose coherency. With
KVM_MEM_UNCACHED, KVM userspace can now hint to KVM which memory
regions are likely to be problematic. With this patch, as pages
of these types of regions are faulted into the guest, not only do
we flush the page's dcache, but we also change userspace's
mapping to NC in order to maintain coherency.

What if the guest doesn't do what we expect? While we can't
force a guest to use cacheable memory, we can take advantage of
the non-cacheable precedence, and force it to use non-cacheable.
So, this patch also introduces PAGE_S2_NORMAL_NC, and uses it on
KVM_MEM_UNCACHED regions to force them to NC.

We now have both guest and userspace on the same page (pun intended)

Signed-off-by: Andrew Jones <drjones@redhat.com>
---
 arch/arm/include/asm/kvm_mmu.h        |  5 ++++-
 arch/arm/include/asm/pgtable-3level.h |  1 +
 arch/arm/include/asm/pgtable.h        |  1 +
 arch/arm/kvm/mmu.c                    | 37 +++++++++++++++++++++++------------
 arch/arm64/include/asm/kvm_mmu.h      |  5 ++++-
 arch/arm64/include/asm/memory.h       |  1 +
 arch/arm64/include/asm/pgtable.h      |  1 +
 7 files changed, 36 insertions(+), 15 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 405aa18833073..e8034a80b12e5 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -214,8 +214,11 @@ static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
 	while (size) {
 		void *va = kmap_atomic_pfn(pfn);
 
-		if (need_flush)
+		if (need_flush) {
 			kvm_flush_dcache_to_poc(va, PAGE_SIZE);
+			if (ipa_uncached)
+				set_memory_nc((unsigned long)va, 1);
+		}
 
 		if (icache_is_pipt())
 			__cpuc_coherent_user_range((unsigned long)va,
diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index a745a2a53853c..39b3f7a40e663 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -121,6 +121,7 @@
  * 2nd stage PTE definitions for LPAE.
  */
 #define L_PTE_S2_MT_UNCACHED		(_AT(pteval_t, 0x0) << 2) /* strongly ordered */
+#define L_PTE_S2_MT_NORMAL_NC		(_AT(pteval_t, 0x5) << 2) /* normal non-cacheable */
 #define L_PTE_S2_MT_WRITETHROUGH	(_AT(pteval_t, 0xa) << 2) /* normal inner write-through */
 #define L_PTE_S2_MT_WRITEBACK		(_AT(pteval_t, 0xf) << 2) /* normal inner write-back */
 #define L_PTE_S2_MT_DEV_SHARED		(_AT(pteval_t, 0x1) << 2) /* device */
diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index f40354198bad4..ae13ca8b0a23d 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -100,6 +100,7 @@ extern pgprot_t		pgprot_s2_device;
 #define PAGE_HYP		_MOD_PROT(pgprot_kernel, L_PTE_HYP)
 #define PAGE_HYP_DEVICE		_MOD_PROT(pgprot_hyp_device, L_PTE_HYP)
 #define PAGE_S2			_MOD_PROT(pgprot_s2, L_PTE_S2_RDONLY)
+#define PAGE_S2_NORMAL_NC	__pgprot((pgprot_val(PAGE_S2) & ~L_PTE_S2_MT_MASK) | L_PTE_S2_MT_NORMAL_NC)
 #define PAGE_S2_DEVICE		_MOD_PROT(pgprot_s2_device, L_PTE_S2_RDONLY)
 
 #define __PAGE_NONE		__pgprot(_L_PTE_DEFAULT | L_PTE_RDONLY | L_PTE_XN | L_PTE_NONE)
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index bc1665acd73e7..6b3bd8061bd2a 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -1220,7 +1220,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	struct vm_area_struct *vma;
 	pfn_t pfn;
 	pgprot_t mem_type = PAGE_S2;
-	bool fault_ipa_uncached;
+	bool fault_ipa_uncached = false;
 	bool logging_active = memslot_is_logging(memslot);
 	unsigned long flags = 0;
 
@@ -1300,6 +1300,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			writable = false;
 	}
 
+	if (memslot->flags & KVM_MEM_UNCACHED) {
+		mem_type = PAGE_S2_NORMAL_NC;
+		fault_ipa_uncached = true;
+	}
+
 	spin_lock(&kvm->mmu_lock);
 	if (mmu_notifier_retry(kvm, mmu_seq))
 		goto out_unlock;
@@ -1307,8 +1312,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (!hugetlb && !force_pte)
 		hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
 
-	fault_ipa_uncached = memslot->flags & KVM_MEM_UNCACHED;
-
 	if (hugetlb) {
 		pmd_t new_pmd = pfn_pmd(pfn, mem_type);
 		new_pmd = pmd_mkhuge(new_pmd);
@@ -1462,6 +1465,7 @@ static int handle_hva_to_gpa(struct kvm *kvm,
 			     unsigned long start,
 			     unsigned long end,
 			     int (*handler)(struct kvm *kvm,
+					    struct kvm_memory_slot *slot,
 					    gpa_t gpa, void *data),
 			     void *data)
 {
@@ -1491,14 +1495,15 @@ static int handle_hva_to_gpa(struct kvm *kvm,
 
 		for (; gfn < gfn_end; ++gfn) {
 			gpa_t gpa = gfn << PAGE_SHIFT;
-			ret |= handler(kvm, gpa, data);
+			ret |= handler(kvm, memslot, gpa, data);
 		}
 	}
 
 	return ret;
 }
 
-static int kvm_unmap_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
+static int kvm_unmap_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
+				 gpa_t gpa, void *data)
 {
 	unmap_stage2_range(kvm, gpa, PAGE_SIZE);
 	return 0;
@@ -1527,9 +1532,15 @@ int kvm_unmap_hva_range(struct kvm *kvm,
 	return 0;
 }
 
-static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
+static int kvm_set_spte_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
+				gpa_t gpa, void *data)
 {
-	pte_t *pte = (pte_t *)data;
+	pte_t pte = *((pte_t *)data);
+
+	if (slot->flags & KVM_MEM_UNCACHED)
+		pte = pfn_pte(pte_pfn(pte), PAGE_S2_NORMAL_NC);
+	else
+		pte = pfn_pte(pte_pfn(pte), PAGE_S2);
 
 	/*
 	 * We can always call stage2_set_pte with KVM_S2PTE_FLAG_LOGGING_ACTIVE
@@ -1538,7 +1549,7 @@ static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
 	 * therefore stage2_set_pte() never needs to clear out a huge PMD
 	 * through this calling path.
 	 */
-	stage2_set_pte(kvm, NULL, gpa, pte, 0);
+	stage2_set_pte(kvm, NULL, gpa, &pte, 0);
 	return 0;
 }
 
@@ -1546,17 +1557,16 @@ static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
 void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
 {
 	unsigned long end = hva + PAGE_SIZE;
-	pte_t stage2_pte;
 
 	if (!kvm->arch.pgd)
 		return;
 
 	trace_kvm_set_spte_hva(hva);
-	stage2_pte = pfn_pte(pte_pfn(pte), PAGE_S2);
-	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &stage2_pte);
+	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pte);
 }
 
-static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
+static int kvm_age_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
+			       gpa_t gpa, void *data)
 {
 	pmd_t *pmd;
 	pte_t *pte;
@@ -1586,7 +1596,8 @@ static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
 	return 0;
 }
 
-static int kvm_test_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
+static int kvm_test_age_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
+				    gpa_t gpa, void *data)
 {
 	pmd_t *pmd;
 	pte_t *pte;
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 61505676d0853..af5f0f0eccef9 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -236,8 +236,11 @@ static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
 {
 	void *va = page_address(pfn_to_page(pfn));
 
-	if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached)
+	if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached) {
 		kvm_flush_dcache_to_poc(va, size);
+		if (ipa_uncached)
+			set_memory_nc((unsigned long)va, size/PAGE_SIZE);
+	}
 
 	if (!icache_is_aliasing()) {		/* PIPT */
 		flush_icache_range((unsigned long)va,
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index f800d45ea2265..800730f7aa7d9 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -105,6 +105,7 @@
  * Memory types for Stage-2 translation
  */
 #define MT_S2_NORMAL		0xf
+#define MT_S2_NORMAL_NC		0x5
 #define MT_S2_DEVICE_nGnRE	0x1
 
 #ifndef __ASSEMBLY__
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 56283f8a675c5..a254925ce1b6b 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -78,6 +78,7 @@ extern void __pgd_error(const char *file, int line, unsigned long val);
 #define PAGE_HYP_DEVICE		__pgprot(PROT_DEVICE_nGnRE | PTE_HYP)
 
 #define PAGE_S2			__pgprot(PROT_DEFAULT | PTE_S2_MEMATTR(MT_S2_NORMAL) | PTE_S2_RDONLY)
+#define PAGE_S2_NORMAL_NC	__pgprot(PROT_DEFAULT | PTE_S2_MEMATTR(MT_S2_NORMAL_NC) | PTE_S2_RDONLY)
 #define PAGE_S2_DEVICE		__pgprot(PROT_DEFAULT | PTE_S2_MEMATTR(MT_S2_DEVICE_nGnRE) | PTE_S2_RDONLY | PTE_UXN)
 
 #define PAGE_NONE		__pgprot(((_PAGE_DEFAULT) & ~PTE_TYPE_MASK) | PTE_PROT_NONE | PTE_PXN | PTE_UXN)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 2/3] KVM: promote KVM_MEMSLOT_INCOHERENT to uapi
  2015-05-13 11:31   ` Andrew Jones
@ 2015-05-14 10:12     ` Paolo Bonzini
  -1 siblings, 0 replies; 102+ messages in thread
From: Paolo Bonzini @ 2015-05-14 10:12 UTC (permalink / raw)
  To: Andrew Jones, kvmarm, qemu-devel, ard.biesheuvel,
	christoffer.dall, marc.zyngier, peter.maydell, agraf
  Cc: catalin.marinas, j.fanguede, lersek, m.smarduch



On 13/05/2015 13:31, Andrew Jones wrote:
> Commit 1050dcda30529 introduced KVM_MEMSLOT_INCOHERENT to flag memory
> regions that may have coherency issues due to mapping host system RAM
> as non-cacheable. This was introduced as a KVM internal flag, but now
> give KVM userspace access to it so that it may use it for hinting
> likely problematic regions. Also rename to KVM_MEM_UNCACHED.
> 
> Signed-off-by: Andrew Jones <drjones@redhat.com>
> ---
>  Documentation/virtual/kvm/api.txt | 20 ++++++++++++++------
>  arch/arm/include/uapi/asm/kvm.h   |  1 +
>  arch/arm/kvm/arm.c                |  1 +
>  arch/arm/kvm/mmu.c                |  4 ++--
>  arch/arm64/include/uapi/asm/kvm.h |  1 +
>  include/linux/kvm_host.h          |  1 -
>  include/uapi/linux/kvm.h          |  2 ++
>  virt/kvm/kvm_main.c               |  7 ++++++-
>  8 files changed, 27 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 9fa2bf8c3f6f1..04ffd9f5db5f2 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -905,6 +905,7 @@ struct kvm_userspace_memory_region {
>  /* for kvm_memory_region::flags */
>  #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>  #define KVM_MEM_READONLY	(1UL << 1)
> +#define KVM_MEM_UNCACHED	(1UL << 2)
>  
>  This ioctl allows the user to create or modify a guest physical memory
>  slot.  When changing an existing slot, it may be moved in the guest
> @@ -920,12 +921,19 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
>  be identical.  This allows large pages in the guest to be backed by large
>  pages in the host.
>  
> -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> -writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> -to make a new slot read-only.  In this case, writes to this memory will be
> -posted to userspace as KVM_EXIT_MMIO exits.
> +The flags field supports the following flags. All flags may be used in
> +combination.
> +  - KVM_MEM_LOG_DIRTY_PAGES: no capability required
> +      Set this to instruct KVM to keep track of writes to memory within
> +      the slot. (See the KVM_GET_DIRTY_LOG ioctl)
> +  - KVM_MEM_READONLY: depends on capability KVM_CAP_READONLY_MEM
> +      Set this to make a new slot read-only. In this case, writes to
> +      this memory will be posted to userspace as KVM_EXIT_MMIO exits.
> +      EINVAL will be returned if the slot is not new.
> +  - KVM_MEM_UNCACHED: depends on capability KVM_CAP_UNCACHED_MEM
> +      Set this to make a new slot uncached, i.e. userspace will always
> +      directly read/write RAM for this memory region. EINVAL will be
> +      returned if the slot is not new.
>  
>  When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
>  the memory region are automatically reflected into the guest.  For example, an
> diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h
> index df3f60cb1168a..893a331655115 100644
> --- a/arch/arm/include/uapi/asm/kvm.h
> +++ b/arch/arm/include/uapi/asm/kvm.h
> @@ -26,6 +26,7 @@
>  #define __KVM_HAVE_GUEST_DEBUG
>  #define __KVM_HAVE_IRQ_LINE
>  #define __KVM_HAVE_READONLY_MEM
> +#define __KVM_HAVE_UNCACHED_MEM
>  
>  #define KVM_REG_SIZE(id)						\
>  	(1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index d9631ecddd56e..cbb532de9c8b5 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -182,6 +182,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_ARM_PSCI_0_2:
>  	case KVM_CAP_READONLY_MEM:
>  	case KVM_CAP_MP_STATE:
> +	case KVM_CAP_UNCACHED_MEM:
>  		r = 1;
>  		break;
>  	case KVM_CAP_COALESCED_MMIO:
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 1d5accbd3dcf2..bc1665acd73e7 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -1307,7 +1307,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (!hugetlb && !force_pte)
>  		hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
>  
> -	fault_ipa_uncached = memslot->flags & KVM_MEMSLOT_INCOHERENT;
> +	fault_ipa_uncached = memslot->flags & KVM_MEM_UNCACHED;
>  
>  	if (hugetlb) {
>  		pmd_t new_pmd = pfn_pmd(pfn, mem_type);
> @@ -1834,7 +1834,7 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
>  	 * regions as incoherent.
>  	 */
>  	if (slot->flags & KVM_MEM_READONLY)
> -		slot->flags |= KVM_MEMSLOT_INCOHERENT;
> +		slot->flags |= KVM_MEM_UNCACHED;
>  	return 0;
>  }
>  
> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> index d26832022127e..be46855ca01b7 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -38,6 +38,7 @@
>  #define __KVM_HAVE_GUEST_DEBUG
>  #define __KVM_HAVE_IRQ_LINE
>  #define __KVM_HAVE_READONLY_MEM
> +#define __KVM_HAVE_UNCACHED_MEM
>  
>  #define KVM_REG_SIZE(id)						\
>  	(1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ad45054309a0f..74d6a14529f53 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -39,7 +39,6 @@
>   * include/linux/kvm_h.
>   */
>  #define KVM_MEMSLOT_INVALID	(1UL << 16)
> -#define KVM_MEMSLOT_INCOHERENT	(1UL << 17)
>  
>  /* Two fragments for cross MMIO pages. */
>  #define KVM_MAX_MMIO_FRAGMENTS	2
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 4b60056776d14..ceb6e805e5f77 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -108,6 +108,7 @@ struct kvm_userspace_memory_region {
>   */
>  #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>  #define KVM_MEM_READONLY	(1UL << 1)
> +#define KVM_MEM_UNCACHED	(1UL << 2)
>  
>  /* for KVM_IRQ_LINE */
>  struct kvm_irq_level {
> @@ -814,6 +815,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_S390_INJECT_IRQ 113
>  #define KVM_CAP_S390_IRQ_STATE 114
>  #define KVM_CAP_PPC_HWRNG 115
> +#define KVM_CAP_UNCACHED_MEM 116
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 90977418aeb6e..d1d535d80a075 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -719,6 +719,10 @@ static int check_memory_region_flags(struct kvm_userspace_memory_region *mem)
>  	valid_flags |= KVM_MEM_READONLY;
>  #endif
>  
> +#ifdef __KVM_HAVE_UNCACHED_MEM
> +	valid_flags |= KVM_MEM_UNCACHED;
> +#endif
> +
>  	if (mem->flags & ~valid_flags)
>  		return -EINVAL;
>  
> @@ -816,7 +820,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  		else { /* Modify an existing slot. */
>  			if ((mem->userspace_addr != old.userspace_addr) ||
>  			    (npages != old.npages) ||
> -			    ((new.flags ^ old.flags) & KVM_MEM_READONLY))
> +			    ((new.flags ^ old.flags) &
> +			     (KVM_MEM_READONLY | KVM_MEM_UNCACHED)))
>  				goto out;
>  
>  			if (base_gfn != old.base_gfn)
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 2/3] KVM: promote KVM_MEMSLOT_INCOHERENT to uapi
@ 2015-05-14 10:12     ` Paolo Bonzini
  0 siblings, 0 replies; 102+ messages in thread
From: Paolo Bonzini @ 2015-05-14 10:12 UTC (permalink / raw)
  To: Andrew Jones, kvmarm, qemu-devel, ard.biesheuvel,
	christoffer.dall, marc.zyngier, peter.maydell, agraf
  Cc: catalin.marinas, lersek



On 13/05/2015 13:31, Andrew Jones wrote:
> Commit 1050dcda30529 introduced KVM_MEMSLOT_INCOHERENT to flag memory
> regions that may have coherency issues due to mapping host system RAM
> as non-cacheable. This was introduced as a KVM internal flag, but now
> give KVM userspace access to it so that it may use it for hinting
> likely problematic regions. Also rename to KVM_MEM_UNCACHED.
> 
> Signed-off-by: Andrew Jones <drjones@redhat.com>
> ---
>  Documentation/virtual/kvm/api.txt | 20 ++++++++++++++------
>  arch/arm/include/uapi/asm/kvm.h   |  1 +
>  arch/arm/kvm/arm.c                |  1 +
>  arch/arm/kvm/mmu.c                |  4 ++--
>  arch/arm64/include/uapi/asm/kvm.h |  1 +
>  include/linux/kvm_host.h          |  1 -
>  include/uapi/linux/kvm.h          |  2 ++
>  virt/kvm/kvm_main.c               |  7 ++++++-
>  8 files changed, 27 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 9fa2bf8c3f6f1..04ffd9f5db5f2 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -905,6 +905,7 @@ struct kvm_userspace_memory_region {
>  /* for kvm_memory_region::flags */
>  #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>  #define KVM_MEM_READONLY	(1UL << 1)
> +#define KVM_MEM_UNCACHED	(1UL << 2)
>  
>  This ioctl allows the user to create or modify a guest physical memory
>  slot.  When changing an existing slot, it may be moved in the guest
> @@ -920,12 +921,19 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
>  be identical.  This allows large pages in the guest to be backed by large
>  pages in the host.
>  
> -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> -writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> -to make a new slot read-only.  In this case, writes to this memory will be
> -posted to userspace as KVM_EXIT_MMIO exits.
> +The flags field supports the following flags. All flags may be used in
> +combination.
> +  - KVM_MEM_LOG_DIRTY_PAGES: no capability required
> +      Set this to instruct KVM to keep track of writes to memory within
> +      the slot. (See the KVM_GET_DIRTY_LOG ioctl)
> +  - KVM_MEM_READONLY: depends on capability KVM_CAP_READONLY_MEM
> +      Set this to make a new slot read-only. In this case, writes to
> +      this memory will be posted to userspace as KVM_EXIT_MMIO exits.
> +      EINVAL will be returned if the slot is not new.
> +  - KVM_MEM_UNCACHED: depends on capability KVM_CAP_UNCACHED_MEM
> +      Set this to make a new slot uncached, i.e. userspace will always
> +      directly read/write RAM for this memory region. EINVAL will be
> +      returned if the slot is not new.
>  
>  When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
>  the memory region are automatically reflected into the guest.  For example, an
> diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h
> index df3f60cb1168a..893a331655115 100644
> --- a/arch/arm/include/uapi/asm/kvm.h
> +++ b/arch/arm/include/uapi/asm/kvm.h
> @@ -26,6 +26,7 @@
>  #define __KVM_HAVE_GUEST_DEBUG
>  #define __KVM_HAVE_IRQ_LINE
>  #define __KVM_HAVE_READONLY_MEM
> +#define __KVM_HAVE_UNCACHED_MEM
>  
>  #define KVM_REG_SIZE(id)						\
>  	(1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index d9631ecddd56e..cbb532de9c8b5 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -182,6 +182,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_ARM_PSCI_0_2:
>  	case KVM_CAP_READONLY_MEM:
>  	case KVM_CAP_MP_STATE:
> +	case KVM_CAP_UNCACHED_MEM:
>  		r = 1;
>  		break;
>  	case KVM_CAP_COALESCED_MMIO:
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 1d5accbd3dcf2..bc1665acd73e7 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -1307,7 +1307,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (!hugetlb && !force_pte)
>  		hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
>  
> -	fault_ipa_uncached = memslot->flags & KVM_MEMSLOT_INCOHERENT;
> +	fault_ipa_uncached = memslot->flags & KVM_MEM_UNCACHED;
>  
>  	if (hugetlb) {
>  		pmd_t new_pmd = pfn_pmd(pfn, mem_type);
> @@ -1834,7 +1834,7 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
>  	 * regions as incoherent.
>  	 */
>  	if (slot->flags & KVM_MEM_READONLY)
> -		slot->flags |= KVM_MEMSLOT_INCOHERENT;
> +		slot->flags |= KVM_MEM_UNCACHED;
>  	return 0;
>  }
>  
> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> index d26832022127e..be46855ca01b7 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -38,6 +38,7 @@
>  #define __KVM_HAVE_GUEST_DEBUG
>  #define __KVM_HAVE_IRQ_LINE
>  #define __KVM_HAVE_READONLY_MEM
> +#define __KVM_HAVE_UNCACHED_MEM
>  
>  #define KVM_REG_SIZE(id)						\
>  	(1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ad45054309a0f..74d6a14529f53 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -39,7 +39,6 @@
>   * include/linux/kvm_h.
>   */
>  #define KVM_MEMSLOT_INVALID	(1UL << 16)
> -#define KVM_MEMSLOT_INCOHERENT	(1UL << 17)
>  
>  /* Two fragments for cross MMIO pages. */
>  #define KVM_MAX_MMIO_FRAGMENTS	2
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 4b60056776d14..ceb6e805e5f77 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -108,6 +108,7 @@ struct kvm_userspace_memory_region {
>   */
>  #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>  #define KVM_MEM_READONLY	(1UL << 1)
> +#define KVM_MEM_UNCACHED	(1UL << 2)
>  
>  /* for KVM_IRQ_LINE */
>  struct kvm_irq_level {
> @@ -814,6 +815,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_S390_INJECT_IRQ 113
>  #define KVM_CAP_S390_IRQ_STATE 114
>  #define KVM_CAP_PPC_HWRNG 115
> +#define KVM_CAP_UNCACHED_MEM 116
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 90977418aeb6e..d1d535d80a075 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -719,6 +719,10 @@ static int check_memory_region_flags(struct kvm_userspace_memory_region *mem)
>  	valid_flags |= KVM_MEM_READONLY;
>  #endif
>  
> +#ifdef __KVM_HAVE_UNCACHED_MEM
> +	valid_flags |= KVM_MEM_UNCACHED;
> +#endif
> +
>  	if (mem->flags & ~valid_flags)
>  		return -EINVAL;
>  
> @@ -816,7 +820,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  		else { /* Modify an existing slot. */
>  			if ((mem->userspace_addr != old.userspace_addr) ||
>  			    (npages != old.npages) ||
> -			    ((new.flags ^ old.flags) & KVM_MEM_READONLY))
> +			    ((new.flags ^ old.flags) &
> +			     (KVM_MEM_READONLY | KVM_MEM_UNCACHED)))
>  				goto out;
>  
>  			if (base_gfn != old.base_gfn)
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-13 11:31 ` Andrew Jones
@ 2015-05-14 10:30   ` Christoffer Dall
  -1 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-14 10:30 UTC (permalink / raw)
  To: Andrew Jones
  Cc: peter.maydell, ard.biesheuvel, marc.zyngier, catalin.marinas,
	qemu-devel, agraf, pbonzini, j.fanguede, lersek, kvmarm,
	m.smarduch

On Wed, May 13, 2015 at 01:31:51PM +0200, Andrew Jones wrote:
> Introduce a new memory region flag, KVM_MEM_UNCACHED, which is
> needed by ARM. This flag informs KVM that the given memory region
> is typically mapped by the guest as non-cacheable. KVM for ARM
> then ensures that that memory is indeed mapped non-cacheable by
> the guest, and also remaps that region as non-cacheable for
> userspace, allowing them both to maintain a coherent view.
> 
> Changes since v1:
>  1) don't pin pages [Paolo]
>  2) ensure the guest maps the memory non-cacheable [me]
>  3) clean up memslot flag documentation [Christoffer]
> changes 1 and 2 effectively redesigned/rewrote v1. Find v1 here
> http://www.spinics.net/lists/kvm-arm/msg14022.html
> 
> The QEMU series for v1 hasn't really changed. Only the linux
> header hack needed to bump KVM_CAP_UNCACHED_MEM from 107 to
> 116.  Find the series here
> http://www.spinics.net/lists/kvm-arm/msg14026.html
> 
> Testing:
> This series still needs lots of testing, but I thought I'd
> kick it to the list early, as there's been recent interest
> in solving this problem, and I'd like to get test results
> and opinions on this approach from others sooner than later.
> I've tested with AAVMF (UEFI for AArch64 mach-virt guests).
> AAVMF has a kludge in it to avoid the coherency problem.

How does the 'kludge' work?

> I've tested both with and without that kludge active. Both
> worked for me (almost). Sometimes with the non-kludged
> version I was still able to see a bit of corruption in
> grub's output after edk2 loaded it - not much, and not always,
> but something.

Remind me, this is a VGA framebuffer corruption with a PCI-plugged VGA
card?

Thanks,
-Christoffer

> Anyway, it's quite frustrating, as I'm not sure
> what I'm missing...
> 
> This series applies to Linus' 110bc76729d4, but I tested with
> a version backported to the current RHELSA kernel.
> 
> Thanks for reviews and testing!
> 
> drew
> 
> 
> Andrew Jones (3):
>   arm/arm64: pageattr: add set_memory_nc
>   KVM: promote KVM_MEMSLOT_INCOHERENT to uapi
>   arm/arm64: KVM: implement 'uncached' mem coherency
> 
>  Documentation/virtual/kvm/api.txt     | 20 ++++++++++++------
>  arch/arm/include/asm/cacheflush.h     |  1 +
>  arch/arm/include/asm/kvm_mmu.h        |  5 ++++-
>  arch/arm/include/asm/pgtable-3level.h |  1 +
>  arch/arm/include/asm/pgtable.h        |  1 +
>  arch/arm/include/uapi/asm/kvm.h       |  1 +
>  arch/arm/kvm/arm.c                    |  1 +
>  arch/arm/kvm/mmu.c                    | 39 ++++++++++++++++++++++-------------
>  arch/arm/mm/pageattr.c                |  7 +++++++
>  arch/arm64/include/asm/cacheflush.h   |  1 +
>  arch/arm64/include/asm/kvm_mmu.h      |  5 ++++-
>  arch/arm64/include/asm/memory.h       |  1 +
>  arch/arm64/include/asm/pgtable.h      |  1 +
>  arch/arm64/include/uapi/asm/kvm.h     |  1 +
>  arch/arm64/mm/pageattr.c              |  8 +++++++
>  include/linux/kvm_host.h              |  1 -
>  include/uapi/linux/kvm.h              |  2 ++
>  virt/kvm/kvm_main.c                   |  7 ++++++-
>  18 files changed, 79 insertions(+), 24 deletions(-)
> 
> -- 
> 2.1.0
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-14 10:30   ` Christoffer Dall
  0 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-14 10:30 UTC (permalink / raw)
  To: Andrew Jones
  Cc: ard.biesheuvel, marc.zyngier, catalin.marinas, qemu-devel,
	pbonzini, lersek, kvmarm

On Wed, May 13, 2015 at 01:31:51PM +0200, Andrew Jones wrote:
> Introduce a new memory region flag, KVM_MEM_UNCACHED, which is
> needed by ARM. This flag informs KVM that the given memory region
> is typically mapped by the guest as non-cacheable. KVM for ARM
> then ensures that that memory is indeed mapped non-cacheable by
> the guest, and also remaps that region as non-cacheable for
> userspace, allowing them both to maintain a coherent view.
> 
> Changes since v1:
>  1) don't pin pages [Paolo]
>  2) ensure the guest maps the memory non-cacheable [me]
>  3) clean up memslot flag documentation [Christoffer]
> changes 1 and 2 effectively redesigned/rewrote v1. Find v1 here
> http://www.spinics.net/lists/kvm-arm/msg14022.html
> 
> The QEMU series for v1 hasn't really changed. Only the linux
> header hack needed to bump KVM_CAP_UNCACHED_MEM from 107 to
> 116.  Find the series here
> http://www.spinics.net/lists/kvm-arm/msg14026.html
> 
> Testing:
> This series still needs lots of testing, but I thought I'd
> kick it to the list early, as there's been recent interest
> in solving this problem, and I'd like to get test results
> and opinions on this approach from others sooner than later.
> I've tested with AAVMF (UEFI for AArch64 mach-virt guests).
> AAVMF has a kludge in it to avoid the coherency problem.

How does the 'kludge' work?

> I've tested both with and without that kludge active. Both
> worked for me (almost). Sometimes with the non-kludged
> version I was still able to see a bit of corruption in
> grub's output after edk2 loaded it - not much, and not always,
> but something.

Remind me, this is a VGA framebuffer corruption with a PCI-plugged VGA
card?

Thanks,
-Christoffer

> Anyway, it's quite frustrating, as I'm not sure
> what I'm missing...
> 
> This series applies to Linus' 110bc76729d4, but I tested with
> a version backported to the current RHELSA kernel.
> 
> Thanks for reviews and testing!
> 
> drew
> 
> 
> Andrew Jones (3):
>   arm/arm64: pageattr: add set_memory_nc
>   KVM: promote KVM_MEMSLOT_INCOHERENT to uapi
>   arm/arm64: KVM: implement 'uncached' mem coherency
> 
>  Documentation/virtual/kvm/api.txt     | 20 ++++++++++++------
>  arch/arm/include/asm/cacheflush.h     |  1 +
>  arch/arm/include/asm/kvm_mmu.h        |  5 ++++-
>  arch/arm/include/asm/pgtable-3level.h |  1 +
>  arch/arm/include/asm/pgtable.h        |  1 +
>  arch/arm/include/uapi/asm/kvm.h       |  1 +
>  arch/arm/kvm/arm.c                    |  1 +
>  arch/arm/kvm/mmu.c                    | 39 ++++++++++++++++++++++-------------
>  arch/arm/mm/pageattr.c                |  7 +++++++
>  arch/arm64/include/asm/cacheflush.h   |  1 +
>  arch/arm64/include/asm/kvm_mmu.h      |  5 ++++-
>  arch/arm64/include/asm/memory.h       |  1 +
>  arch/arm64/include/asm/pgtable.h      |  1 +
>  arch/arm64/include/uapi/asm/kvm.h     |  1 +
>  arch/arm64/mm/pageattr.c              |  8 +++++++
>  include/linux/kvm_host.h              |  1 -
>  include/uapi/linux/kvm.h              |  2 ++
>  virt/kvm/kvm_main.c                   |  7 ++++++-
>  18 files changed, 79 insertions(+), 24 deletions(-)
> 
> -- 
> 2.1.0
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-13 11:31 ` Andrew Jones
@ 2015-05-14 10:31   ` Andrew Jones
  -1 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-14 10:31 UTC (permalink / raw)
  To: kvmarm, qemu-devel, ard.biesheuvel, christoffer.dall,
	marc.zyngier, peter.maydell, pbonzini, agraf
  Cc: catalin.marinas, lersek

On Wed, May 13, 2015 at 01:31:51PM +0200, Andrew Jones wrote:
> Introduce a new memory region flag, KVM_MEM_UNCACHED, which is
> needed by ARM. This flag informs KVM that the given memory region
> is typically mapped by the guest as non-cacheable. KVM for ARM
> then ensures that that memory is indeed mapped non-cacheable by
> the guest, and also remaps that region as non-cacheable for
> userspace, allowing them both to maintain a coherent view.
> 
> Changes since v1:
>  1) don't pin pages [Paolo]
>  2) ensure the guest maps the memory non-cacheable [me]
>  3) clean up memslot flag documentation [Christoffer]

Forgot to (4): switch from setting userspace's mapping to
device memory to normal, non-cacheable. Using device memory
caused a problem that Alex Graf found, and Peter Maydell suggested
using normal, non-cacheable instead.


> changes 1 and 2 effectively redesigned/rewrote v1. Find v1 here
> http://www.spinics.net/lists/kvm-arm/msg14022.html
> 
> The QEMU series for v1 hasn't really changed. Only the linux
> header hack needed to bump KVM_CAP_UNCACHED_MEM from 107 to
> 116.  Find the series here
> http://www.spinics.net/lists/kvm-arm/msg14026.html
> 
> Testing:
> This series still needs lots of testing, but I thought I'd
> kick it to the list early, as there's been recent interest
> in solving this problem, and I'd like to get test results
> and opinions on this approach from others sooner than later.
> I've tested with AAVMF (UEFI for AArch64 mach-virt guests).
> AAVMF has a kludge in it to avoid the coherency problem.
> I've tested both with and without that kludge active. Both
> worked for me (almost). Sometimes with the non-kludged
> version I was still able to see a bit of corruption in
> grub's output after edk2 loaded it - not much, and not always,
> but something. Anyway, it's quite frustrating, as I'm not sure
> what I'm missing...
> 
> This series applies to Linus' 110bc76729d4, but I tested with
> a version backported to the current RHELSA kernel.
> 
> Thanks for reviews and testing!
> 
> drew
> 
> 
> Andrew Jones (3):
>   arm/arm64: pageattr: add set_memory_nc
>   KVM: promote KVM_MEMSLOT_INCOHERENT to uapi
>   arm/arm64: KVM: implement 'uncached' mem coherency
> 
>  Documentation/virtual/kvm/api.txt     | 20 ++++++++++++------
>  arch/arm/include/asm/cacheflush.h     |  1 +
>  arch/arm/include/asm/kvm_mmu.h        |  5 ++++-
>  arch/arm/include/asm/pgtable-3level.h |  1 +
>  arch/arm/include/asm/pgtable.h        |  1 +
>  arch/arm/include/uapi/asm/kvm.h       |  1 +
>  arch/arm/kvm/arm.c                    |  1 +
>  arch/arm/kvm/mmu.c                    | 39 ++++++++++++++++++++++-------------
>  arch/arm/mm/pageattr.c                |  7 +++++++
>  arch/arm64/include/asm/cacheflush.h   |  1 +
>  arch/arm64/include/asm/kvm_mmu.h      |  5 ++++-
>  arch/arm64/include/asm/memory.h       |  1 +
>  arch/arm64/include/asm/pgtable.h      |  1 +
>  arch/arm64/include/uapi/asm/kvm.h     |  1 +
>  arch/arm64/mm/pageattr.c              |  8 +++++++
>  include/linux/kvm_host.h              |  1 -
>  include/uapi/linux/kvm.h              |  2 ++
>  virt/kvm/kvm_main.c                   |  7 ++++++-
>  18 files changed, 79 insertions(+), 24 deletions(-)
> 
> -- 
> 2.1.0
> 
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-14 10:31   ` Andrew Jones
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-14 10:31 UTC (permalink / raw)
  To: kvmarm, qemu-devel, ard.biesheuvel, christoffer.dall,
	marc.zyngier, peter.maydell, pbonzini, agraf
  Cc: catalin.marinas, lersek

On Wed, May 13, 2015 at 01:31:51PM +0200, Andrew Jones wrote:
> Introduce a new memory region flag, KVM_MEM_UNCACHED, which is
> needed by ARM. This flag informs KVM that the given memory region
> is typically mapped by the guest as non-cacheable. KVM for ARM
> then ensures that that memory is indeed mapped non-cacheable by
> the guest, and also remaps that region as non-cacheable for
> userspace, allowing them both to maintain a coherent view.
> 
> Changes since v1:
>  1) don't pin pages [Paolo]
>  2) ensure the guest maps the memory non-cacheable [me]
>  3) clean up memslot flag documentation [Christoffer]

Forgot to (4): switch from setting userspace's mapping to
device memory to normal, non-cacheable. Using device memory
caused a problem that Alex Graf found, and Peter Maydell suggested
using normal, non-cacheable instead.


> changes 1 and 2 effectively redesigned/rewrote v1. Find v1 here
> http://www.spinics.net/lists/kvm-arm/msg14022.html
> 
> The QEMU series for v1 hasn't really changed. Only the linux
> header hack needed to bump KVM_CAP_UNCACHED_MEM from 107 to
> 116.  Find the series here
> http://www.spinics.net/lists/kvm-arm/msg14026.html
> 
> Testing:
> This series still needs lots of testing, but I thought I'd
> kick it to the list early, as there's been recent interest
> in solving this problem, and I'd like to get test results
> and opinions on this approach from others sooner than later.
> I've tested with AAVMF (UEFI for AArch64 mach-virt guests).
> AAVMF has a kludge in it to avoid the coherency problem.
> I've tested both with and without that kludge active. Both
> worked for me (almost). Sometimes with the non-kludged
> version I was still able to see a bit of corruption in
> grub's output after edk2 loaded it - not much, and not always,
> but something. Anyway, it's quite frustrating, as I'm not sure
> what I'm missing...
> 
> This series applies to Linus' 110bc76729d4, but I tested with
> a version backported to the current RHELSA kernel.
> 
> Thanks for reviews and testing!
> 
> drew
> 
> 
> Andrew Jones (3):
>   arm/arm64: pageattr: add set_memory_nc
>   KVM: promote KVM_MEMSLOT_INCOHERENT to uapi
>   arm/arm64: KVM: implement 'uncached' mem coherency
> 
>  Documentation/virtual/kvm/api.txt     | 20 ++++++++++++------
>  arch/arm/include/asm/cacheflush.h     |  1 +
>  arch/arm/include/asm/kvm_mmu.h        |  5 ++++-
>  arch/arm/include/asm/pgtable-3level.h |  1 +
>  arch/arm/include/asm/pgtable.h        |  1 +
>  arch/arm/include/uapi/asm/kvm.h       |  1 +
>  arch/arm/kvm/arm.c                    |  1 +
>  arch/arm/kvm/mmu.c                    | 39 ++++++++++++++++++++++-------------
>  arch/arm/mm/pageattr.c                |  7 +++++++
>  arch/arm64/include/asm/cacheflush.h   |  1 +
>  arch/arm64/include/asm/kvm_mmu.h      |  5 ++++-
>  arch/arm64/include/asm/memory.h       |  1 +
>  arch/arm64/include/asm/pgtable.h      |  1 +
>  arch/arm64/include/uapi/asm/kvm.h     |  1 +
>  arch/arm64/mm/pageattr.c              |  8 +++++++
>  include/linux/kvm_host.h              |  1 -
>  include/uapi/linux/kvm.h              |  2 ++
>  virt/kvm/kvm_main.c                   |  7 ++++++-
>  18 files changed, 79 insertions(+), 24 deletions(-)
> 
> -- 
> 2.1.0
> 
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 2/3] KVM: promote KVM_MEMSLOT_INCOHERENT to uapi
  2015-05-13 11:31   ` Andrew Jones
@ 2015-05-14 10:34     ` Christoffer Dall
  -1 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-14 10:34 UTC (permalink / raw)
  To: Andrew Jones
  Cc: peter.maydell, ard.biesheuvel, marc.zyngier, catalin.marinas,
	qemu-devel, agraf, pbonzini, j.fanguede, lersek, kvmarm,
	m.smarduch

On Wed, May 13, 2015 at 01:31:53PM +0200, Andrew Jones wrote:
> Commit 1050dcda30529 introduced KVM_MEMSLOT_INCOHERENT to flag memory
> regions that may have coherency issues due to mapping host system RAM
> as non-cacheable. This was introduced as a KVM internal flag, but now
> give KVM userspace access to it so that it may use it for hinting
> likely problematic regions. Also rename to KVM_MEM_UNCACHED.
> 
> Signed-off-by: Andrew Jones <drjones@redhat.com>

Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 2/3] KVM: promote KVM_MEMSLOT_INCOHERENT to uapi
@ 2015-05-14 10:34     ` Christoffer Dall
  0 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-14 10:34 UTC (permalink / raw)
  To: Andrew Jones
  Cc: ard.biesheuvel, marc.zyngier, catalin.marinas, qemu-devel,
	pbonzini, lersek, kvmarm

On Wed, May 13, 2015 at 01:31:53PM +0200, Andrew Jones wrote:
> Commit 1050dcda30529 introduced KVM_MEMSLOT_INCOHERENT to flag memory
> regions that may have coherency issues due to mapping host system RAM
> as non-cacheable. This was introduced as a KVM internal flag, but now
> give KVM userspace access to it so that it may use it for hinting
> likely problematic regions. Also rename to KVM_MEM_UNCACHED.
> 
> Signed-off-by: Andrew Jones <drjones@redhat.com>

Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-14 10:31   ` Andrew Jones
@ 2015-05-14 10:37     ` Peter Maydell
  -1 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2015-05-14 10:37 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Ard Biesheuvel, Marc Zyngier, Catalin Marinas, QEMU Developers,
	Alexander Graf, Paolo Bonzini, Laszlo Ersek, kvmarm,
	Christoffer Dall

On 14 May 2015 at 11:31, Andrew Jones <drjones@redhat.com> wrote:
> Forgot to (4): switch from setting userspace's mapping to
> device memory to normal, non-cacheable. Using device memory
> caused a problem that Alex Graf found, and Peter Maydell suggested
> using normal, non-cacheable instead.

Did you check that non-cacheable is definitely the correct
kind of Normal memory attribute we want? (ie not write-through).

-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-14 10:37     ` Peter Maydell
  0 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2015-05-14 10:37 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Ard Biesheuvel, Marc Zyngier, Catalin Marinas, QEMU Developers,
	Paolo Bonzini, Laszlo Ersek, kvmarm

On 14 May 2015 at 11:31, Andrew Jones <drjones@redhat.com> wrote:
> Forgot to (4): switch from setting userspace's mapping to
> device memory to normal, non-cacheable. Using device memory
> caused a problem that Alex Graf found, and Peter Maydell suggested
> using normal, non-cacheable instead.

Did you check that non-cacheable is definitely the correct
kind of Normal memory attribute we want? (ie not write-through).

-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 3/3] arm/arm64: KVM: implement 'uncached' mem coherency
  2015-05-13 11:31   ` Andrew Jones
@ 2015-05-14 10:55     ` Christoffer Dall
  -1 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-14 10:55 UTC (permalink / raw)
  To: Andrew Jones
  Cc: peter.maydell, ard.biesheuvel, marc.zyngier, catalin.marinas,
	qemu-devel, agraf, pbonzini, j.fanguede, lersek, kvmarm,
	m.smarduch

On Wed, May 13, 2015 at 01:31:54PM +0200, Andrew Jones wrote:
> When S1 and S2 memory attributes combine wrt to caching policy,
> non-cacheable types take precedence. If a guest maps a region as
> device memory, which KVM userspace is using to emulate the device
> using normal, cacheable memory, then we lose coherency. With
> KVM_MEM_UNCACHED, KVM userspace can now hint to KVM which memory
> regions are likely to be problematic. With this patch, as pages
> of these types of regions are faulted into the guest, not only do
> we flush the page's dcache, but we also change userspace's
> mapping to NC in order to maintain coherency.
> 
> What if the guest doesn't do what we expect? While we can't
> force a guest to use cacheable memory, we can take advantage of
> the non-cacheable precedence, and force it to use non-cacheable.
> So, this patch also introduces PAGE_S2_NORMAL_NC, and uses it on
> KVM_MEM_UNCACHED regions to force them to NC.
> 
> We now have both guest and userspace on the same page (pun intended)

I'd like to revisit the overall approach here.  Is doing non-cached
accesses in both the guest and host really the right thing to do here?

The semantics of the device becomes that it is cache coherent (because
QEMU is), and I think Marc argued that Linux/UEFI should simply be
adapted to handle whatever emulated devices we have as coherent.  I also
remember someone arguing that would be wrong (Peter?).

Finally, does this address all cache coherency issues with emulated
devices?  Some VOS guys had seen things still not working with this
approach, unsure why...  I'd like to avoid us merging this only to merge
a more complete solution in a few weeks which reverts this solution...

More comments/questions below:

> 
> Signed-off-by: Andrew Jones <drjones@redhat.com>
> ---
>  arch/arm/include/asm/kvm_mmu.h        |  5 ++++-
>  arch/arm/include/asm/pgtable-3level.h |  1 +
>  arch/arm/include/asm/pgtable.h        |  1 +
>  arch/arm/kvm/mmu.c                    | 37 +++++++++++++++++++++++------------
>  arch/arm64/include/asm/kvm_mmu.h      |  5 ++++-
>  arch/arm64/include/asm/memory.h       |  1 +
>  arch/arm64/include/asm/pgtable.h      |  1 +
>  7 files changed, 36 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 405aa18833073..e8034a80b12e5 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -214,8 +214,11 @@ static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
>  	while (size) {
>  		void *va = kmap_atomic_pfn(pfn);
>  
> -		if (need_flush)
> +		if (need_flush) {
>  			kvm_flush_dcache_to_poc(va, PAGE_SIZE);
> +			if (ipa_uncached)
> +				set_memory_nc((unsigned long)va, 1);

nit: consider moving this outside the need_flush

> +		}
>  
>  		if (icache_is_pipt())
>  			__cpuc_coherent_user_range((unsigned long)va,
> diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
> index a745a2a53853c..39b3f7a40e663 100644
> --- a/arch/arm/include/asm/pgtable-3level.h
> +++ b/arch/arm/include/asm/pgtable-3level.h
> @@ -121,6 +121,7 @@
>   * 2nd stage PTE definitions for LPAE.
>   */
>  #define L_PTE_S2_MT_UNCACHED		(_AT(pteval_t, 0x0) << 2) /* strongly ordered */
> +#define L_PTE_S2_MT_NORMAL_NC		(_AT(pteval_t, 0x5) << 2) /* normal non-cacheable */
>  #define L_PTE_S2_MT_WRITETHROUGH	(_AT(pteval_t, 0xa) << 2) /* normal inner write-through */
>  #define L_PTE_S2_MT_WRITEBACK		(_AT(pteval_t, 0xf) << 2) /* normal inner write-back */
>  #define L_PTE_S2_MT_DEV_SHARED		(_AT(pteval_t, 0x1) << 2) /* device */
> diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
> index f40354198bad4..ae13ca8b0a23d 100644
> --- a/arch/arm/include/asm/pgtable.h
> +++ b/arch/arm/include/asm/pgtable.h
> @@ -100,6 +100,7 @@ extern pgprot_t		pgprot_s2_device;
>  #define PAGE_HYP		_MOD_PROT(pgprot_kernel, L_PTE_HYP)
>  #define PAGE_HYP_DEVICE		_MOD_PROT(pgprot_hyp_device, L_PTE_HYP)
>  #define PAGE_S2			_MOD_PROT(pgprot_s2, L_PTE_S2_RDONLY)
> +#define PAGE_S2_NORMAL_NC	__pgprot((pgprot_val(PAGE_S2) & ~L_PTE_S2_MT_MASK) | L_PTE_S2_MT_NORMAL_NC)
>  #define PAGE_S2_DEVICE		_MOD_PROT(pgprot_s2_device, L_PTE_S2_RDONLY)
>  
>  #define __PAGE_NONE		__pgprot(_L_PTE_DEFAULT | L_PTE_RDONLY | L_PTE_XN | L_PTE_NONE)
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index bc1665acd73e7..6b3bd8061bd2a 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -1220,7 +1220,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	struct vm_area_struct *vma;
>  	pfn_t pfn;
>  	pgprot_t mem_type = PAGE_S2;
> -	bool fault_ipa_uncached;
> +	bool fault_ipa_uncached = false;
>  	bool logging_active = memslot_is_logging(memslot);
>  	unsigned long flags = 0;
>  
> @@ -1300,6 +1300,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			writable = false;
>  	}
>  
> +	if (memslot->flags & KVM_MEM_UNCACHED) {
> +		mem_type = PAGE_S2_NORMAL_NC;
> +		fault_ipa_uncached = true;
> +	}
> +
>  	spin_lock(&kvm->mmu_lock);
>  	if (mmu_notifier_retry(kvm, mmu_seq))
>  		goto out_unlock;
> @@ -1307,8 +1312,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (!hugetlb && !force_pte)
>  		hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
>  
> -	fault_ipa_uncached = memslot->flags & KVM_MEM_UNCACHED;
> -
>  	if (hugetlb) {
>  		pmd_t new_pmd = pfn_pmd(pfn, mem_type);
>  		new_pmd = pmd_mkhuge(new_pmd);
> @@ -1462,6 +1465,7 @@ static int handle_hva_to_gpa(struct kvm *kvm,
>  			     unsigned long start,
>  			     unsigned long end,
>  			     int (*handler)(struct kvm *kvm,
> +					    struct kvm_memory_slot *slot,
>  					    gpa_t gpa, void *data),
>  			     void *data)
>  {
> @@ -1491,14 +1495,15 @@ static int handle_hva_to_gpa(struct kvm *kvm,
>  
>  		for (; gfn < gfn_end; ++gfn) {
>  			gpa_t gpa = gfn << PAGE_SHIFT;
> -			ret |= handler(kvm, gpa, data);
> +			ret |= handler(kvm, memslot, gpa, data);
>  		}
>  	}
>  
>  	return ret;
>  }
>  
> -static int kvm_unmap_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
> +static int kvm_unmap_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
> +				 gpa_t gpa, void *data)

Maybe we should consider a pointer to a struct with the relevant data to
pass around to the handler by now, which would allow us to get rid of
the void * cast as well.  Not sure if it's worth it.

>  {
>  	unmap_stage2_range(kvm, gpa, PAGE_SIZE);
>  	return 0;
> @@ -1527,9 +1532,15 @@ int kvm_unmap_hva_range(struct kvm *kvm,
>  	return 0;
>  }
>  
> -static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
> +static int kvm_set_spte_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
> +				gpa_t gpa, void *data)
>  {
> -	pte_t *pte = (pte_t *)data;
> +	pte_t pte = *((pte_t *)data);
> +
> +	if (slot->flags & KVM_MEM_UNCACHED)
> +		pte = pfn_pte(pte_pfn(pte), PAGE_S2_NORMAL_NC);
> +	else
> +		pte = pfn_pte(pte_pfn(pte), PAGE_S2);
>  
>  	/*
>  	 * We can always call stage2_set_pte with KVM_S2PTE_FLAG_LOGGING_ACTIVE
> @@ -1538,7 +1549,7 @@ static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
>  	 * therefore stage2_set_pte() never needs to clear out a huge PMD
>  	 * through this calling path.
>  	 */
> -	stage2_set_pte(kvm, NULL, gpa, pte, 0);
> +	stage2_set_pte(kvm, NULL, gpa, &pte, 0);

this is making me feel like we should have a separate patch that changes
stage2_set_pte from taking a pointer to just taking a value for the new
pte.

>  	return 0;
>  }
>  
> @@ -1546,17 +1557,16 @@ static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
>  void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
>  {
>  	unsigned long end = hva + PAGE_SIZE;
> -	pte_t stage2_pte;
>  
>  	if (!kvm->arch.pgd)
>  		return;
>  
>  	trace_kvm_set_spte_hva(hva);
> -	stage2_pte = pfn_pte(pte_pfn(pte), PAGE_S2);
> -	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &stage2_pte);
> +	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pte);

hooking in here will make sure you catch changes to the page used for
the mapping, but wouldn't that also mean that the userspace mapping
would have been change, and where are you updating this?

Also, is this called if the userspace mapping is zapped without doing
anything about the underlying page?  (how do we then catch when the
userspace pte is populated again, and is this even possible?)

>  }
>  
> -static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
> +static int kvm_age_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
> +			       gpa_t gpa, void *data)
>  {
>  	pmd_t *pmd;
>  	pte_t *pte;
> @@ -1586,7 +1596,8 @@ static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
>  	return 0;
>  }
>  
> -static int kvm_test_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
> +static int kvm_test_age_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
> +				    gpa_t gpa, void *data)
>  {
>  	pmd_t *pmd;
>  	pte_t *pte;
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index 61505676d0853..af5f0f0eccef9 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -236,8 +236,11 @@ static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
>  {
>  	void *va = page_address(pfn_to_page(pfn));
>  
> -	if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached)
> +	if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached) {
>  		kvm_flush_dcache_to_poc(va, size);
> +		if (ipa_uncached)
> +			set_memory_nc((unsigned long)va, size/PAGE_SIZE);

are you not setting the kernel mapping of the page to non-cached here,
which doesn't affect your userspace mappings at all?

(this would explain why things still break with this series).

> +	}
>  
>  	if (!icache_is_aliasing()) {		/* PIPT */
>  		flush_icache_range((unsigned long)va,
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> index f800d45ea2265..800730f7aa7d9 100644
> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -105,6 +105,7 @@
>   * Memory types for Stage-2 translation
>   */
>  #define MT_S2_NORMAL		0xf
> +#define MT_S2_NORMAL_NC		0x5
>  #define MT_S2_DEVICE_nGnRE	0x1
>  
>  #ifndef __ASSEMBLY__
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 56283f8a675c5..a254925ce1b6b 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -78,6 +78,7 @@ extern void __pgd_error(const char *file, int line, unsigned long val);
>  #define PAGE_HYP_DEVICE		__pgprot(PROT_DEVICE_nGnRE | PTE_HYP)
>  
>  #define PAGE_S2			__pgprot(PROT_DEFAULT | PTE_S2_MEMATTR(MT_S2_NORMAL) | PTE_S2_RDONLY)
> +#define PAGE_S2_NORMAL_NC	__pgprot(PROT_DEFAULT | PTE_S2_MEMATTR(MT_S2_NORMAL_NC) | PTE_S2_RDONLY)
>  #define PAGE_S2_DEVICE		__pgprot(PROT_DEFAULT | PTE_S2_MEMATTR(MT_S2_DEVICE_nGnRE) | PTE_S2_RDONLY | PTE_UXN)
>  
>  #define PAGE_NONE		__pgprot(((_PAGE_DEFAULT) & ~PTE_TYPE_MASK) | PTE_PROT_NONE | PTE_PXN | PTE_UXN)
> -- 
> 2.1.0
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 3/3] arm/arm64: KVM: implement 'uncached' mem coherency
@ 2015-05-14 10:55     ` Christoffer Dall
  0 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-14 10:55 UTC (permalink / raw)
  To: Andrew Jones
  Cc: ard.biesheuvel, marc.zyngier, catalin.marinas, qemu-devel,
	pbonzini, lersek, kvmarm

On Wed, May 13, 2015 at 01:31:54PM +0200, Andrew Jones wrote:
> When S1 and S2 memory attributes combine wrt to caching policy,
> non-cacheable types take precedence. If a guest maps a region as
> device memory, which KVM userspace is using to emulate the device
> using normal, cacheable memory, then we lose coherency. With
> KVM_MEM_UNCACHED, KVM userspace can now hint to KVM which memory
> regions are likely to be problematic. With this patch, as pages
> of these types of regions are faulted into the guest, not only do
> we flush the page's dcache, but we also change userspace's
> mapping to NC in order to maintain coherency.
> 
> What if the guest doesn't do what we expect? While we can't
> force a guest to use cacheable memory, we can take advantage of
> the non-cacheable precedence, and force it to use non-cacheable.
> So, this patch also introduces PAGE_S2_NORMAL_NC, and uses it on
> KVM_MEM_UNCACHED regions to force them to NC.
> 
> We now have both guest and userspace on the same page (pun intended)

I'd like to revisit the overall approach here.  Is doing non-cached
accesses in both the guest and host really the right thing to do here?

The semantics of the device becomes that it is cache coherent (because
QEMU is), and I think Marc argued that Linux/UEFI should simply be
adapted to handle whatever emulated devices we have as coherent.  I also
remember someone arguing that would be wrong (Peter?).

Finally, does this address all cache coherency issues with emulated
devices?  Some VOS guys had seen things still not working with this
approach, unsure why...  I'd like to avoid us merging this only to merge
a more complete solution in a few weeks which reverts this solution...

More comments/questions below:

> 
> Signed-off-by: Andrew Jones <drjones@redhat.com>
> ---
>  arch/arm/include/asm/kvm_mmu.h        |  5 ++++-
>  arch/arm/include/asm/pgtable-3level.h |  1 +
>  arch/arm/include/asm/pgtable.h        |  1 +
>  arch/arm/kvm/mmu.c                    | 37 +++++++++++++++++++++++------------
>  arch/arm64/include/asm/kvm_mmu.h      |  5 ++++-
>  arch/arm64/include/asm/memory.h       |  1 +
>  arch/arm64/include/asm/pgtable.h      |  1 +
>  7 files changed, 36 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 405aa18833073..e8034a80b12e5 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -214,8 +214,11 @@ static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
>  	while (size) {
>  		void *va = kmap_atomic_pfn(pfn);
>  
> -		if (need_flush)
> +		if (need_flush) {
>  			kvm_flush_dcache_to_poc(va, PAGE_SIZE);
> +			if (ipa_uncached)
> +				set_memory_nc((unsigned long)va, 1);

nit: consider moving this outside the need_flush

> +		}
>  
>  		if (icache_is_pipt())
>  			__cpuc_coherent_user_range((unsigned long)va,
> diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
> index a745a2a53853c..39b3f7a40e663 100644
> --- a/arch/arm/include/asm/pgtable-3level.h
> +++ b/arch/arm/include/asm/pgtable-3level.h
> @@ -121,6 +121,7 @@
>   * 2nd stage PTE definitions for LPAE.
>   */
>  #define L_PTE_S2_MT_UNCACHED		(_AT(pteval_t, 0x0) << 2) /* strongly ordered */
> +#define L_PTE_S2_MT_NORMAL_NC		(_AT(pteval_t, 0x5) << 2) /* normal non-cacheable */
>  #define L_PTE_S2_MT_WRITETHROUGH	(_AT(pteval_t, 0xa) << 2) /* normal inner write-through */
>  #define L_PTE_S2_MT_WRITEBACK		(_AT(pteval_t, 0xf) << 2) /* normal inner write-back */
>  #define L_PTE_S2_MT_DEV_SHARED		(_AT(pteval_t, 0x1) << 2) /* device */
> diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
> index f40354198bad4..ae13ca8b0a23d 100644
> --- a/arch/arm/include/asm/pgtable.h
> +++ b/arch/arm/include/asm/pgtable.h
> @@ -100,6 +100,7 @@ extern pgprot_t		pgprot_s2_device;
>  #define PAGE_HYP		_MOD_PROT(pgprot_kernel, L_PTE_HYP)
>  #define PAGE_HYP_DEVICE		_MOD_PROT(pgprot_hyp_device, L_PTE_HYP)
>  #define PAGE_S2			_MOD_PROT(pgprot_s2, L_PTE_S2_RDONLY)
> +#define PAGE_S2_NORMAL_NC	__pgprot((pgprot_val(PAGE_S2) & ~L_PTE_S2_MT_MASK) | L_PTE_S2_MT_NORMAL_NC)
>  #define PAGE_S2_DEVICE		_MOD_PROT(pgprot_s2_device, L_PTE_S2_RDONLY)
>  
>  #define __PAGE_NONE		__pgprot(_L_PTE_DEFAULT | L_PTE_RDONLY | L_PTE_XN | L_PTE_NONE)
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index bc1665acd73e7..6b3bd8061bd2a 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -1220,7 +1220,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	struct vm_area_struct *vma;
>  	pfn_t pfn;
>  	pgprot_t mem_type = PAGE_S2;
> -	bool fault_ipa_uncached;
> +	bool fault_ipa_uncached = false;
>  	bool logging_active = memslot_is_logging(memslot);
>  	unsigned long flags = 0;
>  
> @@ -1300,6 +1300,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			writable = false;
>  	}
>  
> +	if (memslot->flags & KVM_MEM_UNCACHED) {
> +		mem_type = PAGE_S2_NORMAL_NC;
> +		fault_ipa_uncached = true;
> +	}
> +
>  	spin_lock(&kvm->mmu_lock);
>  	if (mmu_notifier_retry(kvm, mmu_seq))
>  		goto out_unlock;
> @@ -1307,8 +1312,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (!hugetlb && !force_pte)
>  		hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
>  
> -	fault_ipa_uncached = memslot->flags & KVM_MEM_UNCACHED;
> -
>  	if (hugetlb) {
>  		pmd_t new_pmd = pfn_pmd(pfn, mem_type);
>  		new_pmd = pmd_mkhuge(new_pmd);
> @@ -1462,6 +1465,7 @@ static int handle_hva_to_gpa(struct kvm *kvm,
>  			     unsigned long start,
>  			     unsigned long end,
>  			     int (*handler)(struct kvm *kvm,
> +					    struct kvm_memory_slot *slot,
>  					    gpa_t gpa, void *data),
>  			     void *data)
>  {
> @@ -1491,14 +1495,15 @@ static int handle_hva_to_gpa(struct kvm *kvm,
>  
>  		for (; gfn < gfn_end; ++gfn) {
>  			gpa_t gpa = gfn << PAGE_SHIFT;
> -			ret |= handler(kvm, gpa, data);
> +			ret |= handler(kvm, memslot, gpa, data);
>  		}
>  	}
>  
>  	return ret;
>  }
>  
> -static int kvm_unmap_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
> +static int kvm_unmap_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
> +				 gpa_t gpa, void *data)

Maybe we should consider a pointer to a struct with the relevant data to
pass around to the handler by now, which would allow us to get rid of
the void * cast as well.  Not sure if it's worth it.

>  {
>  	unmap_stage2_range(kvm, gpa, PAGE_SIZE);
>  	return 0;
> @@ -1527,9 +1532,15 @@ int kvm_unmap_hva_range(struct kvm *kvm,
>  	return 0;
>  }
>  
> -static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
> +static int kvm_set_spte_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
> +				gpa_t gpa, void *data)
>  {
> -	pte_t *pte = (pte_t *)data;
> +	pte_t pte = *((pte_t *)data);
> +
> +	if (slot->flags & KVM_MEM_UNCACHED)
> +		pte = pfn_pte(pte_pfn(pte), PAGE_S2_NORMAL_NC);
> +	else
> +		pte = pfn_pte(pte_pfn(pte), PAGE_S2);
>  
>  	/*
>  	 * We can always call stage2_set_pte with KVM_S2PTE_FLAG_LOGGING_ACTIVE
> @@ -1538,7 +1549,7 @@ static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
>  	 * therefore stage2_set_pte() never needs to clear out a huge PMD
>  	 * through this calling path.
>  	 */
> -	stage2_set_pte(kvm, NULL, gpa, pte, 0);
> +	stage2_set_pte(kvm, NULL, gpa, &pte, 0);

this is making me feel like we should have a separate patch that changes
stage2_set_pte from taking a pointer to just taking a value for the new
pte.

>  	return 0;
>  }
>  
> @@ -1546,17 +1557,16 @@ static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
>  void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
>  {
>  	unsigned long end = hva + PAGE_SIZE;
> -	pte_t stage2_pte;
>  
>  	if (!kvm->arch.pgd)
>  		return;
>  
>  	trace_kvm_set_spte_hva(hva);
> -	stage2_pte = pfn_pte(pte_pfn(pte), PAGE_S2);
> -	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &stage2_pte);
> +	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pte);

hooking in here will make sure you catch changes to the page used for
the mapping, but wouldn't that also mean that the userspace mapping
would have been change, and where are you updating this?

Also, is this called if the userspace mapping is zapped without doing
anything about the underlying page?  (how do we then catch when the
userspace pte is populated again, and is this even possible?)

>  }
>  
> -static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
> +static int kvm_age_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
> +			       gpa_t gpa, void *data)
>  {
>  	pmd_t *pmd;
>  	pte_t *pte;
> @@ -1586,7 +1596,8 @@ static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
>  	return 0;
>  }
>  
> -static int kvm_test_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
> +static int kvm_test_age_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
> +				    gpa_t gpa, void *data)
>  {
>  	pmd_t *pmd;
>  	pte_t *pte;
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index 61505676d0853..af5f0f0eccef9 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -236,8 +236,11 @@ static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
>  {
>  	void *va = page_address(pfn_to_page(pfn));
>  
> -	if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached)
> +	if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached) {
>  		kvm_flush_dcache_to_poc(va, size);
> +		if (ipa_uncached)
> +			set_memory_nc((unsigned long)va, size/PAGE_SIZE);

are you not setting the kernel mapping of the page to non-cached here,
which doesn't affect your userspace mappings at all?

(this would explain why things still break with this series).

> +	}
>  
>  	if (!icache_is_aliasing()) {		/* PIPT */
>  		flush_icache_range((unsigned long)va,
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> index f800d45ea2265..800730f7aa7d9 100644
> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -105,6 +105,7 @@
>   * Memory types for Stage-2 translation
>   */
>  #define MT_S2_NORMAL		0xf
> +#define MT_S2_NORMAL_NC		0x5
>  #define MT_S2_DEVICE_nGnRE	0x1
>  
>  #ifndef __ASSEMBLY__
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 56283f8a675c5..a254925ce1b6b 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -78,6 +78,7 @@ extern void __pgd_error(const char *file, int line, unsigned long val);
>  #define PAGE_HYP_DEVICE		__pgprot(PROT_DEVICE_nGnRE | PTE_HYP)
>  
>  #define PAGE_S2			__pgprot(PROT_DEFAULT | PTE_S2_MEMATTR(MT_S2_NORMAL) | PTE_S2_RDONLY)
> +#define PAGE_S2_NORMAL_NC	__pgprot(PROT_DEFAULT | PTE_S2_MEMATTR(MT_S2_NORMAL_NC) | PTE_S2_RDONLY)
>  #define PAGE_S2_DEVICE		__pgprot(PROT_DEFAULT | PTE_S2_MEMATTR(MT_S2_DEVICE_nGnRE) | PTE_S2_RDONLY | PTE_UXN)
>  
>  #define PAGE_NONE		__pgprot(((_PAGE_DEFAULT) & ~PTE_TYPE_MASK) | PTE_PROT_NONE | PTE_PXN | PTE_UXN)
> -- 
> 2.1.0
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
  2015-05-13 11:31   ` Andrew Jones
@ 2015-05-14 11:05     ` Christoffer Dall
  -1 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-14 11:05 UTC (permalink / raw)
  To: Andrew Jones
  Cc: peter.maydell, ard.biesheuvel, marc.zyngier, catalin.marinas,
	qemu-devel, agraf, pbonzini, j.fanguede, lersek, kvmarm,
	m.smarduch

On Wed, May 13, 2015 at 01:31:52PM +0200, Andrew Jones wrote:
> Provide a method to change normal, cacheable memory to non-cacheable.
> KVM will make use of this to keep emulated device memory regions
> coherent with the guest.
> 
> Signed-off-by: Andrew Jones <drjones@redhat.com>

Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>

But you obviously need Russell and Will/Catalin to ack/merge this.

-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
@ 2015-05-14 11:05     ` Christoffer Dall
  0 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-14 11:05 UTC (permalink / raw)
  To: Andrew Jones
  Cc: ard.biesheuvel, marc.zyngier, catalin.marinas, qemu-devel,
	pbonzini, lersek, kvmarm

On Wed, May 13, 2015 at 01:31:52PM +0200, Andrew Jones wrote:
> Provide a method to change normal, cacheable memory to non-cacheable.
> KVM will make use of this to keep emulated device memory regions
> coherent with the guest.
> 
> Signed-off-by: Andrew Jones <drjones@redhat.com>

Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>

But you obviously need Russell and Will/Catalin to ack/merge this.

-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-14 10:30   ` Christoffer Dall
@ 2015-05-14 11:09     ` Laszlo Ersek
  -1 siblings, 0 replies; 102+ messages in thread
From: Laszlo Ersek @ 2015-05-14 11:09 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: peter.maydell, Andrew Jones, ard.biesheuvel, marc.zyngier,
	catalin.marinas, qemu-devel, agraf, pbonzini, j.fanguede, kvmarm,
	m.smarduch

On 05/14/15 12:30, Christoffer Dall wrote:
> On Wed, May 13, 2015 at 01:31:51PM +0200, Andrew Jones wrote:
>> Introduce a new memory region flag, KVM_MEM_UNCACHED, which is
>> needed by ARM. This flag informs KVM that the given memory region
>> is typically mapped by the guest as non-cacheable. KVM for ARM
>> then ensures that that memory is indeed mapped non-cacheable by
>> the guest, and also remaps that region as non-cacheable for
>> userspace, allowing them both to maintain a coherent view.
>>
>> Changes since v1:
>>  1) don't pin pages [Paolo]
>>  2) ensure the guest maps the memory non-cacheable [me]
>>  3) clean up memslot flag documentation [Christoffer]
>> changes 1 and 2 effectively redesigned/rewrote v1. Find v1 here
>> http://www.spinics.net/lists/kvm-arm/msg14022.html
>>
>> The QEMU series for v1 hasn't really changed. Only the linux
>> header hack needed to bump KVM_CAP_UNCACHED_MEM from 107 to
>> 116.  Find the series here
>> http://www.spinics.net/lists/kvm-arm/msg14026.html
>>
>> Testing:
>> This series still needs lots of testing, but I thought I'd
>> kick it to the list early, as there's been recent interest
>> in solving this problem, and I'd like to get test results
>> and opinions on this approach from others sooner than later.
>> I've tested with AAVMF (UEFI for AArch64 mach-virt guests).
>> AAVMF has a kludge in it to avoid the coherency problem.
> 
> How does the 'kludge' work?

https://github.com/tianocore/edk2/commit/f9a8be42

(It's probably worth looking at the documentation in the first hunk too,
under the commit message.)

Thanks
Laszlo


> 
>> I've tested both with and without that kludge active. Both
>> worked for me (almost). Sometimes with the non-kludged
>> version I was still able to see a bit of corruption in
>> grub's output after edk2 loaded it - not much, and not always,
>> but something.
> 
> Remind me, this is a VGA framebuffer corruption with a PCI-plugged VGA
> card?
> 
> Thanks,
> -Christoffer
> 
>> Anyway, it's quite frustrating, as I'm not sure
>> what I'm missing...
>>
>> This series applies to Linus' 110bc76729d4, but I tested with
>> a version backported to the current RHELSA kernel.
>>
>> Thanks for reviews and testing!
>>
>> drew
>>
>>
>> Andrew Jones (3):
>>   arm/arm64: pageattr: add set_memory_nc
>>   KVM: promote KVM_MEMSLOT_INCOHERENT to uapi
>>   arm/arm64: KVM: implement 'uncached' mem coherency
>>
>>  Documentation/virtual/kvm/api.txt     | 20 ++++++++++++------
>>  arch/arm/include/asm/cacheflush.h     |  1 +
>>  arch/arm/include/asm/kvm_mmu.h        |  5 ++++-
>>  arch/arm/include/asm/pgtable-3level.h |  1 +
>>  arch/arm/include/asm/pgtable.h        |  1 +
>>  arch/arm/include/uapi/asm/kvm.h       |  1 +
>>  arch/arm/kvm/arm.c                    |  1 +
>>  arch/arm/kvm/mmu.c                    | 39 ++++++++++++++++++++++-------------
>>  arch/arm/mm/pageattr.c                |  7 +++++++
>>  arch/arm64/include/asm/cacheflush.h   |  1 +
>>  arch/arm64/include/asm/kvm_mmu.h      |  5 ++++-
>>  arch/arm64/include/asm/memory.h       |  1 +
>>  arch/arm64/include/asm/pgtable.h      |  1 +
>>  arch/arm64/include/uapi/asm/kvm.h     |  1 +
>>  arch/arm64/mm/pageattr.c              |  8 +++++++
>>  include/linux/kvm_host.h              |  1 -
>>  include/uapi/linux/kvm.h              |  2 ++
>>  virt/kvm/kvm_main.c                   |  7 ++++++-
>>  18 files changed, 79 insertions(+), 24 deletions(-)
>>
>> -- 
>> 2.1.0
>>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-14 11:09     ` Laszlo Ersek
  0 siblings, 0 replies; 102+ messages in thread
From: Laszlo Ersek @ 2015-05-14 11:09 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: ard.biesheuvel, marc.zyngier, catalin.marinas, qemu-devel,
	pbonzini, kvmarm

On 05/14/15 12:30, Christoffer Dall wrote:
> On Wed, May 13, 2015 at 01:31:51PM +0200, Andrew Jones wrote:
>> Introduce a new memory region flag, KVM_MEM_UNCACHED, which is
>> needed by ARM. This flag informs KVM that the given memory region
>> is typically mapped by the guest as non-cacheable. KVM for ARM
>> then ensures that that memory is indeed mapped non-cacheable by
>> the guest, and also remaps that region as non-cacheable for
>> userspace, allowing them both to maintain a coherent view.
>>
>> Changes since v1:
>>  1) don't pin pages [Paolo]
>>  2) ensure the guest maps the memory non-cacheable [me]
>>  3) clean up memslot flag documentation [Christoffer]
>> changes 1 and 2 effectively redesigned/rewrote v1. Find v1 here
>> http://www.spinics.net/lists/kvm-arm/msg14022.html
>>
>> The QEMU series for v1 hasn't really changed. Only the linux
>> header hack needed to bump KVM_CAP_UNCACHED_MEM from 107 to
>> 116.  Find the series here
>> http://www.spinics.net/lists/kvm-arm/msg14026.html
>>
>> Testing:
>> This series still needs lots of testing, but I thought I'd
>> kick it to the list early, as there's been recent interest
>> in solving this problem, and I'd like to get test results
>> and opinions on this approach from others sooner than later.
>> I've tested with AAVMF (UEFI for AArch64 mach-virt guests).
>> AAVMF has a kludge in it to avoid the coherency problem.
> 
> How does the 'kludge' work?

https://github.com/tianocore/edk2/commit/f9a8be42

(It's probably worth looking at the documentation in the first hunk too,
under the commit message.)

Thanks
Laszlo


> 
>> I've tested both with and without that kludge active. Both
>> worked for me (almost). Sometimes with the non-kludged
>> version I was still able to see a bit of corruption in
>> grub's output after edk2 loaded it - not much, and not always,
>> but something.
> 
> Remind me, this is a VGA framebuffer corruption with a PCI-plugged VGA
> card?
> 
> Thanks,
> -Christoffer
> 
>> Anyway, it's quite frustrating, as I'm not sure
>> what I'm missing...
>>
>> This series applies to Linus' 110bc76729d4, but I tested with
>> a version backported to the current RHELSA kernel.
>>
>> Thanks for reviews and testing!
>>
>> drew
>>
>>
>> Andrew Jones (3):
>>   arm/arm64: pageattr: add set_memory_nc
>>   KVM: promote KVM_MEMSLOT_INCOHERENT to uapi
>>   arm/arm64: KVM: implement 'uncached' mem coherency
>>
>>  Documentation/virtual/kvm/api.txt     | 20 ++++++++++++------
>>  arch/arm/include/asm/cacheflush.h     |  1 +
>>  arch/arm/include/asm/kvm_mmu.h        |  5 ++++-
>>  arch/arm/include/asm/pgtable-3level.h |  1 +
>>  arch/arm/include/asm/pgtable.h        |  1 +
>>  arch/arm/include/uapi/asm/kvm.h       |  1 +
>>  arch/arm/kvm/arm.c                    |  1 +
>>  arch/arm/kvm/mmu.c                    | 39 ++++++++++++++++++++++-------------
>>  arch/arm/mm/pageattr.c                |  7 +++++++
>>  arch/arm64/include/asm/cacheflush.h   |  1 +
>>  arch/arm64/include/asm/kvm_mmu.h      |  5 ++++-
>>  arch/arm64/include/asm/memory.h       |  1 +
>>  arch/arm64/include/asm/pgtable.h      |  1 +
>>  arch/arm64/include/uapi/asm/kvm.h     |  1 +
>>  arch/arm64/mm/pageattr.c              |  8 +++++++
>>  include/linux/kvm_host.h              |  1 -
>>  include/uapi/linux/kvm.h              |  2 ++
>>  virt/kvm/kvm_main.c                   |  7 ++++++-
>>  18 files changed, 79 insertions(+), 24 deletions(-)
>>
>> -- 
>> 2.1.0
>>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-14 11:09     ` Laszlo Ersek
@ 2015-05-14 11:29       ` Christoffer Dall
  -1 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-14 11:29 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: peter.maydell, Andrew Jones, ard.biesheuvel, marc.zyngier,
	catalin.marinas, qemu-devel, agraf, pbonzini, j.fanguede, kvmarm,
	m.smarduch

On Thu, May 14, 2015 at 01:09:34PM +0200, Laszlo Ersek wrote:
> On 05/14/15 12:30, Christoffer Dall wrote:
> > On Wed, May 13, 2015 at 01:31:51PM +0200, Andrew Jones wrote:
> >> Introduce a new memory region flag, KVM_MEM_UNCACHED, which is
> >> needed by ARM. This flag informs KVM that the given memory region
> >> is typically mapped by the guest as non-cacheable. KVM for ARM
> >> then ensures that that memory is indeed mapped non-cacheable by
> >> the guest, and also remaps that region as non-cacheable for
> >> userspace, allowing them both to maintain a coherent view.
> >>
> >> Changes since v1:
> >>  1) don't pin pages [Paolo]
> >>  2) ensure the guest maps the memory non-cacheable [me]
> >>  3) clean up memslot flag documentation [Christoffer]
> >> changes 1 and 2 effectively redesigned/rewrote v1. Find v1 here
> >> http://www.spinics.net/lists/kvm-arm/msg14022.html
> >>
> >> The QEMU series for v1 hasn't really changed. Only the linux
> >> header hack needed to bump KVM_CAP_UNCACHED_MEM from 107 to
> >> 116.  Find the series here
> >> http://www.spinics.net/lists/kvm-arm/msg14026.html
> >>
> >> Testing:
> >> This series still needs lots of testing, but I thought I'd
> >> kick it to the list early, as there's been recent interest
> >> in solving this problem, and I'd like to get test results
> >> and opinions on this approach from others sooner than later.
> >> I've tested with AAVMF (UEFI for AArch64 mach-virt guests).
> >> AAVMF has a kludge in it to avoid the coherency problem.
> > 
> > How does the 'kludge' work?
> 
> https://github.com/tianocore/edk2/commit/f9a8be42
> 
> (It's probably worth looking at the documentation in the first hunk too,
> under the commit message.)
> 
Why is this a hack/unintuitive?  Is the semantics of the QEMU PCI bus
not simply that MMIO regions are coherent?

-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-14 11:29       ` Christoffer Dall
  0 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-14 11:29 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: ard.biesheuvel, marc.zyngier, catalin.marinas, qemu-devel,
	pbonzini, kvmarm

On Thu, May 14, 2015 at 01:09:34PM +0200, Laszlo Ersek wrote:
> On 05/14/15 12:30, Christoffer Dall wrote:
> > On Wed, May 13, 2015 at 01:31:51PM +0200, Andrew Jones wrote:
> >> Introduce a new memory region flag, KVM_MEM_UNCACHED, which is
> >> needed by ARM. This flag informs KVM that the given memory region
> >> is typically mapped by the guest as non-cacheable. KVM for ARM
> >> then ensures that that memory is indeed mapped non-cacheable by
> >> the guest, and also remaps that region as non-cacheable for
> >> userspace, allowing them both to maintain a coherent view.
> >>
> >> Changes since v1:
> >>  1) don't pin pages [Paolo]
> >>  2) ensure the guest maps the memory non-cacheable [me]
> >>  3) clean up memslot flag documentation [Christoffer]
> >> changes 1 and 2 effectively redesigned/rewrote v1. Find v1 here
> >> http://www.spinics.net/lists/kvm-arm/msg14022.html
> >>
> >> The QEMU series for v1 hasn't really changed. Only the linux
> >> header hack needed to bump KVM_CAP_UNCACHED_MEM from 107 to
> >> 116.  Find the series here
> >> http://www.spinics.net/lists/kvm-arm/msg14026.html
> >>
> >> Testing:
> >> This series still needs lots of testing, but I thought I'd
> >> kick it to the list early, as there's been recent interest
> >> in solving this problem, and I'd like to get test results
> >> and opinions on this approach from others sooner than later.
> >> I've tested with AAVMF (UEFI for AArch64 mach-virt guests).
> >> AAVMF has a kludge in it to avoid the coherency problem.
> > 
> > How does the 'kludge' work?
> 
> https://github.com/tianocore/edk2/commit/f9a8be42
> 
> (It's probably worth looking at the documentation in the first hunk too,
> under the commit message.)
> 
Why is this a hack/unintuitive?  Is the semantics of the QEMU PCI bus
not simply that MMIO regions are coherent?

-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-14 11:29       ` Christoffer Dall
@ 2015-05-14 11:31         ` Paolo Bonzini
  -1 siblings, 0 replies; 102+ messages in thread
From: Paolo Bonzini @ 2015-05-14 11:31 UTC (permalink / raw)
  To: Christoffer Dall, Laszlo Ersek
  Cc: peter.maydell, Andrew Jones, ard.biesheuvel, marc.zyngier,
	catalin.marinas, qemu-devel, agraf, j.fanguede, kvmarm,
	m.smarduch



On 14/05/2015 13:29, Christoffer Dall wrote:
> > (It's probably worth looking at the documentation in the first hunk too,
> > under the commit message.)
> 
> Why is this a hack/unintuitive?  Is the semantics of the QEMU PCI bus
> not simply that MMIO regions are coherent?

Only until device assignment gets into the picture.

Paolo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-14 11:31         ` Paolo Bonzini
  0 siblings, 0 replies; 102+ messages in thread
From: Paolo Bonzini @ 2015-05-14 11:31 UTC (permalink / raw)
  To: Christoffer Dall, Laszlo Ersek
  Cc: ard.biesheuvel, marc.zyngier, catalin.marinas, qemu-devel, kvmarm



On 14/05/2015 13:29, Christoffer Dall wrote:
> > (It's probably worth looking at the documentation in the first hunk too,
> > under the commit message.)
> 
> Why is this a hack/unintuitive?  Is the semantics of the QEMU PCI bus
> not simply that MMIO regions are coherent?

Only until device assignment gets into the picture.

Paolo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-14 11:31         ` Paolo Bonzini
@ 2015-05-14 11:36           ` Christoffer Dall
  -1 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-14 11:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: peter.maydell, Andrew Jones, ard.biesheuvel, marc.zyngier,
	catalin.marinas, qemu-devel, agraf, j.fanguede, Laszlo Ersek,
	kvmarm, m.smarduch

On Thu, May 14, 2015 at 01:31:03PM +0200, Paolo Bonzini wrote:
> 
> 
> On 14/05/2015 13:29, Christoffer Dall wrote:
> > > (It's probably worth looking at the documentation in the first hunk too,
> > > under the commit message.)
> > 
> > Why is this a hack/unintuitive?  Is the semantics of the QEMU PCI bus
> > not simply that MMIO regions are coherent?
> 
> Only until device assignment gets into the picture.
> 
Will UEFI have to deal with device assignment in any respect?

-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-14 11:36           ` Christoffer Dall
  0 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-14 11:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: ard.biesheuvel, marc.zyngier, catalin.marinas, qemu-devel,
	Laszlo Ersek, kvmarm

On Thu, May 14, 2015 at 01:31:03PM +0200, Paolo Bonzini wrote:
> 
> 
> On 14/05/2015 13:29, Christoffer Dall wrote:
> > > (It's probably worth looking at the documentation in the first hunk too,
> > > under the commit message.)
> > 
> > Why is this a hack/unintuitive?  Is the semantics of the QEMU PCI bus
> > not simply that MMIO regions are coherent?
> 
> Only until device assignment gets into the picture.
> 
Will UEFI have to deal with device assignment in any respect?

-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-14 11:36           ` Christoffer Dall
@ 2015-05-14 11:38             ` Paolo Bonzini
  -1 siblings, 0 replies; 102+ messages in thread
From: Paolo Bonzini @ 2015-05-14 11:38 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: peter.maydell, Andrew Jones, ard.biesheuvel, marc.zyngier,
	catalin.marinas, qemu-devel, agraf, j.fanguede, Laszlo Ersek,
	kvmarm, m.smarduch



On 14/05/2015 13:36, Christoffer Dall wrote:
> > > > (It's probably worth looking at the documentation in the first hunk too,
> > > > under the commit message.)
> > > 
> > > Why is this a hack/unintuitive?  Is the semantics of the QEMU PCI bus
> > > not simply that MMIO regions are coherent?
> > 
> > Only until device assignment gets into the picture.
> 
> Will UEFI have to deal with device assignment in any respect?

Why not?  For example you could do network boot from an assigned network
card.

In fact, anything that UEFI has to deal with, the OS has to deal with
too.  If you need a UEFI hack, chances are you need or will need a Linux
hack too.

Paolo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-14 11:38             ` Paolo Bonzini
  0 siblings, 0 replies; 102+ messages in thread
From: Paolo Bonzini @ 2015-05-14 11:38 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: ard.biesheuvel, marc.zyngier, catalin.marinas, qemu-devel,
	Laszlo Ersek, kvmarm



On 14/05/2015 13:36, Christoffer Dall wrote:
> > > > (It's probably worth looking at the documentation in the first hunk too,
> > > > under the commit message.)
> > > 
> > > Why is this a hack/unintuitive?  Is the semantics of the QEMU PCI bus
> > > not simply that MMIO regions are coherent?
> > 
> > Only until device assignment gets into the picture.
> 
> Will UEFI have to deal with device assignment in any respect?

Why not?  For example you could do network boot from an assigned network
card.

In fact, anything that UEFI has to deal with, the OS has to deal with
too.  If you need a UEFI hack, chances are you need or will need a Linux
hack too.

Paolo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-14 11:38             ` Paolo Bonzini
@ 2015-05-14 12:00               ` Christoffer Dall
  -1 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-14 12:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: peter.maydell, Andrew Jones, ard.biesheuvel, marc.zyngier,
	catalin.marinas, qemu-devel, agraf, j.fanguede, Laszlo Ersek,
	kvmarm, m.smarduch

On Thu, May 14, 2015 at 01:38:38PM +0200, Paolo Bonzini wrote:
> 
> 
> On 14/05/2015 13:36, Christoffer Dall wrote:
> > > > > (It's probably worth looking at the documentation in the first hunk too,
> > > > > under the commit message.)
> > > > 
> > > > Why is this a hack/unintuitive?  Is the semantics of the QEMU PCI bus
> > > > not simply that MMIO regions are coherent?
> > > 
> > > Only until device assignment gets into the picture.
> > 
> > Will UEFI have to deal with device assignment in any respect?
> 
> Why not?  For example you could do network boot from an assigned network
> card.
> 
> In fact, anything that UEFI has to deal with, the OS has to deal with
> too.  If you need a UEFI hack, chances are you need or will need a Linux
> hack too.
> 
Fair enough.  I was thinking that UEFI needs to be built with knowledge
of all the hardware present including any passthrough devices, but I
guess this is plainly not true with PCI (and might not even be true with
the level of DT parsing we do for the virtual platform).

So, getting back to my original question.  Is the point then that UEFI
must assume (from ACPI/DT) the cache-coherency properties of the PCI
controller which exists in hardware on the system you're running on,
even for the virtual PCI bus because that will be the semantics for
assigned devices?

And in that case, we have no way to distinguish between passthrough
devices and virtual devices plugged into the virtual PCI bus?

What about the idea of having two virtual PCI buses on your system where
one is always cache-coherent and uses for virtual devices, and the other
is whatever the hardware is and used for passthrough devices?

-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-14 12:00               ` Christoffer Dall
  0 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-14 12:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: ard.biesheuvel, marc.zyngier, catalin.marinas, qemu-devel,
	Laszlo Ersek, kvmarm

On Thu, May 14, 2015 at 01:38:38PM +0200, Paolo Bonzini wrote:
> 
> 
> On 14/05/2015 13:36, Christoffer Dall wrote:
> > > > > (It's probably worth looking at the documentation in the first hunk too,
> > > > > under the commit message.)
> > > > 
> > > > Why is this a hack/unintuitive?  Is the semantics of the QEMU PCI bus
> > > > not simply that MMIO regions are coherent?
> > > 
> > > Only until device assignment gets into the picture.
> > 
> > Will UEFI have to deal with device assignment in any respect?
> 
> Why not?  For example you could do network boot from an assigned network
> card.
> 
> In fact, anything that UEFI has to deal with, the OS has to deal with
> too.  If you need a UEFI hack, chances are you need or will need a Linux
> hack too.
> 
Fair enough.  I was thinking that UEFI needs to be built with knowledge
of all the hardware present including any passthrough devices, but I
guess this is plainly not true with PCI (and might not even be true with
the level of DT parsing we do for the virtual platform).

So, getting back to my original question.  Is the point then that UEFI
must assume (from ACPI/DT) the cache-coherency properties of the PCI
controller which exists in hardware on the system you're running on,
even for the virtual PCI bus because that will be the semantics for
assigned devices?

And in that case, we have no way to distinguish between passthrough
devices and virtual devices plugged into the virtual PCI bus?

What about the idea of having two virtual PCI buses on your system where
one is always cache-coherent and uses for virtual devices, and the other
is whatever the hardware is and used for passthrough devices?

-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-14 12:00               ` Christoffer Dall
@ 2015-05-14 12:08                 ` Paolo Bonzini
  -1 siblings, 0 replies; 102+ messages in thread
From: Paolo Bonzini @ 2015-05-14 12:08 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: peter.maydell, Andrew Jones, ard.biesheuvel, marc.zyngier,
	catalin.marinas, Michael S. Tsirkin, qemu-devel, agraf,
	j.fanguede, Laszlo Ersek, kvmarm, m.smarduch



On 14/05/2015 14:00, Christoffer Dall wrote:
> So, getting back to my original question.  Is the point then that UEFI
> must assume (from ACPI/DT) the cache-coherency properties of the PCI
> controller which exists in hardware on the system you're running on,
> even for the virtual PCI bus because that will be the semantics for
> assigned devices?
> 
> And in that case, we have no way to distinguish between passthrough
> devices and virtual devices plugged into the virtual PCI bus?

Well, we could use the subsystem id.  But it's a hack, and may cause
incompatibilities with some drivers.  Michael, any ideas?

> What about the idea of having two virtual PCI buses on your system where
> one is always cache-coherent and uses for virtual devices, and the other
> is whatever the hardware is and used for passthrough devices?

I think that was rejected before.

Paolo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-14 12:08                 ` Paolo Bonzini
  0 siblings, 0 replies; 102+ messages in thread
From: Paolo Bonzini @ 2015-05-14 12:08 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: ard.biesheuvel, marc.zyngier, catalin.marinas,
	Michael S. Tsirkin, qemu-devel, Laszlo Ersek, kvmarm



On 14/05/2015 14:00, Christoffer Dall wrote:
> So, getting back to my original question.  Is the point then that UEFI
> must assume (from ACPI/DT) the cache-coherency properties of the PCI
> controller which exists in hardware on the system you're running on,
> even for the virtual PCI bus because that will be the semantics for
> assigned devices?
> 
> And in that case, we have no way to distinguish between passthrough
> devices and virtual devices plugged into the virtual PCI bus?

Well, we could use the subsystem id.  But it's a hack, and may cause
incompatibilities with some drivers.  Michael, any ideas?

> What about the idea of having two virtual PCI buses on your system where
> one is always cache-coherent and uses for virtual devices, and the other
> is whatever the hardware is and used for passthrough devices?

I think that was rejected before.

Paolo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-14 12:08                 ` Paolo Bonzini
@ 2015-05-14 12:24                   ` Christoffer Dall
  -1 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-14 12:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: peter.maydell, Andrew Jones, ard.biesheuvel, marc.zyngier,
	catalin.marinas, Michael S. Tsirkin, qemu-devel, agraf,
	j.fanguede, Laszlo Ersek, kvmarm, m.smarduch

On Thu, May 14, 2015 at 02:08:49PM +0200, Paolo Bonzini wrote:
> 
> 
> On 14/05/2015 14:00, Christoffer Dall wrote:
> > So, getting back to my original question.  Is the point then that UEFI
> > must assume (from ACPI/DT) the cache-coherency properties of the PCI
> > controller which exists in hardware on the system you're running on,
> > even for the virtual PCI bus because that will be the semantics for
> > assigned devices?
> > 
> > And in that case, we have no way to distinguish between passthrough
> > devices and virtual devices plugged into the virtual PCI bus?
> 
> Well, we could use the subsystem id.  But it's a hack, and may cause
> incompatibilities with some drivers.  Michael, any ideas?
> 
> > What about the idea of having two virtual PCI buses on your system where
> > one is always cache-coherent and uses for virtual devices, and the other
> > is whatever the hardware is and used for passthrough devices?
> 
> I think that was rejected before.
> 

Do you remember where?  I just remember Catalin mentioning the idea to
me verbally.

Besides the slightly heavy added use of resources etc. it seems like it
would address some of our issues in a good way.

But I'm still not sure why UEFI/Linux currently sees our PCI bus as
being non-coherent when in fact it is and we have no passthrough issues
currently.  Are all PCI controllers always non-coherent for some reason
and therefore we model it as such too?

-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-14 12:24                   ` Christoffer Dall
  0 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-14 12:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: ard.biesheuvel, marc.zyngier, catalin.marinas,
	Michael S. Tsirkin, qemu-devel, Laszlo Ersek, kvmarm

On Thu, May 14, 2015 at 02:08:49PM +0200, Paolo Bonzini wrote:
> 
> 
> On 14/05/2015 14:00, Christoffer Dall wrote:
> > So, getting back to my original question.  Is the point then that UEFI
> > must assume (from ACPI/DT) the cache-coherency properties of the PCI
> > controller which exists in hardware on the system you're running on,
> > even for the virtual PCI bus because that will be the semantics for
> > assigned devices?
> > 
> > And in that case, we have no way to distinguish between passthrough
> > devices and virtual devices plugged into the virtual PCI bus?
> 
> Well, we could use the subsystem id.  But it's a hack, and may cause
> incompatibilities with some drivers.  Michael, any ideas?
> 
> > What about the idea of having two virtual PCI buses on your system where
> > one is always cache-coherent and uses for virtual devices, and the other
> > is whatever the hardware is and used for passthrough devices?
> 
> I think that was rejected before.
> 

Do you remember where?  I just remember Catalin mentioning the idea to
me verbally.

Besides the slightly heavy added use of resources etc. it seems like it
would address some of our issues in a good way.

But I'm still not sure why UEFI/Linux currently sees our PCI bus as
being non-coherent when in fact it is and we have no passthrough issues
currently.  Are all PCI controllers always non-coherent for some reason
and therefore we model it as such too?

-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-14 12:24                   ` Christoffer Dall
@ 2015-05-14 12:28                     ` Paolo Bonzini
  -1 siblings, 0 replies; 102+ messages in thread
From: Paolo Bonzini @ 2015-05-14 12:28 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: peter.maydell, Andrew Jones, ard.biesheuvel, marc.zyngier,
	catalin.marinas, Michael S. Tsirkin, qemu-devel, agraf,
	j.fanguede, Laszlo Ersek, kvmarm, m.smarduch



On 14/05/2015 14:24, Christoffer Dall wrote:
> On Thu, May 14, 2015 at 02:08:49PM +0200, Paolo Bonzini wrote:
>>
>>
>> On 14/05/2015 14:00, Christoffer Dall wrote:
>>> So, getting back to my original question.  Is the point then that UEFI
>>> must assume (from ACPI/DT) the cache-coherency properties of the PCI
>>> controller which exists in hardware on the system you're running on,
>>> even for the virtual PCI bus because that will be the semantics for
>>> assigned devices?
>>>
>>> And in that case, we have no way to distinguish between passthrough
>>> devices and virtual devices plugged into the virtual PCI bus?
>>
>> Well, we could use the subsystem id.  But it's a hack, and may cause
>> incompatibilities with some drivers.  Michael, any ideas?
>>
>>> What about the idea of having two virtual PCI buses on your system where
>>> one is always cache-coherent and uses for virtual devices, and the other
>>> is whatever the hardware is and used for passthrough devices?
>>
>> I think that was rejected before.
> 
> Do you remember where?  I just remember Catalin mentioning the idea to
> me verbally.

In the last centithread on the subject. :)

At least I and Peter disagreed.  It's not about the heavy added use of
resources, it's more about it being really easy to misconfigure.

> But I'm still not sure why UEFI/Linux currently sees our PCI bus as
> being non-coherent when in fact it is and we have no passthrough issues
> currently.  Are all PCI controllers always non-coherent for some reason
> and therefore we model it as such too?

Well, PCI BARs are generally MMIO resources, and hence should not be cached.

As an optimization, OS drivers can mark them as cacheable or
write-combining or something like that, but in general it's a safe
default to leave them uncached---one would think.

Paolo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-14 12:28                     ` Paolo Bonzini
  0 siblings, 0 replies; 102+ messages in thread
From: Paolo Bonzini @ 2015-05-14 12:28 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: ard.biesheuvel, marc.zyngier, catalin.marinas,
	Michael S. Tsirkin, qemu-devel, Laszlo Ersek, kvmarm



On 14/05/2015 14:24, Christoffer Dall wrote:
> On Thu, May 14, 2015 at 02:08:49PM +0200, Paolo Bonzini wrote:
>>
>>
>> On 14/05/2015 14:00, Christoffer Dall wrote:
>>> So, getting back to my original question.  Is the point then that UEFI
>>> must assume (from ACPI/DT) the cache-coherency properties of the PCI
>>> controller which exists in hardware on the system you're running on,
>>> even for the virtual PCI bus because that will be the semantics for
>>> assigned devices?
>>>
>>> And in that case, we have no way to distinguish between passthrough
>>> devices and virtual devices plugged into the virtual PCI bus?
>>
>> Well, we could use the subsystem id.  But it's a hack, and may cause
>> incompatibilities with some drivers.  Michael, any ideas?
>>
>>> What about the idea of having two virtual PCI buses on your system where
>>> one is always cache-coherent and uses for virtual devices, and the other
>>> is whatever the hardware is and used for passthrough devices?
>>
>> I think that was rejected before.
> 
> Do you remember where?  I just remember Catalin mentioning the idea to
> me verbally.

In the last centithread on the subject. :)

At least I and Peter disagreed.  It's not about the heavy added use of
resources, it's more about it being really easy to misconfigure.

> But I'm still not sure why UEFI/Linux currently sees our PCI bus as
> being non-coherent when in fact it is and we have no passthrough issues
> currently.  Are all PCI controllers always non-coherent for some reason
> and therefore we model it as such too?

Well, PCI BARs are generally MMIO resources, and hence should not be cached.

As an optimization, OS drivers can mark them as cacheable or
write-combining or something like that, but in general it's a safe
default to leave them uncached---one would think.

Paolo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-14 12:28                     ` Paolo Bonzini
@ 2015-05-14 12:34                       ` Christoffer Dall
  -1 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-14 12:34 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: peter.maydell, Andrew Jones, ard.biesheuvel, marc.zyngier,
	catalin.marinas, Michael S. Tsirkin, qemu-devel, agraf,
	j.fanguede, Laszlo Ersek, kvmarm, m.smarduch

On Thu, May 14, 2015 at 02:28:49PM +0200, Paolo Bonzini wrote:
> 
> 
> On 14/05/2015 14:24, Christoffer Dall wrote:
> > On Thu, May 14, 2015 at 02:08:49PM +0200, Paolo Bonzini wrote:
> >>
> >>
> >> On 14/05/2015 14:00, Christoffer Dall wrote:
> >>> So, getting back to my original question.  Is the point then that UEFI
> >>> must assume (from ACPI/DT) the cache-coherency properties of the PCI
> >>> controller which exists in hardware on the system you're running on,
> >>> even for the virtual PCI bus because that will be the semantics for
> >>> assigned devices?
> >>>
> >>> And in that case, we have no way to distinguish between passthrough
> >>> devices and virtual devices plugged into the virtual PCI bus?
> >>
> >> Well, we could use the subsystem id.  But it's a hack, and may cause
> >> incompatibilities with some drivers.  Michael, any ideas?
> >>
> >>> What about the idea of having two virtual PCI buses on your system where
> >>> one is always cache-coherent and uses for virtual devices, and the other
> >>> is whatever the hardware is and used for passthrough devices?
> >>
> >> I think that was rejected before.
> > 
> > Do you remember where?  I just remember Catalin mentioning the idea to
> > me verbally.
> 
> In the last centithread on the subject. :)
> 
> At least I and Peter disagreed.  It's not about the heavy added use of
> resources, it's more about it being really easy to misconfigure.
> 
> > But I'm still not sure why UEFI/Linux currently sees our PCI bus as
> > being non-coherent when in fact it is and we have no passthrough issues
> > currently.  Are all PCI controllers always non-coherent for some reason
> > and therefore we model it as such too?
> 
> Well, PCI BARs are generally MMIO resources, and hence should not be cached.
> 
> As an optimization, OS drivers can mark them as cacheable or
> write-combining or something like that, but in general it's a safe
> default to leave them uncached---one would think.
> 
ok, I guess this series makes sense then, assuming it works, and
assuming we don't kill performance by going to RAM all the time when we
don't have to...

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-14 12:34                       ` Christoffer Dall
  0 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-14 12:34 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: ard.biesheuvel, marc.zyngier, catalin.marinas,
	Michael S. Tsirkin, qemu-devel, Laszlo Ersek, kvmarm

On Thu, May 14, 2015 at 02:28:49PM +0200, Paolo Bonzini wrote:
> 
> 
> On 14/05/2015 14:24, Christoffer Dall wrote:
> > On Thu, May 14, 2015 at 02:08:49PM +0200, Paolo Bonzini wrote:
> >>
> >>
> >> On 14/05/2015 14:00, Christoffer Dall wrote:
> >>> So, getting back to my original question.  Is the point then that UEFI
> >>> must assume (from ACPI/DT) the cache-coherency properties of the PCI
> >>> controller which exists in hardware on the system you're running on,
> >>> even for the virtual PCI bus because that will be the semantics for
> >>> assigned devices?
> >>>
> >>> And in that case, we have no way to distinguish between passthrough
> >>> devices and virtual devices plugged into the virtual PCI bus?
> >>
> >> Well, we could use the subsystem id.  But it's a hack, and may cause
> >> incompatibilities with some drivers.  Michael, any ideas?
> >>
> >>> What about the idea of having two virtual PCI buses on your system where
> >>> one is always cache-coherent and uses for virtual devices, and the other
> >>> is whatever the hardware is and used for passthrough devices?
> >>
> >> I think that was rejected before.
> > 
> > Do you remember where?  I just remember Catalin mentioning the idea to
> > me verbally.
> 
> In the last centithread on the subject. :)
> 
> At least I and Peter disagreed.  It's not about the heavy added use of
> resources, it's more about it being really easy to misconfigure.
> 
> > But I'm still not sure why UEFI/Linux currently sees our PCI bus as
> > being non-coherent when in fact it is and we have no passthrough issues
> > currently.  Are all PCI controllers always non-coherent for some reason
> > and therefore we model it as such too?
> 
> Well, PCI BARs are generally MMIO resources, and hence should not be cached.
> 
> As an optimization, OS drivers can mark them as cacheable or
> write-combining or something like that, but in general it's a safe
> default to leave them uncached---one would think.
> 
ok, I guess this series makes sense then, assuming it works, and
assuming we don't kill performance by going to RAM all the time when we
don't have to...

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-14 12:28                     ` Paolo Bonzini
@ 2015-05-14 12:38                       ` Peter Maydell
  -1 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2015-05-14 12:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Andrew Jones, Ard Biesheuvel, Marc Zyngier, Catalin Marinas,
	Michael S. Tsirkin, QEMU Developers, Alexander Graf,
	Jérémy Fanguède, Laszlo Ersek, kvmarm,
	Christoffer Dall, Mario Smarduch

On 14 May 2015 at 13:28, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Well, PCI BARs are generally MMIO resources, and hence should not be cached.
>
> As an optimization, OS drivers can mark them as cacheable or
> write-combining or something like that, but in general it's a safe
> default to leave them uncached---one would think.

Isn't this handled by the OS mapping them in the 'prefetchable'
MMIO window rather than the 'non-prefetchable' one? (QEMU's
generic-PCIe device doesn't yet support the prefetchable window.)

-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-14 12:38                       ` Peter Maydell
  0 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2015-05-14 12:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Ard Biesheuvel, Marc Zyngier, Catalin Marinas,
	Michael S. Tsirkin, QEMU Developers, Laszlo Ersek, kvmarm

On 14 May 2015 at 13:28, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Well, PCI BARs are generally MMIO resources, and hence should not be cached.
>
> As an optimization, OS drivers can mark them as cacheable or
> write-combining or something like that, but in general it's a safe
> default to leave them uncached---one would think.

Isn't this handled by the OS mapping them in the 'prefetchable'
MMIO window rather than the 'non-prefetchable' one? (QEMU's
generic-PCIe device doesn't yet support the prefetchable window.)

-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-14 12:38                       ` Peter Maydell
@ 2015-05-14 13:00                         ` Andrew Jones
  -1 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-14 13:00 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Michael S. Tsirkin, Marc Zyngier, Catalin Marinas,
	Ard Biesheuvel, QEMU Developers, Alexander Graf, Paolo Bonzini,
	Jérémy Fanguède, Laszlo Ersek, kvmarm,
	Christoffer Dall, Mario Smarduch

On Thu, May 14, 2015 at 01:38:11PM +0100, Peter Maydell wrote:
> On 14 May 2015 at 13:28, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > Well, PCI BARs are generally MMIO resources, and hence should not be cached.
> >
> > As an optimization, OS drivers can mark them as cacheable or
> > write-combining or something like that, but in general it's a safe
> > default to leave them uncached---one would think.
> 
> Isn't this handled by the OS mapping them in the 'prefetchable'
> MMIO window rather than the 'non-prefetchable' one? (QEMU's
> generic-PCIe device doesn't yet support the prefetchable window.)

I was thinking (with my limited PCI knowledge) the same thing, and
was planning on experimenting with that.

drew

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-14 13:00                         ` Andrew Jones
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-14 13:00 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Michael S. Tsirkin, Marc Zyngier, Catalin Marinas,
	Ard Biesheuvel, QEMU Developers, Paolo Bonzini, Laszlo Ersek,
	kvmarm

On Thu, May 14, 2015 at 01:38:11PM +0100, Peter Maydell wrote:
> On 14 May 2015 at 13:28, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > Well, PCI BARs are generally MMIO resources, and hence should not be cached.
> >
> > As an optimization, OS drivers can mark them as cacheable or
> > write-combining or something like that, but in general it's a safe
> > default to leave them uncached---one would think.
> 
> Isn't this handled by the OS mapping them in the 'prefetchable'
> MMIO window rather than the 'non-prefetchable' one? (QEMU's
> generic-PCIe device doesn't yet support the prefetchable window.)

I was thinking (with my limited PCI knowledge) the same thing, and
was planning on experimenting with that.

drew

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-14 12:34                       ` Christoffer Dall
@ 2015-05-14 13:01                         ` Laszlo Ersek
  -1 siblings, 0 replies; 102+ messages in thread
From: Laszlo Ersek @ 2015-05-14 13:01 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: peter.maydell, Andrew Jones, ard.biesheuvel, marc.zyngier,
	catalin.marinas, Michael S. Tsirkin, qemu-devel, agraf,
	Paolo Bonzini, j.fanguede, kvmarm, m.smarduch

On 05/14/15 14:34, Christoffer Dall wrote:
> On Thu, May 14, 2015 at 02:28:49PM +0200, Paolo Bonzini wrote:
>>
>>
>> On 14/05/2015 14:24, Christoffer Dall wrote:
>>> On Thu, May 14, 2015 at 02:08:49PM +0200, Paolo Bonzini wrote:
>>>>
>>>>
>>>> On 14/05/2015 14:00, Christoffer Dall wrote:
>>>>> So, getting back to my original question.  Is the point then that UEFI
>>>>> must assume (from ACPI/DT) the cache-coherency properties of the PCI
>>>>> controller which exists in hardware on the system you're running on,
>>>>> even for the virtual PCI bus because that will be the semantics for
>>>>> assigned devices?
>>>>>
>>>>> And in that case, we have no way to distinguish between passthrough
>>>>> devices and virtual devices plugged into the virtual PCI bus?
>>>>
>>>> Well, we could use the subsystem id.  But it's a hack, and may cause
>>>> incompatibilities with some drivers.  Michael, any ideas?
>>>>
>>>>> What about the idea of having two virtual PCI buses on your system where
>>>>> one is always cache-coherent and uses for virtual devices, and the other
>>>>> is whatever the hardware is and used for passthrough devices?
>>>>
>>>> I think that was rejected before.
>>>
>>> Do you remember where?  I just remember Catalin mentioning the idea to
>>> me verbally.
>>
>> In the last centithread on the subject. :)
>>
>> At least I and Peter disagreed.  It's not about the heavy added use of
>> resources, it's more about it being really easy to misconfigure.
>>
>>> But I'm still not sure why UEFI/Linux currently sees our PCI bus as
>>> being non-coherent when in fact it is and we have no passthrough issues
>>> currently.  Are all PCI controllers always non-coherent for some reason
>>> and therefore we model it as such too?
>>
>> Well, PCI BARs are generally MMIO resources, and hence should not be cached.
>>
>> As an optimization, OS drivers can mark them as cacheable or
>> write-combining or something like that, but in general it's a safe
>> default to leave them uncached---one would think.
>>
> ok, I guess this series makes sense then, assuming it works, and
> assuming we don't kill performance by going to RAM all the time when we
> don't have to...

The idea Paolo and I had discussed in the past is:
- Remove the kludge from UEFI, and map all MMIO BARs as uncached by
  default. This should be a theoretically correct approach, and for
  assigned devices, correct in practice too.

- At an appropriate place in the firmware (specifically, right before
  this line: [1]), when PCI devices have been enumerated, but their
  particular drivers (especially VGA) have not been connected yet,
  check the subsystem id / vendor id / etc for each, and if we can tell
  it's virtual, then set the attributes for all of its MMIO bars to
  writeback.

It doesn't seem hard to implement, I've just been shying away from
actually coding it up because I'd like to see it make a difference for
an actual assigned device. That is, reproducing the current (statically
kludged) behavior wouldn't be hard, but I prefer not to write a new
patch until I can test it both ways. UC is broken and WB works for
virtual devices, fine; now let's see if the exact reverse holds for
assigned devices in reality.

... Testing of which will require someone to send a PCI card (NIC or
GPU) -- with an AARCH64 UEFI driver oprom on it -- to my place, so that
I can plug into my Mustang. ;)

Thanks
Laszlo

[1]
https://github.com/tianocore/edk2/blob/99d9ade85aad554a0fa08fff8586b0fd40570ac3/ArmPlatformPkg/ArmVirtualizationPkg/Library/PlatformIntelBdsLib/IntelBdsPlatform.c#L366

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-14 13:01                         ` Laszlo Ersek
  0 siblings, 0 replies; 102+ messages in thread
From: Laszlo Ersek @ 2015-05-14 13:01 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: ard.biesheuvel, marc.zyngier, catalin.marinas,
	Michael S. Tsirkin, qemu-devel, Paolo Bonzini, kvmarm

On 05/14/15 14:34, Christoffer Dall wrote:
> On Thu, May 14, 2015 at 02:28:49PM +0200, Paolo Bonzini wrote:
>>
>>
>> On 14/05/2015 14:24, Christoffer Dall wrote:
>>> On Thu, May 14, 2015 at 02:08:49PM +0200, Paolo Bonzini wrote:
>>>>
>>>>
>>>> On 14/05/2015 14:00, Christoffer Dall wrote:
>>>>> So, getting back to my original question.  Is the point then that UEFI
>>>>> must assume (from ACPI/DT) the cache-coherency properties of the PCI
>>>>> controller which exists in hardware on the system you're running on,
>>>>> even for the virtual PCI bus because that will be the semantics for
>>>>> assigned devices?
>>>>>
>>>>> And in that case, we have no way to distinguish between passthrough
>>>>> devices and virtual devices plugged into the virtual PCI bus?
>>>>
>>>> Well, we could use the subsystem id.  But it's a hack, and may cause
>>>> incompatibilities with some drivers.  Michael, any ideas?
>>>>
>>>>> What about the idea of having two virtual PCI buses on your system where
>>>>> one is always cache-coherent and uses for virtual devices, and the other
>>>>> is whatever the hardware is and used for passthrough devices?
>>>>
>>>> I think that was rejected before.
>>>
>>> Do you remember where?  I just remember Catalin mentioning the idea to
>>> me verbally.
>>
>> In the last centithread on the subject. :)
>>
>> At least I and Peter disagreed.  It's not about the heavy added use of
>> resources, it's more about it being really easy to misconfigure.
>>
>>> But I'm still not sure why UEFI/Linux currently sees our PCI bus as
>>> being non-coherent when in fact it is and we have no passthrough issues
>>> currently.  Are all PCI controllers always non-coherent for some reason
>>> and therefore we model it as such too?
>>
>> Well, PCI BARs are generally MMIO resources, and hence should not be cached.
>>
>> As an optimization, OS drivers can mark them as cacheable or
>> write-combining or something like that, but in general it's a safe
>> default to leave them uncached---one would think.
>>
> ok, I guess this series makes sense then, assuming it works, and
> assuming we don't kill performance by going to RAM all the time when we
> don't have to...

The idea Paolo and I had discussed in the past is:
- Remove the kludge from UEFI, and map all MMIO BARs as uncached by
  default. This should be a theoretically correct approach, and for
  assigned devices, correct in practice too.

- At an appropriate place in the firmware (specifically, right before
  this line: [1]), when PCI devices have been enumerated, but their
  particular drivers (especially VGA) have not been connected yet,
  check the subsystem id / vendor id / etc for each, and if we can tell
  it's virtual, then set the attributes for all of its MMIO bars to
  writeback.

It doesn't seem hard to implement, I've just been shying away from
actually coding it up because I'd like to see it make a difference for
an actual assigned device. That is, reproducing the current (statically
kludged) behavior wouldn't be hard, but I prefer not to write a new
patch until I can test it both ways. UC is broken and WB works for
virtual devices, fine; now let's see if the exact reverse holds for
assigned devices in reality.

... Testing of which will require someone to send a PCI card (NIC or
GPU) -- with an AARCH64 UEFI driver oprom on it -- to my place, so that
I can plug into my Mustang. ;)

Thanks
Laszlo

[1]
https://github.com/tianocore/edk2/blob/99d9ade85aad554a0fa08fff8586b0fd40570ac3/ArmPlatformPkg/ArmVirtualizationPkg/Library/PlatformIntelBdsLib/IntelBdsPlatform.c#L366

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-14 10:37     ` Peter Maydell
@ 2015-05-14 13:03       ` Andrew Jones
  -1 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-14 13:03 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Ard Biesheuvel, Marc Zyngier, Catalin Marinas, QEMU Developers,
	Alexander Graf, Paolo Bonzini, Laszlo Ersek, kvmarm,
	Christoffer Dall

On Thu, May 14, 2015 at 11:37:46AM +0100, Peter Maydell wrote:
> On 14 May 2015 at 11:31, Andrew Jones <drjones@redhat.com> wrote:
> > Forgot to (4): switch from setting userspace's mapping to
> > device memory to normal, non-cacheable. Using device memory
> > caused a problem that Alex Graf found, and Peter Maydell suggested
> > using normal, non-cacheable instead.
> 
> Did you check that non-cacheable is definitely the correct
> kind of Normal memory attribute we want? (ie not write-through).

I was concerned that write-through wouldn't be sufficient. If the
guest writes to its non-cached memory, and QEMU needs to see what
it wrote, then won't write-through fail to work? Unless we some
how invalidate the cache first?

drew

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-14 13:03       ` Andrew Jones
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-14 13:03 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Ard Biesheuvel, Marc Zyngier, Catalin Marinas, QEMU Developers,
	Paolo Bonzini, Laszlo Ersek, kvmarm

On Thu, May 14, 2015 at 11:37:46AM +0100, Peter Maydell wrote:
> On 14 May 2015 at 11:31, Andrew Jones <drjones@redhat.com> wrote:
> > Forgot to (4): switch from setting userspace's mapping to
> > device memory to normal, non-cacheable. Using device memory
> > caused a problem that Alex Graf found, and Peter Maydell suggested
> > using normal, non-cacheable instead.
> 
> Did you check that non-cacheable is definitely the correct
> kind of Normal memory attribute we want? (ie not write-through).

I was concerned that write-through wouldn't be sufficient. If the
guest writes to its non-cached memory, and QEMU needs to see what
it wrote, then won't write-through fail to work? Unless we some
how invalidate the cache first?

drew

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-14 13:03       ` Andrew Jones
@ 2015-05-14 13:11         ` Peter Maydell
  -1 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2015-05-14 13:11 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Ard Biesheuvel, Marc Zyngier, Catalin Marinas, QEMU Developers,
	Alexander Graf, Paolo Bonzini, Laszlo Ersek, kvmarm,
	Christoffer Dall

On 14 May 2015 at 14:03, Andrew Jones <drjones@redhat.com> wrote:
> On Thu, May 14, 2015 at 11:37:46AM +0100, Peter Maydell wrote:
>> On 14 May 2015 at 11:31, Andrew Jones <drjones@redhat.com> wrote:
>> > Forgot to (4): switch from setting userspace's mapping to
>> > device memory to normal, non-cacheable. Using device memory
>> > caused a problem that Alex Graf found, and Peter Maydell suggested
>> > using normal, non-cacheable instead.
>>
>> Did you check that non-cacheable is definitely the correct
>> kind of Normal memory attribute we want? (ie not write-through).
>
> I was concerned that write-through wouldn't be sufficient. If the
> guest writes to its non-cached memory, and QEMU needs to see what
> it wrote, then won't write-through fail to work? Unless we some
> how invalidate the cache first?

Well, I meant more that the correct mapping for userspace is
the same as the guest, whatever that is, and so somebody needs
to look at what the guest actually does rather than merely
hoping NormalNC is OK. (For instance, do we need to provide
support for QEMU to map both NC and writethrough?)

-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-14 13:11         ` Peter Maydell
  0 siblings, 0 replies; 102+ messages in thread
From: Peter Maydell @ 2015-05-14 13:11 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Ard Biesheuvel, Marc Zyngier, Catalin Marinas, QEMU Developers,
	Paolo Bonzini, Laszlo Ersek, kvmarm

On 14 May 2015 at 14:03, Andrew Jones <drjones@redhat.com> wrote:
> On Thu, May 14, 2015 at 11:37:46AM +0100, Peter Maydell wrote:
>> On 14 May 2015 at 11:31, Andrew Jones <drjones@redhat.com> wrote:
>> > Forgot to (4): switch from setting userspace's mapping to
>> > device memory to normal, non-cacheable. Using device memory
>> > caused a problem that Alex Graf found, and Peter Maydell suggested
>> > using normal, non-cacheable instead.
>>
>> Did you check that non-cacheable is definitely the correct
>> kind of Normal memory attribute we want? (ie not write-through).
>
> I was concerned that write-through wouldn't be sufficient. If the
> guest writes to its non-cached memory, and QEMU needs to see what
> it wrote, then won't write-through fail to work? Unless we some
> how invalidate the cache first?

Well, I meant more that the correct mapping for userspace is
the same as the guest, whatever that is, and so somebody needs
to look at what the guest actually does rather than merely
hoping NormalNC is OK. (For instance, do we need to provide
support for QEMU to map both NC and writethrough?)

-- PMM

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-14 13:00                         ` Andrew Jones
@ 2015-05-14 13:32                           ` Laszlo Ersek
  -1 siblings, 0 replies; 102+ messages in thread
From: Laszlo Ersek @ 2015-05-14 13:32 UTC (permalink / raw)
  To: Andrew Jones, Peter Maydell
  Cc: Michael S. Tsirkin, Marc Zyngier, Catalin Marinas,
	Ard Biesheuvel, QEMU Developers, Alexander Graf, Paolo Bonzini,
	Jérémy Fanguède, kvmarm, Christoffer Dall,
	Mario Smarduch

On 05/14/15 15:00, Andrew Jones wrote:
> On Thu, May 14, 2015 at 01:38:11PM +0100, Peter Maydell wrote:
>> On 14 May 2015 at 13:28, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> Well, PCI BARs are generally MMIO resources, and hence should not be cached.
>>>
>>> As an optimization, OS drivers can mark them as cacheable or
>>> write-combining or something like that, but in general it's a safe
>>> default to leave them uncached---one would think.
>>
>> Isn't this handled by the OS mapping them in the 'prefetchable'
>> MMIO window rather than the 'non-prefetchable' one? (QEMU's
>> generic-PCIe device doesn't yet support the prefetchable window.)
> 
> I was thinking (with my limited PCI knowledge) the same thing, and
> was planning on experimenting with that.

This could be supported in UEFI as well, with the following steps:
- the DTB that QEMU provides UEFI with should advertise such a
  prefetchable window.
- The driver in UEFI that parses the DTB should understand that DTB
  node (well, record type), and store the appropriate base & size into
  some new dynamic PCDs (= basically, firmware wide global variables;
  PCD = platform configuration database)
- The entry point of the host bridge driver would call
  gDS->AddMemorySpace() twice, separately for the two different windows,
  with their appropriate caching attributes.
- The host bridge driver needs to be extended so that TypePMem32
  requests are not rejected (like now); they should be handled
  similarly to TypeMem32. Except, the gDS->AllocateMemorySpace() call
  should allocate from the prefetchable range (determined by the new
  PCDs above).
- QEMU's emulated devices should then expose their BARs as prefetchable
  (so that the above branch would be taken in the host bridge driver).

(Of course, if QEMU intends to emulate PCI devices somewhat
realistically, then QEMU should claim "non-prefetchable" for BARs that
would not be prefetchable on physical hardware either, and then the
hypervisor should accommodate the firmware's UC mapping and say "hey I
know better, we're virtual in fact", and override the attribute (-> use
WB instead of UC). With which we'd be back to square one...)

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-14 13:32                           ` Laszlo Ersek
  0 siblings, 0 replies; 102+ messages in thread
From: Laszlo Ersek @ 2015-05-14 13:32 UTC (permalink / raw)
  To: Andrew Jones, Peter Maydell
  Cc: Michael S. Tsirkin, Marc Zyngier, Catalin Marinas,
	Ard Biesheuvel, QEMU Developers, Paolo Bonzini, kvmarm

On 05/14/15 15:00, Andrew Jones wrote:
> On Thu, May 14, 2015 at 01:38:11PM +0100, Peter Maydell wrote:
>> On 14 May 2015 at 13:28, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> Well, PCI BARs are generally MMIO resources, and hence should not be cached.
>>>
>>> As an optimization, OS drivers can mark them as cacheable or
>>> write-combining or something like that, but in general it's a safe
>>> default to leave them uncached---one would think.
>>
>> Isn't this handled by the OS mapping them in the 'prefetchable'
>> MMIO window rather than the 'non-prefetchable' one? (QEMU's
>> generic-PCIe device doesn't yet support the prefetchable window.)
> 
> I was thinking (with my limited PCI knowledge) the same thing, and
> was planning on experimenting with that.

This could be supported in UEFI as well, with the following steps:
- the DTB that QEMU provides UEFI with should advertise such a
  prefetchable window.
- The driver in UEFI that parses the DTB should understand that DTB
  node (well, record type), and store the appropriate base & size into
  some new dynamic PCDs (= basically, firmware wide global variables;
  PCD = platform configuration database)
- The entry point of the host bridge driver would call
  gDS->AddMemorySpace() twice, separately for the two different windows,
  with their appropriate caching attributes.
- The host bridge driver needs to be extended so that TypePMem32
  requests are not rejected (like now); they should be handled
  similarly to TypeMem32. Except, the gDS->AllocateMemorySpace() call
  should allocate from the prefetchable range (determined by the new
  PCDs above).
- QEMU's emulated devices should then expose their BARs as prefetchable
  (so that the above branch would be taken in the host bridge driver).

(Of course, if QEMU intends to emulate PCI devices somewhat
realistically, then QEMU should claim "non-prefetchable" for BARs that
would not be prefetchable on physical hardware either, and then the
hypervisor should accommodate the firmware's UC mapping and say "hey I
know better, we're virtual in fact", and override the attribute (-> use
WB instead of UC). With which we'd be back to square one...)

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 3/3] arm/arm64: KVM: implement 'uncached' mem coherency
  2015-05-14 10:55     ` Christoffer Dall
@ 2015-05-14 13:32       ` Andrew Jones
  -1 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-14 13:32 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: peter.maydell, ard.biesheuvel, marc.zyngier, catalin.marinas,
	agraf, qemu-devel, pbonzini, j.fanguede, lersek, kvmarm,
	m.smarduch

On Thu, May 14, 2015 at 12:55:49PM +0200, Christoffer Dall wrote:
> On Wed, May 13, 2015 at 01:31:54PM +0200, Andrew Jones wrote:
> > When S1 and S2 memory attributes combine wrt to caching policy,
> > non-cacheable types take precedence. If a guest maps a region as
> > device memory, which KVM userspace is using to emulate the device
> > using normal, cacheable memory, then we lose coherency. With
> > KVM_MEM_UNCACHED, KVM userspace can now hint to KVM which memory
> > regions are likely to be problematic. With this patch, as pages
> > of these types of regions are faulted into the guest, not only do
> > we flush the page's dcache, but we also change userspace's
> > mapping to NC in order to maintain coherency.
> > 
> > What if the guest doesn't do what we expect? While we can't
> > force a guest to use cacheable memory, we can take advantage of
> > the non-cacheable precedence, and force it to use non-cacheable.
> > So, this patch also introduces PAGE_S2_NORMAL_NC, and uses it on
> > KVM_MEM_UNCACHED regions to force them to NC.
> > 
> > We now have both guest and userspace on the same page (pun intended)
> 
> I'd like to revisit the overall approach here.  Is doing non-cached
> accesses in both the guest and host really the right thing to do here?

I think so, but all ideas/approaches are still on the table. This is
still an RFC.

> 
> The semantics of the device becomes that it is cache coherent (because
> QEMU is), and I think Marc argued that Linux/UEFI should simply be
> adapted to handle whatever emulated devices we have as coherent.  I also
> remember someone arguing that would be wrong (Peter?).

I'm not really for quirking all devices in all guest types (AAVMF, Linux,
other bootloaders, other OSes). Windows is unlikely to apply any quirks.

> 
> Finally, does this address all cache coherency issues with emulated
> devices?  Some VOS guys had seen things still not working with this
> approach, unsure why...  I'd like to avoid us merging this only to merge
> a more complete solution in a few weeks which reverts this solution...

I'm not sure (this is still an RFT too :-) We definitely would need to
scatter some more memory_region_set_uncached() calls around QEMU first.

> 
> More comments/questions below:
> 
> > 
> > Signed-off-by: Andrew Jones <drjones@redhat.com>
> > ---
> >  arch/arm/include/asm/kvm_mmu.h        |  5 ++++-
> >  arch/arm/include/asm/pgtable-3level.h |  1 +
> >  arch/arm/include/asm/pgtable.h        |  1 +
> >  arch/arm/kvm/mmu.c                    | 37 +++++++++++++++++++++++------------
> >  arch/arm64/include/asm/kvm_mmu.h      |  5 ++++-
> >  arch/arm64/include/asm/memory.h       |  1 +
> >  arch/arm64/include/asm/pgtable.h      |  1 +
> >  7 files changed, 36 insertions(+), 15 deletions(-)
> > 
> > diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> > index 405aa18833073..e8034a80b12e5 100644
> > --- a/arch/arm/include/asm/kvm_mmu.h
> > +++ b/arch/arm/include/asm/kvm_mmu.h
> > @@ -214,8 +214,11 @@ static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
> >  	while (size) {
> >  		void *va = kmap_atomic_pfn(pfn);
> >  
> > -		if (need_flush)
> > +		if (need_flush) {
> >  			kvm_flush_dcache_to_poc(va, PAGE_SIZE);
> > +			if (ipa_uncached)
> > +				set_memory_nc((unsigned long)va, 1);
> 
> nit: consider moving this outside the need_flush
> 
> > +		}
> >  
> >  		if (icache_is_pipt())
> >  			__cpuc_coherent_user_range((unsigned long)va,
> > diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
> > index a745a2a53853c..39b3f7a40e663 100644
> > --- a/arch/arm/include/asm/pgtable-3level.h
> > +++ b/arch/arm/include/asm/pgtable-3level.h
> > @@ -121,6 +121,7 @@
> >   * 2nd stage PTE definitions for LPAE.
> >   */
> >  #define L_PTE_S2_MT_UNCACHED		(_AT(pteval_t, 0x0) << 2) /* strongly ordered */
> > +#define L_PTE_S2_MT_NORMAL_NC		(_AT(pteval_t, 0x5) << 2) /* normal non-cacheable */
> >  #define L_PTE_S2_MT_WRITETHROUGH	(_AT(pteval_t, 0xa) << 2) /* normal inner write-through */
> >  #define L_PTE_S2_MT_WRITEBACK		(_AT(pteval_t, 0xf) << 2) /* normal inner write-back */
> >  #define L_PTE_S2_MT_DEV_SHARED		(_AT(pteval_t, 0x1) << 2) /* device */
> > diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
> > index f40354198bad4..ae13ca8b0a23d 100644
> > --- a/arch/arm/include/asm/pgtable.h
> > +++ b/arch/arm/include/asm/pgtable.h
> > @@ -100,6 +100,7 @@ extern pgprot_t		pgprot_s2_device;
> >  #define PAGE_HYP		_MOD_PROT(pgprot_kernel, L_PTE_HYP)
> >  #define PAGE_HYP_DEVICE		_MOD_PROT(pgprot_hyp_device, L_PTE_HYP)
> >  #define PAGE_S2			_MOD_PROT(pgprot_s2, L_PTE_S2_RDONLY)
> > +#define PAGE_S2_NORMAL_NC	__pgprot((pgprot_val(PAGE_S2) & ~L_PTE_S2_MT_MASK) | L_PTE_S2_MT_NORMAL_NC)
> >  #define PAGE_S2_DEVICE		_MOD_PROT(pgprot_s2_device, L_PTE_S2_RDONLY)
> >  
> >  #define __PAGE_NONE		__pgprot(_L_PTE_DEFAULT | L_PTE_RDONLY | L_PTE_XN | L_PTE_NONE)
> > diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> > index bc1665acd73e7..6b3bd8061bd2a 100644
> > --- a/arch/arm/kvm/mmu.c
> > +++ b/arch/arm/kvm/mmu.c
> > @@ -1220,7 +1220,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >  	struct vm_area_struct *vma;
> >  	pfn_t pfn;
> >  	pgprot_t mem_type = PAGE_S2;
> > -	bool fault_ipa_uncached;
> > +	bool fault_ipa_uncached = false;
> >  	bool logging_active = memslot_is_logging(memslot);
> >  	unsigned long flags = 0;
> >  
> > @@ -1300,6 +1300,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >  			writable = false;
> >  	}
> >  
> > +	if (memslot->flags & KVM_MEM_UNCACHED) {
> > +		mem_type = PAGE_S2_NORMAL_NC;
> > +		fault_ipa_uncached = true;
> > +	}
> > +
> >  	spin_lock(&kvm->mmu_lock);
> >  	if (mmu_notifier_retry(kvm, mmu_seq))
> >  		goto out_unlock;
> > @@ -1307,8 +1312,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >  	if (!hugetlb && !force_pte)
> >  		hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
> >  
> > -	fault_ipa_uncached = memslot->flags & KVM_MEM_UNCACHED;
> > -
> >  	if (hugetlb) {
> >  		pmd_t new_pmd = pfn_pmd(pfn, mem_type);
> >  		new_pmd = pmd_mkhuge(new_pmd);
> > @@ -1462,6 +1465,7 @@ static int handle_hva_to_gpa(struct kvm *kvm,
> >  			     unsigned long start,
> >  			     unsigned long end,
> >  			     int (*handler)(struct kvm *kvm,
> > +					    struct kvm_memory_slot *slot,
> >  					    gpa_t gpa, void *data),
> >  			     void *data)
> >  {
> > @@ -1491,14 +1495,15 @@ static int handle_hva_to_gpa(struct kvm *kvm,
> >  
> >  		for (; gfn < gfn_end; ++gfn) {
> >  			gpa_t gpa = gfn << PAGE_SHIFT;
> > -			ret |= handler(kvm, gpa, data);
> > +			ret |= handler(kvm, memslot, gpa, data);
> >  		}
> >  	}
> >  
> >  	return ret;
> >  }
> >  
> > -static int kvm_unmap_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
> > +static int kvm_unmap_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
> > +				 gpa_t gpa, void *data)
> 
> Maybe we should consider a pointer to a struct with the relevant data to
> pass around to the handler by now, which would allow us to get rid of
> the void * cast as well.  Not sure if it's worth it.
> 
> >  {
> >  	unmap_stage2_range(kvm, gpa, PAGE_SIZE);
> >  	return 0;
> > @@ -1527,9 +1532,15 @@ int kvm_unmap_hva_range(struct kvm *kvm,
> >  	return 0;
> >  }
> >  
> > -static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
> > +static int kvm_set_spte_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
> > +				gpa_t gpa, void *data)
> >  {
> > -	pte_t *pte = (pte_t *)data;
> > +	pte_t pte = *((pte_t *)data);
> > +
> > +	if (slot->flags & KVM_MEM_UNCACHED)
> > +		pte = pfn_pte(pte_pfn(pte), PAGE_S2_NORMAL_NC);
> > +	else
> > +		pte = pfn_pte(pte_pfn(pte), PAGE_S2);
> >  
> >  	/*
> >  	 * We can always call stage2_set_pte with KVM_S2PTE_FLAG_LOGGING_ACTIVE
> > @@ -1538,7 +1549,7 @@ static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
> >  	 * therefore stage2_set_pte() never needs to clear out a huge PMD
> >  	 * through this calling path.
> >  	 */
> > -	stage2_set_pte(kvm, NULL, gpa, pte, 0);
> > +	stage2_set_pte(kvm, NULL, gpa, &pte, 0);
> 
> this is making me feel like we should have a separate patch that changes
> stage2_set_pte from taking a pointer to just taking a value for the new
> pte.
> 
> >  	return 0;
> >  }
> >  
> > @@ -1546,17 +1557,16 @@ static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
> >  void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
> >  {
> >  	unsigned long end = hva + PAGE_SIZE;
> > -	pte_t stage2_pte;
> >  
> >  	if (!kvm->arch.pgd)
> >  		return;
> >  
> >  	trace_kvm_set_spte_hva(hva);
> > -	stage2_pte = pfn_pte(pte_pfn(pte), PAGE_S2);
> > -	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &stage2_pte);
> > +	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pte);
> 
> hooking in here will make sure you catch changes to the page used for
> the mapping, but wouldn't that also mean that the userspace mapping
> would have been change, and where are you updating this?
> 
> Also, is this called if the userspace mapping is zapped without doing
> anything about the underlying page?  (how do we then catch when the
> userspace pte is populated again, and is this even possible?)

I was hoping that I only needed to worry about getting the S2 attributes
right here, and then, since the page will need to be refaulted into
the guest anyway, that the userspace part would get taken care of at
that point (in user_mem_abort). But, to be honest, I forgot to dig into
this deep enough to know if my hope will work or not.

> 
> >  }
> >  
> > -static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
> > +static int kvm_age_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
> > +			       gpa_t gpa, void *data)
> >  {
> >  	pmd_t *pmd;
> >  	pte_t *pte;
> > @@ -1586,7 +1596,8 @@ static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
> >  	return 0;
> >  }
> >  
> > -static int kvm_test_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
> > +static int kvm_test_age_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
> > +				    gpa_t gpa, void *data)
> >  {
> >  	pmd_t *pmd;
> >  	pte_t *pte;
> > diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> > index 61505676d0853..af5f0f0eccef9 100644
> > --- a/arch/arm64/include/asm/kvm_mmu.h
> > +++ b/arch/arm64/include/asm/kvm_mmu.h
> > @@ -236,8 +236,11 @@ static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
> >  {
> >  	void *va = page_address(pfn_to_page(pfn));
> >  
> > -	if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached)
> > +	if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached) {
> >  		kvm_flush_dcache_to_poc(va, size);
> > +		if (ipa_uncached)
> > +			set_memory_nc((unsigned long)va, size/PAGE_SIZE);
> 
> are you not setting the kernel mapping of the page to non-cached here,
> which doesn't affect your userspace mappings at all?

Oh crap. I shouldn't have tried to use change_memory_common... I
completely overlooked the fact I'm now using the wrong mm...

> 
> (this would explain why things still break with this series).

yeah, I wonder why it works so well?

> 
> > +	}
> >  
> >  	if (!icache_is_aliasing()) {		/* PIPT */
> >  		flush_icache_range((unsigned long)va,
> > diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> > index f800d45ea2265..800730f7aa7d9 100644
> > --- a/arch/arm64/include/asm/memory.h
> > +++ b/arch/arm64/include/asm/memory.h
> > @@ -105,6 +105,7 @@
> >   * Memory types for Stage-2 translation
> >   */
> >  #define MT_S2_NORMAL		0xf
> > +#define MT_S2_NORMAL_NC		0x5
> >  #define MT_S2_DEVICE_nGnRE	0x1
> >  
> >  #ifndef __ASSEMBLY__
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 56283f8a675c5..a254925ce1b6b 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -78,6 +78,7 @@ extern void __pgd_error(const char *file, int line, unsigned long val);
> >  #define PAGE_HYP_DEVICE		__pgprot(PROT_DEVICE_nGnRE | PTE_HYP)
> >  
> >  #define PAGE_S2			__pgprot(PROT_DEFAULT | PTE_S2_MEMATTR(MT_S2_NORMAL) | PTE_S2_RDONLY)
> > +#define PAGE_S2_NORMAL_NC	__pgprot(PROT_DEFAULT | PTE_S2_MEMATTR(MT_S2_NORMAL_NC) | PTE_S2_RDONLY)
> >  #define PAGE_S2_DEVICE		__pgprot(PROT_DEFAULT | PTE_S2_MEMATTR(MT_S2_DEVICE_nGnRE) | PTE_S2_RDONLY | PTE_UXN)
> >  
> >  #define PAGE_NONE		__pgprot(((_PAGE_DEFAULT) & ~PTE_TYPE_MASK) | PTE_PROT_NONE | PTE_PXN | PTE_UXN)
> > -- 
> > 2.1.0
> > 
> 
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 3/3] arm/arm64: KVM: implement 'uncached' mem coherency
@ 2015-05-14 13:32       ` Andrew Jones
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-14 13:32 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: ard.biesheuvel, marc.zyngier, catalin.marinas, qemu-devel,
	pbonzini, lersek, kvmarm

On Thu, May 14, 2015 at 12:55:49PM +0200, Christoffer Dall wrote:
> On Wed, May 13, 2015 at 01:31:54PM +0200, Andrew Jones wrote:
> > When S1 and S2 memory attributes combine wrt to caching policy,
> > non-cacheable types take precedence. If a guest maps a region as
> > device memory, which KVM userspace is using to emulate the device
> > using normal, cacheable memory, then we lose coherency. With
> > KVM_MEM_UNCACHED, KVM userspace can now hint to KVM which memory
> > regions are likely to be problematic. With this patch, as pages
> > of these types of regions are faulted into the guest, not only do
> > we flush the page's dcache, but we also change userspace's
> > mapping to NC in order to maintain coherency.
> > 
> > What if the guest doesn't do what we expect? While we can't
> > force a guest to use cacheable memory, we can take advantage of
> > the non-cacheable precedence, and force it to use non-cacheable.
> > So, this patch also introduces PAGE_S2_NORMAL_NC, and uses it on
> > KVM_MEM_UNCACHED regions to force them to NC.
> > 
> > We now have both guest and userspace on the same page (pun intended)
> 
> I'd like to revisit the overall approach here.  Is doing non-cached
> accesses in both the guest and host really the right thing to do here?

I think so, but all ideas/approaches are still on the table. This is
still an RFC.

> 
> The semantics of the device becomes that it is cache coherent (because
> QEMU is), and I think Marc argued that Linux/UEFI should simply be
> adapted to handle whatever emulated devices we have as coherent.  I also
> remember someone arguing that would be wrong (Peter?).

I'm not really for quirking all devices in all guest types (AAVMF, Linux,
other bootloaders, other OSes). Windows is unlikely to apply any quirks.

> 
> Finally, does this address all cache coherency issues with emulated
> devices?  Some VOS guys had seen things still not working with this
> approach, unsure why...  I'd like to avoid us merging this only to merge
> a more complete solution in a few weeks which reverts this solution...

I'm not sure (this is still an RFT too :-) We definitely would need to
scatter some more memory_region_set_uncached() calls around QEMU first.

> 
> More comments/questions below:
> 
> > 
> > Signed-off-by: Andrew Jones <drjones@redhat.com>
> > ---
> >  arch/arm/include/asm/kvm_mmu.h        |  5 ++++-
> >  arch/arm/include/asm/pgtable-3level.h |  1 +
> >  arch/arm/include/asm/pgtable.h        |  1 +
> >  arch/arm/kvm/mmu.c                    | 37 +++++++++++++++++++++++------------
> >  arch/arm64/include/asm/kvm_mmu.h      |  5 ++++-
> >  arch/arm64/include/asm/memory.h       |  1 +
> >  arch/arm64/include/asm/pgtable.h      |  1 +
> >  7 files changed, 36 insertions(+), 15 deletions(-)
> > 
> > diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> > index 405aa18833073..e8034a80b12e5 100644
> > --- a/arch/arm/include/asm/kvm_mmu.h
> > +++ b/arch/arm/include/asm/kvm_mmu.h
> > @@ -214,8 +214,11 @@ static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
> >  	while (size) {
> >  		void *va = kmap_atomic_pfn(pfn);
> >  
> > -		if (need_flush)
> > +		if (need_flush) {
> >  			kvm_flush_dcache_to_poc(va, PAGE_SIZE);
> > +			if (ipa_uncached)
> > +				set_memory_nc((unsigned long)va, 1);
> 
> nit: consider moving this outside the need_flush
> 
> > +		}
> >  
> >  		if (icache_is_pipt())
> >  			__cpuc_coherent_user_range((unsigned long)va,
> > diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
> > index a745a2a53853c..39b3f7a40e663 100644
> > --- a/arch/arm/include/asm/pgtable-3level.h
> > +++ b/arch/arm/include/asm/pgtable-3level.h
> > @@ -121,6 +121,7 @@
> >   * 2nd stage PTE definitions for LPAE.
> >   */
> >  #define L_PTE_S2_MT_UNCACHED		(_AT(pteval_t, 0x0) << 2) /* strongly ordered */
> > +#define L_PTE_S2_MT_NORMAL_NC		(_AT(pteval_t, 0x5) << 2) /* normal non-cacheable */
> >  #define L_PTE_S2_MT_WRITETHROUGH	(_AT(pteval_t, 0xa) << 2) /* normal inner write-through */
> >  #define L_PTE_S2_MT_WRITEBACK		(_AT(pteval_t, 0xf) << 2) /* normal inner write-back */
> >  #define L_PTE_S2_MT_DEV_SHARED		(_AT(pteval_t, 0x1) << 2) /* device */
> > diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
> > index f40354198bad4..ae13ca8b0a23d 100644
> > --- a/arch/arm/include/asm/pgtable.h
> > +++ b/arch/arm/include/asm/pgtable.h
> > @@ -100,6 +100,7 @@ extern pgprot_t		pgprot_s2_device;
> >  #define PAGE_HYP		_MOD_PROT(pgprot_kernel, L_PTE_HYP)
> >  #define PAGE_HYP_DEVICE		_MOD_PROT(pgprot_hyp_device, L_PTE_HYP)
> >  #define PAGE_S2			_MOD_PROT(pgprot_s2, L_PTE_S2_RDONLY)
> > +#define PAGE_S2_NORMAL_NC	__pgprot((pgprot_val(PAGE_S2) & ~L_PTE_S2_MT_MASK) | L_PTE_S2_MT_NORMAL_NC)
> >  #define PAGE_S2_DEVICE		_MOD_PROT(pgprot_s2_device, L_PTE_S2_RDONLY)
> >  
> >  #define __PAGE_NONE		__pgprot(_L_PTE_DEFAULT | L_PTE_RDONLY | L_PTE_XN | L_PTE_NONE)
> > diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> > index bc1665acd73e7..6b3bd8061bd2a 100644
> > --- a/arch/arm/kvm/mmu.c
> > +++ b/arch/arm/kvm/mmu.c
> > @@ -1220,7 +1220,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >  	struct vm_area_struct *vma;
> >  	pfn_t pfn;
> >  	pgprot_t mem_type = PAGE_S2;
> > -	bool fault_ipa_uncached;
> > +	bool fault_ipa_uncached = false;
> >  	bool logging_active = memslot_is_logging(memslot);
> >  	unsigned long flags = 0;
> >  
> > @@ -1300,6 +1300,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >  			writable = false;
> >  	}
> >  
> > +	if (memslot->flags & KVM_MEM_UNCACHED) {
> > +		mem_type = PAGE_S2_NORMAL_NC;
> > +		fault_ipa_uncached = true;
> > +	}
> > +
> >  	spin_lock(&kvm->mmu_lock);
> >  	if (mmu_notifier_retry(kvm, mmu_seq))
> >  		goto out_unlock;
> > @@ -1307,8 +1312,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >  	if (!hugetlb && !force_pte)
> >  		hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
> >  
> > -	fault_ipa_uncached = memslot->flags & KVM_MEM_UNCACHED;
> > -
> >  	if (hugetlb) {
> >  		pmd_t new_pmd = pfn_pmd(pfn, mem_type);
> >  		new_pmd = pmd_mkhuge(new_pmd);
> > @@ -1462,6 +1465,7 @@ static int handle_hva_to_gpa(struct kvm *kvm,
> >  			     unsigned long start,
> >  			     unsigned long end,
> >  			     int (*handler)(struct kvm *kvm,
> > +					    struct kvm_memory_slot *slot,
> >  					    gpa_t gpa, void *data),
> >  			     void *data)
> >  {
> > @@ -1491,14 +1495,15 @@ static int handle_hva_to_gpa(struct kvm *kvm,
> >  
> >  		for (; gfn < gfn_end; ++gfn) {
> >  			gpa_t gpa = gfn << PAGE_SHIFT;
> > -			ret |= handler(kvm, gpa, data);
> > +			ret |= handler(kvm, memslot, gpa, data);
> >  		}
> >  	}
> >  
> >  	return ret;
> >  }
> >  
> > -static int kvm_unmap_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
> > +static int kvm_unmap_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
> > +				 gpa_t gpa, void *data)
> 
> Maybe we should consider a pointer to a struct with the relevant data to
> pass around to the handler by now, which would allow us to get rid of
> the void * cast as well.  Not sure if it's worth it.
> 
> >  {
> >  	unmap_stage2_range(kvm, gpa, PAGE_SIZE);
> >  	return 0;
> > @@ -1527,9 +1532,15 @@ int kvm_unmap_hva_range(struct kvm *kvm,
> >  	return 0;
> >  }
> >  
> > -static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
> > +static int kvm_set_spte_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
> > +				gpa_t gpa, void *data)
> >  {
> > -	pte_t *pte = (pte_t *)data;
> > +	pte_t pte = *((pte_t *)data);
> > +
> > +	if (slot->flags & KVM_MEM_UNCACHED)
> > +		pte = pfn_pte(pte_pfn(pte), PAGE_S2_NORMAL_NC);
> > +	else
> > +		pte = pfn_pte(pte_pfn(pte), PAGE_S2);
> >  
> >  	/*
> >  	 * We can always call stage2_set_pte with KVM_S2PTE_FLAG_LOGGING_ACTIVE
> > @@ -1538,7 +1549,7 @@ static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
> >  	 * therefore stage2_set_pte() never needs to clear out a huge PMD
> >  	 * through this calling path.
> >  	 */
> > -	stage2_set_pte(kvm, NULL, gpa, pte, 0);
> > +	stage2_set_pte(kvm, NULL, gpa, &pte, 0);
> 
> this is making me feel like we should have a separate patch that changes
> stage2_set_pte from taking a pointer to just taking a value for the new
> pte.
> 
> >  	return 0;
> >  }
> >  
> > @@ -1546,17 +1557,16 @@ static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
> >  void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
> >  {
> >  	unsigned long end = hva + PAGE_SIZE;
> > -	pte_t stage2_pte;
> >  
> >  	if (!kvm->arch.pgd)
> >  		return;
> >  
> >  	trace_kvm_set_spte_hva(hva);
> > -	stage2_pte = pfn_pte(pte_pfn(pte), PAGE_S2);
> > -	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &stage2_pte);
> > +	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pte);
> 
> hooking in here will make sure you catch changes to the page used for
> the mapping, but wouldn't that also mean that the userspace mapping
> would have been change, and where are you updating this?
> 
> Also, is this called if the userspace mapping is zapped without doing
> anything about the underlying page?  (how do we then catch when the
> userspace pte is populated again, and is this even possible?)

I was hoping that I only needed to worry about getting the S2 attributes
right here, and then, since the page will need to be refaulted into
the guest anyway, that the userspace part would get taken care of at
that point (in user_mem_abort). But, to be honest, I forgot to dig into
this deep enough to know if my hope will work or not.

> 
> >  }
> >  
> > -static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
> > +static int kvm_age_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
> > +			       gpa_t gpa, void *data)
> >  {
> >  	pmd_t *pmd;
> >  	pte_t *pte;
> > @@ -1586,7 +1596,8 @@ static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
> >  	return 0;
> >  }
> >  
> > -static int kvm_test_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
> > +static int kvm_test_age_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
> > +				    gpa_t gpa, void *data)
> >  {
> >  	pmd_t *pmd;
> >  	pte_t *pte;
> > diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> > index 61505676d0853..af5f0f0eccef9 100644
> > --- a/arch/arm64/include/asm/kvm_mmu.h
> > +++ b/arch/arm64/include/asm/kvm_mmu.h
> > @@ -236,8 +236,11 @@ static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
> >  {
> >  	void *va = page_address(pfn_to_page(pfn));
> >  
> > -	if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached)
> > +	if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached) {
> >  		kvm_flush_dcache_to_poc(va, size);
> > +		if (ipa_uncached)
> > +			set_memory_nc((unsigned long)va, size/PAGE_SIZE);
> 
> are you not setting the kernel mapping of the page to non-cached here,
> which doesn't affect your userspace mappings at all?

Oh crap. I shouldn't have tried to use change_memory_common... I
completely overlooked the fact I'm now using the wrong mm...

> 
> (this would explain why things still break with this series).

yeah, I wonder why it works so well?

> 
> > +	}
> >  
> >  	if (!icache_is_aliasing()) {		/* PIPT */
> >  		flush_icache_range((unsigned long)va,
> > diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> > index f800d45ea2265..800730f7aa7d9 100644
> > --- a/arch/arm64/include/asm/memory.h
> > +++ b/arch/arm64/include/asm/memory.h
> > @@ -105,6 +105,7 @@
> >   * Memory types for Stage-2 translation
> >   */
> >  #define MT_S2_NORMAL		0xf
> > +#define MT_S2_NORMAL_NC		0x5
> >  #define MT_S2_DEVICE_nGnRE	0x1
> >  
> >  #ifndef __ASSEMBLY__
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 56283f8a675c5..a254925ce1b6b 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -78,6 +78,7 @@ extern void __pgd_error(const char *file, int line, unsigned long val);
> >  #define PAGE_HYP_DEVICE		__pgprot(PROT_DEVICE_nGnRE | PTE_HYP)
> >  
> >  #define PAGE_S2			__pgprot(PROT_DEFAULT | PTE_S2_MEMATTR(MT_S2_NORMAL) | PTE_S2_RDONLY)
> > +#define PAGE_S2_NORMAL_NC	__pgprot(PROT_DEFAULT | PTE_S2_MEMATTR(MT_S2_NORMAL_NC) | PTE_S2_RDONLY)
> >  #define PAGE_S2_DEVICE		__pgprot(PROT_DEFAULT | PTE_S2_MEMATTR(MT_S2_DEVICE_nGnRE) | PTE_S2_RDONLY | PTE_UXN)
> >  
> >  #define PAGE_NONE		__pgprot(((_PAGE_DEFAULT) & ~PTE_TYPE_MASK) | PTE_PROT_NONE | PTE_PXN | PTE_UXN)
> > -- 
> > 2.1.0
> > 
> 
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-14 13:11         ` Peter Maydell
@ 2015-05-14 13:33           ` Laszlo Ersek
  -1 siblings, 0 replies; 102+ messages in thread
From: Laszlo Ersek @ 2015-05-14 13:33 UTC (permalink / raw)
  To: Peter Maydell, Andrew Jones
  Cc: Ard Biesheuvel, Marc Zyngier, Catalin Marinas, QEMU Developers,
	Alexander Graf, Paolo Bonzini, kvmarm, Christoffer Dall

On 05/14/15 15:11, Peter Maydell wrote:
> On 14 May 2015 at 14:03, Andrew Jones <drjones@redhat.com> wrote:
>> On Thu, May 14, 2015 at 11:37:46AM +0100, Peter Maydell wrote:
>>> On 14 May 2015 at 11:31, Andrew Jones <drjones@redhat.com> wrote:
>>>> Forgot to (4): switch from setting userspace's mapping to
>>>> device memory to normal, non-cacheable. Using device memory
>>>> caused a problem that Alex Graf found, and Peter Maydell suggested
>>>> using normal, non-cacheable instead.
>>>
>>> Did you check that non-cacheable is definitely the correct
>>> kind of Normal memory attribute we want? (ie not write-through).
>>
>> I was concerned that write-through wouldn't be sufficient. If the
>> guest writes to its non-cached memory, and QEMU needs to see what
>> it wrote, then won't write-through fail to work? Unless we some
>> how invalidate the cache first?
> 
> Well, I meant more that the correct mapping for userspace is
> the same as the guest, whatever that is, and so somebody needs
> to look at what the guest actually does

I think Ard explored this earlier, by tracking the MAIRs and stuff.
Can't recall what the findings were though. (Or I could be simply
confused, sorry.)

Laszlo

> rather than merely
> hoping NormalNC is OK. (For instance, do we need to provide
> support for QEMU to map both NC and writethrough?)
> 
> -- PMM
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-14 13:33           ` Laszlo Ersek
  0 siblings, 0 replies; 102+ messages in thread
From: Laszlo Ersek @ 2015-05-14 13:33 UTC (permalink / raw)
  To: Peter Maydell, Andrew Jones
  Cc: Ard Biesheuvel, Marc Zyngier, Catalin Marinas, QEMU Developers,
	Paolo Bonzini, kvmarm

On 05/14/15 15:11, Peter Maydell wrote:
> On 14 May 2015 at 14:03, Andrew Jones <drjones@redhat.com> wrote:
>> On Thu, May 14, 2015 at 11:37:46AM +0100, Peter Maydell wrote:
>>> On 14 May 2015 at 11:31, Andrew Jones <drjones@redhat.com> wrote:
>>>> Forgot to (4): switch from setting userspace's mapping to
>>>> device memory to normal, non-cacheable. Using device memory
>>>> caused a problem that Alex Graf found, and Peter Maydell suggested
>>>> using normal, non-cacheable instead.
>>>
>>> Did you check that non-cacheable is definitely the correct
>>> kind of Normal memory attribute we want? (ie not write-through).
>>
>> I was concerned that write-through wouldn't be sufficient. If the
>> guest writes to its non-cached memory, and QEMU needs to see what
>> it wrote, then won't write-through fail to work? Unless we some
>> how invalidate the cache first?
> 
> Well, I meant more that the correct mapping for userspace is
> the same as the guest, whatever that is, and so somebody needs
> to look at what the guest actually does

I think Ard explored this earlier, by tracking the MAIRs and stuff.
Can't recall what the findings were though. (Or I could be simply
confused, sorry.)

Laszlo

> rather than merely
> hoping NormalNC is OK. (For instance, do we need to provide
> support for QEMU to map both NC and writethrough?)
> 
> -- PMM
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-14 13:11         ` Peter Maydell
@ 2015-05-14 13:36           ` Andrew Jones
  -1 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-14 13:36 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Ard Biesheuvel, Marc Zyngier, Catalin Marinas, Alexander Graf,
	QEMU Developers, Paolo Bonzini, Laszlo Ersek, kvmarm,
	Christoffer Dall

On Thu, May 14, 2015 at 02:11:59PM +0100, Peter Maydell wrote:
> On 14 May 2015 at 14:03, Andrew Jones <drjones@redhat.com> wrote:
> > On Thu, May 14, 2015 at 11:37:46AM +0100, Peter Maydell wrote:
> >> On 14 May 2015 at 11:31, Andrew Jones <drjones@redhat.com> wrote:
> >> > Forgot to (4): switch from setting userspace's mapping to
> >> > device memory to normal, non-cacheable. Using device memory
> >> > caused a problem that Alex Graf found, and Peter Maydell suggested
> >> > using normal, non-cacheable instead.
> >>
> >> Did you check that non-cacheable is definitely the correct
> >> kind of Normal memory attribute we want? (ie not write-through).
> >
> > I was concerned that write-through wouldn't be sufficient. If the
> > guest writes to its non-cached memory, and QEMU needs to see what
> > it wrote, then won't write-through fail to work? Unless we some
> > how invalidate the cache first?
> 
> Well, I meant more that the correct mapping for userspace is
> the same as the guest, whatever that is, and so somebody needs
> to look at what the guest actually does rather than merely
> hoping NormalNC is OK. (For instance, do we need to provide
> support for QEMU to map both NC and writethrough?)
>

Ah, we assume the guest is mapping it as device memory, and in
this version of the series, I ensure that it is at least NC with
the S2 attributes. I don't think we can look at what some guests
do with some devices to come up with anything beyond (poor?)
heuristics. I prefer that we force both the guest and QEMU to NC
(or guest chooses Device and QEMU is forced to NC) to make sure
we get it right.

drew

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-14 13:36           ` Andrew Jones
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-14 13:36 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Ard Biesheuvel, Marc Zyngier, Catalin Marinas, QEMU Developers,
	Paolo Bonzini, Laszlo Ersek, kvmarm

On Thu, May 14, 2015 at 02:11:59PM +0100, Peter Maydell wrote:
> On 14 May 2015 at 14:03, Andrew Jones <drjones@redhat.com> wrote:
> > On Thu, May 14, 2015 at 11:37:46AM +0100, Peter Maydell wrote:
> >> On 14 May 2015 at 11:31, Andrew Jones <drjones@redhat.com> wrote:
> >> > Forgot to (4): switch from setting userspace's mapping to
> >> > device memory to normal, non-cacheable. Using device memory
> >> > caused a problem that Alex Graf found, and Peter Maydell suggested
> >> > using normal, non-cacheable instead.
> >>
> >> Did you check that non-cacheable is definitely the correct
> >> kind of Normal memory attribute we want? (ie not write-through).
> >
> > I was concerned that write-through wouldn't be sufficient. If the
> > guest writes to its non-cached memory, and QEMU needs to see what
> > it wrote, then won't write-through fail to work? Unless we some
> > how invalidate the cache first?
> 
> Well, I meant more that the correct mapping for userspace is
> the same as the guest, whatever that is, and so somebody needs
> to look at what the guest actually does rather than merely
> hoping NormalNC is OK. (For instance, do we need to provide
> support for QEMU to map both NC and writethrough?)
>

Ah, we assume the guest is mapping it as device memory, and in
this version of the series, I ensure that it is at least NC with
the S2 attributes. I don't think we can look at what some guests
do with some devices to come up with anything beyond (poor?)
heuristics. I prefer that we force both the guest and QEMU to NC
(or guest chooses Device and QEMU is forced to NC) to make sure
we get it right.

drew

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
  2015-05-14 11:05     ` Christoffer Dall
@ 2015-05-14 13:46       ` Andrew Jones
  -1 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-14 13:46 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: peter.maydell, ard.biesheuvel, marc.zyngier, catalin.marinas,
	qemu-devel, agraf, pbonzini, j.fanguede, lersek, kvmarm,
	m.smarduch

On Thu, May 14, 2015 at 01:05:09PM +0200, Christoffer Dall wrote:
> On Wed, May 13, 2015 at 01:31:52PM +0200, Andrew Jones wrote:
> > Provide a method to change normal, cacheable memory to non-cacheable.
> > KVM will make use of this to keep emulated device memory regions
> > coherent with the guest.
> > 
> > Signed-off-by: Andrew Jones <drjones@redhat.com>
> 
> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
> 
> But you obviously need Russell and Will/Catalin to ack/merge this.

I guess this patch is going to go away in the next round. You've
pointed out that I screwed stuff up royally with my over eagerness
to reuse code. I need to reimplement change_memory_common, but a
version that takes an mm, which is more or less what I did in the
last version of this series, back when I was pinning pages.

drew

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
@ 2015-05-14 13:46       ` Andrew Jones
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-14 13:46 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: ard.biesheuvel, marc.zyngier, catalin.marinas, qemu-devel,
	pbonzini, lersek, kvmarm

On Thu, May 14, 2015 at 01:05:09PM +0200, Christoffer Dall wrote:
> On Wed, May 13, 2015 at 01:31:52PM +0200, Andrew Jones wrote:
> > Provide a method to change normal, cacheable memory to non-cacheable.
> > KVM will make use of this to keep emulated device memory regions
> > coherent with the guest.
> > 
> > Signed-off-by: Andrew Jones <drjones@redhat.com>
> 
> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
> 
> But you obviously need Russell and Will/Catalin to ack/merge this.

I guess this patch is going to go away in the next round. You've
pointed out that I screwed stuff up royally with my over eagerness
to reuse code. I need to reimplement change_memory_common, but a
version that takes an mm, which is more or less what I did in the
last version of this series, back when I was pinning pages.

drew

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-14 13:32                           ` Laszlo Ersek
@ 2015-05-14 13:48                             ` Michael S. Tsirkin
  -1 siblings, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2015-05-14 13:48 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Peter Maydell, Andrew Jones, Ard Biesheuvel, Marc Zyngier,
	Catalin Marinas, QEMU Developers, Alexander Graf, Paolo Bonzini,
	Jérémy Fanguède, kvmarm, Christoffer Dall,
	Mario Smarduch

On Thu, May 14, 2015 at 03:32:10PM +0200, Laszlo Ersek wrote:
> On 05/14/15 15:00, Andrew Jones wrote:
> > On Thu, May 14, 2015 at 01:38:11PM +0100, Peter Maydell wrote:
> >> On 14 May 2015 at 13:28, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>> Well, PCI BARs are generally MMIO resources, and hence should not be cached.
> >>>
> >>> As an optimization, OS drivers can mark them as cacheable or
> >>> write-combining or something like that, but in general it's a safe
> >>> default to leave them uncached---one would think.
> >>
> >> Isn't this handled by the OS mapping them in the 'prefetchable'
> >> MMIO window rather than the 'non-prefetchable' one? (QEMU's
> >> generic-PCIe device doesn't yet support the prefetchable window.)
> > 
> > I was thinking (with my limited PCI knowledge) the same thing, and
> > was planning on experimenting with that.
> 
> This could be supported in UEFI as well, with the following steps:
> - the DTB that QEMU provides UEFI with should advertise such a
>   prefetchable window.
> - The driver in UEFI that parses the DTB should understand that DTB
>   node (well, record type), and store the appropriate base & size into
>   some new dynamic PCDs (= basically, firmware wide global variables;
>   PCD = platform configuration database)
> - The entry point of the host bridge driver would call
>   gDS->AddMemorySpace() twice, separately for the two different windows,
>   with their appropriate caching attributes.
> - The host bridge driver needs to be extended so that TypePMem32
>   requests are not rejected (like now); they should be handled
>   similarly to TypeMem32. Except, the gDS->AllocateMemorySpace() call
>   should allocate from the prefetchable range (determined by the new
>   PCDs above).
> - QEMU's emulated devices should then expose their BARs as prefetchable
>   (so that the above branch would be taken in the host bridge driver).
> 
> (Of course, if QEMU intends to emulate PCI devices somewhat
> realistically, then QEMU should claim "non-prefetchable" for BARs that
> would not be prefetchable on physical hardware either, and then the
> hypervisor should accommodate the firmware's UC mapping and say "hey I
> know better, we're virtual in fact", and override the attribute (-> use
> WB instead of UC). With which we'd be back to square one...)
> 
> Thanks
> Laszlo

Prefetcheable is unrelated to BAR caching or drivers, it's a way to tell
host bridges they can do limited tweaks to downstream transactions in a
specific range.

Really non-prefetcheable BARs are mostly those where read has
side-effects, which is best avoided. this does not mean it's ok to
reorder transactions or cache them.

-- 
MST

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-14 13:48                             ` Michael S. Tsirkin
  0 siblings, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2015-05-14 13:48 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Ard Biesheuvel, Marc Zyngier, Catalin Marinas, QEMU Developers,
	Paolo Bonzini, kvmarm

On Thu, May 14, 2015 at 03:32:10PM +0200, Laszlo Ersek wrote:
> On 05/14/15 15:00, Andrew Jones wrote:
> > On Thu, May 14, 2015 at 01:38:11PM +0100, Peter Maydell wrote:
> >> On 14 May 2015 at 13:28, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>> Well, PCI BARs are generally MMIO resources, and hence should not be cached.
> >>>
> >>> As an optimization, OS drivers can mark them as cacheable or
> >>> write-combining or something like that, but in general it's a safe
> >>> default to leave them uncached---one would think.
> >>
> >> Isn't this handled by the OS mapping them in the 'prefetchable'
> >> MMIO window rather than the 'non-prefetchable' one? (QEMU's
> >> generic-PCIe device doesn't yet support the prefetchable window.)
> > 
> > I was thinking (with my limited PCI knowledge) the same thing, and
> > was planning on experimenting with that.
> 
> This could be supported in UEFI as well, with the following steps:
> - the DTB that QEMU provides UEFI with should advertise such a
>   prefetchable window.
> - The driver in UEFI that parses the DTB should understand that DTB
>   node (well, record type), and store the appropriate base & size into
>   some new dynamic PCDs (= basically, firmware wide global variables;
>   PCD = platform configuration database)
> - The entry point of the host bridge driver would call
>   gDS->AddMemorySpace() twice, separately for the two different windows,
>   with their appropriate caching attributes.
> - The host bridge driver needs to be extended so that TypePMem32
>   requests are not rejected (like now); they should be handled
>   similarly to TypeMem32. Except, the gDS->AllocateMemorySpace() call
>   should allocate from the prefetchable range (determined by the new
>   PCDs above).
> - QEMU's emulated devices should then expose their BARs as prefetchable
>   (so that the above branch would be taken in the host bridge driver).
> 
> (Of course, if QEMU intends to emulate PCI devices somewhat
> realistically, then QEMU should claim "non-prefetchable" for BARs that
> would not be prefetchable on physical hardware either, and then the
> hypervisor should accommodate the firmware's UC mapping and say "hey I
> know better, we're virtual in fact", and override the attribute (-> use
> WB instead of UC). With which we'd be back to square one...)
> 
> Thanks
> Laszlo

Prefetcheable is unrelated to BAR caching or drivers, it's a way to tell
host bridges they can do limited tweaks to downstream transactions in a
specific range.

Really non-prefetcheable BARs are mostly those where read has
side-effects, which is best avoided. this does not mean it's ok to
reorder transactions or cache them.

-- 
MST

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-14 13:48                             ` Michael S. Tsirkin
@ 2015-05-14 14:19                               ` Laszlo Ersek
  -1 siblings, 0 replies; 102+ messages in thread
From: Laszlo Ersek @ 2015-05-14 14:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Maydell, Andrew Jones, Ard Biesheuvel, Marc Zyngier,
	Catalin Marinas, QEMU Developers, Alexander Graf, Paolo Bonzini,
	Jérémy Fanguède, kvmarm, Christoffer Dall,
	Mario Smarduch

On 05/14/15 15:48, Michael S. Tsirkin wrote:
> On Thu, May 14, 2015 at 03:32:10PM +0200, Laszlo Ersek wrote:
>> On 05/14/15 15:00, Andrew Jones wrote:
>>> On Thu, May 14, 2015 at 01:38:11PM +0100, Peter Maydell wrote:
>>>> On 14 May 2015 at 13:28, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>>> Well, PCI BARs are generally MMIO resources, and hence should not be cached.
>>>>>
>>>>> As an optimization, OS drivers can mark them as cacheable or
>>>>> write-combining or something like that, but in general it's a safe
>>>>> default to leave them uncached---one would think.
>>>>
>>>> Isn't this handled by the OS mapping them in the 'prefetchable'
>>>> MMIO window rather than the 'non-prefetchable' one? (QEMU's
>>>> generic-PCIe device doesn't yet support the prefetchable window.)
>>>
>>> I was thinking (with my limited PCI knowledge) the same thing, and
>>> was planning on experimenting with that.
>>
>> This could be supported in UEFI as well, with the following steps:
>> - the DTB that QEMU provides UEFI with should advertise such a
>>   prefetchable window.
>> - The driver in UEFI that parses the DTB should understand that DTB
>>   node (well, record type), and store the appropriate base & size into
>>   some new dynamic PCDs (= basically, firmware wide global variables;
>>   PCD = platform configuration database)
>> - The entry point of the host bridge driver would call
>>   gDS->AddMemorySpace() twice, separately for the two different windows,
>>   with their appropriate caching attributes.
>> - The host bridge driver needs to be extended so that TypePMem32
>>   requests are not rejected (like now); they should be handled
>>   similarly to TypeMem32. Except, the gDS->AllocateMemorySpace() call
>>   should allocate from the prefetchable range (determined by the new
>>   PCDs above).
>> - QEMU's emulated devices should then expose their BARs as prefetchable
>>   (so that the above branch would be taken in the host bridge driver).
>>
>> (Of course, if QEMU intends to emulate PCI devices somewhat
>> realistically, then QEMU should claim "non-prefetchable" for BARs that
>> would not be prefetchable on physical hardware either, and then the
>> hypervisor should accommodate the firmware's UC mapping and say "hey I
>> know better, we're virtual in fact", and override the attribute (-> use
>> WB instead of UC). With which we'd be back to square one...)
>>
>> Thanks
>> Laszlo
> 
> Prefetcheable is unrelated to BAR caching or drivers, it's a way to tell
> host bridges they can do limited tweaks to downstream transactions in a
> specific range.
> 
> Really non-prefetcheable BARs are mostly those where read has
> side-effects, which is best avoided. this does not mean it's ok to
> reorder transactions or cache them.

I believe I understood that (although certainly not in the depth that
you do), because when the idea had come up first (ie. equating cacheable
with prefetchable, or at least "repurposing" the latter for the former)
I had tried to read up on prefetchable (just on the web; no time for
reading the PCI spec. ... I peeked now, it also mentions "write merging"
for bridges.) The way I perceived it, the idea was to give the guest a
hint about caching with the prefetchable bit / DTB entry. Sorry if I was
mistaken.

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-14 14:19                               ` Laszlo Ersek
  0 siblings, 0 replies; 102+ messages in thread
From: Laszlo Ersek @ 2015-05-14 14:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Ard Biesheuvel, Marc Zyngier, Catalin Marinas, QEMU Developers,
	Paolo Bonzini, kvmarm

On 05/14/15 15:48, Michael S. Tsirkin wrote:
> On Thu, May 14, 2015 at 03:32:10PM +0200, Laszlo Ersek wrote:
>> On 05/14/15 15:00, Andrew Jones wrote:
>>> On Thu, May 14, 2015 at 01:38:11PM +0100, Peter Maydell wrote:
>>>> On 14 May 2015 at 13:28, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>>> Well, PCI BARs are generally MMIO resources, and hence should not be cached.
>>>>>
>>>>> As an optimization, OS drivers can mark them as cacheable or
>>>>> write-combining or something like that, but in general it's a safe
>>>>> default to leave them uncached---one would think.
>>>>
>>>> Isn't this handled by the OS mapping them in the 'prefetchable'
>>>> MMIO window rather than the 'non-prefetchable' one? (QEMU's
>>>> generic-PCIe device doesn't yet support the prefetchable window.)
>>>
>>> I was thinking (with my limited PCI knowledge) the same thing, and
>>> was planning on experimenting with that.
>>
>> This could be supported in UEFI as well, with the following steps:
>> - the DTB that QEMU provides UEFI with should advertise such a
>>   prefetchable window.
>> - The driver in UEFI that parses the DTB should understand that DTB
>>   node (well, record type), and store the appropriate base & size into
>>   some new dynamic PCDs (= basically, firmware wide global variables;
>>   PCD = platform configuration database)
>> - The entry point of the host bridge driver would call
>>   gDS->AddMemorySpace() twice, separately for the two different windows,
>>   with their appropriate caching attributes.
>> - The host bridge driver needs to be extended so that TypePMem32
>>   requests are not rejected (like now); they should be handled
>>   similarly to TypeMem32. Except, the gDS->AllocateMemorySpace() call
>>   should allocate from the prefetchable range (determined by the new
>>   PCDs above).
>> - QEMU's emulated devices should then expose their BARs as prefetchable
>>   (so that the above branch would be taken in the host bridge driver).
>>
>> (Of course, if QEMU intends to emulate PCI devices somewhat
>> realistically, then QEMU should claim "non-prefetchable" for BARs that
>> would not be prefetchable on physical hardware either, and then the
>> hypervisor should accommodate the firmware's UC mapping and say "hey I
>> know better, we're virtual in fact", and override the attribute (-> use
>> WB instead of UC). With which we'd be back to square one...)
>>
>> Thanks
>> Laszlo
> 
> Prefetcheable is unrelated to BAR caching or drivers, it's a way to tell
> host bridges they can do limited tweaks to downstream transactions in a
> specific range.
> 
> Really non-prefetcheable BARs are mostly those where read has
> side-effects, which is best avoided. this does not mean it's ok to
> reorder transactions or cache them.

I believe I understood that (although certainly not in the depth that
you do), because when the idea had come up first (ie. equating cacheable
with prefetchable, or at least "repurposing" the latter for the former)
I had tried to read up on prefetchable (just on the web; no time for
reading the PCI spec. ... I peeked now, it also mentions "write merging"
for bridges.) The way I perceived it, the idea was to give the guest a
hint about caching with the prefetchable bit / DTB entry. Sorry if I was
mistaken.

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-14 14:19                               ` Laszlo Ersek
@ 2015-05-14 14:41                                 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2015-05-14 14:41 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Peter Maydell, Andrew Jones, Ard Biesheuvel, Marc Zyngier,
	Catalin Marinas, QEMU Developers, Alexander Graf, Paolo Bonzini,
	Jérémy Fanguède, kvmarm, Christoffer Dall,
	Mario Smarduch

On Thu, May 14, 2015 at 04:19:23PM +0200, Laszlo Ersek wrote:
> On 05/14/15 15:48, Michael S. Tsirkin wrote:
> > On Thu, May 14, 2015 at 03:32:10PM +0200, Laszlo Ersek wrote:
> >> On 05/14/15 15:00, Andrew Jones wrote:
> >>> On Thu, May 14, 2015 at 01:38:11PM +0100, Peter Maydell wrote:
> >>>> On 14 May 2015 at 13:28, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>>>> Well, PCI BARs are generally MMIO resources, and hence should not be cached.
> >>>>>
> >>>>> As an optimization, OS drivers can mark them as cacheable or
> >>>>> write-combining or something like that, but in general it's a safe
> >>>>> default to leave them uncached---one would think.
> >>>>
> >>>> Isn't this handled by the OS mapping them in the 'prefetchable'
> >>>> MMIO window rather than the 'non-prefetchable' one? (QEMU's
> >>>> generic-PCIe device doesn't yet support the prefetchable window.)
> >>>
> >>> I was thinking (with my limited PCI knowledge) the same thing, and
> >>> was planning on experimenting with that.
> >>
> >> This could be supported in UEFI as well, with the following steps:
> >> - the DTB that QEMU provides UEFI with should advertise such a
> >>   prefetchable window.
> >> - The driver in UEFI that parses the DTB should understand that DTB
> >>   node (well, record type), and store the appropriate base & size into
> >>   some new dynamic PCDs (= basically, firmware wide global variables;
> >>   PCD = platform configuration database)
> >> - The entry point of the host bridge driver would call
> >>   gDS->AddMemorySpace() twice, separately for the two different windows,
> >>   with their appropriate caching attributes.
> >> - The host bridge driver needs to be extended so that TypePMem32
> >>   requests are not rejected (like now); they should be handled
> >>   similarly to TypeMem32. Except, the gDS->AllocateMemorySpace() call
> >>   should allocate from the prefetchable range (determined by the new
> >>   PCDs above).
> >> - QEMU's emulated devices should then expose their BARs as prefetchable
> >>   (so that the above branch would be taken in the host bridge driver).
> >>
> >> (Of course, if QEMU intends to emulate PCI devices somewhat
> >> realistically, then QEMU should claim "non-prefetchable" for BARs that
> >> would not be prefetchable on physical hardware either, and then the
> >> hypervisor should accommodate the firmware's UC mapping and say "hey I
> >> know better, we're virtual in fact", and override the attribute (-> use
> >> WB instead of UC). With which we'd be back to square one...)
> >>
> >> Thanks
> >> Laszlo
> > 
> > Prefetcheable is unrelated to BAR caching or drivers, it's a way to tell
> > host bridges they can do limited tweaks to downstream transactions in a
> > specific range.
> > 
> > Really non-prefetcheable BARs are mostly those where read has
> > side-effects, which is best avoided. this does not mean it's ok to
> > reorder transactions or cache them.
> 
> I believe I understood that (although certainly not in the depth that
> you do), because when the idea had come up first (ie. equating cacheable
> with prefetchable, or at least "repurposing" the latter for the former)
> I had tried to read up on prefetchable (just on the web; no time for
> reading the PCI spec. ... I peeked now, it also mentions "write merging"
> for bridges.)

Read up on what it is if you like, it is much weaker than WC not to
mention cacheable.

> The way I perceived it, the idea was to give the guest a
> hint about caching with the prefetchable bit / DTB entry. Sorry if I was
> mistaken.
> 
> Thanks
> Laszlo

And what I am saying is that prefetchable bit would be a PV solution -
on real devices it is not a hint about caching and can't be used as
such.

-- 
MST

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-14 14:41                                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 102+ messages in thread
From: Michael S. Tsirkin @ 2015-05-14 14:41 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Ard Biesheuvel, Marc Zyngier, Catalin Marinas, QEMU Developers,
	Paolo Bonzini, kvmarm

On Thu, May 14, 2015 at 04:19:23PM +0200, Laszlo Ersek wrote:
> On 05/14/15 15:48, Michael S. Tsirkin wrote:
> > On Thu, May 14, 2015 at 03:32:10PM +0200, Laszlo Ersek wrote:
> >> On 05/14/15 15:00, Andrew Jones wrote:
> >>> On Thu, May 14, 2015 at 01:38:11PM +0100, Peter Maydell wrote:
> >>>> On 14 May 2015 at 13:28, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>>>> Well, PCI BARs are generally MMIO resources, and hence should not be cached.
> >>>>>
> >>>>> As an optimization, OS drivers can mark them as cacheable or
> >>>>> write-combining or something like that, but in general it's a safe
> >>>>> default to leave them uncached---one would think.
> >>>>
> >>>> Isn't this handled by the OS mapping them in the 'prefetchable'
> >>>> MMIO window rather than the 'non-prefetchable' one? (QEMU's
> >>>> generic-PCIe device doesn't yet support the prefetchable window.)
> >>>
> >>> I was thinking (with my limited PCI knowledge) the same thing, and
> >>> was planning on experimenting with that.
> >>
> >> This could be supported in UEFI as well, with the following steps:
> >> - the DTB that QEMU provides UEFI with should advertise such a
> >>   prefetchable window.
> >> - The driver in UEFI that parses the DTB should understand that DTB
> >>   node (well, record type), and store the appropriate base & size into
> >>   some new dynamic PCDs (= basically, firmware wide global variables;
> >>   PCD = platform configuration database)
> >> - The entry point of the host bridge driver would call
> >>   gDS->AddMemorySpace() twice, separately for the two different windows,
> >>   with their appropriate caching attributes.
> >> - The host bridge driver needs to be extended so that TypePMem32
> >>   requests are not rejected (like now); they should be handled
> >>   similarly to TypeMem32. Except, the gDS->AllocateMemorySpace() call
> >>   should allocate from the prefetchable range (determined by the new
> >>   PCDs above).
> >> - QEMU's emulated devices should then expose their BARs as prefetchable
> >>   (so that the above branch would be taken in the host bridge driver).
> >>
> >> (Of course, if QEMU intends to emulate PCI devices somewhat
> >> realistically, then QEMU should claim "non-prefetchable" for BARs that
> >> would not be prefetchable on physical hardware either, and then the
> >> hypervisor should accommodate the firmware's UC mapping and say "hey I
> >> know better, we're virtual in fact", and override the attribute (-> use
> >> WB instead of UC). With which we'd be back to square one...)
> >>
> >> Thanks
> >> Laszlo
> > 
> > Prefetcheable is unrelated to BAR caching or drivers, it's a way to tell
> > host bridges they can do limited tweaks to downstream transactions in a
> > specific range.
> > 
> > Really non-prefetcheable BARs are mostly those where read has
> > side-effects, which is best avoided. this does not mean it's ok to
> > reorder transactions or cache them.
> 
> I believe I understood that (although certainly not in the depth that
> you do), because when the idea had come up first (ie. equating cacheable
> with prefetchable, or at least "repurposing" the latter for the former)
> I had tried to read up on prefetchable (just on the web; no time for
> reading the PCI spec. ... I peeked now, it also mentions "write merging"
> for bridges.)

Read up on what it is if you like, it is much weaker than WC not to
mention cacheable.

> The way I perceived it, the idea was to give the guest a
> hint about caching with the prefetchable bit / DTB entry. Sorry if I was
> mistaken.
> 
> Thanks
> Laszlo

And what I am saying is that prefetchable bit would be a PV solution -
on real devices it is not a hint about caching and can't be used as
such.

-- 
MST

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-14 14:41                                 ` Michael S. Tsirkin
@ 2015-05-15  9:00                                   ` Ard Biesheuvel
  -1 siblings, 0 replies; 102+ messages in thread
From: Ard Biesheuvel @ 2015-05-15  9:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Maydell, Andrew Jones, Marc Zyngier, Catalin Marinas,
	QEMU Developers, Alexander Graf, Paolo Bonzini,
	Jérémy Fanguède, Laszlo Ersek, kvmarm,
	Christoffer Dall, Mario Smarduch

On 14 May 2015 at 16:41, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Thu, May 14, 2015 at 04:19:23PM +0200, Laszlo Ersek wrote:
>> On 05/14/15 15:48, Michael S. Tsirkin wrote:
>> > On Thu, May 14, 2015 at 03:32:10PM +0200, Laszlo Ersek wrote:
>> >> On 05/14/15 15:00, Andrew Jones wrote:
>> >>> On Thu, May 14, 2015 at 01:38:11PM +0100, Peter Maydell wrote:
>> >>>> On 14 May 2015 at 13:28, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> >>>>> Well, PCI BARs are generally MMIO resources, and hence should not be cached.
>> >>>>>
>> >>>>> As an optimization, OS drivers can mark them as cacheable or
>> >>>>> write-combining or something like that, but in general it's a safe
>> >>>>> default to leave them uncached---one would think.
>> >>>>
>> >>>> Isn't this handled by the OS mapping them in the 'prefetchable'
>> >>>> MMIO window rather than the 'non-prefetchable' one? (QEMU's
>> >>>> generic-PCIe device doesn't yet support the prefetchable window.)
>> >>>
>> >>> I was thinking (with my limited PCI knowledge) the same thing, and
>> >>> was planning on experimenting with that.
>> >>
>> >> This could be supported in UEFI as well, with the following steps:
>> >> - the DTB that QEMU provides UEFI with should advertise such a
>> >>   prefetchable window.
>> >> - The driver in UEFI that parses the DTB should understand that DTB
>> >>   node (well, record type), and store the appropriate base & size into
>> >>   some new dynamic PCDs (= basically, firmware wide global variables;
>> >>   PCD = platform configuration database)
>> >> - The entry point of the host bridge driver would call
>> >>   gDS->AddMemorySpace() twice, separately for the two different windows,
>> >>   with their appropriate caching attributes.
>> >> - The host bridge driver needs to be extended so that TypePMem32
>> >>   requests are not rejected (like now); they should be handled
>> >>   similarly to TypeMem32. Except, the gDS->AllocateMemorySpace() call
>> >>   should allocate from the prefetchable range (determined by the new
>> >>   PCDs above).
>> >> - QEMU's emulated devices should then expose their BARs as prefetchable
>> >>   (so that the above branch would be taken in the host bridge driver).
>> >>
>> >> (Of course, if QEMU intends to emulate PCI devices somewhat
>> >> realistically, then QEMU should claim "non-prefetchable" for BARs that
>> >> would not be prefetchable on physical hardware either, and then the
>> >> hypervisor should accommodate the firmware's UC mapping and say "hey I
>> >> know better, we're virtual in fact", and override the attribute (-> use
>> >> WB instead of UC). With which we'd be back to square one...)
>> >>
>> >> Thanks
>> >> Laszlo
>> >
>> > Prefetcheable is unrelated to BAR caching or drivers, it's a way to tell
>> > host bridges they can do limited tweaks to downstream transactions in a
>> > specific range.
>> >
>> > Really non-prefetcheable BARs are mostly those where read has
>> > side-effects, which is best avoided. this does not mean it's ok to
>> > reorder transactions or cache them.
>>
>> I believe I understood that (although certainly not in the depth that
>> you do), because when the idea had come up first (ie. equating cacheable
>> with prefetchable, or at least "repurposing" the latter for the former)
>> I had tried to read up on prefetchable (just on the web; no time for
>> reading the PCI spec. ... I peeked now, it also mentions "write merging"
>> for bridges.)
>
> Read up on what it is if you like, it is much weaker than WC not to
> mention cacheable.
>
>> The way I perceived it, the idea was to give the guest a
>> hint about caching with the prefetchable bit / DTB entry. Sorry if I was
>> mistaken.
>>
>> Thanks
>> Laszlo
>
> And what I am saying is that prefetchable bit would be a PV solution -
> on real devices it is not a hint about caching and can't be used as
> such.
>

On a general note, may I point out that while this discussion now
focuses heavily on PCI and its metadata that could potentially
describe the cached/uncached nature of a region, there are other
emulated devices that are affected as well. Most notably, there is the
emulated NOR flash which is backed by a read-only memslot while in
array mode, but treated as a device by the guest and hence mapped
uncached. Since the NOR flash contains the executable image of the
firmware (in case of UEFI), it must be backed by actual host RAM or
the CPU won't be able to fetch instructions from it (since instruction
fetches cannot be emulated like ordinary loads and stores). On the
other hand, since the guest treats it as a ROM, it is totally
oblivious of any caching concerns that may exist.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-15  9:00                                   ` Ard Biesheuvel
  0 siblings, 0 replies; 102+ messages in thread
From: Ard Biesheuvel @ 2015-05-15  9:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Marc Zyngier, Catalin Marinas, QEMU Developers, Paolo Bonzini,
	Laszlo Ersek, kvmarm

On 14 May 2015 at 16:41, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Thu, May 14, 2015 at 04:19:23PM +0200, Laszlo Ersek wrote:
>> On 05/14/15 15:48, Michael S. Tsirkin wrote:
>> > On Thu, May 14, 2015 at 03:32:10PM +0200, Laszlo Ersek wrote:
>> >> On 05/14/15 15:00, Andrew Jones wrote:
>> >>> On Thu, May 14, 2015 at 01:38:11PM +0100, Peter Maydell wrote:
>> >>>> On 14 May 2015 at 13:28, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> >>>>> Well, PCI BARs are generally MMIO resources, and hence should not be cached.
>> >>>>>
>> >>>>> As an optimization, OS drivers can mark them as cacheable or
>> >>>>> write-combining or something like that, but in general it's a safe
>> >>>>> default to leave them uncached---one would think.
>> >>>>
>> >>>> Isn't this handled by the OS mapping them in the 'prefetchable'
>> >>>> MMIO window rather than the 'non-prefetchable' one? (QEMU's
>> >>>> generic-PCIe device doesn't yet support the prefetchable window.)
>> >>>
>> >>> I was thinking (with my limited PCI knowledge) the same thing, and
>> >>> was planning on experimenting with that.
>> >>
>> >> This could be supported in UEFI as well, with the following steps:
>> >> - the DTB that QEMU provides UEFI with should advertise such a
>> >>   prefetchable window.
>> >> - The driver in UEFI that parses the DTB should understand that DTB
>> >>   node (well, record type), and store the appropriate base & size into
>> >>   some new dynamic PCDs (= basically, firmware wide global variables;
>> >>   PCD = platform configuration database)
>> >> - The entry point of the host bridge driver would call
>> >>   gDS->AddMemorySpace() twice, separately for the two different windows,
>> >>   with their appropriate caching attributes.
>> >> - The host bridge driver needs to be extended so that TypePMem32
>> >>   requests are not rejected (like now); they should be handled
>> >>   similarly to TypeMem32. Except, the gDS->AllocateMemorySpace() call
>> >>   should allocate from the prefetchable range (determined by the new
>> >>   PCDs above).
>> >> - QEMU's emulated devices should then expose their BARs as prefetchable
>> >>   (so that the above branch would be taken in the host bridge driver).
>> >>
>> >> (Of course, if QEMU intends to emulate PCI devices somewhat
>> >> realistically, then QEMU should claim "non-prefetchable" for BARs that
>> >> would not be prefetchable on physical hardware either, and then the
>> >> hypervisor should accommodate the firmware's UC mapping and say "hey I
>> >> know better, we're virtual in fact", and override the attribute (-> use
>> >> WB instead of UC). With which we'd be back to square one...)
>> >>
>> >> Thanks
>> >> Laszlo
>> >
>> > Prefetcheable is unrelated to BAR caching or drivers, it's a way to tell
>> > host bridges they can do limited tweaks to downstream transactions in a
>> > specific range.
>> >
>> > Really non-prefetcheable BARs are mostly those where read has
>> > side-effects, which is best avoided. this does not mean it's ok to
>> > reorder transactions or cache them.
>>
>> I believe I understood that (although certainly not in the depth that
>> you do), because when the idea had come up first (ie. equating cacheable
>> with prefetchable, or at least "repurposing" the latter for the former)
>> I had tried to read up on prefetchable (just on the web; no time for
>> reading the PCI spec. ... I peeked now, it also mentions "write merging"
>> for bridges.)
>
> Read up on what it is if you like, it is much weaker than WC not to
> mention cacheable.
>
>> The way I perceived it, the idea was to give the guest a
>> hint about caching with the prefetchable bit / DTB entry. Sorry if I was
>> mistaken.
>>
>> Thanks
>> Laszlo
>
> And what I am saying is that prefetchable bit would be a PV solution -
> on real devices it is not a hint about caching and can't be used as
> such.
>

On a general note, may I point out that while this discussion now
focuses heavily on PCI and its metadata that could potentially
describe the cached/uncached nature of a region, there are other
emulated devices that are affected as well. Most notably, there is the
emulated NOR flash which is backed by a read-only memslot while in
array mode, but treated as a device by the guest and hence mapped
uncached. Since the NOR flash contains the executable image of the
firmware (in case of UEFI), it must be backed by actual host RAM or
the CPU won't be able to fetch instructions from it (since instruction
fetches cannot be emulated like ordinary loads and stores). On the
other hand, since the guest treats it as a ROM, it is totally
oblivious of any caching concerns that may exist.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
  2015-05-14 13:46       ` Andrew Jones
@ 2015-05-15 14:51         ` Christoffer Dall
  -1 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-15 14:51 UTC (permalink / raw)
  To: Andrew Jones
  Cc: peter.maydell, ard.biesheuvel, marc.zyngier, catalin.marinas,
	qemu-devel, agraf, pbonzini, j.fanguede, lersek, kvmarm,
	m.smarduch

On Thu, May 14, 2015 at 03:46:44PM +0200, Andrew Jones wrote:
> On Thu, May 14, 2015 at 01:05:09PM +0200, Christoffer Dall wrote:
> > On Wed, May 13, 2015 at 01:31:52PM +0200, Andrew Jones wrote:
> > > Provide a method to change normal, cacheable memory to non-cacheable.
> > > KVM will make use of this to keep emulated device memory regions
> > > coherent with the guest.
> > > 
> > > Signed-off-by: Andrew Jones <drjones@redhat.com>
> > 
> > Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
> > 
> > But you obviously need Russell and Will/Catalin to ack/merge this.
> 
> I guess this patch is going to go away in the next round. You've
> pointed out that I screwed stuff up royally with my over eagerness
> to reuse code. I need to reimplement change_memory_common, but a
> version that takes an mm, which is more or less what I did in the
> last version of this series, back when I was pinning pages.
> 
Yeah, I just read this one before looking at the others because it was a
simple one...

-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
@ 2015-05-15 14:51         ` Christoffer Dall
  0 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-15 14:51 UTC (permalink / raw)
  To: Andrew Jones
  Cc: ard.biesheuvel, marc.zyngier, catalin.marinas, qemu-devel,
	pbonzini, lersek, kvmarm

On Thu, May 14, 2015 at 03:46:44PM +0200, Andrew Jones wrote:
> On Thu, May 14, 2015 at 01:05:09PM +0200, Christoffer Dall wrote:
> > On Wed, May 13, 2015 at 01:31:52PM +0200, Andrew Jones wrote:
> > > Provide a method to change normal, cacheable memory to non-cacheable.
> > > KVM will make use of this to keep emulated device memory regions
> > > coherent with the guest.
> > > 
> > > Signed-off-by: Andrew Jones <drjones@redhat.com>
> > 
> > Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
> > 
> > But you obviously need Russell and Will/Catalin to ack/merge this.
> 
> I guess this patch is going to go away in the next round. You've
> pointed out that I screwed stuff up royally with my over eagerness
> to reuse code. I need to reimplement change_memory_common, but a
> version that takes an mm, which is more or less what I did in the
> last version of this series, back when I was pinning pages.
> 
Yeah, I just read this one before looking at the others because it was a
simple one...

-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 3/3] arm/arm64: KVM: implement 'uncached' mem coherency
  2015-05-14 13:32       ` Andrew Jones
@ 2015-05-15 15:02         ` Christoffer Dall
  -1 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-15 15:02 UTC (permalink / raw)
  To: Andrew Jones
  Cc: peter.maydell, ard.biesheuvel, marc.zyngier, catalin.marinas,
	agraf, qemu-devel, pbonzini, j.fanguede, lersek, kvmarm,
	m.smarduch

On Thu, May 14, 2015 at 03:32:13PM +0200, Andrew Jones wrote:
> On Thu, May 14, 2015 at 12:55:49PM +0200, Christoffer Dall wrote:
> > On Wed, May 13, 2015 at 01:31:54PM +0200, Andrew Jones wrote:
> > > When S1 and S2 memory attributes combine wrt to caching policy,
> > > non-cacheable types take precedence. If a guest maps a region as
> > > device memory, which KVM userspace is using to emulate the device
> > > using normal, cacheable memory, then we lose coherency. With
> > > KVM_MEM_UNCACHED, KVM userspace can now hint to KVM which memory
> > > regions are likely to be problematic. With this patch, as pages
> > > of these types of regions are faulted into the guest, not only do
> > > we flush the page's dcache, but we also change userspace's
> > > mapping to NC in order to maintain coherency.
> > > 
> > > What if the guest doesn't do what we expect? While we can't
> > > force a guest to use cacheable memory, we can take advantage of
> > > the non-cacheable precedence, and force it to use non-cacheable.
> > > So, this patch also introduces PAGE_S2_NORMAL_NC, and uses it on
> > > KVM_MEM_UNCACHED regions to force them to NC.
> > > 
> > > We now have both guest and userspace on the same page (pun intended)
> > 
> > I'd like to revisit the overall approach here.  Is doing non-cached
> > accesses in both the guest and host really the right thing to do here?
> 
> I think so, but all ideas/approaches are still on the table. This is
> still an RFC.
> 
> > 
> > The semantics of the device becomes that it is cache coherent (because
> > QEMU is), and I think Marc argued that Linux/UEFI should simply be
> > adapted to handle whatever emulated devices we have as coherent.  I also
> > remember someone arguing that would be wrong (Peter?).
> 
> I'm not really for quirking all devices in all guest types (AAVMF, Linux,
> other bootloaders, other OSes). Windows is unlikely to apply any quirks.
> 

Well my point was that if we're emulating a platform with coherent IO
memory for PCI devices that is something that the guest should work with
as such, but as Paolo explained it should always be safe for a guest to
assume non-coherent, so that doesn't work.

> > 
> > Finally, does this address all cache coherency issues with emulated
> > devices?  Some VOS guys had seen things still not working with this
> > approach, unsure why...  I'd like to avoid us merging this only to merge
> > a more complete solution in a few weeks which reverts this solution...
> 
> I'm not sure (this is still an RFT too :-) We definitely would need to
> scatter some more memory_region_set_uncached() calls around QEMU first.
> 

It would be good if you could sync with the VOS guys and make sure your
patch set addresses their issues with the appropriate
memory_region_set_uncached() added to QEMU, and if it does not, some
vague idea why that falls outside of the scope of this patch set.  After
all, adding a USB controller to a VM is not that an esoteric use case,
is it?

> > 
> > More comments/questions below:
> > 
> > > 
> > > Signed-off-by: Andrew Jones <drjones@redhat.com>
> > > ---
> > >  arch/arm/include/asm/kvm_mmu.h        |  5 ++++-
> > >  arch/arm/include/asm/pgtable-3level.h |  1 +
> > >  arch/arm/include/asm/pgtable.h        |  1 +
> > >  arch/arm/kvm/mmu.c                    | 37 +++++++++++++++++++++++------------
> > >  arch/arm64/include/asm/kvm_mmu.h      |  5 ++++-
> > >  arch/arm64/include/asm/memory.h       |  1 +
> > >  arch/arm64/include/asm/pgtable.h      |  1 +
> > >  7 files changed, 36 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> > > index 405aa18833073..e8034a80b12e5 100644
> > > --- a/arch/arm/include/asm/kvm_mmu.h
> > > +++ b/arch/arm/include/asm/kvm_mmu.h
> > > @@ -214,8 +214,11 @@ static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
> > >  	while (size) {
> > >  		void *va = kmap_atomic_pfn(pfn);
> > >  
> > > -		if (need_flush)
> > > +		if (need_flush) {
> > >  			kvm_flush_dcache_to_poc(va, PAGE_SIZE);
> > > +			if (ipa_uncached)
> > > +				set_memory_nc((unsigned long)va, 1);
> > 
> > nit: consider moving this outside the need_flush
> > 
> > > +		}
> > >  
> > >  		if (icache_is_pipt())
> > >  			__cpuc_coherent_user_range((unsigned long)va,
> > > diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
> > > index a745a2a53853c..39b3f7a40e663 100644
> > > --- a/arch/arm/include/asm/pgtable-3level.h
> > > +++ b/arch/arm/include/asm/pgtable-3level.h
> > > @@ -121,6 +121,7 @@
> > >   * 2nd stage PTE definitions for LPAE.
> > >   */
> > >  #define L_PTE_S2_MT_UNCACHED		(_AT(pteval_t, 0x0) << 2) /* strongly ordered */
> > > +#define L_PTE_S2_MT_NORMAL_NC		(_AT(pteval_t, 0x5) << 2) /* normal non-cacheable */
> > >  #define L_PTE_S2_MT_WRITETHROUGH	(_AT(pteval_t, 0xa) << 2) /* normal inner write-through */
> > >  #define L_PTE_S2_MT_WRITEBACK		(_AT(pteval_t, 0xf) << 2) /* normal inner write-back */
> > >  #define L_PTE_S2_MT_DEV_SHARED		(_AT(pteval_t, 0x1) << 2) /* device */
> > > diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
> > > index f40354198bad4..ae13ca8b0a23d 100644
> > > --- a/arch/arm/include/asm/pgtable.h
> > > +++ b/arch/arm/include/asm/pgtable.h
> > > @@ -100,6 +100,7 @@ extern pgprot_t		pgprot_s2_device;
> > >  #define PAGE_HYP		_MOD_PROT(pgprot_kernel, L_PTE_HYP)
> > >  #define PAGE_HYP_DEVICE		_MOD_PROT(pgprot_hyp_device, L_PTE_HYP)
> > >  #define PAGE_S2			_MOD_PROT(pgprot_s2, L_PTE_S2_RDONLY)
> > > +#define PAGE_S2_NORMAL_NC	__pgprot((pgprot_val(PAGE_S2) & ~L_PTE_S2_MT_MASK) | L_PTE_S2_MT_NORMAL_NC)
> > >  #define PAGE_S2_DEVICE		_MOD_PROT(pgprot_s2_device, L_PTE_S2_RDONLY)
> > >  
> > >  #define __PAGE_NONE		__pgprot(_L_PTE_DEFAULT | L_PTE_RDONLY | L_PTE_XN | L_PTE_NONE)
> > > diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> > > index bc1665acd73e7..6b3bd8061bd2a 100644
> > > --- a/arch/arm/kvm/mmu.c
> > > +++ b/arch/arm/kvm/mmu.c
> > > @@ -1220,7 +1220,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >  	struct vm_area_struct *vma;
> > >  	pfn_t pfn;
> > >  	pgprot_t mem_type = PAGE_S2;
> > > -	bool fault_ipa_uncached;
> > > +	bool fault_ipa_uncached = false;
> > >  	bool logging_active = memslot_is_logging(memslot);
> > >  	unsigned long flags = 0;
> > >  
> > > @@ -1300,6 +1300,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >  			writable = false;
> > >  	}
> > >  
> > > +	if (memslot->flags & KVM_MEM_UNCACHED) {
> > > +		mem_type = PAGE_S2_NORMAL_NC;
> > > +		fault_ipa_uncached = true;
> > > +	}
> > > +
> > >  	spin_lock(&kvm->mmu_lock);
> > >  	if (mmu_notifier_retry(kvm, mmu_seq))
> > >  		goto out_unlock;
> > > @@ -1307,8 +1312,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >  	if (!hugetlb && !force_pte)
> > >  		hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
> > >  
> > > -	fault_ipa_uncached = memslot->flags & KVM_MEM_UNCACHED;
> > > -
> > >  	if (hugetlb) {
> > >  		pmd_t new_pmd = pfn_pmd(pfn, mem_type);
> > >  		new_pmd = pmd_mkhuge(new_pmd);
> > > @@ -1462,6 +1465,7 @@ static int handle_hva_to_gpa(struct kvm *kvm,
> > >  			     unsigned long start,
> > >  			     unsigned long end,
> > >  			     int (*handler)(struct kvm *kvm,
> > > +					    struct kvm_memory_slot *slot,
> > >  					    gpa_t gpa, void *data),
> > >  			     void *data)
> > >  {
> > > @@ -1491,14 +1495,15 @@ static int handle_hva_to_gpa(struct kvm *kvm,
> > >  
> > >  		for (; gfn < gfn_end; ++gfn) {
> > >  			gpa_t gpa = gfn << PAGE_SHIFT;
> > > -			ret |= handler(kvm, gpa, data);
> > > +			ret |= handler(kvm, memslot, gpa, data);
> > >  		}
> > >  	}
> > >  
> > >  	return ret;
> > >  }
> > >  
> > > -static int kvm_unmap_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
> > > +static int kvm_unmap_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > +				 gpa_t gpa, void *data)
> > 
> > Maybe we should consider a pointer to a struct with the relevant data to
> > pass around to the handler by now, which would allow us to get rid of
> > the void * cast as well.  Not sure if it's worth it.
> > 
> > >  {
> > >  	unmap_stage2_range(kvm, gpa, PAGE_SIZE);
> > >  	return 0;
> > > @@ -1527,9 +1532,15 @@ int kvm_unmap_hva_range(struct kvm *kvm,
> > >  	return 0;
> > >  }
> > >  
> > > -static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
> > > +static int kvm_set_spte_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > +				gpa_t gpa, void *data)
> > >  {
> > > -	pte_t *pte = (pte_t *)data;
> > > +	pte_t pte = *((pte_t *)data);
> > > +
> > > +	if (slot->flags & KVM_MEM_UNCACHED)
> > > +		pte = pfn_pte(pte_pfn(pte), PAGE_S2_NORMAL_NC);
> > > +	else
> > > +		pte = pfn_pte(pte_pfn(pte), PAGE_S2);
> > >  
> > >  	/*
> > >  	 * We can always call stage2_set_pte with KVM_S2PTE_FLAG_LOGGING_ACTIVE
> > > @@ -1538,7 +1549,7 @@ static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
> > >  	 * therefore stage2_set_pte() never needs to clear out a huge PMD
> > >  	 * through this calling path.
> > >  	 */
> > > -	stage2_set_pte(kvm, NULL, gpa, pte, 0);
> > > +	stage2_set_pte(kvm, NULL, gpa, &pte, 0);
> > 
> > this is making me feel like we should have a separate patch that changes
> > stage2_set_pte from taking a pointer to just taking a value for the new
> > pte.
> > 
> > >  	return 0;
> > >  }
> > >  
> > > @@ -1546,17 +1557,16 @@ static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
> > >  void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
> > >  {
> > >  	unsigned long end = hva + PAGE_SIZE;
> > > -	pte_t stage2_pte;
> > >  
> > >  	if (!kvm->arch.pgd)
> > >  		return;
> > >  
> > >  	trace_kvm_set_spte_hva(hva);
> > > -	stage2_pte = pfn_pte(pte_pfn(pte), PAGE_S2);
> > > -	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &stage2_pte);
> > > +	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pte);
> > 
> > hooking in here will make sure you catch changes to the page used for
> > the mapping, but wouldn't that also mean that the userspace mapping
> > would have been change, and where are you updating this?
> > 
> > Also, is this called if the userspace mapping is zapped without doing
> > anything about the underlying page?  (how do we then catch when the
> > userspace pte is populated again, and is this even possible?)
> 
> I was hoping that I only needed to worry about getting the S2 attributes
> right here, and then, since the page will need to be refaulted into
> the guest anyway, that the userspace part would get taken care of at
> that point (in user_mem_abort).

user_mem_abort handles stage-2 page table faults, which has (almost)
nothing to do with the user space page tables.

I think it's entirely possible to have a stage-2 mapping to a page,
which is no longer mapped to userspace at all.  Or do we pin the
userspace PTEs (not just keep a reference on the physical page) during
fault handling?

> But, to be honest, I forgot to dig into
> this deep enough to know if my hope will work or not.
> 

You really need to work this out so you feel confident with the overall
scheme here, then I can try to see if I can break it under review, but I
think the author of this patch must know how it is *supposed* to work ;)

> > 
> > >  }
> > >  
> > > -static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
> > > +static int kvm_age_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > +			       gpa_t gpa, void *data)
> > >  {
> > >  	pmd_t *pmd;
> > >  	pte_t *pte;
> > > @@ -1586,7 +1596,8 @@ static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
> > >  	return 0;
> > >  }
> > >  
> > > -static int kvm_test_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
> > > +static int kvm_test_age_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > +				    gpa_t gpa, void *data)
> > >  {
> > >  	pmd_t *pmd;
> > >  	pte_t *pte;
> > > diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> > > index 61505676d0853..af5f0f0eccef9 100644
> > > --- a/arch/arm64/include/asm/kvm_mmu.h
> > > +++ b/arch/arm64/include/asm/kvm_mmu.h
> > > @@ -236,8 +236,11 @@ static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
> > >  {
> > >  	void *va = page_address(pfn_to_page(pfn));
> > >  
> > > -	if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached)
> > > +	if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached) {
> > >  		kvm_flush_dcache_to_poc(va, size);
> > > +		if (ipa_uncached)
> > > +			set_memory_nc((unsigned long)va, size/PAGE_SIZE);
> > 
> > are you not setting the kernel mapping of the page to non-cached here,
> > which doesn't affect your userspace mappings at all?
> 
> Oh crap. I shouldn't have tried to use change_memory_common... I
> completely overlooked the fact I'm now using the wrong mm...
> 

and the wrong va...

> > 
> > (this would explain why things still break with this series).
> 
> yeah, I wonder why it works so well?
> 

luck/slowed things down/reordered operations to make things less likely
would be my guess.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 3/3] arm/arm64: KVM: implement 'uncached' mem coherency
@ 2015-05-15 15:02         ` Christoffer Dall
  0 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-15 15:02 UTC (permalink / raw)
  To: Andrew Jones
  Cc: ard.biesheuvel, marc.zyngier, catalin.marinas, qemu-devel,
	pbonzini, lersek, kvmarm

On Thu, May 14, 2015 at 03:32:13PM +0200, Andrew Jones wrote:
> On Thu, May 14, 2015 at 12:55:49PM +0200, Christoffer Dall wrote:
> > On Wed, May 13, 2015 at 01:31:54PM +0200, Andrew Jones wrote:
> > > When S1 and S2 memory attributes combine wrt to caching policy,
> > > non-cacheable types take precedence. If a guest maps a region as
> > > device memory, which KVM userspace is using to emulate the device
> > > using normal, cacheable memory, then we lose coherency. With
> > > KVM_MEM_UNCACHED, KVM userspace can now hint to KVM which memory
> > > regions are likely to be problematic. With this patch, as pages
> > > of these types of regions are faulted into the guest, not only do
> > > we flush the page's dcache, but we also change userspace's
> > > mapping to NC in order to maintain coherency.
> > > 
> > > What if the guest doesn't do what we expect? While we can't
> > > force a guest to use cacheable memory, we can take advantage of
> > > the non-cacheable precedence, and force it to use non-cacheable.
> > > So, this patch also introduces PAGE_S2_NORMAL_NC, and uses it on
> > > KVM_MEM_UNCACHED regions to force them to NC.
> > > 
> > > We now have both guest and userspace on the same page (pun intended)
> > 
> > I'd like to revisit the overall approach here.  Is doing non-cached
> > accesses in both the guest and host really the right thing to do here?
> 
> I think so, but all ideas/approaches are still on the table. This is
> still an RFC.
> 
> > 
> > The semantics of the device becomes that it is cache coherent (because
> > QEMU is), and I think Marc argued that Linux/UEFI should simply be
> > adapted to handle whatever emulated devices we have as coherent.  I also
> > remember someone arguing that would be wrong (Peter?).
> 
> I'm not really for quirking all devices in all guest types (AAVMF, Linux,
> other bootloaders, other OSes). Windows is unlikely to apply any quirks.
> 

Well my point was that if we're emulating a platform with coherent IO
memory for PCI devices that is something that the guest should work with
as such, but as Paolo explained it should always be safe for a guest to
assume non-coherent, so that doesn't work.

> > 
> > Finally, does this address all cache coherency issues with emulated
> > devices?  Some VOS guys had seen things still not working with this
> > approach, unsure why...  I'd like to avoid us merging this only to merge
> > a more complete solution in a few weeks which reverts this solution...
> 
> I'm not sure (this is still an RFT too :-) We definitely would need to
> scatter some more memory_region_set_uncached() calls around QEMU first.
> 

It would be good if you could sync with the VOS guys and make sure your
patch set addresses their issues with the appropriate
memory_region_set_uncached() added to QEMU, and if it does not, some
vague idea why that falls outside of the scope of this patch set.  After
all, adding a USB controller to a VM is not that an esoteric use case,
is it?

> > 
> > More comments/questions below:
> > 
> > > 
> > > Signed-off-by: Andrew Jones <drjones@redhat.com>
> > > ---
> > >  arch/arm/include/asm/kvm_mmu.h        |  5 ++++-
> > >  arch/arm/include/asm/pgtable-3level.h |  1 +
> > >  arch/arm/include/asm/pgtable.h        |  1 +
> > >  arch/arm/kvm/mmu.c                    | 37 +++++++++++++++++++++++------------
> > >  arch/arm64/include/asm/kvm_mmu.h      |  5 ++++-
> > >  arch/arm64/include/asm/memory.h       |  1 +
> > >  arch/arm64/include/asm/pgtable.h      |  1 +
> > >  7 files changed, 36 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> > > index 405aa18833073..e8034a80b12e5 100644
> > > --- a/arch/arm/include/asm/kvm_mmu.h
> > > +++ b/arch/arm/include/asm/kvm_mmu.h
> > > @@ -214,8 +214,11 @@ static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
> > >  	while (size) {
> > >  		void *va = kmap_atomic_pfn(pfn);
> > >  
> > > -		if (need_flush)
> > > +		if (need_flush) {
> > >  			kvm_flush_dcache_to_poc(va, PAGE_SIZE);
> > > +			if (ipa_uncached)
> > > +				set_memory_nc((unsigned long)va, 1);
> > 
> > nit: consider moving this outside the need_flush
> > 
> > > +		}
> > >  
> > >  		if (icache_is_pipt())
> > >  			__cpuc_coherent_user_range((unsigned long)va,
> > > diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
> > > index a745a2a53853c..39b3f7a40e663 100644
> > > --- a/arch/arm/include/asm/pgtable-3level.h
> > > +++ b/arch/arm/include/asm/pgtable-3level.h
> > > @@ -121,6 +121,7 @@
> > >   * 2nd stage PTE definitions for LPAE.
> > >   */
> > >  #define L_PTE_S2_MT_UNCACHED		(_AT(pteval_t, 0x0) << 2) /* strongly ordered */
> > > +#define L_PTE_S2_MT_NORMAL_NC		(_AT(pteval_t, 0x5) << 2) /* normal non-cacheable */
> > >  #define L_PTE_S2_MT_WRITETHROUGH	(_AT(pteval_t, 0xa) << 2) /* normal inner write-through */
> > >  #define L_PTE_S2_MT_WRITEBACK		(_AT(pteval_t, 0xf) << 2) /* normal inner write-back */
> > >  #define L_PTE_S2_MT_DEV_SHARED		(_AT(pteval_t, 0x1) << 2) /* device */
> > > diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
> > > index f40354198bad4..ae13ca8b0a23d 100644
> > > --- a/arch/arm/include/asm/pgtable.h
> > > +++ b/arch/arm/include/asm/pgtable.h
> > > @@ -100,6 +100,7 @@ extern pgprot_t		pgprot_s2_device;
> > >  #define PAGE_HYP		_MOD_PROT(pgprot_kernel, L_PTE_HYP)
> > >  #define PAGE_HYP_DEVICE		_MOD_PROT(pgprot_hyp_device, L_PTE_HYP)
> > >  #define PAGE_S2			_MOD_PROT(pgprot_s2, L_PTE_S2_RDONLY)
> > > +#define PAGE_S2_NORMAL_NC	__pgprot((pgprot_val(PAGE_S2) & ~L_PTE_S2_MT_MASK) | L_PTE_S2_MT_NORMAL_NC)
> > >  #define PAGE_S2_DEVICE		_MOD_PROT(pgprot_s2_device, L_PTE_S2_RDONLY)
> > >  
> > >  #define __PAGE_NONE		__pgprot(_L_PTE_DEFAULT | L_PTE_RDONLY | L_PTE_XN | L_PTE_NONE)
> > > diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> > > index bc1665acd73e7..6b3bd8061bd2a 100644
> > > --- a/arch/arm/kvm/mmu.c
> > > +++ b/arch/arm/kvm/mmu.c
> > > @@ -1220,7 +1220,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >  	struct vm_area_struct *vma;
> > >  	pfn_t pfn;
> > >  	pgprot_t mem_type = PAGE_S2;
> > > -	bool fault_ipa_uncached;
> > > +	bool fault_ipa_uncached = false;
> > >  	bool logging_active = memslot_is_logging(memslot);
> > >  	unsigned long flags = 0;
> > >  
> > > @@ -1300,6 +1300,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >  			writable = false;
> > >  	}
> > >  
> > > +	if (memslot->flags & KVM_MEM_UNCACHED) {
> > > +		mem_type = PAGE_S2_NORMAL_NC;
> > > +		fault_ipa_uncached = true;
> > > +	}
> > > +
> > >  	spin_lock(&kvm->mmu_lock);
> > >  	if (mmu_notifier_retry(kvm, mmu_seq))
> > >  		goto out_unlock;
> > > @@ -1307,8 +1312,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > >  	if (!hugetlb && !force_pte)
> > >  		hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
> > >  
> > > -	fault_ipa_uncached = memslot->flags & KVM_MEM_UNCACHED;
> > > -
> > >  	if (hugetlb) {
> > >  		pmd_t new_pmd = pfn_pmd(pfn, mem_type);
> > >  		new_pmd = pmd_mkhuge(new_pmd);
> > > @@ -1462,6 +1465,7 @@ static int handle_hva_to_gpa(struct kvm *kvm,
> > >  			     unsigned long start,
> > >  			     unsigned long end,
> > >  			     int (*handler)(struct kvm *kvm,
> > > +					    struct kvm_memory_slot *slot,
> > >  					    gpa_t gpa, void *data),
> > >  			     void *data)
> > >  {
> > > @@ -1491,14 +1495,15 @@ static int handle_hva_to_gpa(struct kvm *kvm,
> > >  
> > >  		for (; gfn < gfn_end; ++gfn) {
> > >  			gpa_t gpa = gfn << PAGE_SHIFT;
> > > -			ret |= handler(kvm, gpa, data);
> > > +			ret |= handler(kvm, memslot, gpa, data);
> > >  		}
> > >  	}
> > >  
> > >  	return ret;
> > >  }
> > >  
> > > -static int kvm_unmap_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
> > > +static int kvm_unmap_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > +				 gpa_t gpa, void *data)
> > 
> > Maybe we should consider a pointer to a struct with the relevant data to
> > pass around to the handler by now, which would allow us to get rid of
> > the void * cast as well.  Not sure if it's worth it.
> > 
> > >  {
> > >  	unmap_stage2_range(kvm, gpa, PAGE_SIZE);
> > >  	return 0;
> > > @@ -1527,9 +1532,15 @@ int kvm_unmap_hva_range(struct kvm *kvm,
> > >  	return 0;
> > >  }
> > >  
> > > -static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
> > > +static int kvm_set_spte_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > +				gpa_t gpa, void *data)
> > >  {
> > > -	pte_t *pte = (pte_t *)data;
> > > +	pte_t pte = *((pte_t *)data);
> > > +
> > > +	if (slot->flags & KVM_MEM_UNCACHED)
> > > +		pte = pfn_pte(pte_pfn(pte), PAGE_S2_NORMAL_NC);
> > > +	else
> > > +		pte = pfn_pte(pte_pfn(pte), PAGE_S2);
> > >  
> > >  	/*
> > >  	 * We can always call stage2_set_pte with KVM_S2PTE_FLAG_LOGGING_ACTIVE
> > > @@ -1538,7 +1549,7 @@ static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
> > >  	 * therefore stage2_set_pte() never needs to clear out a huge PMD
> > >  	 * through this calling path.
> > >  	 */
> > > -	stage2_set_pte(kvm, NULL, gpa, pte, 0);
> > > +	stage2_set_pte(kvm, NULL, gpa, &pte, 0);
> > 
> > this is making me feel like we should have a separate patch that changes
> > stage2_set_pte from taking a pointer to just taking a value for the new
> > pte.
> > 
> > >  	return 0;
> > >  }
> > >  
> > > @@ -1546,17 +1557,16 @@ static int kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data)
> > >  void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
> > >  {
> > >  	unsigned long end = hva + PAGE_SIZE;
> > > -	pte_t stage2_pte;
> > >  
> > >  	if (!kvm->arch.pgd)
> > >  		return;
> > >  
> > >  	trace_kvm_set_spte_hva(hva);
> > > -	stage2_pte = pfn_pte(pte_pfn(pte), PAGE_S2);
> > > -	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &stage2_pte);
> > > +	handle_hva_to_gpa(kvm, hva, end, &kvm_set_spte_handler, &pte);
> > 
> > hooking in here will make sure you catch changes to the page used for
> > the mapping, but wouldn't that also mean that the userspace mapping
> > would have been change, and where are you updating this?
> > 
> > Also, is this called if the userspace mapping is zapped without doing
> > anything about the underlying page?  (how do we then catch when the
> > userspace pte is populated again, and is this even possible?)
> 
> I was hoping that I only needed to worry about getting the S2 attributes
> right here, and then, since the page will need to be refaulted into
> the guest anyway, that the userspace part would get taken care of at
> that point (in user_mem_abort).

user_mem_abort handles stage-2 page table faults, which has (almost)
nothing to do with the user space page tables.

I think it's entirely possible to have a stage-2 mapping to a page,
which is no longer mapped to userspace at all.  Or do we pin the
userspace PTEs (not just keep a reference on the physical page) during
fault handling?

> But, to be honest, I forgot to dig into
> this deep enough to know if my hope will work or not.
> 

You really need to work this out so you feel confident with the overall
scheme here, then I can try to see if I can break it under review, but I
think the author of this patch must know how it is *supposed* to work ;)

> > 
> > >  }
> > >  
> > > -static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
> > > +static int kvm_age_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > +			       gpa_t gpa, void *data)
> > >  {
> > >  	pmd_t *pmd;
> > >  	pte_t *pte;
> > > @@ -1586,7 +1596,8 @@ static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
> > >  	return 0;
> > >  }
> > >  
> > > -static int kvm_test_age_hva_handler(struct kvm *kvm, gpa_t gpa, void *data)
> > > +static int kvm_test_age_hva_handler(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > +				    gpa_t gpa, void *data)
> > >  {
> > >  	pmd_t *pmd;
> > >  	pte_t *pte;
> > > diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> > > index 61505676d0853..af5f0f0eccef9 100644
> > > --- a/arch/arm64/include/asm/kvm_mmu.h
> > > +++ b/arch/arm64/include/asm/kvm_mmu.h
> > > @@ -236,8 +236,11 @@ static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
> > >  {
> > >  	void *va = page_address(pfn_to_page(pfn));
> > >  
> > > -	if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached)
> > > +	if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached) {
> > >  		kvm_flush_dcache_to_poc(va, size);
> > > +		if (ipa_uncached)
> > > +			set_memory_nc((unsigned long)va, size/PAGE_SIZE);
> > 
> > are you not setting the kernel mapping of the page to non-cached here,
> > which doesn't affect your userspace mappings at all?
> 
> Oh crap. I shouldn't have tried to use change_memory_common... I
> completely overlooked the fact I'm now using the wrong mm...
> 

and the wrong va...

> > 
> > (this would explain why things still break with this series).
> 
> yeah, I wonder why it works so well?
> 

luck/slowed things down/reordered operations to make things less likely
would be my guess.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
  2015-05-14 13:36           ` Andrew Jones
@ 2015-05-15 15:09             ` Christoffer Dall
  -1 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-15 15:09 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Peter Maydell, Ard Biesheuvel, Marc Zyngier, Catalin Marinas,
	Alexander Graf, QEMU Developers, Paolo Bonzini, Laszlo Ersek,
	kvmarm

On Thu, May 14, 2015 at 03:36:37PM +0200, Andrew Jones wrote:
> On Thu, May 14, 2015 at 02:11:59PM +0100, Peter Maydell wrote:
> > On 14 May 2015 at 14:03, Andrew Jones <drjones@redhat.com> wrote:
> > > On Thu, May 14, 2015 at 11:37:46AM +0100, Peter Maydell wrote:
> > >> On 14 May 2015 at 11:31, Andrew Jones <drjones@redhat.com> wrote:
> > >> > Forgot to (4): switch from setting userspace's mapping to
> > >> > device memory to normal, non-cacheable. Using device memory
> > >> > caused a problem that Alex Graf found, and Peter Maydell suggested
> > >> > using normal, non-cacheable instead.
> > >>
> > >> Did you check that non-cacheable is definitely the correct
> > >> kind of Normal memory attribute we want? (ie not write-through).
> > >
> > > I was concerned that write-through wouldn't be sufficient. If the
> > > guest writes to its non-cached memory, and QEMU needs to see what
> > > it wrote, then won't write-through fail to work? Unless we some
> > > how invalidate the cache first?
> > 
> > Well, I meant more that the correct mapping for userspace is
> > the same as the guest, whatever that is, and so somebody needs
> > to look at what the guest actually does rather than merely
> > hoping NormalNC is OK. (For instance, do we need to provide
> > support for QEMU to map both NC and writethrough?)
> >
> 
> Ah, we assume the guest is mapping it as device memory, and in
> this version of the series, I ensure that it is at least NC with
> the S2 attributes. I don't think we can look at what some guests
> do with some devices to come up with anything beyond (poor?)
> heuristics. I prefer that we force both the guest and QEMU to NC
> (or guest chooses Device and QEMU is forced to NC) to make sure
> we get it right.
> 
But picking up on Peter's feedback I think it would be good if the
series clearly states something like:

1) We assume that the guest may use device type memory for the accesses
2) we cannot use device memory for the userspace mapping because
userspace may be doing unaligned accesses to it 3) normal non-cacheable
bridges these worlds becauase of x, y, and z.

I assume x, y, and z would include a fairly involved discussion of the
interesting aspects of how you can configure memory accesses on ARM ...
:)

-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED
@ 2015-05-15 15:09             ` Christoffer Dall
  0 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-15 15:09 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Ard Biesheuvel, Marc Zyngier, Catalin Marinas, QEMU Developers,
	Paolo Bonzini, Laszlo Ersek, kvmarm

On Thu, May 14, 2015 at 03:36:37PM +0200, Andrew Jones wrote:
> On Thu, May 14, 2015 at 02:11:59PM +0100, Peter Maydell wrote:
> > On 14 May 2015 at 14:03, Andrew Jones <drjones@redhat.com> wrote:
> > > On Thu, May 14, 2015 at 11:37:46AM +0100, Peter Maydell wrote:
> > >> On 14 May 2015 at 11:31, Andrew Jones <drjones@redhat.com> wrote:
> > >> > Forgot to (4): switch from setting userspace's mapping to
> > >> > device memory to normal, non-cacheable. Using device memory
> > >> > caused a problem that Alex Graf found, and Peter Maydell suggested
> > >> > using normal, non-cacheable instead.
> > >>
> > >> Did you check that non-cacheable is definitely the correct
> > >> kind of Normal memory attribute we want? (ie not write-through).
> > >
> > > I was concerned that write-through wouldn't be sufficient. If the
> > > guest writes to its non-cached memory, and QEMU needs to see what
> > > it wrote, then won't write-through fail to work? Unless we some
> > > how invalidate the cache first?
> > 
> > Well, I meant more that the correct mapping for userspace is
> > the same as the guest, whatever that is, and so somebody needs
> > to look at what the guest actually does rather than merely
> > hoping NormalNC is OK. (For instance, do we need to provide
> > support for QEMU to map both NC and writethrough?)
> >
> 
> Ah, we assume the guest is mapping it as device memory, and in
> this version of the series, I ensure that it is at least NC with
> the S2 attributes. I don't think we can look at what some guests
> do with some devices to come up with anything beyond (poor?)
> heuristics. I prefer that we force both the guest and QEMU to NC
> (or guest chooses Device and QEMU is forced to NC) to make sure
> we get it right.
> 
But picking up on Peter's feedback I think it would be good if the
series clearly states something like:

1) We assume that the guest may use device type memory for the accesses
2) we cannot use device memory for the userspace mapping because
userspace may be doing unaligned accesses to it 3) normal non-cacheable
bridges these worlds becauase of x, y, and z.

I assume x, y, and z would include a fairly involved discussion of the
interesting aspects of how you can configure memory accesses on ARM ...
:)

-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 3/3] arm/arm64: KVM: implement 'uncached' mem coherency
  2015-05-15 15:02         ` Christoffer Dall
@ 2015-05-15 17:04           ` Andrew Jones
  -1 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-15 17:04 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: peter.maydell, ard.biesheuvel, marc.zyngier, catalin.marinas,
	qemu-devel, agraf, pbonzini, j.fanguede, lersek, kvmarm,
	m.smarduch

On Fri, May 15, 2015 at 08:02:59AM -0700, Christoffer Dall wrote:
> On Thu, May 14, 2015 at 03:32:13PM +0200, Andrew Jones wrote:
> > On Thu, May 14, 2015 at 12:55:49PM +0200, Christoffer Dall wrote:
> > > On Wed, May 13, 2015 at 01:31:54PM +0200, Andrew Jones wrote:
> > > > When S1 and S2 memory attributes combine wrt to caching policy,
> > > > non-cacheable types take precedence. If a guest maps a region as
> > > > device memory, which KVM userspace is using to emulate the device
> > > > using normal, cacheable memory, then we lose coherency. With
> > > > KVM_MEM_UNCACHED, KVM userspace can now hint to KVM which memory
> > > > regions are likely to be problematic. With this patch, as pages
> > > > of these types of regions are faulted into the guest, not only do
> > > > we flush the page's dcache, but we also change userspace's
> > > > mapping to NC in order to maintain coherency.
> > > > 
> > > > What if the guest doesn't do what we expect? While we can't
> > > > force a guest to use cacheable memory, we can take advantage of
> > > > the non-cacheable precedence, and force it to use non-cacheable.
> > > > So, this patch also introduces PAGE_S2_NORMAL_NC, and uses it on
> > > > KVM_MEM_UNCACHED regions to force them to NC.
> > > > 
> > > > We now have both guest and userspace on the same page (pun intended)
> > > 
> > > I'd like to revisit the overall approach here.  Is doing non-cached
> > > accesses in both the guest and host really the right thing to do here?
> > 
> > I think so, but all ideas/approaches are still on the table. This is
> > still an RFC.
> > 
> > > 
> > > The semantics of the device becomes that it is cache coherent (because
> > > QEMU is), and I think Marc argued that Linux/UEFI should simply be
> > > adapted to handle whatever emulated devices we have as coherent.  I also
> > > remember someone arguing that would be wrong (Peter?).
> > 
> > I'm not really for quirking all devices in all guest types (AAVMF, Linux,
> > other bootloaders, other OSes). Windows is unlikely to apply any quirks.
> > 
> 
> Well my point was that if we're emulating a platform with coherent IO
> memory for PCI devices that is something that the guest should work with
> as such, but as Paolo explained it should always be safe for a guest to
> assume non-coherent, so that doesn't work.
> 
> > > 
> > > Finally, does this address all cache coherency issues with emulated
> > > devices?  Some VOS guys had seen things still not working with this
> > > approach, unsure why...  I'd like to avoid us merging this only to merge
> > > a more complete solution in a few weeks which reverts this solution...
> > 
> > I'm not sure (this is still an RFT too :-) We definitely would need to
> > scatter some more memory_region_set_uncached() calls around QEMU first.
> > 
> 
> It would be good if you could sync with the VOS guys and make sure your
> patch set addresses their issues with the appropriate
> memory_region_set_uncached() added to QEMU, and if it does not, some
> vague idea why that falls outside of the scope of this patch set.  After
> all, adding a USB controller to a VM is not that an esoteric use case,
> is it?

I'll pull together a new version addressing all your comments, and also
put some more time into making sure it'll work...

Jeremy, can you give me the qemu command line you're using for your tests?
I'll do some experimenting with it.

Thanks,
drew

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 3/3] arm/arm64: KVM: implement 'uncached' mem coherency
@ 2015-05-15 17:04           ` Andrew Jones
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-15 17:04 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: ard.biesheuvel, marc.zyngier, catalin.marinas, qemu-devel,
	pbonzini, lersek, kvmarm

On Fri, May 15, 2015 at 08:02:59AM -0700, Christoffer Dall wrote:
> On Thu, May 14, 2015 at 03:32:13PM +0200, Andrew Jones wrote:
> > On Thu, May 14, 2015 at 12:55:49PM +0200, Christoffer Dall wrote:
> > > On Wed, May 13, 2015 at 01:31:54PM +0200, Andrew Jones wrote:
> > > > When S1 and S2 memory attributes combine wrt to caching policy,
> > > > non-cacheable types take precedence. If a guest maps a region as
> > > > device memory, which KVM userspace is using to emulate the device
> > > > using normal, cacheable memory, then we lose coherency. With
> > > > KVM_MEM_UNCACHED, KVM userspace can now hint to KVM which memory
> > > > regions are likely to be problematic. With this patch, as pages
> > > > of these types of regions are faulted into the guest, not only do
> > > > we flush the page's dcache, but we also change userspace's
> > > > mapping to NC in order to maintain coherency.
> > > > 
> > > > What if the guest doesn't do what we expect? While we can't
> > > > force a guest to use cacheable memory, we can take advantage of
> > > > the non-cacheable precedence, and force it to use non-cacheable.
> > > > So, this patch also introduces PAGE_S2_NORMAL_NC, and uses it on
> > > > KVM_MEM_UNCACHED regions to force them to NC.
> > > > 
> > > > We now have both guest and userspace on the same page (pun intended)
> > > 
> > > I'd like to revisit the overall approach here.  Is doing non-cached
> > > accesses in both the guest and host really the right thing to do here?
> > 
> > I think so, but all ideas/approaches are still on the table. This is
> > still an RFC.
> > 
> > > 
> > > The semantics of the device becomes that it is cache coherent (because
> > > QEMU is), and I think Marc argued that Linux/UEFI should simply be
> > > adapted to handle whatever emulated devices we have as coherent.  I also
> > > remember someone arguing that would be wrong (Peter?).
> > 
> > I'm not really for quirking all devices in all guest types (AAVMF, Linux,
> > other bootloaders, other OSes). Windows is unlikely to apply any quirks.
> > 
> 
> Well my point was that if we're emulating a platform with coherent IO
> memory for PCI devices that is something that the guest should work with
> as such, but as Paolo explained it should always be safe for a guest to
> assume non-coherent, so that doesn't work.
> 
> > > 
> > > Finally, does this address all cache coherency issues with emulated
> > > devices?  Some VOS guys had seen things still not working with this
> > > approach, unsure why...  I'd like to avoid us merging this only to merge
> > > a more complete solution in a few weeks which reverts this solution...
> > 
> > I'm not sure (this is still an RFT too :-) We definitely would need to
> > scatter some more memory_region_set_uncached() calls around QEMU first.
> > 
> 
> It would be good if you could sync with the VOS guys and make sure your
> patch set addresses their issues with the appropriate
> memory_region_set_uncached() added to QEMU, and if it does not, some
> vague idea why that falls outside of the scope of this patch set.  After
> all, adding a USB controller to a VM is not that an esoteric use case,
> is it?

I'll pull together a new version addressing all your comments, and also
put some more time into making sure it'll work...

Jeremy, can you give me the qemu command line you're using for your tests?
I'll do some experimenting with it.

Thanks,
drew

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 3/3] arm/arm64: KVM: implement 'uncached' mem coherency
  2015-05-15 17:04           ` Andrew Jones
@ 2015-05-15 20:16             ` Jérémy Fanguède
  -1 siblings, 0 replies; 102+ messages in thread
From: Jérémy Fanguède @ 2015-05-15 20:16 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Peter Maydell, ard.biesheuvel, marc.zyngier, catalin.marinas,
	agraf, QEMU Developers, Paolo Bonzini, lersek, kvmarm,
	Christoffer Dall, m.smarduch

On Fri, May 15, 2015 at 7:04 PM, Andrew Jones <drjones@redhat.com> wrote:
> On Fri, May 15, 2015 at 08:02:59AM -0700, Christoffer Dall wrote:
>> On Thu, May 14, 2015 at 03:32:13PM +0200, Andrew Jones wrote:
>> > On Thu, May 14, 2015 at 12:55:49PM +0200, Christoffer Dall wrote:
>> > > On Wed, May 13, 2015 at 01:31:54PM +0200, Andrew Jones wrote:
>> > > > When S1 and S2 memory attributes combine wrt to caching policy,
>> > > > non-cacheable types take precedence. If a guest maps a region as
>> > > > device memory, which KVM userspace is using to emulate the device
>> > > > using normal, cacheable memory, then we lose coherency. With
>> > > > KVM_MEM_UNCACHED, KVM userspace can now hint to KVM which memory
>> > > > regions are likely to be problematic. With this patch, as pages
>> > > > of these types of regions are faulted into the guest, not only do
>> > > > we flush the page's dcache, but we also change userspace's
>> > > > mapping to NC in order to maintain coherency.
>> > > >
>> > > > What if the guest doesn't do what we expect? While we can't
>> > > > force a guest to use cacheable memory, we can take advantage of
>> > > > the non-cacheable precedence, and force it to use non-cacheable.
>> > > > So, this patch also introduces PAGE_S2_NORMAL_NC, and uses it on
>> > > > KVM_MEM_UNCACHED regions to force them to NC.
>> > > >
>> > > > We now have both guest and userspace on the same page (pun intended)
>> > >
>> > > I'd like to revisit the overall approach here.  Is doing non-cached
>> > > accesses in both the guest and host really the right thing to do here?
>> >
>> > I think so, but all ideas/approaches are still on the table. This is
>> > still an RFC.
>> >
>> > >
>> > > The semantics of the device becomes that it is cache coherent (because
>> > > QEMU is), and I think Marc argued that Linux/UEFI should simply be
>> > > adapted to handle whatever emulated devices we have as coherent.  I also
>> > > remember someone arguing that would be wrong (Peter?).
>> >
>> > I'm not really for quirking all devices in all guest types (AAVMF, Linux,
>> > other bootloaders, other OSes). Windows is unlikely to apply any quirks.
>> >
>>
>> Well my point was that if we're emulating a platform with coherent IO
>> memory for PCI devices that is something that the guest should work with
>> as such, but as Paolo explained it should always be safe for a guest to
>> assume non-coherent, so that doesn't work.
>>
>> > >
>> > > Finally, does this address all cache coherency issues with emulated
>> > > devices?  Some VOS guys had seen things still not working with this
>> > > approach, unsure why...  I'd like to avoid us merging this only to merge
>> > > a more complete solution in a few weeks which reverts this solution...
>> >
>> > I'm not sure (this is still an RFT too :-) We definitely would need to
>> > scatter some more memory_region_set_uncached() calls around QEMU first.
>> >
>>
>> It would be good if you could sync with the VOS guys and make sure your
>> patch set addresses their issues with the appropriate
>> memory_region_set_uncached() added to QEMU, and if it does not, some
>> vague idea why that falls outside of the scope of this patch set.  After
>> all, adding a USB controller to a VM is not that an esoteric use case,
>> is it?
>
> I'll pull together a new version addressing all your comments, and also
> put some more time into making sure it'll work...
>
> Jeremy, can you give me the qemu command line you're using for your tests?
> I'll do some experimenting with it.

Hi Andrew,

Here is the command line that I used:

./qemu-system-arm -m 512 -machine type=virt \
    -enable-kvm -cpu host \
    -nographic \
    -kernel zImage \
    -drive if=none,file=ubuntu.img,id=fs,format=raw \
    -device virtio-blk-device,drive=fs \
    -usb -device usb-ehci \
    -device usb-tablet \
    -append "console=ttyAMA0 root=/dev/vda rw"

I encountered also troubles with other devices, so the test can be
extended with some devices:
    -netdev type=user,id=net0 -device e1000,netdev=net0 \
    -device nec-usb-xhci,id=xhci \
    -device usb-kbd,bus=xhci.0 \
    -drive if=scsi,file=disk.img,format=raw \
    -device lsi53c895a \

I tried it with some arm32 boards, but also on arm64 (Juno board).


Jeremy

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 3/3] arm/arm64: KVM: implement 'uncached' mem coherency
@ 2015-05-15 20:16             ` Jérémy Fanguède
  0 siblings, 0 replies; 102+ messages in thread
From: Jérémy Fanguède @ 2015-05-15 20:16 UTC (permalink / raw)
  To: Andrew Jones
  Cc: ard.biesheuvel, marc.zyngier, catalin.marinas, QEMU Developers,
	Paolo Bonzini, lersek, kvmarm

On Fri, May 15, 2015 at 7:04 PM, Andrew Jones <drjones@redhat.com> wrote:
> On Fri, May 15, 2015 at 08:02:59AM -0700, Christoffer Dall wrote:
>> On Thu, May 14, 2015 at 03:32:13PM +0200, Andrew Jones wrote:
>> > On Thu, May 14, 2015 at 12:55:49PM +0200, Christoffer Dall wrote:
>> > > On Wed, May 13, 2015 at 01:31:54PM +0200, Andrew Jones wrote:
>> > > > When S1 and S2 memory attributes combine wrt to caching policy,
>> > > > non-cacheable types take precedence. If a guest maps a region as
>> > > > device memory, which KVM userspace is using to emulate the device
>> > > > using normal, cacheable memory, then we lose coherency. With
>> > > > KVM_MEM_UNCACHED, KVM userspace can now hint to KVM which memory
>> > > > regions are likely to be problematic. With this patch, as pages
>> > > > of these types of regions are faulted into the guest, not only do
>> > > > we flush the page's dcache, but we also change userspace's
>> > > > mapping to NC in order to maintain coherency.
>> > > >
>> > > > What if the guest doesn't do what we expect? While we can't
>> > > > force a guest to use cacheable memory, we can take advantage of
>> > > > the non-cacheable precedence, and force it to use non-cacheable.
>> > > > So, this patch also introduces PAGE_S2_NORMAL_NC, and uses it on
>> > > > KVM_MEM_UNCACHED regions to force them to NC.
>> > > >
>> > > > We now have both guest and userspace on the same page (pun intended)
>> > >
>> > > I'd like to revisit the overall approach here.  Is doing non-cached
>> > > accesses in both the guest and host really the right thing to do here?
>> >
>> > I think so, but all ideas/approaches are still on the table. This is
>> > still an RFC.
>> >
>> > >
>> > > The semantics of the device becomes that it is cache coherent (because
>> > > QEMU is), and I think Marc argued that Linux/UEFI should simply be
>> > > adapted to handle whatever emulated devices we have as coherent.  I also
>> > > remember someone arguing that would be wrong (Peter?).
>> >
>> > I'm not really for quirking all devices in all guest types (AAVMF, Linux,
>> > other bootloaders, other OSes). Windows is unlikely to apply any quirks.
>> >
>>
>> Well my point was that if we're emulating a platform with coherent IO
>> memory for PCI devices that is something that the guest should work with
>> as such, but as Paolo explained it should always be safe for a guest to
>> assume non-coherent, so that doesn't work.
>>
>> > >
>> > > Finally, does this address all cache coherency issues with emulated
>> > > devices?  Some VOS guys had seen things still not working with this
>> > > approach, unsure why...  I'd like to avoid us merging this only to merge
>> > > a more complete solution in a few weeks which reverts this solution...
>> >
>> > I'm not sure (this is still an RFT too :-) We definitely would need to
>> > scatter some more memory_region_set_uncached() calls around QEMU first.
>> >
>>
>> It would be good if you could sync with the VOS guys and make sure your
>> patch set addresses their issues with the appropriate
>> memory_region_set_uncached() added to QEMU, and if it does not, some
>> vague idea why that falls outside of the scope of this patch set.  After
>> all, adding a USB controller to a VM is not that an esoteric use case,
>> is it?
>
> I'll pull together a new version addressing all your comments, and also
> put some more time into making sure it'll work...
>
> Jeremy, can you give me the qemu command line you're using for your tests?
> I'll do some experimenting with it.

Hi Andrew,

Here is the command line that I used:

./qemu-system-arm -m 512 -machine type=virt \
    -enable-kvm -cpu host \
    -nographic \
    -kernel zImage \
    -drive if=none,file=ubuntu.img,id=fs,format=raw \
    -device virtio-blk-device,drive=fs \
    -usb -device usb-ehci \
    -device usb-tablet \
    -append "console=ttyAMA0 root=/dev/vda rw"

I encountered also troubles with other devices, so the test can be
extended with some devices:
    -netdev type=user,id=net0 -device e1000,netdev=net0 \
    -device nec-usb-xhci,id=xhci \
    -device usb-kbd,bus=xhci.0 \
    -drive if=scsi,file=disk.img,format=raw \
    -device lsi53c895a \

I tried it with some arm32 boards, but also on arm64 (Juno board).


Jeremy

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
  2015-05-14 13:46       ` Andrew Jones
@ 2015-05-18 15:53         ` Catalin Marinas
  -1 siblings, 0 replies; 102+ messages in thread
From: Catalin Marinas @ 2015-05-18 15:53 UTC (permalink / raw)
  To: Andrew Jones
  Cc: peter.maydell, ard.biesheuvel, Marc Zyngier, qemu-devel, agraf,
	pbonzini, j.fanguede, lersek, kvmarm, Christoffer Dall,
	m.smarduch

On Thu, May 14, 2015 at 02:46:44PM +0100, Andrew Jones wrote:
> On Thu, May 14, 2015 at 01:05:09PM +0200, Christoffer Dall wrote:
> > On Wed, May 13, 2015 at 01:31:52PM +0200, Andrew Jones wrote:
> > > Provide a method to change normal, cacheable memory to non-cacheable.
> > > KVM will make use of this to keep emulated device memory regions
> > > coherent with the guest.
> > > 
> > > Signed-off-by: Andrew Jones <drjones@redhat.com>
> > 
> > Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
> > 
> > But you obviously need Russell and Will/Catalin to ack/merge this.
> 
> I guess this patch is going to go away in the next round. You've
> pointed out that I screwed stuff up royally with my over eagerness
> to reuse code. I need to reimplement change_memory_common, but a
> version that takes an mm, which is more or less what I did in the
> last version of this series, back when I was pinning pages.

I kept wondering what this patch and the next one are doing with
set_memory_nc(). Basically you were trying to set the cache attributes
for the kernel linear mapping or kmap address (in the 32-bit arm case,
which is removed anyway on kunmap).

What you need is changing the attributes of the user mapping as accessed
by Qemu but I don't think simply re-implementing change_memory_common()
would work, you probably need to pin the pages in memory as well.
Otherwise, the kernel may remove such pages and, when bringing them
back, would set the default cacheability attributes.

Another way would be to split the vma containing the non-cacheable
memory so that you get a single vma with the vm_page_prot as
Non-cacheable.

Yet another approach could be for KVM to mmap the necessary memory for
Qemu via a file_operations.mmap call (but that's only for ranges outside
the guest "RAM").

I didn't have time to follow these threads in details, but just to
recap my understanding, we have two main use-cases:

1. Qemu handling guest I/O to device (e.g. PCIe BARs)
2. Qemu emulating device DMA

For (1), I guess Qemu uses an anonymous mmap() and then tells KVM about
this memory slot. The memory attributes in this case could be Device
because that's how the guest would normally map it. The
file_operations.mmap trick would work in this case but this means
expanding the KVM ABI beyond just an ioctl().

For (2), since Qemu is writing to the guest "RAM" (e.g. video
framebuffer allocated by the guest), I still think the simplest is to
tell the guest (via DT) that such device is cache coherent rather than
trying to remap the Qemu mapping as non-cacheable.

-- 
Catalin

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
@ 2015-05-18 15:53         ` Catalin Marinas
  0 siblings, 0 replies; 102+ messages in thread
From: Catalin Marinas @ 2015-05-18 15:53 UTC (permalink / raw)
  To: Andrew Jones
  Cc: ard.biesheuvel, Marc Zyngier, qemu-devel, pbonzini, lersek, kvmarm

On Thu, May 14, 2015 at 02:46:44PM +0100, Andrew Jones wrote:
> On Thu, May 14, 2015 at 01:05:09PM +0200, Christoffer Dall wrote:
> > On Wed, May 13, 2015 at 01:31:52PM +0200, Andrew Jones wrote:
> > > Provide a method to change normal, cacheable memory to non-cacheable.
> > > KVM will make use of this to keep emulated device memory regions
> > > coherent with the guest.
> > > 
> > > Signed-off-by: Andrew Jones <drjones@redhat.com>
> > 
> > Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
> > 
> > But you obviously need Russell and Will/Catalin to ack/merge this.
> 
> I guess this patch is going to go away in the next round. You've
> pointed out that I screwed stuff up royally with my over eagerness
> to reuse code. I need to reimplement change_memory_common, but a
> version that takes an mm, which is more or less what I did in the
> last version of this series, back when I was pinning pages.

I kept wondering what this patch and the next one are doing with
set_memory_nc(). Basically you were trying to set the cache attributes
for the kernel linear mapping or kmap address (in the 32-bit arm case,
which is removed anyway on kunmap).

What you need is changing the attributes of the user mapping as accessed
by Qemu but I don't think simply re-implementing change_memory_common()
would work, you probably need to pin the pages in memory as well.
Otherwise, the kernel may remove such pages and, when bringing them
back, would set the default cacheability attributes.

Another way would be to split the vma containing the non-cacheable
memory so that you get a single vma with the vm_page_prot as
Non-cacheable.

Yet another approach could be for KVM to mmap the necessary memory for
Qemu via a file_operations.mmap call (but that's only for ranges outside
the guest "RAM").

I didn't have time to follow these threads in details, but just to
recap my understanding, we have two main use-cases:

1. Qemu handling guest I/O to device (e.g. PCIe BARs)
2. Qemu emulating device DMA

For (1), I guess Qemu uses an anonymous mmap() and then tells KVM about
this memory slot. The memory attributes in this case could be Device
because that's how the guest would normally map it. The
file_operations.mmap trick would work in this case but this means
expanding the KVM ABI beyond just an ioctl().

For (2), since Qemu is writing to the guest "RAM" (e.g. video
framebuffer allocated by the guest), I still think the simplest is to
tell the guest (via DT) that such device is cache coherent rather than
trying to remap the Qemu mapping as non-cacheable.

-- 
Catalin

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
  2015-05-18 15:53         ` Catalin Marinas
@ 2015-05-19 10:03           ` Andrew Jones
  -1 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-19 10:03 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: peter.maydell, ard.biesheuvel, Marc Zyngier, qemu-devel, agraf,
	pbonzini, j.fanguede, lersek, kvmarm, Christoffer Dall,
	m.smarduch


Hi Catalin,

Thanks for the feedback. Some comments to your comments below.

On Mon, May 18, 2015 at 04:53:03PM +0100, Catalin Marinas wrote:
> On Thu, May 14, 2015 at 02:46:44PM +0100, Andrew Jones wrote:
> > On Thu, May 14, 2015 at 01:05:09PM +0200, Christoffer Dall wrote:
> > > On Wed, May 13, 2015 at 01:31:52PM +0200, Andrew Jones wrote:
> > > > Provide a method to change normal, cacheable memory to non-cacheable.
> > > > KVM will make use of this to keep emulated device memory regions
> > > > coherent with the guest.
> > > > 
> > > > Signed-off-by: Andrew Jones <drjones@redhat.com>
> > > 
> > > Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
> > > 
> > > But you obviously need Russell and Will/Catalin to ack/merge this.
> > 
> > I guess this patch is going to go away in the next round. You've
> > pointed out that I screwed stuff up royally with my over eagerness
> > to reuse code. I need to reimplement change_memory_common, but a
> > version that takes an mm, which is more or less what I did in the
> > last version of this series, back when I was pinning pages.
> 
> I kept wondering what this patch and the next one are doing with
> set_memory_nc(). Basically you were trying to set the cache attributes
> for the kernel linear mapping or kmap address (in the 32-bit arm case,
> which is removed anyway on kunmap).

Yeah, I was way to hasty with this version... Sorry for wasting people's
time with it.

> 
> What you need is changing the attributes of the user mapping as accessed
> by Qemu but I don't think simply re-implementing change_memory_common()
> would work, you probably need to pin the pages in memory as well.
> Otherwise, the kernel may remove such pages and, when bringing them
> back, would set the default cacheability attributes.

Yup, I read that code and saw it would inherit the memory attributes
from the vma. At the time I wasn't thinking about the userspace mapping
though, and thus had convinced myself that if "the" mapping goes away,
then we'll be invalidating the guest's mapping too, and thus we'd end up
faulting it in again when necessary, and at that time resetting the
attributes. If the guest never faulted it in again, then it wouldn't
have mattered what the attributes were anyway. Of course I was thinking
about the wrong mapping...

> 
> Another way would be to split the vma containing the non-cacheable
> memory so that you get a single vma with the vm_page_prot as
> Non-cacheable.

This sounds interesting. Actually, it even crossed my mind once when I
first saw that the vma would overwrite the attributes, but then, sigh,
I let my brain take a stupidity bath.

> 
> Yet another approach could be for KVM to mmap the necessary memory for
> Qemu via a file_operations.mmap call (but that's only for ranges outside
> the guest "RAM").

I guess I prefer the vma splitting, rather than this (the vma creating
with mmap), as it keeps the KVM interface from changing (as you point out
below). Well, unless there are other advantages to this that are worth
considering?

> 
> I didn't have time to follow these threads in details, but just to
> recap my understanding, we have two main use-cases:
> 
> 1. Qemu handling guest I/O to device (e.g. PCIe BARs)
> 2. Qemu emulating device DMA
> 
> For (1), I guess Qemu uses an anonymous mmap() and then tells KVM about
> this memory slot. The memory attributes in this case could be Device
> because that's how the guest would normally map it. The
> file_operations.mmap trick would work in this case but this means
> expanding the KVM ABI beyond just an ioctl().
> 
> For (2), since Qemu is writing to the guest "RAM" (e.g. video
> framebuffer allocated by the guest), I still think the simplest is to
> tell the guest (via DT) that such device is cache coherent rather than
> trying to remap the Qemu mapping as non-cacheable.

If we need a solution for (1), then I'd prefer that it work and be
applied to (2) as well. Anyway, I'm still not 100% sure we can count on
all guest types (booloaders, different OSes) to listen to us. They may
assume non-cacheable is typical and safe, and thus just do that always.
We can certainly change some of those bootloaders and OSes, but probably
not all of them.

Thanks,
drew

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
@ 2015-05-19 10:03           ` Andrew Jones
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-19 10:03 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: ard.biesheuvel, Marc Zyngier, qemu-devel, pbonzini, lersek, kvmarm


Hi Catalin,

Thanks for the feedback. Some comments to your comments below.

On Mon, May 18, 2015 at 04:53:03PM +0100, Catalin Marinas wrote:
> On Thu, May 14, 2015 at 02:46:44PM +0100, Andrew Jones wrote:
> > On Thu, May 14, 2015 at 01:05:09PM +0200, Christoffer Dall wrote:
> > > On Wed, May 13, 2015 at 01:31:52PM +0200, Andrew Jones wrote:
> > > > Provide a method to change normal, cacheable memory to non-cacheable.
> > > > KVM will make use of this to keep emulated device memory regions
> > > > coherent with the guest.
> > > > 
> > > > Signed-off-by: Andrew Jones <drjones@redhat.com>
> > > 
> > > Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
> > > 
> > > But you obviously need Russell and Will/Catalin to ack/merge this.
> > 
> > I guess this patch is going to go away in the next round. You've
> > pointed out that I screwed stuff up royally with my over eagerness
> > to reuse code. I need to reimplement change_memory_common, but a
> > version that takes an mm, which is more or less what I did in the
> > last version of this series, back when I was pinning pages.
> 
> I kept wondering what this patch and the next one are doing with
> set_memory_nc(). Basically you were trying to set the cache attributes
> for the kernel linear mapping or kmap address (in the 32-bit arm case,
> which is removed anyway on kunmap).

Yeah, I was way to hasty with this version... Sorry for wasting people's
time with it.

> 
> What you need is changing the attributes of the user mapping as accessed
> by Qemu but I don't think simply re-implementing change_memory_common()
> would work, you probably need to pin the pages in memory as well.
> Otherwise, the kernel may remove such pages and, when bringing them
> back, would set the default cacheability attributes.

Yup, I read that code and saw it would inherit the memory attributes
from the vma. At the time I wasn't thinking about the userspace mapping
though, and thus had convinced myself that if "the" mapping goes away,
then we'll be invalidating the guest's mapping too, and thus we'd end up
faulting it in again when necessary, and at that time resetting the
attributes. If the guest never faulted it in again, then it wouldn't
have mattered what the attributes were anyway. Of course I was thinking
about the wrong mapping...

> 
> Another way would be to split the vma containing the non-cacheable
> memory so that you get a single vma with the vm_page_prot as
> Non-cacheable.

This sounds interesting. Actually, it even crossed my mind once when I
first saw that the vma would overwrite the attributes, but then, sigh,
I let my brain take a stupidity bath.

> 
> Yet another approach could be for KVM to mmap the necessary memory for
> Qemu via a file_operations.mmap call (but that's only for ranges outside
> the guest "RAM").

I guess I prefer the vma splitting, rather than this (the vma creating
with mmap), as it keeps the KVM interface from changing (as you point out
below). Well, unless there are other advantages to this that are worth
considering?

> 
> I didn't have time to follow these threads in details, but just to
> recap my understanding, we have two main use-cases:
> 
> 1. Qemu handling guest I/O to device (e.g. PCIe BARs)
> 2. Qemu emulating device DMA
> 
> For (1), I guess Qemu uses an anonymous mmap() and then tells KVM about
> this memory slot. The memory attributes in this case could be Device
> because that's how the guest would normally map it. The
> file_operations.mmap trick would work in this case but this means
> expanding the KVM ABI beyond just an ioctl().
> 
> For (2), since Qemu is writing to the guest "RAM" (e.g. video
> framebuffer allocated by the guest), I still think the simplest is to
> tell the guest (via DT) that such device is cache coherent rather than
> trying to remap the Qemu mapping as non-cacheable.

If we need a solution for (1), then I'd prefer that it work and be
applied to (2) as well. Anyway, I'm still not 100% sure we can count on
all guest types (booloaders, different OSes) to listen to us. They may
assume non-cacheable is typical and safe, and thus just do that always.
We can certainly change some of those bootloaders and OSes, but probably
not all of them.

Thanks,
drew

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
  2015-05-19 10:03           ` Andrew Jones
@ 2015-05-19 11:18             ` Catalin Marinas
  -1 siblings, 0 replies; 102+ messages in thread
From: Catalin Marinas @ 2015-05-19 11:18 UTC (permalink / raw)
  To: Andrew Jones
  Cc: peter.maydell, ard.biesheuvel, Marc Zyngier, qemu-devel, agraf,
	pbonzini, j.fanguede, lersek, kvmarm, Christoffer Dall,
	m.smarduch

On Tue, May 19, 2015 at 11:03:22AM +0100, Andrew Jones wrote:
> On Mon, May 18, 2015 at 04:53:03PM +0100, Catalin Marinas wrote:
> > Another way would be to split the vma containing the non-cacheable
> > memory so that you get a single vma with the vm_page_prot as
> > Non-cacheable.
> 
> This sounds interesting. Actually, it even crossed my mind once when I
> first saw that the vma would overwrite the attributes, but then, sigh,
> I let my brain take a stupidity bath.
> 
> > 
> > Yet another approach could be for KVM to mmap the necessary memory for
> > Qemu via a file_operations.mmap call (but that's only for ranges outside
> > the guest "RAM").
> 
> I guess I prefer the vma splitting, rather than this (the vma creating
> with mmap), as it keeps the KVM interface from changing (as you point out
> below). Well, unless there are other advantages to this that are worth
> considering?

The advantage is that you don't need to deal with the mm internals in
the KVM code.

But you can probably add such code directly to mm/ and reuse some of the
existing code in there already as part of change_protection(),
mprotect_fixup(), sys_mprotect(). Actually, once you split the vma and
set the new protection (something similar to mprotect_fixup), it looks
to me like you can just call change_protection(vma->vm_page_prot).

> > I didn't have time to follow these threads in details, but just to
> > recap my understanding, we have two main use-cases:
> > 
> > 1. Qemu handling guest I/O to device (e.g. PCIe BARs)
> > 2. Qemu emulating device DMA
> > 
> > For (1), I guess Qemu uses an anonymous mmap() and then tells KVM about
> > this memory slot. The memory attributes in this case could be Device
> > because that's how the guest would normally map it. The
> > file_operations.mmap trick would work in this case but this means
> > expanding the KVM ABI beyond just an ioctl().
> > 
> > For (2), since Qemu is writing to the guest "RAM" (e.g. video
> > framebuffer allocated by the guest), I still think the simplest is to
> > tell the guest (via DT) that such device is cache coherent rather than
> > trying to remap the Qemu mapping as non-cacheable.
> 
> If we need a solution for (1), then I'd prefer that it work and be
> applied to (2) as well. Anyway, I'm still not 100% sure we can count on
> all guest types (booloaders, different OSes) to listen to us. They may
> assume non-cacheable is typical and safe, and thus just do that always.
> We can certainly change some of those bootloaders and OSes, but probably
> not all of them.

That's fine by me. Once you get the vma splitting and attributes
changing done, I think you get the second one for free.

Do we want to differentiate between Device and Normal Non-cacheable
memory? Something like KVM_MEMSLOT_DEVICE?

Nitpick: I'm not sure whether "uncached" is clear enough. In Linux,
pgprot_noncached() returns Strongly Ordered memory. For Normal
Non-cachable we used pgprot_writecombine (e.g. a video framebuffer).

Maybe something like KVM_MEMSLOT_COHERENT meaning a request to KVM to
ensure that guest and host access it coherently (which would mean
writecombine for ARM). That's similar naming to functions like
dma_alloc_coherent() that return cacheable or non-cacheable memory based
on what the device supports. Anyway, I'm not to bothered with the
naming.

-- 
Catalin

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
@ 2015-05-19 11:18             ` Catalin Marinas
  0 siblings, 0 replies; 102+ messages in thread
From: Catalin Marinas @ 2015-05-19 11:18 UTC (permalink / raw)
  To: Andrew Jones
  Cc: ard.biesheuvel, Marc Zyngier, qemu-devel, pbonzini, lersek, kvmarm

On Tue, May 19, 2015 at 11:03:22AM +0100, Andrew Jones wrote:
> On Mon, May 18, 2015 at 04:53:03PM +0100, Catalin Marinas wrote:
> > Another way would be to split the vma containing the non-cacheable
> > memory so that you get a single vma with the vm_page_prot as
> > Non-cacheable.
> 
> This sounds interesting. Actually, it even crossed my mind once when I
> first saw that the vma would overwrite the attributes, but then, sigh,
> I let my brain take a stupidity bath.
> 
> > 
> > Yet another approach could be for KVM to mmap the necessary memory for
> > Qemu via a file_operations.mmap call (but that's only for ranges outside
> > the guest "RAM").
> 
> I guess I prefer the vma splitting, rather than this (the vma creating
> with mmap), as it keeps the KVM interface from changing (as you point out
> below). Well, unless there are other advantages to this that are worth
> considering?

The advantage is that you don't need to deal with the mm internals in
the KVM code.

But you can probably add such code directly to mm/ and reuse some of the
existing code in there already as part of change_protection(),
mprotect_fixup(), sys_mprotect(). Actually, once you split the vma and
set the new protection (something similar to mprotect_fixup), it looks
to me like you can just call change_protection(vma->vm_page_prot).

> > I didn't have time to follow these threads in details, but just to
> > recap my understanding, we have two main use-cases:
> > 
> > 1. Qemu handling guest I/O to device (e.g. PCIe BARs)
> > 2. Qemu emulating device DMA
> > 
> > For (1), I guess Qemu uses an anonymous mmap() and then tells KVM about
> > this memory slot. The memory attributes in this case could be Device
> > because that's how the guest would normally map it. The
> > file_operations.mmap trick would work in this case but this means
> > expanding the KVM ABI beyond just an ioctl().
> > 
> > For (2), since Qemu is writing to the guest "RAM" (e.g. video
> > framebuffer allocated by the guest), I still think the simplest is to
> > tell the guest (via DT) that such device is cache coherent rather than
> > trying to remap the Qemu mapping as non-cacheable.
> 
> If we need a solution for (1), then I'd prefer that it work and be
> applied to (2) as well. Anyway, I'm still not 100% sure we can count on
> all guest types (booloaders, different OSes) to listen to us. They may
> assume non-cacheable is typical and safe, and thus just do that always.
> We can certainly change some of those bootloaders and OSes, but probably
> not all of them.

That's fine by me. Once you get the vma splitting and attributes
changing done, I think you get the second one for free.

Do we want to differentiate between Device and Normal Non-cacheable
memory? Something like KVM_MEMSLOT_DEVICE?

Nitpick: I'm not sure whether "uncached" is clear enough. In Linux,
pgprot_noncached() returns Strongly Ordered memory. For Normal
Non-cachable we used pgprot_writecombine (e.g. a video framebuffer).

Maybe something like KVM_MEMSLOT_COHERENT meaning a request to KVM to
ensure that guest and host access it coherently (which would mean
writecombine for ARM). That's similar naming to functions like
dma_alloc_coherent() that return cacheable or non-cacheable memory based
on what the device supports. Anyway, I'm not to bothered with the
naming.

-- 
Catalin

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
  2015-05-19 11:18             ` Catalin Marinas
@ 2015-05-19 11:38               ` Andrew Jones
  -1 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-19 11:38 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: peter.maydell, ard.biesheuvel, Marc Zyngier, qemu-devel, agraf,
	pbonzini, j.fanguede, lersek, kvmarm, Christoffer Dall,
	m.smarduch

On Tue, May 19, 2015 at 12:18:54PM +0100, Catalin Marinas wrote:
> On Tue, May 19, 2015 at 11:03:22AM +0100, Andrew Jones wrote:
> > On Mon, May 18, 2015 at 04:53:03PM +0100, Catalin Marinas wrote:
> > > Another way would be to split the vma containing the non-cacheable
> > > memory so that you get a single vma with the vm_page_prot as
> > > Non-cacheable.
> > 
> > This sounds interesting. Actually, it even crossed my mind once when I
> > first saw that the vma would overwrite the attributes, but then, sigh,
> > I let my brain take a stupidity bath.
> > 
> > > 
> > > Yet another approach could be for KVM to mmap the necessary memory for
> > > Qemu via a file_operations.mmap call (but that's only for ranges outside
> > > the guest "RAM").
> > 
> > I guess I prefer the vma splitting, rather than this (the vma creating
> > with mmap), as it keeps the KVM interface from changing (as you point out
> > below). Well, unless there are other advantages to this that are worth
> > considering?
> 
> The advantage is that you don't need to deal with the mm internals in
> the KVM code.
> 
> But you can probably add such code directly to mm/ and reuse some of the
> existing code in there already as part of change_protection(),
> mprotect_fixup(), sys_mprotect(). Actually, once you split the vma and
> set the new protection (something similar to mprotect_fixup), it looks
> to me like you can just call change_protection(vma->vm_page_prot).

I'll start playing around with this today.

> 
> > > I didn't have time to follow these threads in details, but just to
> > > recap my understanding, we have two main use-cases:
> > > 
> > > 1. Qemu handling guest I/O to device (e.g. PCIe BARs)
> > > 2. Qemu emulating device DMA
> > > 
> > > For (1), I guess Qemu uses an anonymous mmap() and then tells KVM about
> > > this memory slot. The memory attributes in this case could be Device
> > > because that's how the guest would normally map it. The
> > > file_operations.mmap trick would work in this case but this means
> > > expanding the KVM ABI beyond just an ioctl().
> > > 
> > > For (2), since Qemu is writing to the guest "RAM" (e.g. video
> > > framebuffer allocated by the guest), I still think the simplest is to
> > > tell the guest (via DT) that such device is cache coherent rather than
> > > trying to remap the Qemu mapping as non-cacheable.
> > 
> > If we need a solution for (1), then I'd prefer that it work and be
> > applied to (2) as well. Anyway, I'm still not 100% sure we can count on
> > all guest types (booloaders, different OSes) to listen to us. They may
> > assume non-cacheable is typical and safe, and thus just do that always.
> > We can certainly change some of those bootloaders and OSes, but probably
> > not all of them.
> 
> That's fine by me. Once you get the vma splitting and attributes
> changing done, I think you get the second one for free.
> 
> Do we want to differentiate between Device and Normal Non-cacheable
> memory? Something like KVM_MEMSLOT_DEVICE?
> 
> Nitpick: I'm not sure whether "uncached" is clear enough. In Linux,
> pgprot_noncached() returns Strongly Ordered memory. For Normal
> Non-cachable we used pgprot_writecombine (e.g. a video framebuffer).
> 
> Maybe something like KVM_MEMSLOT_COHERENT meaning a request to KVM to

Sounds good to me. I'll rename for the next round.

> ensure that guest and host access it coherently (which would mean
> writecombine for ARM). That's similar naming to functions like
> dma_alloc_coherent() that return cacheable or non-cacheable memory based
> on what the device supports. Anyway, I'm not to bothered with the
> naming.
>

Thanks for your help!

drew 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
@ 2015-05-19 11:38               ` Andrew Jones
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-19 11:38 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: ard.biesheuvel, Marc Zyngier, qemu-devel, pbonzini, lersek, kvmarm

On Tue, May 19, 2015 at 12:18:54PM +0100, Catalin Marinas wrote:
> On Tue, May 19, 2015 at 11:03:22AM +0100, Andrew Jones wrote:
> > On Mon, May 18, 2015 at 04:53:03PM +0100, Catalin Marinas wrote:
> > > Another way would be to split the vma containing the non-cacheable
> > > memory so that you get a single vma with the vm_page_prot as
> > > Non-cacheable.
> > 
> > This sounds interesting. Actually, it even crossed my mind once when I
> > first saw that the vma would overwrite the attributes, but then, sigh,
> > I let my brain take a stupidity bath.
> > 
> > > 
> > > Yet another approach could be for KVM to mmap the necessary memory for
> > > Qemu via a file_operations.mmap call (but that's only for ranges outside
> > > the guest "RAM").
> > 
> > I guess I prefer the vma splitting, rather than this (the vma creating
> > with mmap), as it keeps the KVM interface from changing (as you point out
> > below). Well, unless there are other advantages to this that are worth
> > considering?
> 
> The advantage is that you don't need to deal with the mm internals in
> the KVM code.
> 
> But you can probably add such code directly to mm/ and reuse some of the
> existing code in there already as part of change_protection(),
> mprotect_fixup(), sys_mprotect(). Actually, once you split the vma and
> set the new protection (something similar to mprotect_fixup), it looks
> to me like you can just call change_protection(vma->vm_page_prot).

I'll start playing around with this today.

> 
> > > I didn't have time to follow these threads in details, but just to
> > > recap my understanding, we have two main use-cases:
> > > 
> > > 1. Qemu handling guest I/O to device (e.g. PCIe BARs)
> > > 2. Qemu emulating device DMA
> > > 
> > > For (1), I guess Qemu uses an anonymous mmap() and then tells KVM about
> > > this memory slot. The memory attributes in this case could be Device
> > > because that's how the guest would normally map it. The
> > > file_operations.mmap trick would work in this case but this means
> > > expanding the KVM ABI beyond just an ioctl().
> > > 
> > > For (2), since Qemu is writing to the guest "RAM" (e.g. video
> > > framebuffer allocated by the guest), I still think the simplest is to
> > > tell the guest (via DT) that such device is cache coherent rather than
> > > trying to remap the Qemu mapping as non-cacheable.
> > 
> > If we need a solution for (1), then I'd prefer that it work and be
> > applied to (2) as well. Anyway, I'm still not 100% sure we can count on
> > all guest types (booloaders, different OSes) to listen to us. They may
> > assume non-cacheable is typical and safe, and thus just do that always.
> > We can certainly change some of those bootloaders and OSes, but probably
> > not all of them.
> 
> That's fine by me. Once you get the vma splitting and attributes
> changing done, I think you get the second one for free.
> 
> Do we want to differentiate between Device and Normal Non-cacheable
> memory? Something like KVM_MEMSLOT_DEVICE?
> 
> Nitpick: I'm not sure whether "uncached" is clear enough. In Linux,
> pgprot_noncached() returns Strongly Ordered memory. For Normal
> Non-cachable we used pgprot_writecombine (e.g. a video framebuffer).
> 
> Maybe something like KVM_MEMSLOT_COHERENT meaning a request to KVM to

Sounds good to me. I'll rename for the next round.

> ensure that guest and host access it coherently (which would mean
> writecombine for ARM). That's similar naming to functions like
> dma_alloc_coherent() that return cacheable or non-cacheable memory based
> on what the device supports. Anyway, I'm not to bothered with the
> naming.
>

Thanks for your help!

drew 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
  2015-05-19 11:18             ` Catalin Marinas
@ 2015-05-20 10:01               ` Christoffer Dall
  -1 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-20 10:01 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: peter.maydell, Andrew Jones, ard.biesheuvel, Marc Zyngier,
	qemu-devel, agraf, pbonzini, j.fanguede, lersek, kvmarm,
	m.smarduch

On Tue, May 19, 2015 at 12:18:54PM +0100, Catalin Marinas wrote:
> On Tue, May 19, 2015 at 11:03:22AM +0100, Andrew Jones wrote:
> > On Mon, May 18, 2015 at 04:53:03PM +0100, Catalin Marinas wrote:
> > > Another way would be to split the vma containing the non-cacheable
> > > memory so that you get a single vma with the vm_page_prot as
> > > Non-cacheable.
> > 
> > This sounds interesting. Actually, it even crossed my mind once when I
> > first saw that the vma would overwrite the attributes, but then, sigh,
> > I let my brain take a stupidity bath.
> > 
> > > 
> > > Yet another approach could be for KVM to mmap the necessary memory for
> > > Qemu via a file_operations.mmap call (but that's only for ranges outside
> > > the guest "RAM").
> > 
> > I guess I prefer the vma splitting, rather than this (the vma creating
> > with mmap), as it keeps the KVM interface from changing (as you point out
> > below). Well, unless there are other advantages to this that are worth
> > considering?
> 
> The advantage is that you don't need to deal with the mm internals in
> the KVM code.
> 
> But you can probably add such code directly to mm/ and reuse some of the
> existing code in there already as part of change_protection(),
> mprotect_fixup(), sys_mprotect(). Actually, once you split the vma and
> set the new protection (something similar to mprotect_fixup), it looks
> to me like you can just call change_protection(vma->vm_page_prot).
> 
> > > I didn't have time to follow these threads in details, but just to
> > > recap my understanding, we have two main use-cases:
> > > 
> > > 1. Qemu handling guest I/O to device (e.g. PCIe BARs)
> > > 2. Qemu emulating device DMA
> > > 
> > > For (1), I guess Qemu uses an anonymous mmap() and then tells KVM about
> > > this memory slot. The memory attributes in this case could be Device
> > > because that's how the guest would normally map it. The
> > > file_operations.mmap trick would work in this case but this means
> > > expanding the KVM ABI beyond just an ioctl().
> > > 
> > > For (2), since Qemu is writing to the guest "RAM" (e.g. video
> > > framebuffer allocated by the guest), I still think the simplest is to
> > > tell the guest (via DT) that such device is cache coherent rather than
> > > trying to remap the Qemu mapping as non-cacheable.
> > 
> > If we need a solution for (1), then I'd prefer that it work and be
> > applied to (2) as well. Anyway, I'm still not 100% sure we can count on
> > all guest types (booloaders, different OSes) to listen to us. They may
> > assume non-cacheable is typical and safe, and thus just do that always.
> > We can certainly change some of those bootloaders and OSes, but probably
> > not all of them.
> 
> That's fine by me. Once you get the vma splitting and attributes
> changing done, I think you get the second one for free.
> 
> Do we want to differentiate between Device and Normal Non-cacheable
> memory? Something like KVM_MEMSLOT_DEVICE?
> 
> Nitpick: I'm not sure whether "uncached" is clear enough. In Linux,
> pgprot_noncached() returns Strongly Ordered memory. For Normal
> Non-cachable we used pgprot_writecombine (e.g. a video framebuffer).
> 
> Maybe something like KVM_MEMSLOT_COHERENT meaning a request to KVM to
> ensure that guest and host access it coherently (which would mean
> writecombine for ARM). That's similar naming to functions like
> dma_alloc_coherent() that return cacheable or non-cacheable memory based
> on what the device supports. Anyway, I'm not to bothered with the
> naming.
> 
One thing to keep in mind for (2) is that QEMU is likely to do things
like calling regular memcpy() on the memory region, so mapping it as
device memory which would fault on unaligned accesses may be a problem,
so ideally there is a memory type for the user space mapping which
allows such behavior where we at the same time can guarantee the that
the mapping is coherent with the guest mapping through the S2
attributes.

-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
@ 2015-05-20 10:01               ` Christoffer Dall
  0 siblings, 0 replies; 102+ messages in thread
From: Christoffer Dall @ 2015-05-20 10:01 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: ard.biesheuvel, Marc Zyngier, qemu-devel, pbonzini, lersek, kvmarm

On Tue, May 19, 2015 at 12:18:54PM +0100, Catalin Marinas wrote:
> On Tue, May 19, 2015 at 11:03:22AM +0100, Andrew Jones wrote:
> > On Mon, May 18, 2015 at 04:53:03PM +0100, Catalin Marinas wrote:
> > > Another way would be to split the vma containing the non-cacheable
> > > memory so that you get a single vma with the vm_page_prot as
> > > Non-cacheable.
> > 
> > This sounds interesting. Actually, it even crossed my mind once when I
> > first saw that the vma would overwrite the attributes, but then, sigh,
> > I let my brain take a stupidity bath.
> > 
> > > 
> > > Yet another approach could be for KVM to mmap the necessary memory for
> > > Qemu via a file_operations.mmap call (but that's only for ranges outside
> > > the guest "RAM").
> > 
> > I guess I prefer the vma splitting, rather than this (the vma creating
> > with mmap), as it keeps the KVM interface from changing (as you point out
> > below). Well, unless there are other advantages to this that are worth
> > considering?
> 
> The advantage is that you don't need to deal with the mm internals in
> the KVM code.
> 
> But you can probably add such code directly to mm/ and reuse some of the
> existing code in there already as part of change_protection(),
> mprotect_fixup(), sys_mprotect(). Actually, once you split the vma and
> set the new protection (something similar to mprotect_fixup), it looks
> to me like you can just call change_protection(vma->vm_page_prot).
> 
> > > I didn't have time to follow these threads in details, but just to
> > > recap my understanding, we have two main use-cases:
> > > 
> > > 1. Qemu handling guest I/O to device (e.g. PCIe BARs)
> > > 2. Qemu emulating device DMA
> > > 
> > > For (1), I guess Qemu uses an anonymous mmap() and then tells KVM about
> > > this memory slot. The memory attributes in this case could be Device
> > > because that's how the guest would normally map it. The
> > > file_operations.mmap trick would work in this case but this means
> > > expanding the KVM ABI beyond just an ioctl().
> > > 
> > > For (2), since Qemu is writing to the guest "RAM" (e.g. video
> > > framebuffer allocated by the guest), I still think the simplest is to
> > > tell the guest (via DT) that such device is cache coherent rather than
> > > trying to remap the Qemu mapping as non-cacheable.
> > 
> > If we need a solution for (1), then I'd prefer that it work and be
> > applied to (2) as well. Anyway, I'm still not 100% sure we can count on
> > all guest types (booloaders, different OSes) to listen to us. They may
> > assume non-cacheable is typical and safe, and thus just do that always.
> > We can certainly change some of those bootloaders and OSes, but probably
> > not all of them.
> 
> That's fine by me. Once you get the vma splitting and attributes
> changing done, I think you get the second one for free.
> 
> Do we want to differentiate between Device and Normal Non-cacheable
> memory? Something like KVM_MEMSLOT_DEVICE?
> 
> Nitpick: I'm not sure whether "uncached" is clear enough. In Linux,
> pgprot_noncached() returns Strongly Ordered memory. For Normal
> Non-cachable we used pgprot_writecombine (e.g. a video framebuffer).
> 
> Maybe something like KVM_MEMSLOT_COHERENT meaning a request to KVM to
> ensure that guest and host access it coherently (which would mean
> writecombine for ARM). That's similar naming to functions like
> dma_alloc_coherent() that return cacheable or non-cacheable memory based
> on what the device supports. Anyway, I'm not to bothered with the
> naming.
> 
One thing to keep in mind for (2) is that QEMU is likely to do things
like calling regular memcpy() on the memory region, so mapping it as
device memory which would fault on unaligned accesses may be a problem,
so ideally there is a memory type for the user space mapping which
allows such behavior where we at the same time can guarantee the that
the mapping is coherent with the guest mapping through the S2
attributes.

-Christoffer

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
  2015-05-20 10:01               ` Christoffer Dall
@ 2015-05-20 11:24                 ` Catalin Marinas
  -1 siblings, 0 replies; 102+ messages in thread
From: Catalin Marinas @ 2015-05-20 11:24 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: peter.maydell, Andrew Jones, ard.biesheuvel, Marc Zyngier,
	qemu-devel, agraf, pbonzini, j.fanguede, lersek, kvmarm,
	m.smarduch

On Wed, May 20, 2015 at 11:01:27AM +0100, Christoffer Dall wrote:
> On Tue, May 19, 2015 at 12:18:54PM +0100, Catalin Marinas wrote:
> > On Tue, May 19, 2015 at 11:03:22AM +0100, Andrew Jones wrote:
> > > On Mon, May 18, 2015 at 04:53:03PM +0100, Catalin Marinas wrote:
> > > > I didn't have time to follow these threads in details, but just to
> > > > recap my understanding, we have two main use-cases:
> > > > 
> > > > 1. Qemu handling guest I/O to device (e.g. PCIe BARs)
> > > > 2. Qemu emulating device DMA
> > > > 
> > > > For (1), I guess Qemu uses an anonymous mmap() and then tells KVM about
> > > > this memory slot. The memory attributes in this case could be Device
> > > > because that's how the guest would normally map it. The
> > > > file_operations.mmap trick would work in this case but this means
> > > > expanding the KVM ABI beyond just an ioctl().
> > > > 
> > > > For (2), since Qemu is writing to the guest "RAM" (e.g. video
> > > > framebuffer allocated by the guest), I still think the simplest is to
> > > > tell the guest (via DT) that such device is cache coherent rather than
> > > > trying to remap the Qemu mapping as non-cacheable.
> > > 
> > > If we need a solution for (1), then I'd prefer that it work and be
> > > applied to (2) as well. Anyway, I'm still not 100% sure we can count on
> > > all guest types (booloaders, different OSes) to listen to us. They may
> > > assume non-cacheable is typical and safe, and thus just do that always.
> > > We can certainly change some of those bootloaders and OSes, but probably
> > > not all of them.
> > 
> > That's fine by me. Once you get the vma splitting and attributes
> > changing done, I think you get the second one for free.
> > 
> > Do we want to differentiate between Device and Normal Non-cacheable
> > memory? Something like KVM_MEMSLOT_DEVICE?
> > 
> > Nitpick: I'm not sure whether "uncached" is clear enough. In Linux,
> > pgprot_noncached() returns Strongly Ordered memory. For Normal
> > Non-cachable we used pgprot_writecombine (e.g. a video framebuffer).
> > 
> > Maybe something like KVM_MEMSLOT_COHERENT meaning a request to KVM to
> > ensure that guest and host access it coherently (which would mean
> > writecombine for ARM). That's similar naming to functions like
> > dma_alloc_coherent() that return cacheable or non-cacheable memory based
> > on what the device supports. Anyway, I'm not to bothered with the
> > naming.
> > 
> One thing to keep in mind for (2) is that QEMU is likely to do things
> like calling regular memcpy() on the memory region, so mapping it as
> device memory which would fault on unaligned accesses may be a problem,
> so ideally there is a memory type for the user space mapping which
> allows such behavior where we at the same time can guarantee the that
> the mapping is coherent with the guest mapping through the S2
> attributes.

I agree, for (2) we need Normal memory (either cacheable or
non-cacheable, though as I can see it's more likely the latter as we
can't guarantee the guest honouring "dma-coherent" device properties).

-- 
Catalin

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
@ 2015-05-20 11:24                 ` Catalin Marinas
  0 siblings, 0 replies; 102+ messages in thread
From: Catalin Marinas @ 2015-05-20 11:24 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: ard.biesheuvel, Marc Zyngier, qemu-devel, pbonzini, lersek, kvmarm

On Wed, May 20, 2015 at 11:01:27AM +0100, Christoffer Dall wrote:
> On Tue, May 19, 2015 at 12:18:54PM +0100, Catalin Marinas wrote:
> > On Tue, May 19, 2015 at 11:03:22AM +0100, Andrew Jones wrote:
> > > On Mon, May 18, 2015 at 04:53:03PM +0100, Catalin Marinas wrote:
> > > > I didn't have time to follow these threads in details, but just to
> > > > recap my understanding, we have two main use-cases:
> > > > 
> > > > 1. Qemu handling guest I/O to device (e.g. PCIe BARs)
> > > > 2. Qemu emulating device DMA
> > > > 
> > > > For (1), I guess Qemu uses an anonymous mmap() and then tells KVM about
> > > > this memory slot. The memory attributes in this case could be Device
> > > > because that's how the guest would normally map it. The
> > > > file_operations.mmap trick would work in this case but this means
> > > > expanding the KVM ABI beyond just an ioctl().
> > > > 
> > > > For (2), since Qemu is writing to the guest "RAM" (e.g. video
> > > > framebuffer allocated by the guest), I still think the simplest is to
> > > > tell the guest (via DT) that such device is cache coherent rather than
> > > > trying to remap the Qemu mapping as non-cacheable.
> > > 
> > > If we need a solution for (1), then I'd prefer that it work and be
> > > applied to (2) as well. Anyway, I'm still not 100% sure we can count on
> > > all guest types (booloaders, different OSes) to listen to us. They may
> > > assume non-cacheable is typical and safe, and thus just do that always.
> > > We can certainly change some of those bootloaders and OSes, but probably
> > > not all of them.
> > 
> > That's fine by me. Once you get the vma splitting and attributes
> > changing done, I think you get the second one for free.
> > 
> > Do we want to differentiate between Device and Normal Non-cacheable
> > memory? Something like KVM_MEMSLOT_DEVICE?
> > 
> > Nitpick: I'm not sure whether "uncached" is clear enough. In Linux,
> > pgprot_noncached() returns Strongly Ordered memory. For Normal
> > Non-cachable we used pgprot_writecombine (e.g. a video framebuffer).
> > 
> > Maybe something like KVM_MEMSLOT_COHERENT meaning a request to KVM to
> > ensure that guest and host access it coherently (which would mean
> > writecombine for ARM). That's similar naming to functions like
> > dma_alloc_coherent() that return cacheable or non-cacheable memory based
> > on what the device supports. Anyway, I'm not to bothered with the
> > naming.
> > 
> One thing to keep in mind for (2) is that QEMU is likely to do things
> like calling regular memcpy() on the memory region, so mapping it as
> device memory which would fault on unaligned accesses may be a problem,
> so ideally there is a memory type for the user space mapping which
> allows such behavior where we at the same time can guarantee the that
> the mapping is coherent with the guest mapping through the S2
> attributes.

I agree, for (2) we need Normal memory (either cacheable or
non-cacheable, though as I can see it's more likely the latter as we
can't guarantee the guest honouring "dma-coherent" device properties).

-- 
Catalin

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 3/3] arm/arm64: KVM: implement 'uncached' mem coherency
  2015-05-15 17:04           ` Andrew Jones
@ 2015-05-21  2:29             ` Mario Smarduch
  -1 siblings, 0 replies; 102+ messages in thread
From: Mario Smarduch @ 2015-05-21  2:29 UTC (permalink / raw)
  To: Andrew Jones
  Cc: peter.maydell, ard.biesheuvel, marc.zyngier, catalin.marinas,
	agraf, qemu-devel, pbonzini, j.fanguede, lersek, kvmarm,
	Christoffer Dall

On 05/15/2015 10:04 AM, Andrew Jones wrote:
> On Fri, May 15, 2015 at 08:02:59AM -0700, Christoffer Dall wrote:
>> On Thu, May 14, 2015 at 03:32:13PM +0200, Andrew Jones wrote:
>>> On Thu, May 14, 2015 at 12:55:49PM +0200, Christoffer Dall wrote:
>>>> On Wed, May 13, 2015 at 01:31:54PM +0200, Andrew Jones wrote:
>>>>> When S1 and S2 memory attributes combine wrt to caching policy,
>>>>> non-cacheable types take precedence. If a guest maps a region as
>>>>> device memory, which KVM userspace is using to emulate the device
>>>>> using normal, cacheable memory, then we lose coherency. With
>>>>> KVM_MEM_UNCACHED, KVM userspace can now hint to KVM which memory
>>>>> regions are likely to be problematic. With this patch, as pages
>>>>> of these types of regions are faulted into the guest, not only do
>>>>> we flush the page's dcache, but we also change userspace's
>>>>> mapping to NC in order to maintain coherency.
>>>>>
>>>>> What if the guest doesn't do what we expect? While we can't
>>>>> force a guest to use cacheable memory, we can take advantage of
>>>>> the non-cacheable precedence, and force it to use non-cacheable.
>>>>> So, this patch also introduces PAGE_S2_NORMAL_NC, and uses it on
>>>>> KVM_MEM_UNCACHED regions to force them to NC.
>>>>>
>>>>> We now have both guest and userspace on the same page (pun intended)
>>>>
>>>> I'd like to revisit the overall approach here.  Is doing non-cached
>>>> accesses in both the guest and host really the right thing to do here?
>>>
>>> I think so, but all ideas/approaches are still on the table. This is
>>> still an RFC.
>>>
>>>>
>>>> The semantics of the device becomes that it is cache coherent (because
>>>> QEMU is), and I think Marc argued that Linux/UEFI should simply be
>>>> adapted to handle whatever emulated devices we have as coherent.  I also
>>>> remember someone arguing that would be wrong (Peter?).
>>>
>>> I'm not really for quirking all devices in all guest types (AAVMF, Linux,
>>> other bootloaders, other OSes). Windows is unlikely to apply any quirks.
>>>
>>
>> Well my point was that if we're emulating a platform with coherent IO
>> memory for PCI devices that is something that the guest should work with
>> as such, but as Paolo explained it should always be safe for a guest to
>> assume non-coherent, so that doesn't work.
>>
>>>>
>>>> Finally, does this address all cache coherency issues with emulated
>>>> devices?  Some VOS guys had seen things still not working with this
>>>> approach, unsure why...  I'd like to avoid us merging this only to merge
>>>> a more complete solution in a few weeks which reverts this solution...
>>>
>>> I'm not sure (this is still an RFT too :-) We definitely would need to
>>> scatter some more memory_region_set_uncached() calls around QEMU first.
>>>
>>
>> It would be good if you could sync with the VOS guys and make sure your
>> patch set addresses their issues with the appropriate
>> memory_region_set_uncached() added to QEMU, and if it does not, some
>> vague idea why that falls outside of the scope of this patch set.  After
>> all, adding a USB controller to a VM is not that an esoteric use case,
>> is it?
> 
> I'll pull together a new version addressing all your comments, and also
> put some more time into making sure it'll work...
> 
> Jeremy, can you give me the qemu command line you're using for your tests?
> I'll do some experimenting with it.
> 
> Thanks,
> drew
> 

Hi Drew,

I just recently looked at these patches and little confused.

Where or how are the QEMU page tables changed to
non-cacheable?

I noticed the logical pfn is changed to non-cacheable.

- Mario

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 3/3] arm/arm64: KVM: implement 'uncached' mem coherency
@ 2015-05-21  2:29             ` Mario Smarduch
  0 siblings, 0 replies; 102+ messages in thread
From: Mario Smarduch @ 2015-05-21  2:29 UTC (permalink / raw)
  To: Andrew Jones
  Cc: ard.biesheuvel, marc.zyngier, catalin.marinas, qemu-devel,
	pbonzini, lersek, kvmarm

On 05/15/2015 10:04 AM, Andrew Jones wrote:
> On Fri, May 15, 2015 at 08:02:59AM -0700, Christoffer Dall wrote:
>> On Thu, May 14, 2015 at 03:32:13PM +0200, Andrew Jones wrote:
>>> On Thu, May 14, 2015 at 12:55:49PM +0200, Christoffer Dall wrote:
>>>> On Wed, May 13, 2015 at 01:31:54PM +0200, Andrew Jones wrote:
>>>>> When S1 and S2 memory attributes combine wrt to caching policy,
>>>>> non-cacheable types take precedence. If a guest maps a region as
>>>>> device memory, which KVM userspace is using to emulate the device
>>>>> using normal, cacheable memory, then we lose coherency. With
>>>>> KVM_MEM_UNCACHED, KVM userspace can now hint to KVM which memory
>>>>> regions are likely to be problematic. With this patch, as pages
>>>>> of these types of regions are faulted into the guest, not only do
>>>>> we flush the page's dcache, but we also change userspace's
>>>>> mapping to NC in order to maintain coherency.
>>>>>
>>>>> What if the guest doesn't do what we expect? While we can't
>>>>> force a guest to use cacheable memory, we can take advantage of
>>>>> the non-cacheable precedence, and force it to use non-cacheable.
>>>>> So, this patch also introduces PAGE_S2_NORMAL_NC, and uses it on
>>>>> KVM_MEM_UNCACHED regions to force them to NC.
>>>>>
>>>>> We now have both guest and userspace on the same page (pun intended)
>>>>
>>>> I'd like to revisit the overall approach here.  Is doing non-cached
>>>> accesses in both the guest and host really the right thing to do here?
>>>
>>> I think so, but all ideas/approaches are still on the table. This is
>>> still an RFC.
>>>
>>>>
>>>> The semantics of the device becomes that it is cache coherent (because
>>>> QEMU is), and I think Marc argued that Linux/UEFI should simply be
>>>> adapted to handle whatever emulated devices we have as coherent.  I also
>>>> remember someone arguing that would be wrong (Peter?).
>>>
>>> I'm not really for quirking all devices in all guest types (AAVMF, Linux,
>>> other bootloaders, other OSes). Windows is unlikely to apply any quirks.
>>>
>>
>> Well my point was that if we're emulating a platform with coherent IO
>> memory for PCI devices that is something that the guest should work with
>> as such, but as Paolo explained it should always be safe for a guest to
>> assume non-coherent, so that doesn't work.
>>
>>>>
>>>> Finally, does this address all cache coherency issues with emulated
>>>> devices?  Some VOS guys had seen things still not working with this
>>>> approach, unsure why...  I'd like to avoid us merging this only to merge
>>>> a more complete solution in a few weeks which reverts this solution...
>>>
>>> I'm not sure (this is still an RFT too :-) We definitely would need to
>>> scatter some more memory_region_set_uncached() calls around QEMU first.
>>>
>>
>> It would be good if you could sync with the VOS guys and make sure your
>> patch set addresses their issues with the appropriate
>> memory_region_set_uncached() added to QEMU, and if it does not, some
>> vague idea why that falls outside of the scope of this patch set.  After
>> all, adding a USB controller to a VM is not that an esoteric use case,
>> is it?
> 
> I'll pull together a new version addressing all your comments, and also
> put some more time into making sure it'll work...
> 
> Jeremy, can you give me the qemu command line you're using for your tests?
> I'll do some experimenting with it.
> 
> Thanks,
> drew
> 

Hi Drew,

I just recently looked at these patches and little confused.

Where or how are the QEMU page tables changed to
non-cacheable?

I noticed the logical pfn is changed to non-cacheable.

- Mario

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 3/3] arm/arm64: KVM: implement 'uncached' mem coherency
  2015-05-21  2:29             ` Mario Smarduch
@ 2015-05-21 16:50               ` Andrew Jones
  -1 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-21 16:50 UTC (permalink / raw)
  To: Mario Smarduch
  Cc: peter.maydell, ard.biesheuvel, marc.zyngier, catalin.marinas,
	qemu-devel, agraf, pbonzini, j.fanguede, lersek, kvmarm,
	Christoffer Dall

On Wed, May 20, 2015 at 07:29:28PM -0700, Mario Smarduch wrote:
> On 05/15/2015 10:04 AM, Andrew Jones wrote:
> > On Fri, May 15, 2015 at 08:02:59AM -0700, Christoffer Dall wrote:
> >> On Thu, May 14, 2015 at 03:32:13PM +0200, Andrew Jones wrote:
> >>> On Thu, May 14, 2015 at 12:55:49PM +0200, Christoffer Dall wrote:
> >>>> On Wed, May 13, 2015 at 01:31:54PM +0200, Andrew Jones wrote:
> >>>>> When S1 and S2 memory attributes combine wrt to caching policy,
> >>>>> non-cacheable types take precedence. If a guest maps a region as
> >>>>> device memory, which KVM userspace is using to emulate the device
> >>>>> using normal, cacheable memory, then we lose coherency. With
> >>>>> KVM_MEM_UNCACHED, KVM userspace can now hint to KVM which memory
> >>>>> regions are likely to be problematic. With this patch, as pages
> >>>>> of these types of regions are faulted into the guest, not only do
> >>>>> we flush the page's dcache, but we also change userspace's
> >>>>> mapping to NC in order to maintain coherency.
> >>>>>
> >>>>> What if the guest doesn't do what we expect? While we can't
> >>>>> force a guest to use cacheable memory, we can take advantage of
> >>>>> the non-cacheable precedence, and force it to use non-cacheable.
> >>>>> So, this patch also introduces PAGE_S2_NORMAL_NC, and uses it on
> >>>>> KVM_MEM_UNCACHED regions to force them to NC.
> >>>>>
> >>>>> We now have both guest and userspace on the same page (pun intended)
> >>>>
> >>>> I'd like to revisit the overall approach here.  Is doing non-cached
> >>>> accesses in both the guest and host really the right thing to do here?
> >>>
> >>> I think so, but all ideas/approaches are still on the table. This is
> >>> still an RFC.
> >>>
> >>>>
> >>>> The semantics of the device becomes that it is cache coherent (because
> >>>> QEMU is), and I think Marc argued that Linux/UEFI should simply be
> >>>> adapted to handle whatever emulated devices we have as coherent.  I also
> >>>> remember someone arguing that would be wrong (Peter?).
> >>>
> >>> I'm not really for quirking all devices in all guest types (AAVMF, Linux,
> >>> other bootloaders, other OSes). Windows is unlikely to apply any quirks.
> >>>
> >>
> >> Well my point was that if we're emulating a platform with coherent IO
> >> memory for PCI devices that is something that the guest should work with
> >> as such, but as Paolo explained it should always be safe for a guest to
> >> assume non-coherent, so that doesn't work.
> >>
> >>>>
> >>>> Finally, does this address all cache coherency issues with emulated
> >>>> devices?  Some VOS guys had seen things still not working with this
> >>>> approach, unsure why...  I'd like to avoid us merging this only to merge
> >>>> a more complete solution in a few weeks which reverts this solution...
> >>>
> >>> I'm not sure (this is still an RFT too :-) We definitely would need to
> >>> scatter some more memory_region_set_uncached() calls around QEMU first.
> >>>
> >>
> >> It would be good if you could sync with the VOS guys and make sure your
> >> patch set addresses their issues with the appropriate
> >> memory_region_set_uncached() added to QEMU, and if it does not, some
> >> vague idea why that falls outside of the scope of this patch set.  After
> >> all, adding a USB controller to a VM is not that an esoteric use case,
> >> is it?
> > 
> > I'll pull together a new version addressing all your comments, and also
> > put some more time into making sure it'll work...
> > 
> > Jeremy, can you give me the qemu command line you're using for your tests?
> > I'll do some experimenting with it.
> > 
> > Thanks,
> > drew
> > 
> 
> Hi Drew,
> 
> I just recently looked at these patches and little confused.
> 
> Where or how are the QEMU page tables changed to
> non-cacheable?
> 
> I noticed the logical pfn is changed to non-cacheable.

Right, this series is broken. Please stay tuned, I'll fix it the
next time around.

Thanks for reviewing.

drew

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 3/3] arm/arm64: KVM: implement 'uncached' mem coherency
@ 2015-05-21 16:50               ` Andrew Jones
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-21 16:50 UTC (permalink / raw)
  To: Mario Smarduch
  Cc: ard.biesheuvel, marc.zyngier, catalin.marinas, qemu-devel,
	pbonzini, lersek, kvmarm

On Wed, May 20, 2015 at 07:29:28PM -0700, Mario Smarduch wrote:
> On 05/15/2015 10:04 AM, Andrew Jones wrote:
> > On Fri, May 15, 2015 at 08:02:59AM -0700, Christoffer Dall wrote:
> >> On Thu, May 14, 2015 at 03:32:13PM +0200, Andrew Jones wrote:
> >>> On Thu, May 14, 2015 at 12:55:49PM +0200, Christoffer Dall wrote:
> >>>> On Wed, May 13, 2015 at 01:31:54PM +0200, Andrew Jones wrote:
> >>>>> When S1 and S2 memory attributes combine wrt to caching policy,
> >>>>> non-cacheable types take precedence. If a guest maps a region as
> >>>>> device memory, which KVM userspace is using to emulate the device
> >>>>> using normal, cacheable memory, then we lose coherency. With
> >>>>> KVM_MEM_UNCACHED, KVM userspace can now hint to KVM which memory
> >>>>> regions are likely to be problematic. With this patch, as pages
> >>>>> of these types of regions are faulted into the guest, not only do
> >>>>> we flush the page's dcache, but we also change userspace's
> >>>>> mapping to NC in order to maintain coherency.
> >>>>>
> >>>>> What if the guest doesn't do what we expect? While we can't
> >>>>> force a guest to use cacheable memory, we can take advantage of
> >>>>> the non-cacheable precedence, and force it to use non-cacheable.
> >>>>> So, this patch also introduces PAGE_S2_NORMAL_NC, and uses it on
> >>>>> KVM_MEM_UNCACHED regions to force them to NC.
> >>>>>
> >>>>> We now have both guest and userspace on the same page (pun intended)
> >>>>
> >>>> I'd like to revisit the overall approach here.  Is doing non-cached
> >>>> accesses in both the guest and host really the right thing to do here?
> >>>
> >>> I think so, but all ideas/approaches are still on the table. This is
> >>> still an RFC.
> >>>
> >>>>
> >>>> The semantics of the device becomes that it is cache coherent (because
> >>>> QEMU is), and I think Marc argued that Linux/UEFI should simply be
> >>>> adapted to handle whatever emulated devices we have as coherent.  I also
> >>>> remember someone arguing that would be wrong (Peter?).
> >>>
> >>> I'm not really for quirking all devices in all guest types (AAVMF, Linux,
> >>> other bootloaders, other OSes). Windows is unlikely to apply any quirks.
> >>>
> >>
> >> Well my point was that if we're emulating a platform with coherent IO
> >> memory for PCI devices that is something that the guest should work with
> >> as such, but as Paolo explained it should always be safe for a guest to
> >> assume non-coherent, so that doesn't work.
> >>
> >>>>
> >>>> Finally, does this address all cache coherency issues with emulated
> >>>> devices?  Some VOS guys had seen things still not working with this
> >>>> approach, unsure why...  I'd like to avoid us merging this only to merge
> >>>> a more complete solution in a few weeks which reverts this solution...
> >>>
> >>> I'm not sure (this is still an RFT too :-) We definitely would need to
> >>> scatter some more memory_region_set_uncached() calls around QEMU first.
> >>>
> >>
> >> It would be good if you could sync with the VOS guys and make sure your
> >> patch set addresses their issues with the appropriate
> >> memory_region_set_uncached() added to QEMU, and if it does not, some
> >> vague idea why that falls outside of the scope of this patch set.  After
> >> all, adding a USB controller to a VM is not that an esoteric use case,
> >> is it?
> > 
> > I'll pull together a new version addressing all your comments, and also
> > put some more time into making sure it'll work...
> > 
> > Jeremy, can you give me the qemu command line you're using for your tests?
> > I'll do some experimenting with it.
> > 
> > Thanks,
> > drew
> > 
> 
> Hi Drew,
> 
> I just recently looked at these patches and little confused.
> 
> Where or how are the QEMU page tables changed to
> non-cacheable?
> 
> I noticed the logical pfn is changed to non-cacheable.

Right, this series is broken. Please stay tuned, I'll fix it the
next time around.

Thanks for reviewing.

drew

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
  2015-05-18 15:53         ` Catalin Marinas
@ 2015-05-23  1:08           ` Mario Smarduch
  -1 siblings, 0 replies; 102+ messages in thread
From: Mario Smarduch @ 2015-05-23  1:08 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: peter.maydell, Andrew Jones, ard.biesheuvel, Marc Zyngier,
	qemu-devel, agraf, pbonzini, j.fanguede, lersek, kvmarm,
	Christoffer Dall

On 05/18/2015 08:53 AM, Catalin Marinas wrote:
> On Thu, May 14, 2015 at 02:46:44PM +0100, Andrew Jones wrote:
>> On Thu, May 14, 2015 at 01:05:09PM +0200, Christoffer Dall wrote:
>>> On Wed, May 13, 2015 at 01:31:52PM +0200, Andrew Jones wrote:
>>>> Provide a method to change normal, cacheable memory to non-cacheable.
>>>> KVM will make use of this to keep emulated device memory regions
>>>> coherent with the guest.
>>>>
>>>> Signed-off-by: Andrew Jones <drjones@redhat.com>
>>>
>>> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
>>>
>>> But you obviously need Russell and Will/Catalin to ack/merge this.
>>
>> I guess this patch is going to go away in the next round. You've
>> pointed out that I screwed stuff up royally with my over eagerness
>> to reuse code. I need to reimplement change_memory_common, but a
>> version that takes an mm, which is more or less what I did in the
>> last version of this series, back when I was pinning pages.
> 
> I kept wondering what this patch and the next one are doing with
> set_memory_nc(). Basically you were trying to set the cache attributes
> for the kernel linear mapping or kmap address (in the 32-bit arm case,
> which is removed anyway on kunmap).
> 
> What you need is changing the attributes of the user mapping as accessed
> by Qemu but I don't think simply re-implementing change_memory_common()
> would work, you probably need to pin the pages in memory as well.
> Otherwise, the kernel may remove such pages and, when bringing them
> back, would set the default cacheability attributes.
> 
> Another way would be to split the vma containing the non-cacheable
> memory so that you get a single vma with the vm_page_prot as
> Non-cacheable.
> 
> Yet another approach could be for KVM to mmap the necessary memory for
> Qemu via a file_operations.mmap call (but that's only for ranges outside
> the guest "RAM").

I think this option with a basic loadable driver
that allocates non-cachable/pinned pages for QEMU to mmap()
may provide a reference point to build on. If that covers
all the cases then perhaps move to more generic solution. This
should be quicker to implement and test.

I wonder if kernel mm will ever have a reason
to create a cacheable mapping even if the pages are pinned?
Like reading /dev/mem although that's not a likely case here.

- Mario

> 
> I didn't have time to follow these threads in details, but just to
> recap my understanding, we have two main use-cases:
> 
> 1. Qemu handling guest I/O to device (e.g. PCIe BARs)
> 2. Qemu emulating device DMA
> 
> For (1), I guess Qemu uses an anonymous mmap() and then tells KVM about
> this memory slot. The memory attributes in this case could be Device
> because that's how the guest would normally map it. The
> file_operations.mmap trick would work in this case but this means
> expanding the KVM ABI beyond just an ioctl().
> 
> For (2), since Qemu is writing to the guest "RAM" (e.g. video
> framebuffer allocated by the guest), I still think the simplest is to
> tell the guest (via DT) that such device is cache coherent rather than
> trying to remap the Qemu mapping as non-cacheable.
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
@ 2015-05-23  1:08           ` Mario Smarduch
  0 siblings, 0 replies; 102+ messages in thread
From: Mario Smarduch @ 2015-05-23  1:08 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: ard.biesheuvel, Marc Zyngier, qemu-devel, pbonzini, lersek, kvmarm

On 05/18/2015 08:53 AM, Catalin Marinas wrote:
> On Thu, May 14, 2015 at 02:46:44PM +0100, Andrew Jones wrote:
>> On Thu, May 14, 2015 at 01:05:09PM +0200, Christoffer Dall wrote:
>>> On Wed, May 13, 2015 at 01:31:52PM +0200, Andrew Jones wrote:
>>>> Provide a method to change normal, cacheable memory to non-cacheable.
>>>> KVM will make use of this to keep emulated device memory regions
>>>> coherent with the guest.
>>>>
>>>> Signed-off-by: Andrew Jones <drjones@redhat.com>
>>>
>>> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
>>>
>>> But you obviously need Russell and Will/Catalin to ack/merge this.
>>
>> I guess this patch is going to go away in the next round. You've
>> pointed out that I screwed stuff up royally with my over eagerness
>> to reuse code. I need to reimplement change_memory_common, but a
>> version that takes an mm, which is more or less what I did in the
>> last version of this series, back when I was pinning pages.
> 
> I kept wondering what this patch and the next one are doing with
> set_memory_nc(). Basically you were trying to set the cache attributes
> for the kernel linear mapping or kmap address (in the 32-bit arm case,
> which is removed anyway on kunmap).
> 
> What you need is changing the attributes of the user mapping as accessed
> by Qemu but I don't think simply re-implementing change_memory_common()
> would work, you probably need to pin the pages in memory as well.
> Otherwise, the kernel may remove such pages and, when bringing them
> back, would set the default cacheability attributes.
> 
> Another way would be to split the vma containing the non-cacheable
> memory so that you get a single vma with the vm_page_prot as
> Non-cacheable.
> 
> Yet another approach could be for KVM to mmap the necessary memory for
> Qemu via a file_operations.mmap call (but that's only for ranges outside
> the guest "RAM").

I think this option with a basic loadable driver
that allocates non-cachable/pinned pages for QEMU to mmap()
may provide a reference point to build on. If that covers
all the cases then perhaps move to more generic solution. This
should be quicker to implement and test.

I wonder if kernel mm will ever have a reason
to create a cacheable mapping even if the pages are pinned?
Like reading /dev/mem although that's not a likely case here.

- Mario

> 
> I didn't have time to follow these threads in details, but just to
> recap my understanding, we have two main use-cases:
> 
> 1. Qemu handling guest I/O to device (e.g. PCIe BARs)
> 2. Qemu emulating device DMA
> 
> For (1), I guess Qemu uses an anonymous mmap() and then tells KVM about
> this memory slot. The memory attributes in this case could be Device
> because that's how the guest would normally map it. The
> file_operations.mmap trick would work in this case but this means
> expanding the KVM ABI beyond just an ioctl().
> 
> For (2), since Qemu is writing to the guest "RAM" (e.g. video
> framebuffer allocated by the guest), I still think the simplest is to
> tell the guest (via DT) that such device is cache coherent rather than
> trying to remap the Qemu mapping as non-cacheable.
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
  2015-05-23  1:08           ` Mario Smarduch
@ 2015-05-25 17:11             ` Andrew Jones
  -1 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-25 17:11 UTC (permalink / raw)
  To: Mario Smarduch
  Cc: peter.maydell, ard.biesheuvel, Marc Zyngier, Catalin Marinas,
	agraf, qemu-devel, pbonzini, j.fanguede, lersek, kvmarm,
	Christoffer Dall

On Fri, May 22, 2015 at 06:08:30PM -0700, Mario Smarduch wrote:
> On 05/18/2015 08:53 AM, Catalin Marinas wrote:
> > On Thu, May 14, 2015 at 02:46:44PM +0100, Andrew Jones wrote:
> >> On Thu, May 14, 2015 at 01:05:09PM +0200, Christoffer Dall wrote:
> >>> On Wed, May 13, 2015 at 01:31:52PM +0200, Andrew Jones wrote:
> >>>> Provide a method to change normal, cacheable memory to non-cacheable.
> >>>> KVM will make use of this to keep emulated device memory regions
> >>>> coherent with the guest.
> >>>>
> >>>> Signed-off-by: Andrew Jones <drjones@redhat.com>
> >>>
> >>> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
> >>>
> >>> But you obviously need Russell and Will/Catalin to ack/merge this.
> >>
> >> I guess this patch is going to go away in the next round. You've
> >> pointed out that I screwed stuff up royally with my over eagerness
> >> to reuse code. I need to reimplement change_memory_common, but a
> >> version that takes an mm, which is more or less what I did in the
> >> last version of this series, back when I was pinning pages.
> > 
> > I kept wondering what this patch and the next one are doing with
> > set_memory_nc(). Basically you were trying to set the cache attributes
> > for the kernel linear mapping or kmap address (in the 32-bit arm case,
> > which is removed anyway on kunmap).
> > 
> > What you need is changing the attributes of the user mapping as accessed
> > by Qemu but I don't think simply re-implementing change_memory_common()
> > would work, you probably need to pin the pages in memory as well.
> > Otherwise, the kernel may remove such pages and, when bringing them
> > back, would set the default cacheability attributes.
> > 
> > Another way would be to split the vma containing the non-cacheable
> > memory so that you get a single vma with the vm_page_prot as
> > Non-cacheable.
> > 
> > Yet another approach could be for KVM to mmap the necessary memory for
> > Qemu via a file_operations.mmap call (but that's only for ranges outside
> > the guest "RAM").
> 
> I think this option with a basic loadable driver
> that allocates non-cachable/pinned pages for QEMU to mmap()
> may provide a reference point to build on. If that covers
> all the cases then perhaps move to more generic solution. This
> should be quicker to implement and test.

I've pulled together a different approach for experimentation. I added a
new mmap/mprotect prot flag, like what was done for the powerpc SAO bit
(see commit b845f313d78e4e "mm: Allow architectures to define additional
protection bits" and commit ef3d3246a0d06 " powerpc/mm: Add Strong Access
Ordering support"). So far I've only tested with a simple test program that
forks and messes around with mapping shared memory in different ways. With
some cache flushing added to set_pte_at on memattr changes, then it seems
to work pretty well.

Now I'll start experimenting with QEMU again to see if the "map QEMU's
memory as noncacheable" approach makes any sense at all. If it does,
then I'm not sure we want to do it with mprotect, but I could clean up
the patches and post them for discussion. The main problems I see with it
are the need to define a new PROT_ flag, and the fact that it might not
be a good idea for userspace to have this capability in the first place,
at least not for MAP_SHARED regions.

> 
> I wonder if kernel mm will ever have a reason
> to create a cacheable mapping even if the pages are pinned?
> Like reading /dev/mem although that's not a likely case here.

Actually for a version with pinned pages, then I think this one
http://www.spinics.net/lists/kvm-arm/msg14021.html
is a good start. There are some problems with it, but I can fix
them in order for it to be useful for experimenting with QEMU. It
suffers from the Device-nGnRnE issue Peter and Alex pointed out, and
I also see that I'm using the wrong address in there as well for the
dcache flush, because I'm using kvm_flush_dcache_pte(*ptep) in
set_page_uncached, which ends up using page_address(page). I should
just do __flush_dcache_area(addr, PAGE_SIZE), instead.

Thanks,
drew

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
@ 2015-05-25 17:11             ` Andrew Jones
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Jones @ 2015-05-25 17:11 UTC (permalink / raw)
  To: Mario Smarduch
  Cc: ard.biesheuvel, Marc Zyngier, Catalin Marinas, qemu-devel,
	pbonzini, lersek, kvmarm

On Fri, May 22, 2015 at 06:08:30PM -0700, Mario Smarduch wrote:
> On 05/18/2015 08:53 AM, Catalin Marinas wrote:
> > On Thu, May 14, 2015 at 02:46:44PM +0100, Andrew Jones wrote:
> >> On Thu, May 14, 2015 at 01:05:09PM +0200, Christoffer Dall wrote:
> >>> On Wed, May 13, 2015 at 01:31:52PM +0200, Andrew Jones wrote:
> >>>> Provide a method to change normal, cacheable memory to non-cacheable.
> >>>> KVM will make use of this to keep emulated device memory regions
> >>>> coherent with the guest.
> >>>>
> >>>> Signed-off-by: Andrew Jones <drjones@redhat.com>
> >>>
> >>> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
> >>>
> >>> But you obviously need Russell and Will/Catalin to ack/merge this.
> >>
> >> I guess this patch is going to go away in the next round. You've
> >> pointed out that I screwed stuff up royally with my over eagerness
> >> to reuse code. I need to reimplement change_memory_common, but a
> >> version that takes an mm, which is more or less what I did in the
> >> last version of this series, back when I was pinning pages.
> > 
> > I kept wondering what this patch and the next one are doing with
> > set_memory_nc(). Basically you were trying to set the cache attributes
> > for the kernel linear mapping or kmap address (in the 32-bit arm case,
> > which is removed anyway on kunmap).
> > 
> > What you need is changing the attributes of the user mapping as accessed
> > by Qemu but I don't think simply re-implementing change_memory_common()
> > would work, you probably need to pin the pages in memory as well.
> > Otherwise, the kernel may remove such pages and, when bringing them
> > back, would set the default cacheability attributes.
> > 
> > Another way would be to split the vma containing the non-cacheable
> > memory so that you get a single vma with the vm_page_prot as
> > Non-cacheable.
> > 
> > Yet another approach could be for KVM to mmap the necessary memory for
> > Qemu via a file_operations.mmap call (but that's only for ranges outside
> > the guest "RAM").
> 
> I think this option with a basic loadable driver
> that allocates non-cachable/pinned pages for QEMU to mmap()
> may provide a reference point to build on. If that covers
> all the cases then perhaps move to more generic solution. This
> should be quicker to implement and test.

I've pulled together a different approach for experimentation. I added a
new mmap/mprotect prot flag, like what was done for the powerpc SAO bit
(see commit b845f313d78e4e "mm: Allow architectures to define additional
protection bits" and commit ef3d3246a0d06 " powerpc/mm: Add Strong Access
Ordering support"). So far I've only tested with a simple test program that
forks and messes around with mapping shared memory in different ways. With
some cache flushing added to set_pte_at on memattr changes, then it seems
to work pretty well.

Now I'll start experimenting with QEMU again to see if the "map QEMU's
memory as noncacheable" approach makes any sense at all. If it does,
then I'm not sure we want to do it with mprotect, but I could clean up
the patches and post them for discussion. The main problems I see with it
are the need to define a new PROT_ flag, and the fact that it might not
be a good idea for userspace to have this capability in the first place,
at least not for MAP_SHARED regions.

> 
> I wonder if kernel mm will ever have a reason
> to create a cacheable mapping even if the pages are pinned?
> Like reading /dev/mem although that's not a likely case here.

Actually for a version with pinned pages, then I think this one
http://www.spinics.net/lists/kvm-arm/msg14021.html
is a good start. There are some problems with it, but I can fix
them in order for it to be useful for experimenting with QEMU. It
suffers from the Device-nGnRnE issue Peter and Alex pointed out, and
I also see that I'm using the wrong address in there as well for the
dcache flush, because I'm using kvm_flush_dcache_pte(*ptep) in
set_page_uncached, which ends up using page_address(page). I should
just do __flush_dcache_area(addr, PAGE_SIZE), instead.

Thanks,
drew

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
  2015-05-25 17:11             ` Andrew Jones
@ 2015-05-27  1:08               ` Mario Smarduch
  -1 siblings, 0 replies; 102+ messages in thread
From: Mario Smarduch @ 2015-05-27  1:08 UTC (permalink / raw)
  To: Andrew Jones
  Cc: peter.maydell, ard.biesheuvel, Marc Zyngier, Catalin Marinas,
	agraf, qemu-devel, pbonzini, j.fanguede, lersek, kvmarm,
	Christoffer Dall

On 05/25/2015 10:11 AM, Andrew Jones wrote:
> On Fri, May 22, 2015 at 06:08:30PM -0700, Mario Smarduch wrote:
>> On 05/18/2015 08:53 AM, Catalin Marinas wrote:
>>> On Thu, May 14, 2015 at 02:46:44PM +0100, Andrew Jones wrote:
>>>> On Thu, May 14, 2015 at 01:05:09PM +0200, Christoffer Dall wrote:
>>>>> On Wed, May 13, 2015 at 01:31:52PM +0200, Andrew Jones wrote:
>>>>>> Provide a method to change normal, cacheable memory to non-cacheable.
>>>>>> KVM will make use of this to keep emulated device memory regions
>>>>>> coherent with the guest.
>>>>>>
>>>>>> Signed-off-by: Andrew Jones <drjones@redhat.com>
>>>>>
>>>>> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
>>>>>
>>>>> But you obviously need Russell and Will/Catalin to ack/merge this.
>>>>
>>>> I guess this patch is going to go away in the next round. You've
>>>> pointed out that I screwed stuff up royally with my over eagerness
>>>> to reuse code. I need to reimplement change_memory_common, but a
>>>> version that takes an mm, which is more or less what I did in the
>>>> last version of this series, back when I was pinning pages.
>>>
>>> I kept wondering what this patch and the next one are doing with
>>> set_memory_nc(). Basically you were trying to set the cache attributes
>>> for the kernel linear mapping or kmap address (in the 32-bit arm case,
>>> which is removed anyway on kunmap).
>>>
>>> What you need is changing the attributes of the user mapping as accessed
>>> by Qemu but I don't think simply re-implementing change_memory_common()
>>> would work, you probably need to pin the pages in memory as well.
>>> Otherwise, the kernel may remove such pages and, when bringing them
>>> back, would set the default cacheability attributes.
>>>
>>> Another way would be to split the vma containing the non-cacheable
>>> memory so that you get a single vma with the vm_page_prot as
>>> Non-cacheable.
>>>
>>> Yet another approach could be for KVM to mmap the necessary memory for
>>> Qemu via a file_operations.mmap call (but that's only for ranges outside
>>> the guest "RAM").
>>
>> I think this option with a basic loadable driver
>> that allocates non-cachable/pinned pages for QEMU to mmap()
>> may provide a reference point to build on. If that covers
>> all the cases then perhaps move to more generic solution. This
>> should be quicker to implement and test.
> 
> I've pulled together a different approach for experimentation. I added a
> new mmap/mprotect prot flag, like what was done for the powerpc SAO bit
> (see commit b845f313d78e4e "mm: Allow architectures to define additional
> protection bits" and commit ef3d3246a0d06 " powerpc/mm: Add Strong Access
> Ordering support"). So far I've only tested with a simple test program that
> forks and messes around with mapping shared memory in different ways. With
> some cache flushing added to set_pte_at on memattr changes, then it seems
> to work pretty well.

Thanks for the pointer, it's pretty deep into generic mmap_region()
code. Does SAO apply to regular memory or MMIO regions?

> 
> Now I'll start experimenting with QEMU again to see if the "map QEMU's
> memory as noncacheable" approach makes any sense at all. If it does,
> then I'm not sure we want to do it with mprotect, but I could clean up
> the patches and post them for discussion. The main problems I see with it
> are the need to define a new PROT_ flag, and the fact that it might not
> be a good idea for userspace to have this capability in the first place,
> at least not for MAP_SHARED regions.

Seem like pretty significant incisions to generic kernel.
Appears like below patches do same thing without adding
arch specific vma protection extensions. You would need
to lock the region pages, right? Also flush the TLBs
after locking. Appears to make a general mmap() interface
unique for this case.

> 
>>
>> I wonder if kernel mm will ever have a reason
>> to create a cacheable mapping even if the pages are pinned?
>> Like reading /dev/mem although that's not a likely case here.
> 
> Actually for a version with pinned pages, then I think this one
> http://www.spinics.net/lists/kvm-arm/msg14021.html
> is a good start. There are some problems with it, but I can fix
> them in order for it to be useful for experimenting with QEMU. It
> suffers from the Device-nGnRnE issue Peter and Alex pointed out, and
> I also see that I'm using the wrong address in there as well for the
> dcache flush, because I'm using kvm_flush_dcache_pte(*ptep) in
> set_page_uncached, which ends up using page_address(page). I should
> just do __flush_dcache_area(addr, PAGE_SIZE), instead.

Sorry I missed this series it appears clean, with no excessive flushing,
and the few additional modifications you pointed out. Instead of
flush_tlb_kernel_range(), maybe use flush_tlb_all(),
I'm not sure vaae1is flushes any other alias mappings to
same pfn.

Thanks,
- Mario

> 
> Thanks,
> drew
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc
@ 2015-05-27  1:08               ` Mario Smarduch
  0 siblings, 0 replies; 102+ messages in thread
From: Mario Smarduch @ 2015-05-27  1:08 UTC (permalink / raw)
  To: Andrew Jones
  Cc: ard.biesheuvel, Marc Zyngier, Catalin Marinas, qemu-devel,
	pbonzini, lersek, kvmarm

On 05/25/2015 10:11 AM, Andrew Jones wrote:
> On Fri, May 22, 2015 at 06:08:30PM -0700, Mario Smarduch wrote:
>> On 05/18/2015 08:53 AM, Catalin Marinas wrote:
>>> On Thu, May 14, 2015 at 02:46:44PM +0100, Andrew Jones wrote:
>>>> On Thu, May 14, 2015 at 01:05:09PM +0200, Christoffer Dall wrote:
>>>>> On Wed, May 13, 2015 at 01:31:52PM +0200, Andrew Jones wrote:
>>>>>> Provide a method to change normal, cacheable memory to non-cacheable.
>>>>>> KVM will make use of this to keep emulated device memory regions
>>>>>> coherent with the guest.
>>>>>>
>>>>>> Signed-off-by: Andrew Jones <drjones@redhat.com>
>>>>>
>>>>> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
>>>>>
>>>>> But you obviously need Russell and Will/Catalin to ack/merge this.
>>>>
>>>> I guess this patch is going to go away in the next round. You've
>>>> pointed out that I screwed stuff up royally with my over eagerness
>>>> to reuse code. I need to reimplement change_memory_common, but a
>>>> version that takes an mm, which is more or less what I did in the
>>>> last version of this series, back when I was pinning pages.
>>>
>>> I kept wondering what this patch and the next one are doing with
>>> set_memory_nc(). Basically you were trying to set the cache attributes
>>> for the kernel linear mapping or kmap address (in the 32-bit arm case,
>>> which is removed anyway on kunmap).
>>>
>>> What you need is changing the attributes of the user mapping as accessed
>>> by Qemu but I don't think simply re-implementing change_memory_common()
>>> would work, you probably need to pin the pages in memory as well.
>>> Otherwise, the kernel may remove such pages and, when bringing them
>>> back, would set the default cacheability attributes.
>>>
>>> Another way would be to split the vma containing the non-cacheable
>>> memory so that you get a single vma with the vm_page_prot as
>>> Non-cacheable.
>>>
>>> Yet another approach could be for KVM to mmap the necessary memory for
>>> Qemu via a file_operations.mmap call (but that's only for ranges outside
>>> the guest "RAM").
>>
>> I think this option with a basic loadable driver
>> that allocates non-cachable/pinned pages for QEMU to mmap()
>> may provide a reference point to build on. If that covers
>> all the cases then perhaps move to more generic solution. This
>> should be quicker to implement and test.
> 
> I've pulled together a different approach for experimentation. I added a
> new mmap/mprotect prot flag, like what was done for the powerpc SAO bit
> (see commit b845f313d78e4e "mm: Allow architectures to define additional
> protection bits" and commit ef3d3246a0d06 " powerpc/mm: Add Strong Access
> Ordering support"). So far I've only tested with a simple test program that
> forks and messes around with mapping shared memory in different ways. With
> some cache flushing added to set_pte_at on memattr changes, then it seems
> to work pretty well.

Thanks for the pointer, it's pretty deep into generic mmap_region()
code. Does SAO apply to regular memory or MMIO regions?

> 
> Now I'll start experimenting with QEMU again to see if the "map QEMU's
> memory as noncacheable" approach makes any sense at all. If it does,
> then I'm not sure we want to do it with mprotect, but I could clean up
> the patches and post them for discussion. The main problems I see with it
> are the need to define a new PROT_ flag, and the fact that it might not
> be a good idea for userspace to have this capability in the first place,
> at least not for MAP_SHARED regions.

Seem like pretty significant incisions to generic kernel.
Appears like below patches do same thing without adding
arch specific vma protection extensions. You would need
to lock the region pages, right? Also flush the TLBs
after locking. Appears to make a general mmap() interface
unique for this case.

> 
>>
>> I wonder if kernel mm will ever have a reason
>> to create a cacheable mapping even if the pages are pinned?
>> Like reading /dev/mem although that's not a likely case here.
> 
> Actually for a version with pinned pages, then I think this one
> http://www.spinics.net/lists/kvm-arm/msg14021.html
> is a good start. There are some problems with it, but I can fix
> them in order for it to be useful for experimenting with QEMU. It
> suffers from the Device-nGnRnE issue Peter and Alex pointed out, and
> I also see that I'm using the wrong address in there as well for the
> dcache flush, because I'm using kvm_flush_dcache_pte(*ptep) in
> set_page_uncached, which ends up using page_address(page). I should
> just do __flush_dcache_area(addr, PAGE_SIZE), instead.

Sorry I missed this series it appears clean, with no excessive flushing,
and the few additional modifications you pointed out. Instead of
flush_tlb_kernel_range(), maybe use flush_tlb_all(),
I'm not sure vaae1is flushes any other alias mappings to
same pfn.

Thanks,
- Mario

> 
> Thanks,
> drew
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

end of thread, other threads:[~2015-05-27  1:08 UTC | newest]

Thread overview: 102+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-13 11:31 [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED Andrew Jones
2015-05-13 11:31 ` Andrew Jones
2015-05-13 11:31 ` [Qemu-devel] [RFC/RFT PATCH v2 1/3] arm/arm64: pageattr: add set_memory_nc Andrew Jones
2015-05-13 11:31   ` Andrew Jones
2015-05-14 11:05   ` [Qemu-devel] " Christoffer Dall
2015-05-14 11:05     ` Christoffer Dall
2015-05-14 13:46     ` [Qemu-devel] " Andrew Jones
2015-05-14 13:46       ` Andrew Jones
2015-05-15 14:51       ` [Qemu-devel] " Christoffer Dall
2015-05-15 14:51         ` Christoffer Dall
2015-05-18 15:53       ` [Qemu-devel] " Catalin Marinas
2015-05-18 15:53         ` Catalin Marinas
2015-05-19 10:03         ` [Qemu-devel] " Andrew Jones
2015-05-19 10:03           ` Andrew Jones
2015-05-19 11:18           ` [Qemu-devel] " Catalin Marinas
2015-05-19 11:18             ` Catalin Marinas
2015-05-19 11:38             ` [Qemu-devel] " Andrew Jones
2015-05-19 11:38               ` Andrew Jones
2015-05-20 10:01             ` [Qemu-devel] " Christoffer Dall
2015-05-20 10:01               ` Christoffer Dall
2015-05-20 11:24               ` [Qemu-devel] " Catalin Marinas
2015-05-20 11:24                 ` Catalin Marinas
2015-05-23  1:08         ` [Qemu-devel] " Mario Smarduch
2015-05-23  1:08           ` Mario Smarduch
2015-05-25 17:11           ` [Qemu-devel] " Andrew Jones
2015-05-25 17:11             ` Andrew Jones
2015-05-27  1:08             ` Mario Smarduch
2015-05-27  1:08               ` Mario Smarduch
2015-05-13 11:31 ` [Qemu-devel] [RFC/RFT PATCH v2 2/3] KVM: promote KVM_MEMSLOT_INCOHERENT to uapi Andrew Jones
2015-05-13 11:31   ` Andrew Jones
2015-05-14 10:12   ` [Qemu-devel] " Paolo Bonzini
2015-05-14 10:12     ` Paolo Bonzini
2015-05-14 10:34   ` [Qemu-devel] " Christoffer Dall
2015-05-14 10:34     ` Christoffer Dall
2015-05-13 11:31 ` [Qemu-devel] [RFC/RFT PATCH v2 3/3] arm/arm64: KVM: implement 'uncached' mem coherency Andrew Jones
2015-05-13 11:31   ` Andrew Jones
2015-05-14 10:55   ` [Qemu-devel] " Christoffer Dall
2015-05-14 10:55     ` Christoffer Dall
2015-05-14 13:32     ` [Qemu-devel] " Andrew Jones
2015-05-14 13:32       ` Andrew Jones
2015-05-15 15:02       ` Christoffer Dall
2015-05-15 15:02         ` Christoffer Dall
2015-05-15 17:04         ` Andrew Jones
2015-05-15 17:04           ` Andrew Jones
2015-05-15 20:16           ` Jérémy Fanguède
2015-05-15 20:16             ` Jérémy Fanguède
2015-05-21  2:29           ` Mario Smarduch
2015-05-21  2:29             ` Mario Smarduch
2015-05-21 16:50             ` Andrew Jones
2015-05-21 16:50               ` Andrew Jones
2015-05-14 10:30 ` [Qemu-devel] [RFC/RFT PATCH v2 0/3] KVM: Introduce KVM_MEM_UNCACHED Christoffer Dall
2015-05-14 10:30   ` Christoffer Dall
2015-05-14 11:09   ` [Qemu-devel] " Laszlo Ersek
2015-05-14 11:09     ` Laszlo Ersek
2015-05-14 11:29     ` [Qemu-devel] " Christoffer Dall
2015-05-14 11:29       ` Christoffer Dall
2015-05-14 11:31       ` [Qemu-devel] " Paolo Bonzini
2015-05-14 11:31         ` Paolo Bonzini
2015-05-14 11:36         ` [Qemu-devel] " Christoffer Dall
2015-05-14 11:36           ` Christoffer Dall
2015-05-14 11:38           ` [Qemu-devel] " Paolo Bonzini
2015-05-14 11:38             ` Paolo Bonzini
2015-05-14 12:00             ` [Qemu-devel] " Christoffer Dall
2015-05-14 12:00               ` Christoffer Dall
2015-05-14 12:08               ` [Qemu-devel] " Paolo Bonzini
2015-05-14 12:08                 ` Paolo Bonzini
2015-05-14 12:24                 ` [Qemu-devel] " Christoffer Dall
2015-05-14 12:24                   ` Christoffer Dall
2015-05-14 12:28                   ` [Qemu-devel] " Paolo Bonzini
2015-05-14 12:28                     ` Paolo Bonzini
2015-05-14 12:34                     ` [Qemu-devel] " Christoffer Dall
2015-05-14 12:34                       ` Christoffer Dall
2015-05-14 13:01                       ` [Qemu-devel] " Laszlo Ersek
2015-05-14 13:01                         ` Laszlo Ersek
2015-05-14 12:38                     ` [Qemu-devel] " Peter Maydell
2015-05-14 12:38                       ` Peter Maydell
2015-05-14 13:00                       ` [Qemu-devel] " Andrew Jones
2015-05-14 13:00                         ` Andrew Jones
2015-05-14 13:32                         ` Laszlo Ersek
2015-05-14 13:32                           ` Laszlo Ersek
2015-05-14 13:48                           ` Michael S. Tsirkin
2015-05-14 13:48                             ` Michael S. Tsirkin
2015-05-14 14:19                             ` Laszlo Ersek
2015-05-14 14:19                               ` Laszlo Ersek
2015-05-14 14:41                               ` Michael S. Tsirkin
2015-05-14 14:41                                 ` Michael S. Tsirkin
2015-05-15  9:00                                 ` Ard Biesheuvel
2015-05-15  9:00                                   ` Ard Biesheuvel
2015-05-14 10:31 ` Andrew Jones
2015-05-14 10:31   ` Andrew Jones
2015-05-14 10:37   ` [Qemu-devel] " Peter Maydell
2015-05-14 10:37     ` Peter Maydell
2015-05-14 13:03     ` [Qemu-devel] " Andrew Jones
2015-05-14 13:03       ` Andrew Jones
2015-05-14 13:11       ` [Qemu-devel] " Peter Maydell
2015-05-14 13:11         ` Peter Maydell
2015-05-14 13:33         ` [Qemu-devel] " Laszlo Ersek
2015-05-14 13:33           ` Laszlo Ersek
2015-05-14 13:36         ` [Qemu-devel] " Andrew Jones
2015-05-14 13:36           ` Andrew Jones
2015-05-15 15:09           ` Christoffer Dall
2015-05-15 15:09             ` Christoffer Dall

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.