[PATCH 0/4] arm/arm64: KVM: Random selection of MM related fixes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] arm/arm64: KVM: Random selection of MM related fixes
@ 2015-01-08 11:59 Marc Zyngier
  2015-01-08 11:59 ` [PATCH 1/4] mm: Correct ordering of *_clear_flush_young_notify Marc Zyngier
                   ` (3 more replies)
  0 siblings, 4 replies; 45+ messages in thread
From: Marc Zyngier @ 2015-01-08 11:59 UTC (permalink / raw)
  To: Christoffer Dall, kvm, kvmarm; +Cc: Steve Capper

This small series fixes a number of issues that Christoffer and I have
been trying to nail down for a while, having to do with the host dying
under load (swapping), and also with the way we deal with caches in
general (and with set/way operation in particular):

- The first patch, courtesy of Steve Capper, fixes a long standing
  buglet in one of the MMU notifiers.

- The second one changes the way we handle cache ops by set/way,
  basically turning them into VA ops for the whole memory. This allows
  platforms with system caches to boot a 32bit zImage, for example.

- The third one fixes a corner case that could happen if the guest
  used an uncached mapping (or had its caches off) while the host was
  swapping it out (and using a cache-coherent IO subsystem).

- Finally, the last one fixes this stability issue when the host was
  swapping, by using a kernel mapping for cache maintenance instead of
  the userspace one.

With these patches (and both the TLB invalidation fix I posted before
Xmas and the HCR fix posted yesterday), the APM platform seems much
more robust than it was. Fingers crossed.

Based on 3.19-rc3, tested on Juno, X-Gene and Cubietruck.

Also at git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git kvm-arm64/mm-fixes-3.19

Marc Zyngier (3):
  arm/arm64: KVM: Use set/way op trapping to track the state of the
    caches
  arm/arm64: KVM: Flush caches to memory on unmap
  arm/arm64: KVM: use kernel mapping to perform invalidation on page
    fault

Steve Capper (1):
  mm: Correct ordering of *_clear_flush_young_notify

 arch/arm/include/asm/kvm_host.h   |   3 --
 arch/arm/include/asm/kvm_mmu.h    |  68 +++++++++++++++++++++----
 arch/arm/kvm/arm.c                |  10 ----
 arch/arm/kvm/coproc.c             |  90 +++++++++++++++++++++------------
 arch/arm/kvm/mmu.c                |  58 +++++++++++++++-------
 arch/arm64/include/asm/kvm_host.h |   3 --
 arch/arm64/include/asm/kvm_mmu.h  |  31 ++++++++++--
 arch/arm64/kvm/sys_regs.c         | 102 ++++++++++++++++++++++----------------
 include/linux/mmu_notifier.h      |   8 +--
 9 files changed, 244 insertions(+), 129 deletions(-)

-- 
2.1.4


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 1/4] mm: Correct ordering of *_clear_flush_young_notify
  2015-01-08 11:59 [PATCH 0/4] arm/arm64: KVM: Random selection of MM related fixes Marc Zyngier
@ 2015-01-08 11:59 ` Marc Zyngier
  2015-01-08 13:12   ` Paolo Bonzini
  2015-01-08 19:00   ` Andrea Arcangeli
  2015-01-08 11:59 ` [PATCH 2/4] arm/arm64: KVM: Use set/way op trapping to track the state of the caches Marc Zyngier
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 45+ messages in thread
From: Marc Zyngier @ 2015-01-08 11:59 UTC (permalink / raw)
  To: Christoffer Dall, kvm, kvmarm; +Cc: Steve Capper

From: Steve Capper <steve.capper@linaro.org>

ptep_clear_flush_young_notify and pmdp_clear_flush_young_notify both
call the notifiers *after* the pte/pmd has been made young.

This can cause problems with KVM that relies on being able to block
MMU notifiers when carrying out maintenance of second stage
descriptors.

This patch ensures that the MMU notifiers are called before ptes and
pmds are made old.

Signed-off-by: Steve Capper <steve.capper@linaro.org>
---
 include/linux/mmu_notifier.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 95243d2..c454c76 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -290,11 +290,11 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
 	int __young;							\
 	struct vm_area_struct *___vma = __vma;				\
 	unsigned long ___address = __address;				\
-	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
-	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
+	__young = mmu_notifier_clear_flush_young(___vma->vm_mm,		\
 						  ___address,		\
 						  ___address +		\
 							PAGE_SIZE);	\
+	__young |= ptep_clear_flush_young(___vma, ___address, __ptep);	\
 	__young;							\
 })
 
@@ -303,11 +303,11 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
 	int __young;							\
 	struct vm_area_struct *___vma = __vma;				\
 	unsigned long ___address = __address;				\
-	__young = pmdp_clear_flush_young(___vma, ___address, __pmdp);	\
-	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
+	__young = mmu_notifier_clear_flush_young(___vma->vm_mm,		\
 						  ___address,		\
 						  ___address +		\
 							PMD_SIZE);	\
+	__young |= pmdp_clear_flush_young(___vma, ___address, __pmdp);	\
 	__young;							\
 })
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 2/4] arm/arm64: KVM: Use set/way op trapping to track the state of the caches
  2015-01-08 11:59 [PATCH 0/4] arm/arm64: KVM: Random selection of MM related fixes Marc Zyngier
  2015-01-08 11:59 ` [PATCH 1/4] mm: Correct ordering of *_clear_flush_young_notify Marc Zyngier
@ 2015-01-08 11:59 ` Marc Zyngier
  2015-01-09 11:19   ` Christoffer Dall
  2015-01-08 11:59 ` [PATCH 3/4] arm/arm64: KVM: Flush caches to memory on unmap Marc Zyngier
  2015-01-08 11:59 ` [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault Marc Zyngier
  3 siblings, 1 reply; 45+ messages in thread
From: Marc Zyngier @ 2015-01-08 11:59 UTC (permalink / raw)
  To: Christoffer Dall, kvm, kvmarm; +Cc: Steve Capper

Trying to emulate the behaviour of set/way cache ops is fairly
pointless, as there are too many ways we can end-up missing stuff.
Also, there is some system caches out there that simply ignore
set/way operations.

So instead of trying to implement them, let's convert it to VA ops,
and use them as a way to re-enable the trapping of VM ops. That way,
we can detect the point when the MMU/caches are turned off, and do
a full VM flush (which is what the guest was trying to do anyway).

This allows a 32bit zImage to boot on the APM thingy, and will
probably help bootloaders in general.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
---
 arch/arm/include/asm/kvm_host.h   |   3 --
 arch/arm/kvm/arm.c                |  10 ----
 arch/arm/kvm/coproc.c             |  90 +++++++++++++++++++++------------
 arch/arm64/include/asm/kvm_host.h |   3 --
 arch/arm64/kvm/sys_regs.c         | 102 ++++++++++++++++++++++----------------
 5 files changed, 116 insertions(+), 92 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 254e065..04b4ea0 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -125,9 +125,6 @@ struct kvm_vcpu_arch {
 	 * Anything that is not used directly from assembly code goes
 	 * here.
 	 */
-	/* dcache set/way operation pending */
-	int last_pcpu;
-	cpumask_t require_dcache_flush;
 
 	/* Don't run the guest on this vcpu */
 	bool pause;
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 2d6d910..0b0d58a 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -281,15 +281,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	vcpu->cpu = cpu;
 	vcpu->arch.host_cpu_context = this_cpu_ptr(kvm_host_cpu_state);
 
-	/*
-	 * Check whether this vcpu requires the cache to be flushed on
-	 * this physical CPU. This is a consequence of doing dcache
-	 * operations by set/way on this vcpu. We do it here to be in
-	 * a non-preemptible section.
-	 */
-	if (cpumask_test_and_clear_cpu(cpu, &vcpu->arch.require_dcache_flush))
-		flush_cache_all(); /* We'd really want v7_flush_dcache_all() */
-
 	kvm_arm_set_running_vcpu(vcpu);
 }
 
@@ -541,7 +532,6 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
 		ret = kvm_call_hyp(__kvm_vcpu_run, vcpu);
 
 		vcpu->mode = OUTSIDE_GUEST_MODE;
-		vcpu->arch.last_pcpu = smp_processor_id();
 		kvm_guest_exit();
 		trace_kvm_exit(*vcpu_pc(vcpu));
 		/*
diff --git a/arch/arm/kvm/coproc.c b/arch/arm/kvm/coproc.c
index 7928dbd..3d352a5 100644
--- a/arch/arm/kvm/coproc.c
+++ b/arch/arm/kvm/coproc.c
@@ -189,43 +189,57 @@ static bool access_l2ectlr(struct kvm_vcpu *vcpu,
 	return true;
 }
 
-/* See note at ARM ARM B1.14.4 */
+/*
+ * See note at ARMv7 ARM B1.14.4 (TL;DR: S/W ops are not easily virtualized).
+ *
+ * Main problems:
+ * - S/W ops are local to a CPU (not broadcast)
+ * - We have line migration behind our back (speculation)
+ * - System caches don't support S/W at all (damn!)
+ *
+ * In the face of the above, the best we can do is to try and convert
+ * S/W ops to VA ops. Because the guest is not allowed to infer the
+ * S/W to PA mapping, it can only use S/W to nuke the whole cache,
+ * which is a rather good thing for us.
+ *
+ * Also, it is only used when turning caches on/off ("The expected
+ * usage of the cache maintenance instructions that operate by set/way
+ * is associated with the cache maintenance instructions associated
+ * with the powerdown and powerup of caches, if this is required by
+ * the implementation.").
+ *
+ * We use the following policy:
+ *
+ * - If we trap a S/W operation, we enable VM trapping to detect
+ *   caches being turned on/off.
+ *
+ * - If the caches have already been turned off when doing the S/W op,
+ *   we nuke the whole VM cache.
+ *
+ * - We flush the cache on both caches being turned on and off.
+ *
+ * - Once the caches are enabled, we stop trapping VM ops.
+ */
 static bool access_dcsw(struct kvm_vcpu *vcpu,
 			const struct coproc_params *p,
 			const struct coproc_reg *r)
 {
-	unsigned long val;
-	int cpu;
-
 	if (!p->is_write)
 		return read_from_write_only(vcpu, p);
 
-	cpu = get_cpu();
-
-	cpumask_setall(&vcpu->arch.require_dcache_flush);
-	cpumask_clear_cpu(cpu, &vcpu->arch.require_dcache_flush);
-
-	/* If we were already preempted, take the long way around */
-	if (cpu != vcpu->arch.last_pcpu) {
-		flush_cache_all();
-		goto done;
-	}
-
-	val = *vcpu_reg(vcpu, p->Rt1);
-
-	switch (p->CRm) {
-	case 6:			/* Upgrade DCISW to DCCISW, as per HCR.SWIO */
-	case 14:		/* DCCISW */
-		asm volatile("mcr p15, 0, %0, c7, c14, 2" : : "r" (val));
-		break;
-
-	case 10:		/* DCCSW */
-		asm volatile("mcr p15, 0, %0, c7, c10, 2" : : "r" (val));
-		break;
-	}
+	/*
+	 * If this is the first time we do a S/W operation
+	 * (i.e. HCT_TVM not set) and the caches are already disabled,
+	 * flush the whole memory.
+	 *
+	 * Otherwise, don't do anything. Just wait for the MMU +
+	 * Caches to be turned off, and use that to actually clean the
+	 * caches.
+	 */
+	if (!(vcpu->arch.hcr & HCR_TVM) && !vcpu_has_cache_enabled(vcpu))
+		stage2_flush_vm(vcpu->kvm);
 
-done:
-	put_cpu();
+	vcpu->arch.hcr |= HCR_TVM;
 
 	return true;
 }
@@ -258,12 +272,24 @@ bool access_sctlr(struct kvm_vcpu *vcpu,
 		  const struct coproc_params *p,
 		  const struct coproc_reg *r)
 {
+	bool was_enabled = vcpu_has_cache_enabled(vcpu);
+	bool now_enabled;
+
 	access_vm_reg(vcpu, p, r);
 
-	if (vcpu_has_cache_enabled(vcpu)) {	/* MMU+Caches enabled? */
-		vcpu->arch.hcr &= ~HCR_TVM;
+	now_enabled = vcpu_has_cache_enabled(vcpu);
+
+	/*
+	 * If switching the MMU+caches on, need to invalidate the caches.
+	 * If switching it off, need to clean the caches.
+	 * Clean + invalidate does the trick always.
+	 */
+	if (now_enabled != was_enabled)
 		stage2_flush_vm(vcpu->kvm);
-	}
+
+	/* Caches are now on, stop trapping VM ops (until a S/W op) */
+	if (now_enabled)
+		vcpu->arch.hcr &= ~HCR_TVM;
 
 	return true;
 }
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 0b7dfdb..acd101a 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -116,9 +116,6 @@ struct kvm_vcpu_arch {
 	 * Anything that is not used directly from assembly code goes
 	 * here.
 	 */
-	/* dcache set/way operation pending */
-	int last_pcpu;
-	cpumask_t require_dcache_flush;
 
 	/* Don't run the guest */
 	bool pause;
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 3d7c2df..5c6540b 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -69,55 +69,57 @@ static u32 get_ccsidr(u32 csselr)
 	return ccsidr;
 }
 
-static void do_dc_cisw(u32 val)
-{
-	asm volatile("dc cisw, %x0" : : "r" (val));
-	dsb(ish);
-}
-
-static void do_dc_csw(u32 val)
-{
-	asm volatile("dc csw, %x0" : : "r" (val));
-	dsb(ish);
-}
-
-/* See note at ARM ARM B1.14.4 */
+/*
+ * See note at ARMv7 ARM B1.14.4 (TL;DR: S/W ops are not easily virtualized).
+ *
+ * Main problems:
+ * - S/W ops are local to a CPU (not broadcast)
+ * - We have line migration behind our back (speculation)
+ * - System caches don't support S/W at all (damn!)
+ *
+ * In the face of the above, the best we can do is to try and convert
+ * S/W ops to VA ops. Because the guest is not allowed to infer the
+ * S/W to PA mapping, it can only use S/W to nuke the whole cache,
+ * which is a rather good thing for us.
+ *
+ * Also, it is only used when turning caches on/off ("The expected
+ * usage of the cache maintenance instructions that operate by set/way
+ * is associated with the cache maintenance instructions associated
+ * with the powerdown and powerup of caches, if this is required by
+ * the implementation.").
+ *
+ * We use the following policy:
+ *
+ * - If we trap a S/W operation, we enable VM trapping to detect
+ *   caches being turned on/off.
+ *
+ * - If the caches have already been turned off when doing the S/W op,
+ *   we nuke the whole VM cache.
+ *
+ * - We flush the cache on both caches being turned on and off.
+ *
+ * - Once the caches are enabled, we stop trapping VM ops.
+ */
 static bool access_dcsw(struct kvm_vcpu *vcpu,
 			const struct sys_reg_params *p,
 			const struct sys_reg_desc *r)
 {
-	unsigned long val;
-	int cpu;
-
 	if (!p->is_write)
 		return read_from_write_only(vcpu, p);
 
-	cpu = get_cpu();
-
-	cpumask_setall(&vcpu->arch.require_dcache_flush);
-	cpumask_clear_cpu(cpu, &vcpu->arch.require_dcache_flush);
-
-	/* If we were already preempted, take the long way around */
-	if (cpu != vcpu->arch.last_pcpu) {
-		flush_cache_all();
-		goto done;
-	}
-
-	val = *vcpu_reg(vcpu, p->Rt);
-
-	switch (p->CRm) {
-	case 6:			/* Upgrade DCISW to DCCISW, as per HCR.SWIO */
-	case 14:		/* DCCISW */
-		do_dc_cisw(val);
-		break;
-
-	case 10:		/* DCCSW */
-		do_dc_csw(val);
-		break;
-	}
+	/*
+	 * If this is the first time we do a S/W operation
+	 * (i.e. HCT_TVM not set) and the caches are already disabled,
+	 * flush the whole memory.
+	 *
+	 * Otherwise, don't do anything. Just wait for the MMU +
+	 * Caches to be turned off, and use that to actually clean the
+	 * caches.
+	 */
+	if (!(vcpu->arch.hcr_el2 & HCR_TVM) && !vcpu_has_cache_enabled(vcpu))
+		stage2_flush_vm(vcpu->kvm);
 
-done:
-	put_cpu();
+	vcpu->arch.hcr_el2 |= HCR_TVM;
 
 	return true;
 }
@@ -155,12 +157,24 @@ static bool access_sctlr(struct kvm_vcpu *vcpu,
 			 const struct sys_reg_params *p,
 			 const struct sys_reg_desc *r)
 {
+	bool was_enabled = vcpu_has_cache_enabled(vcpu);
+	bool now_enabled;
+
 	access_vm_reg(vcpu, p, r);
 
-	if (vcpu_has_cache_enabled(vcpu)) {	/* MMU+Caches enabled? */
-		vcpu->arch.hcr_el2 &= ~HCR_TVM;
+	now_enabled = vcpu_has_cache_enabled(vcpu);
+
+	/*
+	 * If switching the MMU+caches on, need to invalidate the caches.
+	 * If switching it off, need to clean the caches.
+	 * Clean + invalidate does the trick always.
+	 */
+	if (now_enabled != was_enabled)
 		stage2_flush_vm(vcpu->kvm);
-	}
+
+	/* Caches are now on, stop trapping VM ops (until a S/W op) */
+	if (now_enabled)
+		vcpu->arch.hcr_el2 &= ~HCR_TVM;
 
 	return true;
 }
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 3/4] arm/arm64: KVM: Flush caches to memory on unmap
  2015-01-08 11:59 [PATCH 0/4] arm/arm64: KVM: Random selection of MM related fixes Marc Zyngier
  2015-01-08 11:59 ` [PATCH 1/4] mm: Correct ordering of *_clear_flush_young_notify Marc Zyngier
  2015-01-08 11:59 ` [PATCH 2/4] arm/arm64: KVM: Use set/way op trapping to track the state of the caches Marc Zyngier
@ 2015-01-08 11:59 ` Marc Zyngier
  2015-01-09 12:30   ` Christoffer Dall
  2015-01-08 11:59 ` [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault Marc Zyngier
  3 siblings, 1 reply; 45+ messages in thread
From: Marc Zyngier @ 2015-01-08 11:59 UTC (permalink / raw)
  To: Christoffer Dall, kvm, kvmarm; +Cc: Steve Capper

Let's assume a guest has created an uncached mapping, and written
to that page. Let's also assume that the host uses a cache-coherent
IO subsystem. Let's finally assume that the host is under memory
pressure and starts to swap things out.

Before this "uncached" page is evicted, we need to make sure it
gets invalidated, or the IO subsystem is going to swap out the
cached view, loosing the data that has been written there.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
---
 arch/arm/include/asm/kvm_mmu.h   | 31 +++++++++++++++++++++++++++
 arch/arm/kvm/mmu.c               | 46 +++++++++++++++++++++++++++-------------
 arch/arm64/include/asm/kvm_mmu.h | 18 ++++++++++++++++
 3 files changed, 80 insertions(+), 15 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 63e0ecc..7ceb836 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -44,6 +44,7 @@
 
 #ifndef __ASSEMBLY__
 
+#include <linux/highmem.h>
 #include <asm/cacheflush.h>
 #include <asm/pgalloc.h>
 
@@ -190,6 +191,36 @@ static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva,
 
 #define kvm_virt_to_phys(x)		virt_to_idmap((unsigned long)(x))
 
+static inline void __kvm_flush_dcache_pte(pte_t pte)
+{
+	void *va = kmap_atomic(pte_page(pte));
+
+	kvm_flush_dcache_to_poc(va, PAGE_SIZE);
+
+	kunmap_atomic(va);
+}
+
+static inline void __kvm_flush_dcache_pmd(pmd_t pmd)
+{
+	unsigned long size = PMD_SIZE;
+	pfn_t pfn = pmd_pfn(pmd);
+
+	while (size) {
+		void *va = kmap_atomic_pfn(pfn);
+
+		kvm_flush_dcache_to_poc(va, PAGE_SIZE);
+
+		pfn++;
+		size -= PAGE_SIZE;
+
+		kunmap_atomic(va);
+	}
+}
+
+static inline void __kvm_flush_dcache_pud(pud_t pud)
+{
+}
+
 void stage2_flush_vm(struct kvm *kvm);
 
 #endif	/* !__ASSEMBLY__ */
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 1dc9778..1f5b793 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -58,6 +58,21 @@ static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
 		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa);
 }
 
+static void kvm_flush_dcache_pte(pte_t pte)
+{
+	__kvm_flush_dcache_pte(pte);
+}
+
+static void kvm_flush_dcache_pmd(pmd_t pmd)
+{
+	__kvm_flush_dcache_pmd(pmd);
+}
+
+static void kvm_flush_dcache_pud(pud_t pud)
+{
+	__kvm_flush_dcache_pud(pud);
+}
+
 static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
 				  int min, int max)
 {
@@ -128,9 +143,12 @@ static void unmap_ptes(struct kvm *kvm, pmd_t *pmd,
 	start_pte = pte = pte_offset_kernel(pmd, addr);
 	do {
 		if (!pte_none(*pte)) {
+			pte_t old_pte = *pte;
 			kvm_set_pte(pte, __pte(0));
-			put_page(virt_to_page(pte));
 			kvm_tlb_flush_vmid_ipa(kvm, addr);
+			if ((pte_val(old_pte) & PAGE_S2_DEVICE) != PAGE_S2_DEVICE)
+				kvm_flush_dcache_pte(old_pte);
+			put_page(virt_to_page(pte));
 		}
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 
@@ -149,8 +167,10 @@ static void unmap_pmds(struct kvm *kvm, pud_t *pud,
 		next = kvm_pmd_addr_end(addr, end);
 		if (!pmd_none(*pmd)) {
 			if (kvm_pmd_huge(*pmd)) {
+				pmd_t old_pmd = *pmd;
 				pmd_clear(pmd);
 				kvm_tlb_flush_vmid_ipa(kvm, addr);
+				kvm_flush_dcache_pmd(old_pmd);
 				put_page(virt_to_page(pmd));
 			} else {
 				unmap_ptes(kvm, pmd, addr, next);
@@ -173,8 +193,10 @@ static void unmap_puds(struct kvm *kvm, pgd_t *pgd,
 		next = kvm_pud_addr_end(addr, end);
 		if (!pud_none(*pud)) {
 			if (pud_huge(*pud)) {
+				pud_t old_pud = *pud;
 				pud_clear(pud);
 				kvm_tlb_flush_vmid_ipa(kvm, addr);
+				kvm_flush_dcache_pud(old_pud);
 				put_page(virt_to_page(pud));
 			} else {
 				unmap_pmds(kvm, pud, addr, next);
@@ -209,10 +231,8 @@ static void stage2_flush_ptes(struct kvm *kvm, pmd_t *pmd,
 
 	pte = pte_offset_kernel(pmd, addr);
 	do {
-		if (!pte_none(*pte)) {
-			hva_t hva = gfn_to_hva(kvm, addr >> PAGE_SHIFT);
-			kvm_flush_dcache_to_poc((void*)hva, PAGE_SIZE);
-		}
+		if (!pte_none(*pte))
+			kvm_flush_dcache_pte(*pte);
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 }
 
@@ -226,12 +246,10 @@ static void stage2_flush_pmds(struct kvm *kvm, pud_t *pud,
 	do {
 		next = kvm_pmd_addr_end(addr, end);
 		if (!pmd_none(*pmd)) {
-			if (kvm_pmd_huge(*pmd)) {
-				hva_t hva = gfn_to_hva(kvm, addr >> PAGE_SHIFT);
-				kvm_flush_dcache_to_poc((void*)hva, PMD_SIZE);
-			} else {
+			if (kvm_pmd_huge(*pmd))
+				kvm_flush_dcache_pmd(*pmd);
+			else
 				stage2_flush_ptes(kvm, pmd, addr, next);
-			}
 		}
 	} while (pmd++, addr = next, addr != end);
 }
@@ -246,12 +264,10 @@ static void stage2_flush_puds(struct kvm *kvm, pgd_t *pgd,
 	do {
 		next = kvm_pud_addr_end(addr, end);
 		if (!pud_none(*pud)) {
-			if (pud_huge(*pud)) {
-				hva_t hva = gfn_to_hva(kvm, addr >> PAGE_SHIFT);
-				kvm_flush_dcache_to_poc((void*)hva, PUD_SIZE);
-			} else {
+			if (pud_huge(*pud))
+				kvm_flush_dcache_pud(*pud);
+			else
 				stage2_flush_pmds(kvm, pud, addr, next);
-			}
 		}
 	} while (pud++, addr = next, addr != end);
 }
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 14a74f1..b7419f5 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -260,6 +260,24 @@ static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva,
 
 #define kvm_virt_to_phys(x)		__virt_to_phys((unsigned long)(x))
 
+static inline void __kvm_flush_dcache_pte(pte_t pte)
+{
+	struct page *page = pte_page(pte);
+	kvm_flush_dcache_to_poc(page_address(page), PAGE_SIZE);
+}
+
+static inline void __kvm_flush_dcache_pmd(pmd_t pmd)
+{
+	struct page *page = pmd_page(pmd);
+	kvm_flush_dcache_to_poc(page_address(page), PMD_SIZE);
+}
+
+static inline void __kvm_flush_dcache_pud(pud_t pud)
+{
+	struct page *page = pud_page(pud);
+	kvm_flush_dcache_to_poc(page_address(page), PUD_SIZE);
+}
+
 void stage2_flush_vm(struct kvm *kvm);
 
 #endif /* __ASSEMBLY__ */
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-08 11:59 [PATCH 0/4] arm/arm64: KVM: Random selection of MM related fixes Marc Zyngier
                   ` (2 preceding siblings ...)
  2015-01-08 11:59 ` [PATCH 3/4] arm/arm64: KVM: Flush caches to memory on unmap Marc Zyngier
@ 2015-01-08 11:59 ` Marc Zyngier
  2015-01-08 12:30   ` Peter Maydell
  2015-01-09 12:51   ` Christoffer Dall
  3 siblings, 2 replies; 45+ messages in thread
From: Marc Zyngier @ 2015-01-08 11:59 UTC (permalink / raw)
  To: Christoffer Dall, kvm, kvmarm; +Cc: Steve Capper

When handling a fault in stage-2, we need to resync I$ and D$, just
to be sure we don't leave any old cache line behind.

That's very good, except that we do so using the *user* address.
Under heavy load (swapping like crazy), we may end up in a situation
where the page gets mapped in stage-2 while being unmapped from
userspace by another CPU.

At that point, the DC/IC instructions can generate a fault, which
we handle with kvm->mmu_lock held. The box quickly deadlocks, user
is unhappy.

Instead, perform this invalidation through the kernel mapping,
which is guaranteed to be present. The box is much happier, and so
am I.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
---
 arch/arm/include/asm/kvm_mmu.h   | 37 ++++++++++++++++++++++++++++---------
 arch/arm/kvm/mmu.c               | 12 ++++++++----
 arch/arm64/include/asm/kvm_mmu.h | 13 ++++++++-----
 3 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 7ceb836..be6bc72 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -162,13 +162,10 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
 	return (vcpu->arch.cp15[c1_SCTLR] & 0b101) == 0b101;
 }
 
-static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva,
-					     unsigned long size,
-					     bool ipa_uncached)
+static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
+				 	       unsigned long size,
+					       bool ipa_uncached)
 {
-	if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached)
-		kvm_flush_dcache_to_poc((void *)hva, size);
-	
 	/*
 	 * If we are going to insert an instruction page and the icache is
 	 * either VIPT or PIPT, there is a potential problem where the host
@@ -180,10 +177,32 @@ static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva,
 	 *
 	 * VIVT caches are tagged using both the ASID and the VMID and doesn't
 	 * need any kind of flushing (DDI 0406C.b - Page B3-1392).
+	 *
+	 * We need to do this through a kernel mapping (using the
+	 * user-space mapping has proved to be the wrong
+	 * solution). For that, we need to kmap one page at a time,
+	 * and iterate over the range.
 	 */
-	if (icache_is_pipt()) {
-		__cpuc_coherent_user_range(hva, hva + size);
-	} else if (!icache_is_vivt_asid_tagged()) {
+
+	VM_BUG_ON(size & PAGE_MASK);
+
+	while (size) {
+		void *va = kmap_atomic_pfn(pfn);
+
+		if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached)
+			kvm_flush_dcache_to_poc(va, PAGE_SIZE);
+	
+		if (icache_is_pipt())
+			__cpuc_coherent_user_range((unsigned long)va,
+						   (unsigned long)va + PAGE_SIZE);
+
+		size -= PAGE_SIZE;
+		pfn++;
+
+		kunmap_atomic(va);
+	}
+
+	if (!icache_is_pipt() && !icache_is_vivt_asid_tagged()) {
 		/* any kind of VIPT cache */
 		__flush_icache_all();
 	}
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 1f5b793..4da6504 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -921,6 +921,12 @@ static bool kvm_is_device_pfn(unsigned long pfn)
 	return !pfn_valid(pfn);
 }
 
+static void coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
+				      unsigned long size, bool uncached)
+{
+	__coherent_cache_guest_page(vcpu, pfn, size, uncached);
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
 			  unsigned long fault_status)
@@ -1010,8 +1016,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			kvm_set_s2pmd_writable(&new_pmd);
 			kvm_set_pfn_dirty(pfn);
 		}
-		coherent_cache_guest_page(vcpu, hva & PMD_MASK, PMD_SIZE,
-					  fault_ipa_uncached);
+		coherent_cache_guest_page(vcpu, pfn, PMD_SIZE, fault_ipa_uncached);
 		ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, &new_pmd);
 	} else {
 		pte_t new_pte = pfn_pte(pfn, mem_type);
@@ -1019,8 +1024,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			kvm_set_s2pte_writable(&new_pte);
 			kvm_set_pfn_dirty(pfn);
 		}
-		coherent_cache_guest_page(vcpu, hva, PAGE_SIZE,
-					  fault_ipa_uncached);
+		coherent_cache_guest_page(vcpu, pfn, PAGE_SIZE, fault_ipa_uncached);
 		ret = stage2_set_pte(kvm, memcache, fault_ipa, &new_pte,
 			pgprot_val(mem_type) == pgprot_val(PAGE_S2_DEVICE));
 	}
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index b7419f5..970e11e 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -243,15 +243,18 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
 	return (vcpu_sys_reg(vcpu, SCTLR_EL1) & 0b101) == 0b101;
 }
 
-static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva,
-					     unsigned long size,
-					     bool ipa_uncached)
+static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu, pfn_t pfn,
+					       unsigned long size,
+					       bool ipa_uncached)
 {
+	void *va = page_address(pfn_to_page(pfn));
+
 	if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached)
-		kvm_flush_dcache_to_poc((void *)hva, size);
+		kvm_flush_dcache_to_poc(va, size);
 
 	if (!icache_is_aliasing()) {		/* PIPT */
-		flush_icache_range(hva, hva + size);
+		flush_icache_range((unsigned long)va,
+				   (unsigned long)va + size);
 	} else if (!icache_is_aivivt()) {	/* non ASID-tagged VIVT */
 		/* any kind of VIPT cache */
 		__flush_icache_all();
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-08 11:59 ` [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault Marc Zyngier
@ 2015-01-08 12:30   ` Peter Maydell
  2015-01-08 13:07     ` Marc Zyngier
  2015-01-09 12:51   ` Christoffer Dall
  1 sibling, 1 reply; 45+ messages in thread
From: Peter Maydell @ 2015-01-08 12:30 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: Christoffer Dall, kvm-devel, kvmarm

On 8 January 2015 at 11:59, Marc Zyngier <marc.zyngier@arm.com> wrote:
> @@ -180,10 +177,32 @@ static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva,
>          *
>          * VIVT caches are tagged using both the ASID and the VMID and doesn't
>          * need any kind of flushing (DDI 0406C.b - Page B3-1392).
> +        *
> +        * We need to do this through a kernel mapping (using the
> +        * user-space mapping has proved to be the wrong
> +        * solution). For that, we need to kmap one page at a time,
> +        * and iterate over the range.
>          */
> -       if (icache_is_pipt()) {
> -               __cpuc_coherent_user_range(hva, hva + size);
> -       } else if (!icache_is_vivt_asid_tagged()) {
> +
> +       VM_BUG_ON(size & PAGE_MASK);
> +
> +       while (size) {
> +               void *va = kmap_atomic_pfn(pfn);
> +
> +               if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached)
> +                       kvm_flush_dcache_to_poc(va, PAGE_SIZE);
> +
> +               if (icache_is_pipt())
> +                       __cpuc_coherent_user_range((unsigned long)va,
> +                                                  (unsigned long)va + PAGE_SIZE);
> +
> +               size -= PAGE_SIZE;
> +               pfn++;
> +
> +               kunmap_atomic(va);
> +       }

If (vcpu_has_cache_enabled(vcpu) && !ipa_uncached && !icache_is_pipt())
then we're going to run round this loop mapping and unmapping without
actually doing anything. Is it worth hoisting that check out of the
loop? (I think it's going to be the common case for a non-PIPT icache,
right?)

> +       if (!icache_is_pipt() && !icache_is_vivt_asid_tagged()) {
>                 /* any kind of VIPT cache */
>                 __flush_icache_all();
>         }

Can you remind me why it's OK not to flush the icache for an
ASID tagged VIVT icache? Making this page coherent might actually
be revealing a change in the instructions associated with the VA,
mightn't it?

thanks
-- PMM

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-08 12:30   ` Peter Maydell
@ 2015-01-08 13:07     ` Marc Zyngier
  2015-01-08 13:16       ` Peter Maydell
  0 siblings, 1 reply; 45+ messages in thread
From: Marc Zyngier @ 2015-01-08 13:07 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Christoffer Dall, kvm-devel, kvmarm

On 08/01/15 12:30, Peter Maydell wrote:
> On 8 January 2015 at 11:59, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> @@ -180,10 +177,32 @@ static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva,
>>          *
>>          * VIVT caches are tagged using both the ASID and the VMID and doesn't
>>          * need any kind of flushing (DDI 0406C.b - Page B3-1392).
>> +        *
>> +        * We need to do this through a kernel mapping (using the
>> +        * user-space mapping has proved to be the wrong
>> +        * solution). For that, we need to kmap one page at a time,
>> +        * and iterate over the range.
>>          */
>> -       if (icache_is_pipt()) {
>> -               __cpuc_coherent_user_range(hva, hva + size);
>> -       } else if (!icache_is_vivt_asid_tagged()) {
>> +
>> +       VM_BUG_ON(size & PAGE_MASK);
>> +
>> +       while (size) {
>> +               void *va = kmap_atomic_pfn(pfn);
>> +
>> +               if (!vcpu_has_cache_enabled(vcpu) || ipa_uncached)
>> +                       kvm_flush_dcache_to_poc(va, PAGE_SIZE);
>> +
>> +               if (icache_is_pipt())
>> +                       __cpuc_coherent_user_range((unsigned long)va,
>> +                                                  (unsigned long)va + PAGE_SIZE);
>> +
>> +               size -= PAGE_SIZE;
>> +               pfn++;
>> +
>> +               kunmap_atomic(va);
>> +       }
> 
> If (vcpu_has_cache_enabled(vcpu) && !ipa_uncached && !icache_is_pipt())
> then we're going to run round this loop mapping and unmapping without
> actually doing anything. Is it worth hoisting that check out of the
> loop? (I think it's going to be the common case for a non-PIPT icache,
> right?)

Yup, that's a valid optimization.

>> +       if (!icache_is_pipt() && !icache_is_vivt_asid_tagged()) {
>>                 /* any kind of VIPT cache */
>>                 __flush_icache_all();
>>         }
> 
> Can you remind me why it's OK not to flush the icache for an
> ASID tagged VIVT icache? Making this page coherent might actually
> be revealing a change in the instructions associated with the VA,
> mightn't it?

ASID cached VIVT icaches are also VMID tagged. It is thus impossible for
stale cache lines to come with a new page. And if by synchronizing the
caches you obtain a different instruction stream, it means you've
restored the wrong page.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/4] mm: Correct ordering of *_clear_flush_young_notify
  2015-01-08 11:59 ` [PATCH 1/4] mm: Correct ordering of *_clear_flush_young_notify Marc Zyngier
@ 2015-01-08 13:12   ` Paolo Bonzini
  2015-01-08 19:00   ` Andrea Arcangeli
  1 sibling, 0 replies; 45+ messages in thread
From: Paolo Bonzini @ 2015-01-08 13:12 UTC (permalink / raw)
  To: Marc Zyngier, Christoffer Dall, kvm, kvmarm
  Cc: Steve Capper, Andrea Arcangeli, Rik van Riel

Andrea, Rik, please review this patch.

Thanks in advance,

Paolo

On 08/01/2015 12:59, Marc Zyngier wrote:
> From: Steve Capper <steve.capper@linaro.org>
> 
> ptep_clear_flush_young_notify and pmdp_clear_flush_young_notify both
> call the notifiers *after* the pte/pmd has been made young.
> 
> This can cause problems with KVM that relies on being able to block
> MMU notifiers when carrying out maintenance of second stage
> descriptors.
> 
> This patch ensures that the MMU notifiers are called before ptes and
> pmds are made old.
> 
> Signed-off-by: Steve Capper <steve.capper@linaro.org>
> ---
>  include/linux/mmu_notifier.h | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 95243d2..c454c76 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -290,11 +290,11 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
>  	int __young;							\
>  	struct vm_area_struct *___vma = __vma;				\
>  	unsigned long ___address = __address;				\
> -	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
> -	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
> +	__young = mmu_notifier_clear_flush_young(___vma->vm_mm,		\
>  						  ___address,		\
>  						  ___address +		\
>  							PAGE_SIZE);	\
> +	__young |= ptep_clear_flush_young(___vma, ___address, __ptep);	\
>  	__young;							\
>  })
>  
> @@ -303,11 +303,11 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
>  	int __young;							\
>  	struct vm_area_struct *___vma = __vma;				\
>  	unsigned long ___address = __address;				\
> -	__young = pmdp_clear_flush_young(___vma, ___address, __pmdp);	\
> -	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
> +	__young = mmu_notifier_clear_flush_young(___vma->vm_mm,		\
>  						  ___address,		\
>  						  ___address +		\
>  							PMD_SIZE);	\
> +	__young |= pmdp_clear_flush_young(___vma, ___address, __pmdp);	\
>  	__young;							\
>  })
>  
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-08 13:07     ` Marc Zyngier
@ 2015-01-08 13:16       ` Peter Maydell
  2015-01-08 15:06         ` Marc Zyngier
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Maydell @ 2015-01-08 13:16 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: Christoffer Dall, kvm-devel, kvmarm

On 8 January 2015 at 13:07, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> Can you remind me why it's OK not to flush the icache for an
>> ASID tagged VIVT icache? Making this page coherent might actually
>> be revealing a change in the instructions associated with the VA,
>> mightn't it?
>
> ASID cached VIVT icaches are also VMID tagged. It is thus impossible for
> stale cache lines to come with a new page. And if by synchronizing the
> caches you obtain a different instruction stream, it means you've
> restored the wrong page.

...is that true even if the dirty data in the dcache comes from
the userspace process doing DMA or writing the initial boot
image or whatever?

-- PMM

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-08 13:16       ` Peter Maydell
@ 2015-01-08 15:06         ` Marc Zyngier
  2015-01-08 15:21           ` Peter Maydell
  0 siblings, 1 reply; 45+ messages in thread
From: Marc Zyngier @ 2015-01-08 15:06 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Christoffer Dall, kvm-devel, kvmarm

On 08/01/15 13:16, Peter Maydell wrote:
> On 8 January 2015 at 13:07, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>> Can you remind me why it's OK not to flush the icache for an
>>> ASID tagged VIVT icache? Making this page coherent might actually
>>> be revealing a change in the instructions associated with the VA,
>>> mightn't it?
>>
>> ASID cached VIVT icaches are also VMID tagged. It is thus impossible for
>> stale cache lines to come with a new page. And if by synchronizing the
>> caches you obtain a different instruction stream, it means you've
>> restored the wrong page.
> 
> ...is that true even if the dirty data in the dcache comes from
> the userspace process doing DMA or writing the initial boot
> image or whatever?

We perform this on a page that is being brought in stage-2. Two cases:

- This is a page is mapped for the first time: the icache should be
invalid for this page (the guest should have invalidated it the first
place),

- This is a page that we bring back from swap: the page must match the
one that has been swapped out. If it has been DMA'ed in in the meantime,
then the guest surely has flushed its icache if it intends to branch to
it, hasn't it?

I have the feeling I'm missing something from your question...

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-08 15:06         ` Marc Zyngier
@ 2015-01-08 15:21           ` Peter Maydell
  2015-01-09 12:50             ` Christoffer Dall
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Maydell @ 2015-01-08 15:21 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: Christoffer Dall, kvm-devel, kvmarm

On 8 January 2015 at 15:06, Marc Zyngier <marc.zyngier@arm.com> wrote:
> On 08/01/15 13:16, Peter Maydell wrote:
>>> ASID cached VIVT icaches are also VMID tagged. It is thus impossible for
>>> stale cache lines to come with a new page. And if by synchronizing the
>>> caches you obtain a different instruction stream, it means you've
>>> restored the wrong page.
>>
>> ...is that true even if the dirty data in the dcache comes from
>> the userspace process doing DMA or writing the initial boot
>> image or whatever?
>
> We perform this on a page that is being brought in stage-2. Two cases:
>
> - This is a page is mapped for the first time: the icache should be
> invalid for this page (the guest should have invalidated it the first
> place),

If this is the first instruction in the guest (ie we've just
(warm) reset the VM and are running the kernel as loaded into the guest
by QEMU/kvmtool) then the guest can't have invalidated the icache,
and QEMU can't do the invalidate because it doesn't have the vaddr
and VMID of the guest.

> - This is a page that we bring back from swap: the page must match the
> one that has been swapped out. If it has been DMA'ed in in the meantime,
> then the guest surely has flushed its icache if it intends to branch to
> it, hasn't it?

I agree that for the DMA case the guest will have done the invalidate.

-- PMM

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/4] mm: Correct ordering of *_clear_flush_young_notify
  2015-01-08 11:59 ` [PATCH 1/4] mm: Correct ordering of *_clear_flush_young_notify Marc Zyngier
  2015-01-08 13:12   ` Paolo Bonzini
@ 2015-01-08 19:00   ` Andrea Arcangeli
  2015-01-12 10:15     ` Steve Capper
  1 sibling, 1 reply; 45+ messages in thread
From: Andrea Arcangeli @ 2015-01-08 19:00 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: Christoffer Dall, kvm, kvmarm, Steve Capper

On Thu, Jan 08, 2015 at 11:59:06AM +0000, Marc Zyngier wrote:
> From: Steve Capper <steve.capper@linaro.org>
> 
> ptep_clear_flush_young_notify and pmdp_clear_flush_young_notify both
> call the notifiers *after* the pte/pmd has been made young.
> 

On x86 on EPT without hardware access bit (!shadow_accessed_mask),
we'll trigger a KVM page fault (gup_fast) which would mark the page
referenced to give it higher priority in the LRU (or set the accessed
bit if it's a THP).

If we drop the KVM shadow pagetable before clearing the accessed bit
in the linux pte, there's a window where we won't set the young bit
for THP. For non-THP it's less of an issue because gup_fast calls
mark_page_accessed which rolls the lrus and sets the referenced bit in
the struct page, so the effect of mark_page_accessed doesn't get
lost when the linux pte accessed bit is cleared.

We could also consider using mark_page_accessed in
follow_trans_huge_pmd to workaround the problem.  I think setting the
young bit in gup_fast is correct and would be more similar to a real
CPU access (which is what gup_fast simulates anyway) so below patch
literally is introducing a race condition even if it's going to be
lost in the noise and it's not a problem.

> This can cause problems with KVM that relies on being able to block
> MMU notifiers when carrying out maintenance of second stage
> descriptors.
> 
> This patch ensures that the MMU notifiers are called before ptes and
> pmds are made old.

Unfortunately I don't understand why this is needed.

The only difference this can make to KVM is that without the patch,
kvm_age_rmapp is called while the linux pte is less likely to have the
accessed bit set (current behavior). It can still be set by hardware
through another CPU touching the memory before the mmu notifier is
invoked.

With the patch the linux pte is more likely to have the accessed bit
set as it's not cleared before calling the mmu notifier.

In both cases (at least in x86 where the accessed bit is always set in
hardware) the accessed bit may or may not be set. The pte can not
otherwise change as it's called with the PT lock.

So again it looks a noop and it introduces a mostly theoretical race
condition for THP young bit in the linux pte with EPT and
!shadow_accessed_mask.

Clearly there must be some arm obscure detail I'm not aware of that
makes this helpful but the description in commit header isn't enough
to get what's up with blocking mmu notifiers or such. Could you
elaborate?

Thanks!
Andrea

> 
> Signed-off-by: Steve Capper <steve.capper@linaro.org>
> ---
>  include/linux/mmu_notifier.h | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 95243d2..c454c76 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -290,11 +290,11 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
>  	int __young;							\
>  	struct vm_area_struct *___vma = __vma;				\
>  	unsigned long ___address = __address;				\
> -	__young = ptep_clear_flush_young(___vma, ___address, __ptep);	\
> -	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
> +	__young = mmu_notifier_clear_flush_young(___vma->vm_mm,		\
>  						  ___address,		\
>  						  ___address +		\
>  							PAGE_SIZE);	\
> +	__young |= ptep_clear_flush_young(___vma, ___address, __ptep);	\
>  	__young;							\
>  })
>  
> @@ -303,11 +303,11 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
>  	int __young;							\
>  	struct vm_area_struct *___vma = __vma;				\
>  	unsigned long ___address = __address;				\
> -	__young = pmdp_clear_flush_young(___vma, ___address, __pmdp);	\
> -	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
> +	__young = mmu_notifier_clear_flush_young(___vma->vm_mm,		\
>  						  ___address,		\
>  						  ___address +		\
>  							PMD_SIZE);	\
> +	__young |= pmdp_clear_flush_young(___vma, ___address, __pmdp);	\
>  	__young;							\
>  })
>  
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/4] arm/arm64: KVM: Use set/way op trapping to track the state of the caches
  2015-01-08 11:59 ` [PATCH 2/4] arm/arm64: KVM: Use set/way op trapping to track the state of the caches Marc Zyngier
@ 2015-01-09 11:19   ` Christoffer Dall
  2015-01-09 11:38     ` Marc Zyngier
  0 siblings, 1 reply; 45+ messages in thread
From: Christoffer Dall @ 2015-01-09 11:19 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: kvm, kvmarm, Steve Capper

On Thu, Jan 08, 2015 at 11:59:07AM +0000, Marc Zyngier wrote:
> Trying to emulate the behaviour of set/way cache ops is fairly
> pointless, as there are too many ways we can end-up missing stuff.
> Also, there is some system caches out there that simply ignore
> set/way operations.
> 
> So instead of trying to implement them, let's convert it to VA ops,
> and use them as a way to re-enable the trapping of VM ops. That way,
> we can detect the point when the MMU/caches are turned off, and do
> a full VM flush (which is what the guest was trying to do anyway).
> 
> This allows a 32bit zImage to boot on the APM thingy, and will
> probably help bootloaders in general.
> 
> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
> ---
>  arch/arm/include/asm/kvm_host.h   |   3 --
>  arch/arm/kvm/arm.c                |  10 ----
>  arch/arm/kvm/coproc.c             |  90 +++++++++++++++++++++------------
>  arch/arm64/include/asm/kvm_host.h |   3 --
>  arch/arm64/kvm/sys_regs.c         | 102 ++++++++++++++++++++++----------------
>  5 files changed, 116 insertions(+), 92 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index 254e065..04b4ea0 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -125,9 +125,6 @@ struct kvm_vcpu_arch {
>  	 * Anything that is not used directly from assembly code goes
>  	 * here.
>  	 */
> -	/* dcache set/way operation pending */
> -	int last_pcpu;
> -	cpumask_t require_dcache_flush;
>  
>  	/* Don't run the guest on this vcpu */
>  	bool pause;
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index 2d6d910..0b0d58a 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -281,15 +281,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  	vcpu->cpu = cpu;
>  	vcpu->arch.host_cpu_context = this_cpu_ptr(kvm_host_cpu_state);
>  
> -	/*
> -	 * Check whether this vcpu requires the cache to be flushed on
> -	 * this physical CPU. This is a consequence of doing dcache
> -	 * operations by set/way on this vcpu. We do it here to be in
> -	 * a non-preemptible section.
> -	 */
> -	if (cpumask_test_and_clear_cpu(cpu, &vcpu->arch.require_dcache_flush))
> -		flush_cache_all(); /* We'd really want v7_flush_dcache_all() */
> -
>  	kvm_arm_set_running_vcpu(vcpu);
>  }
>  
> @@ -541,7 +532,6 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  		ret = kvm_call_hyp(__kvm_vcpu_run, vcpu);
>  
>  		vcpu->mode = OUTSIDE_GUEST_MODE;
> -		vcpu->arch.last_pcpu = smp_processor_id();
>  		kvm_guest_exit();
>  		trace_kvm_exit(*vcpu_pc(vcpu));
>  		/*
> diff --git a/arch/arm/kvm/coproc.c b/arch/arm/kvm/coproc.c
> index 7928dbd..3d352a5 100644
> --- a/arch/arm/kvm/coproc.c
> +++ b/arch/arm/kvm/coproc.c
> @@ -189,43 +189,57 @@ static bool access_l2ectlr(struct kvm_vcpu *vcpu,
>  	return true;
>  }
>  
> -/* See note at ARM ARM B1.14.4 */
> +/*
> + * See note at ARMv7 ARM B1.14.4 (TL;DR: S/W ops are not easily virtualized).
> + *
> + * Main problems:
> + * - S/W ops are local to a CPU (not broadcast)
> + * - We have line migration behind our back (speculation)
> + * - System caches don't support S/W at all (damn!)
> + *
> + * In the face of the above, the best we can do is to try and convert
> + * S/W ops to VA ops. Because the guest is not allowed to infer the
> + * S/W to PA mapping, it can only use S/W to nuke the whole cache,
> + * which is a rather good thing for us.
> + *
> + * Also, it is only used when turning caches on/off ("The expected
> + * usage of the cache maintenance instructions that operate by set/way
> + * is associated with the cache maintenance instructions associated
> + * with the powerdown and powerup of caches, if this is required by
> + * the implementation.").
> + *
> + * We use the following policy:
> + *
> + * - If we trap a S/W operation, we enable VM trapping to detect
> + *   caches being turned on/off.
> + *
> + * - If the caches have already been turned off when doing the S/W op,
> + *   we nuke the whole VM cache.
> + *
> + * - We flush the cache on both caches being turned on and off.
> + *
> + * - Once the caches are enabled, we stop trapping VM ops.
> + */
>  static bool access_dcsw(struct kvm_vcpu *vcpu,
>  			const struct coproc_params *p,
>  			const struct coproc_reg *r)
>  {
> -	unsigned long val;
> -	int cpu;
> -
>  	if (!p->is_write)
>  		return read_from_write_only(vcpu, p);
>  
> -	cpu = get_cpu();
> -
> -	cpumask_setall(&vcpu->arch.require_dcache_flush);
> -	cpumask_clear_cpu(cpu, &vcpu->arch.require_dcache_flush);
> -
> -	/* If we were already preempted, take the long way around */
> -	if (cpu != vcpu->arch.last_pcpu) {
> -		flush_cache_all();
> -		goto done;
> -	}
> -
> -	val = *vcpu_reg(vcpu, p->Rt1);
> -
> -	switch (p->CRm) {
> -	case 6:			/* Upgrade DCISW to DCCISW, as per HCR.SWIO */
> -	case 14:		/* DCCISW */
> -		asm volatile("mcr p15, 0, %0, c7, c14, 2" : : "r" (val));
> -		break;
> -
> -	case 10:		/* DCCSW */
> -		asm volatile("mcr p15, 0, %0, c7, c10, 2" : : "r" (val));
> -		break;
> -	}
> +	/*
> +	 * If this is the first time we do a S/W operation
> +	 * (i.e. HCT_TVM not set) and the caches are already disabled,
> +	 * flush the whole memory.
> +	 *
> +	 * Otherwise, don't do anything. Just wait for the MMU +

this is misleading because you always enable trapping of the cache
registers, so probably say "don't flush anything".  We can fix this up
when applying though.

> +	 * Caches to be turned off, and use that to actually clean the
> +	 * caches.
> +	 */
> +	if (!(vcpu->arch.hcr & HCR_TVM) && !vcpu_has_cache_enabled(vcpu))
> +		stage2_flush_vm(vcpu->kvm);
>  
> -done:
> -	put_cpu();
> +	vcpu->arch.hcr |= HCR_TVM;
>  
>  	return true;
>  }
> @@ -258,12 +272,24 @@ bool access_sctlr(struct kvm_vcpu *vcpu,
>  		  const struct coproc_params *p,
>  		  const struct coproc_reg *r)
>  {
> +	bool was_enabled = vcpu_has_cache_enabled(vcpu);
> +	bool now_enabled;
> +
>  	access_vm_reg(vcpu, p, r);
>  
> -	if (vcpu_has_cache_enabled(vcpu)) {	/* MMU+Caches enabled? */
> -		vcpu->arch.hcr &= ~HCR_TVM;
> +	now_enabled = vcpu_has_cache_enabled(vcpu);
> +
> +	/*
> +	 * If switching the MMU+caches on, need to invalidate the caches.
> +	 * If switching it off, need to clean the caches.
> +	 * Clean + invalidate does the trick always.

couldn't a clean here then be writing bogus to RAM on an initial boot
of the guest?

> +	 */
> +	if (now_enabled != was_enabled)
>  		stage2_flush_vm(vcpu->kvm);
> -	}
> +
> +	/* Caches are now on, stop trapping VM ops (until a S/W op) */
> +	if (now_enabled)
> +		vcpu->arch.hcr &= ~HCR_TVM;
>  
>  	return true;
>  }
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 0b7dfdb..acd101a 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -116,9 +116,6 @@ struct kvm_vcpu_arch {
>  	 * Anything that is not used directly from assembly code goes
>  	 * here.
>  	 */
> -	/* dcache set/way operation pending */
> -	int last_pcpu;
> -	cpumask_t require_dcache_flush;
>  
>  	/* Don't run the guest */
>  	bool pause;
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index 3d7c2df..5c6540b 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -69,55 +69,57 @@ static u32 get_ccsidr(u32 csselr)
>  	return ccsidr;
>  }
>  
> -static void do_dc_cisw(u32 val)
> -{
> -	asm volatile("dc cisw, %x0" : : "r" (val));
> -	dsb(ish);
> -}
> -
> -static void do_dc_csw(u32 val)
> -{
> -	asm volatile("dc csw, %x0" : : "r" (val));
> -	dsb(ish);
> -}
> -
> -/* See note at ARM ARM B1.14.4 */
> +/*
> + * See note at ARMv7 ARM B1.14.4 (TL;DR: S/W ops are not easily virtualized).
> + *
> + * Main problems:
> + * - S/W ops are local to a CPU (not broadcast)
> + * - We have line migration behind our back (speculation)
> + * - System caches don't support S/W at all (damn!)
> + *
> + * In the face of the above, the best we can do is to try and convert
> + * S/W ops to VA ops. Because the guest is not allowed to infer the
> + * S/W to PA mapping, it can only use S/W to nuke the whole cache,
> + * which is a rather good thing for us.
> + *
> + * Also, it is only used when turning caches on/off ("The expected
> + * usage of the cache maintenance instructions that operate by set/way
> + * is associated with the cache maintenance instructions associated
> + * with the powerdown and powerup of caches, if this is required by
> + * the implementation.").
> + *
> + * We use the following policy:
> + *
> + * - If we trap a S/W operation, we enable VM trapping to detect
> + *   caches being turned on/off.
> + *
> + * - If the caches have already been turned off when doing the S/W op,
> + *   we nuke the whole VM cache.
> + *
> + * - We flush the cache on both caches being turned on and off.
> + *
> + * - Once the caches are enabled, we stop trapping VM ops.
> + */

can't we factor this out into a handler in arch/arm/kvm/mmu.c and only
have this comment in one place?

>  static bool access_dcsw(struct kvm_vcpu *vcpu,
>  			const struct sys_reg_params *p,
>  			const struct sys_reg_desc *r)
>  {
> -	unsigned long val;
> -	int cpu;
> -
>  	if (!p->is_write)
>  		return read_from_write_only(vcpu, p);
>  
> -	cpu = get_cpu();
> -
> -	cpumask_setall(&vcpu->arch.require_dcache_flush);
> -	cpumask_clear_cpu(cpu, &vcpu->arch.require_dcache_flush);
> -
> -	/* If we were already preempted, take the long way around */
> -	if (cpu != vcpu->arch.last_pcpu) {
> -		flush_cache_all();
> -		goto done;
> -	}
> -
> -	val = *vcpu_reg(vcpu, p->Rt);
> -
> -	switch (p->CRm) {
> -	case 6:			/* Upgrade DCISW to DCCISW, as per HCR.SWIO */
> -	case 14:		/* DCCISW */
> -		do_dc_cisw(val);
> -		break;
> -
> -	case 10:		/* DCCSW */
> -		do_dc_csw(val);
> -		break;
> -	}
> +	/*
> +	 * If this is the first time we do a S/W operation
> +	 * (i.e. HCT_TVM not set) and the caches are already disabled,
> +	 * flush the whole memory.
> +	 *
> +	 * Otherwise, don't do anything. Just wait for the MMU +
> +	 * Caches to be turned off, and use that to actually clean the
> +	 * caches.
> +	 */
> +	if (!(vcpu->arch.hcr_el2 & HCR_TVM) && !vcpu_has_cache_enabled(vcpu))
> +		stage2_flush_vm(vcpu->kvm);
>  
> -done:
> -	put_cpu();
> +	vcpu->arch.hcr_el2 |= HCR_TVM;
>  
>  	return true;
>  }
> @@ -155,12 +157,24 @@ static bool access_sctlr(struct kvm_vcpu *vcpu,
>  			 const struct sys_reg_params *p,
>  			 const struct sys_reg_desc *r)
>  {
> +	bool was_enabled = vcpu_has_cache_enabled(vcpu);
> +	bool now_enabled;
> +
>  	access_vm_reg(vcpu, p, r);
>  
> -	if (vcpu_has_cache_enabled(vcpu)) {	/* MMU+Caches enabled? */
> -		vcpu->arch.hcr_el2 &= ~HCR_TVM;
> +	now_enabled = vcpu_has_cache_enabled(vcpu);
> +
> +	/*
> +	 * If switching the MMU+caches on, need to invalidate the caches.
> +	 * If switching it off, need to clean the caches.
> +	 * Clean + invalidate does the trick always.

same question as above (hint: should we really not try to share this
code?)

> +	 */
> +	if (now_enabled != was_enabled)
>  		stage2_flush_vm(vcpu->kvm);
> -	}
> +
> +	/* Caches are now on, stop trapping VM ops (until a S/W op) */
> +	if (now_enabled)
> +		vcpu->arch.hcr_el2 &= ~HCR_TVM;
>  
>  	return true;
>  }
> -- 
> 2.1.4
> 

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/4] arm/arm64: KVM: Use set/way op trapping to track the state of the caches
  2015-01-09 11:19   ` Christoffer Dall
@ 2015-01-09 11:38     ` Marc Zyngier
  2015-01-09 12:12       ` Christoffer Dall
  0 siblings, 1 reply; 45+ messages in thread
From: Marc Zyngier @ 2015-01-09 11:38 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: kvm, kvmarm, Steve Capper

On 09/01/15 11:19, Christoffer Dall wrote:
> On Thu, Jan 08, 2015 at 11:59:07AM +0000, Marc Zyngier wrote:
>> Trying to emulate the behaviour of set/way cache ops is fairly
>> pointless, as there are too many ways we can end-up missing stuff.
>> Also, there is some system caches out there that simply ignore
>> set/way operations.
>>
>> So instead of trying to implement them, let's convert it to VA ops,
>> and use them as a way to re-enable the trapping of VM ops. That way,
>> we can detect the point when the MMU/caches are turned off, and do
>> a full VM flush (which is what the guest was trying to do anyway).
>>
>> This allows a 32bit zImage to boot on the APM thingy, and will
>> probably help bootloaders in general.
>>
>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
>> ---
>>  arch/arm/include/asm/kvm_host.h   |   3 --
>>  arch/arm/kvm/arm.c                |  10 ----
>>  arch/arm/kvm/coproc.c             |  90 +++++++++++++++++++++------------
>>  arch/arm64/include/asm/kvm_host.h |   3 --
>>  arch/arm64/kvm/sys_regs.c         | 102 ++++++++++++++++++++++----------------
>>  5 files changed, 116 insertions(+), 92 deletions(-)
>>
>> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
>> index 254e065..04b4ea0 100644
>> --- a/arch/arm/include/asm/kvm_host.h
>> +++ b/arch/arm/include/asm/kvm_host.h
>> @@ -125,9 +125,6 @@ struct kvm_vcpu_arch {
>>        * Anything that is not used directly from assembly code goes
>>        * here.
>>        */
>> -     /* dcache set/way operation pending */
>> -     int last_pcpu;
>> -     cpumask_t require_dcache_flush;
>>
>>       /* Don't run the guest on this vcpu */
>>       bool pause;
>> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
>> index 2d6d910..0b0d58a 100644
>> --- a/arch/arm/kvm/arm.c
>> +++ b/arch/arm/kvm/arm.c
>> @@ -281,15 +281,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>       vcpu->cpu = cpu;
>>       vcpu->arch.host_cpu_context = this_cpu_ptr(kvm_host_cpu_state);
>>
>> -     /*
>> -      * Check whether this vcpu requires the cache to be flushed on
>> -      * this physical CPU. This is a consequence of doing dcache
>> -      * operations by set/way on this vcpu. We do it here to be in
>> -      * a non-preemptible section.
>> -      */
>> -     if (cpumask_test_and_clear_cpu(cpu, &vcpu->arch.require_dcache_flush))
>> -             flush_cache_all(); /* We'd really want v7_flush_dcache_all() */
>> -
>>       kvm_arm_set_running_vcpu(vcpu);
>>  }
>>
>> @@ -541,7 +532,6 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>               ret = kvm_call_hyp(__kvm_vcpu_run, vcpu);
>>
>>               vcpu->mode = OUTSIDE_GUEST_MODE;
>> -             vcpu->arch.last_pcpu = smp_processor_id();
>>               kvm_guest_exit();
>>               trace_kvm_exit(*vcpu_pc(vcpu));
>>               /*
>> diff --git a/arch/arm/kvm/coproc.c b/arch/arm/kvm/coproc.c
>> index 7928dbd..3d352a5 100644
>> --- a/arch/arm/kvm/coproc.c
>> +++ b/arch/arm/kvm/coproc.c
>> @@ -189,43 +189,57 @@ static bool access_l2ectlr(struct kvm_vcpu *vcpu,
>>       return true;
>>  }
>>
>> -/* See note at ARM ARM B1.14.4 */
>> +/*
>> + * See note at ARMv7 ARM B1.14.4 (TL;DR: S/W ops are not easily virtualized).
>> + *
>> + * Main problems:
>> + * - S/W ops are local to a CPU (not broadcast)
>> + * - We have line migration behind our back (speculation)
>> + * - System caches don't support S/W at all (damn!)
>> + *
>> + * In the face of the above, the best we can do is to try and convert
>> + * S/W ops to VA ops. Because the guest is not allowed to infer the
>> + * S/W to PA mapping, it can only use S/W to nuke the whole cache,
>> + * which is a rather good thing for us.
>> + *
>> + * Also, it is only used when turning caches on/off ("The expected
>> + * usage of the cache maintenance instructions that operate by set/way
>> + * is associated with the cache maintenance instructions associated
>> + * with the powerdown and powerup of caches, if this is required by
>> + * the implementation.").
>> + *
>> + * We use the following policy:
>> + *
>> + * - If we trap a S/W operation, we enable VM trapping to detect
>> + *   caches being turned on/off.
>> + *
>> + * - If the caches have already been turned off when doing the S/W op,
>> + *   we nuke the whole VM cache.
>> + *
>> + * - We flush the cache on both caches being turned on and off.
>> + *
>> + * - Once the caches are enabled, we stop trapping VM ops.
>> + */
>>  static bool access_dcsw(struct kvm_vcpu *vcpu,
>>                       const struct coproc_params *p,
>>                       const struct coproc_reg *r)
>>  {
>> -     unsigned long val;
>> -     int cpu;
>> -
>>       if (!p->is_write)
>>               return read_from_write_only(vcpu, p);
>>
>> -     cpu = get_cpu();
>> -
>> -     cpumask_setall(&vcpu->arch.require_dcache_flush);
>> -     cpumask_clear_cpu(cpu, &vcpu->arch.require_dcache_flush);
>> -
>> -     /* If we were already preempted, take the long way around */
>> -     if (cpu != vcpu->arch.last_pcpu) {
>> -             flush_cache_all();
>> -             goto done;
>> -     }
>> -
>> -     val = *vcpu_reg(vcpu, p->Rt1);
>> -
>> -     switch (p->CRm) {
>> -     case 6:                 /* Upgrade DCISW to DCCISW, as per HCR.SWIO */
>> -     case 14:                /* DCCISW */
>> -             asm volatile("mcr p15, 0, %0, c7, c14, 2" : : "r" (val));
>> -             break;
>> -
>> -     case 10:                /* DCCSW */
>> -             asm volatile("mcr p15, 0, %0, c7, c10, 2" : : "r" (val));
>> -             break;
>> -     }
>> +     /*
>> +      * If this is the first time we do a S/W operation
>> +      * (i.e. HCT_TVM not set) and the caches are already disabled,
>> +      * flush the whole memory.
>> +      *
>> +      * Otherwise, don't do anything. Just wait for the MMU +
> 
> this is misleading because you always enable trapping of the cache
> registers, so probably say "don't flush anything".  We can fix this up
> when applying though.

Fair enough.

>> +      * Caches to be turned off, and use that to actually clean the
>> +      * caches.
>> +      */
>> +     if (!(vcpu->arch.hcr & HCR_TVM) && !vcpu_has_cache_enabled(vcpu))
>> +             stage2_flush_vm(vcpu->kvm);
>>
>> -done:
>> -     put_cpu();
>> +     vcpu->arch.hcr |= HCR_TVM;
>>
>>       return true;
>>  }
>> @@ -258,12 +272,24 @@ bool access_sctlr(struct kvm_vcpu *vcpu,
>>                 const struct coproc_params *p,
>>                 const struct coproc_reg *r)
>>  {
>> +     bool was_enabled = vcpu_has_cache_enabled(vcpu);
>> +     bool now_enabled;
>> +
>>       access_vm_reg(vcpu, p, r);
>>
>> -     if (vcpu_has_cache_enabled(vcpu)) {     /* MMU+Caches enabled? */
>> -             vcpu->arch.hcr &= ~HCR_TVM;
>> +     now_enabled = vcpu_has_cache_enabled(vcpu);
>> +
>> +     /*
>> +      * If switching the MMU+caches on, need to invalidate the caches.
>> +      * If switching it off, need to clean the caches.
>> +      * Clean + invalidate does the trick always.
> 
> couldn't a clean here then be writing bogus to RAM on an initial boot
> of the guest?

It shouldn't. We're supposed to start a VM with a fully invalid cache
(what's why we call coherent_cache_guest_page). And this is no different
from what we had before, as an invalidate by set/way was upgraded to
clean+invalidate.

>> +      */
>> +     if (now_enabled != was_enabled)
>>               stage2_flush_vm(vcpu->kvm);
>> -     }
>> +
>> +     /* Caches are now on, stop trapping VM ops (until a S/W op) */
>> +     if (now_enabled)
>> +             vcpu->arch.hcr &= ~HCR_TVM;
>>
>>       return true;
>>  }
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index 0b7dfdb..acd101a 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -116,9 +116,6 @@ struct kvm_vcpu_arch {
>>        * Anything that is not used directly from assembly code goes
>>        * here.
>>        */
>> -     /* dcache set/way operation pending */
>> -     int last_pcpu;
>> -     cpumask_t require_dcache_flush;
>>
>>       /* Don't run the guest */
>>       bool pause;
>> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>> index 3d7c2df..5c6540b 100644
>> --- a/arch/arm64/kvm/sys_regs.c
>> +++ b/arch/arm64/kvm/sys_regs.c
>> @@ -69,55 +69,57 @@ static u32 get_ccsidr(u32 csselr)
>>       return ccsidr;
>>  }
>>
>> -static void do_dc_cisw(u32 val)
>> -{
>> -     asm volatile("dc cisw, %x0" : : "r" (val));
>> -     dsb(ish);
>> -}
>> -
>> -static void do_dc_csw(u32 val)
>> -{
>> -     asm volatile("dc csw, %x0" : : "r" (val));
>> -     dsb(ish);
>> -}
>> -
>> -/* See note at ARM ARM B1.14.4 */
>> +/*
>> + * See note at ARMv7 ARM B1.14.4 (TL;DR: S/W ops are not easily virtualized).
>> + *
>> + * Main problems:
>> + * - S/W ops are local to a CPU (not broadcast)
>> + * - We have line migration behind our back (speculation)
>> + * - System caches don't support S/W at all (damn!)
>> + *
>> + * In the face of the above, the best we can do is to try and convert
>> + * S/W ops to VA ops. Because the guest is not allowed to infer the
>> + * S/W to PA mapping, it can only use S/W to nuke the whole cache,
>> + * which is a rather good thing for us.
>> + *
>> + * Also, it is only used when turning caches on/off ("The expected
>> + * usage of the cache maintenance instructions that operate by set/way
>> + * is associated with the cache maintenance instructions associated
>> + * with the powerdown and powerup of caches, if this is required by
>> + * the implementation.").
>> + *
>> + * We use the following policy:
>> + *
>> + * - If we trap a S/W operation, we enable VM trapping to detect
>> + *   caches being turned on/off.
>> + *
>> + * - If the caches have already been turned off when doing the S/W op,
>> + *   we nuke the whole VM cache.
>> + *
>> + * - We flush the cache on both caches being turned on and off.
>> + *
>> + * - Once the caches are enabled, we stop trapping VM ops.
>> + */
> 
> can't we factor this out into a handler in arch/arm/kvm/mmu.c and only
> have this comment in one place?

That's a good point. I'll move it across.

>>  static bool access_dcsw(struct kvm_vcpu *vcpu,
>>                       const struct sys_reg_params *p,
>>                       const struct sys_reg_desc *r)
>>  {
>> -     unsigned long val;
>> -     int cpu;
>> -
>>       if (!p->is_write)
>>               return read_from_write_only(vcpu, p);
>>
>> -     cpu = get_cpu();
>> -
>> -     cpumask_setall(&vcpu->arch.require_dcache_flush);
>> -     cpumask_clear_cpu(cpu, &vcpu->arch.require_dcache_flush);
>> -
>> -     /* If we were already preempted, take the long way around */
>> -     if (cpu != vcpu->arch.last_pcpu) {
>> -             flush_cache_all();
>> -             goto done;
>> -     }
>> -
>> -     val = *vcpu_reg(vcpu, p->Rt);
>> -
>> -     switch (p->CRm) {
>> -     case 6:                 /* Upgrade DCISW to DCCISW, as per HCR.SWIO */
>> -     case 14:                /* DCCISW */
>> -             do_dc_cisw(val);
>> -             break;
>> -
>> -     case 10:                /* DCCSW */
>> -             do_dc_csw(val);
>> -             break;
>> -     }
>> +     /*
>> +      * If this is the first time we do a S/W operation
>> +      * (i.e. HCT_TVM not set) and the caches are already disabled,
>> +      * flush the whole memory.
>> +      *
>> +      * Otherwise, don't do anything. Just wait for the MMU +
>> +      * Caches to be turned off, and use that to actually clean the
>> +      * caches.
>> +      */
>> +     if (!(vcpu->arch.hcr_el2 & HCR_TVM) && !vcpu_has_cache_enabled(vcpu))
>> +             stage2_flush_vm(vcpu->kvm);
>>
>> -done:
>> -     put_cpu();
>> +     vcpu->arch.hcr_el2 |= HCR_TVM;
>>
>>       return true;
>>  }
>> @@ -155,12 +157,24 @@ static bool access_sctlr(struct kvm_vcpu *vcpu,
>>                        const struct sys_reg_params *p,
>>                        const struct sys_reg_desc *r)
>>  {
>> +     bool was_enabled = vcpu_has_cache_enabled(vcpu);
>> +     bool now_enabled;
>> +
>>       access_vm_reg(vcpu, p, r);
>>
>> -     if (vcpu_has_cache_enabled(vcpu)) {     /* MMU+Caches enabled? */
>> -             vcpu->arch.hcr_el2 &= ~HCR_TVM;
>> +     now_enabled = vcpu_has_cache_enabled(vcpu);
>> +
>> +     /*
>> +      * If switching the MMU+caches on, need to invalidate the caches.
>> +      * If switching it off, need to clean the caches.
>> +      * Clean + invalidate does the trick always.
> 
> same question as above (hint: should we really not try to share this
> code?)

I'll have a look at moving things to mmu.c.

>> +      */
>> +     if (now_enabled != was_enabled)
>>               stage2_flush_vm(vcpu->kvm);
>> -     }
>> +
>> +     /* Caches are now on, stop trapping VM ops (until a S/W op) */
>> +     if (now_enabled)
>> +             vcpu->arch.hcr_el2 &= ~HCR_TVM;
>>
>>       return true;
>>  }

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/4] arm/arm64: KVM: Use set/way op trapping to track the state of the caches
  2015-01-09 11:38     ` Marc Zyngier
@ 2015-01-09 12:12       ` Christoffer Dall
  0 siblings, 0 replies; 45+ messages in thread
From: Christoffer Dall @ 2015-01-09 12:12 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: kvm, kvmarm, Steve Capper

On Fri, Jan 09, 2015 at 11:38:19AM +0000, Marc Zyngier wrote:

[...]

> >> @@ -258,12 +272,24 @@ bool access_sctlr(struct kvm_vcpu *vcpu,
> >>                 const struct coproc_params *p,
> >>                 const struct coproc_reg *r)
> >>  {
> >> +     bool was_enabled = vcpu_has_cache_enabled(vcpu);
> >> +     bool now_enabled;
> >> +
> >>       access_vm_reg(vcpu, p, r);
> >>
> >> -     if (vcpu_has_cache_enabled(vcpu)) {     /* MMU+Caches enabled? */
> >> -             vcpu->arch.hcr &= ~HCR_TVM;
> >> +     now_enabled = vcpu_has_cache_enabled(vcpu);
> >> +
> >> +     /*
> >> +      * If switching the MMU+caches on, need to invalidate the caches.
> >> +      * If switching it off, need to clean the caches.
> >> +      * Clean + invalidate does the trick always.
> > 
> > couldn't a clean here then be writing bogus to RAM on an initial boot
> > of the guest?
> 
> It shouldn't. We're supposed to start a VM with a fully invalid cache
> (what's why we call coherent_cache_guest_page). And this is no different
> from what we had before, as an invalidate by set/way was upgraded to
> clean+invalidate.
> 

right, that makes sense.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4] arm/arm64: KVM: Flush caches to memory on unmap
  2015-01-08 11:59 ` [PATCH 3/4] arm/arm64: KVM: Flush caches to memory on unmap Marc Zyngier
@ 2015-01-09 12:30   ` Christoffer Dall
  2015-01-09 14:35     ` Marc Zyngier
  0 siblings, 1 reply; 45+ messages in thread
From: Christoffer Dall @ 2015-01-09 12:30 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: kvm, kvmarm, Steve Capper

On Thu, Jan 08, 2015 at 11:59:08AM +0000, Marc Zyngier wrote:
> Let's assume a guest has created an uncached mapping, and written
> to that page. Let's also assume that the host uses a cache-coherent
> IO subsystem. Let's finally assume that the host is under memory
> pressure and starts to swap things out.
> 
> Before this "uncached" page is evicted, we need to make sure it
> gets invalidated, or the IO subsystem is going to swap out the
> cached view, loosing the data that has been written there.
> 
> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
> ---
>  arch/arm/include/asm/kvm_mmu.h   | 31 +++++++++++++++++++++++++++
>  arch/arm/kvm/mmu.c               | 46 +++++++++++++++++++++++++++-------------
>  arch/arm64/include/asm/kvm_mmu.h | 18 ++++++++++++++++
>  3 files changed, 80 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 63e0ecc..7ceb836 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -44,6 +44,7 @@
>  
>  #ifndef __ASSEMBLY__
>  
> +#include <linux/highmem.h>
>  #include <asm/cacheflush.h>
>  #include <asm/pgalloc.h>
>  
> @@ -190,6 +191,36 @@ static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva,
>  
>  #define kvm_virt_to_phys(x)		virt_to_idmap((unsigned long)(x))
>  
> +static inline void __kvm_flush_dcache_pte(pte_t pte)
> +{
> +	void *va = kmap_atomic(pte_page(pte));
> +
> +	kvm_flush_dcache_to_poc(va, PAGE_SIZE);
> +
> +	kunmap_atomic(va);
> +}
> +
> +static inline void __kvm_flush_dcache_pmd(pmd_t pmd)
> +{
> +	unsigned long size = PMD_SIZE;
> +	pfn_t pfn = pmd_pfn(pmd);
> +
> +	while (size) {
> +		void *va = kmap_atomic_pfn(pfn);
> +
> +		kvm_flush_dcache_to_poc(va, PAGE_SIZE);
> +
> +		pfn++;
> +		size -= PAGE_SIZE;
> +
> +		kunmap_atomic(va);
> +	}
> +}
> +
> +static inline void __kvm_flush_dcache_pud(pud_t pud)
> +{
> +}
> +
>  void stage2_flush_vm(struct kvm *kvm);
>  
>  #endif	/* !__ASSEMBLY__ */
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 1dc9778..1f5b793 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -58,6 +58,21 @@ static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
>  		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa);
>  }
>  
> +static void kvm_flush_dcache_pte(pte_t pte)
> +{
> +	__kvm_flush_dcache_pte(pte);
> +}
> +
> +static void kvm_flush_dcache_pmd(pmd_t pmd)
> +{
> +	__kvm_flush_dcache_pmd(pmd);
> +}
> +
> +static void kvm_flush_dcache_pud(pud_t pud)
> +{
> +	__kvm_flush_dcache_pud(pud);
> +}
> +
>  static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
>  				  int min, int max)
>  {
> @@ -128,9 +143,12 @@ static void unmap_ptes(struct kvm *kvm, pmd_t *pmd,
>  	start_pte = pte = pte_offset_kernel(pmd, addr);
>  	do {
>  		if (!pte_none(*pte)) {
> +			pte_t old_pte = *pte;
>  			kvm_set_pte(pte, __pte(0));
> -			put_page(virt_to_page(pte));

was this a bug beforehand that we released the page before we flushed
the TLB?

>  			kvm_tlb_flush_vmid_ipa(kvm, addr);
> +			if ((pte_val(old_pte) & PAGE_S2_DEVICE) != PAGE_S2_DEVICE)
> +				kvm_flush_dcache_pte(old_pte);

this is confusing me: We are only flushing the cache for cached stage-2
mappings?  Weren't we trying to flush the cache for uncached mappings?
(we obviously also need flush a cached stage-2 mapping but where the
guest is mapping it as uncached, but we don't know that easily).

Am I missing something completely here?

In any case we probably need a comment here explaining this.

> +			put_page(virt_to_page(pte));
>  		}
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
>  
> @@ -149,8 +167,10 @@ static void unmap_pmds(struct kvm *kvm, pud_t *pud,
>  		next = kvm_pmd_addr_end(addr, end);
>  		if (!pmd_none(*pmd)) {
>  			if (kvm_pmd_huge(*pmd)) {
> +				pmd_t old_pmd = *pmd;
>  				pmd_clear(pmd);
>  				kvm_tlb_flush_vmid_ipa(kvm, addr);
> +				kvm_flush_dcache_pmd(old_pmd);
>  				put_page(virt_to_page(pmd));
>  			} else {
>  				unmap_ptes(kvm, pmd, addr, next);
> @@ -173,8 +193,10 @@ static void unmap_puds(struct kvm *kvm, pgd_t *pgd,
>  		next = kvm_pud_addr_end(addr, end);
>  		if (!pud_none(*pud)) {
>  			if (pud_huge(*pud)) {
> +				pud_t old_pud = *pud;
>  				pud_clear(pud);
>  				kvm_tlb_flush_vmid_ipa(kvm, addr);
> +				kvm_flush_dcache_pud(old_pud);
>  				put_page(virt_to_page(pud));
>  			} else {
>  				unmap_pmds(kvm, pud, addr, next);
> @@ -209,10 +231,8 @@ static void stage2_flush_ptes(struct kvm *kvm, pmd_t *pmd,
>  
>  	pte = pte_offset_kernel(pmd, addr);
>  	do {
> -		if (!pte_none(*pte)) {
> -			hva_t hva = gfn_to_hva(kvm, addr >> PAGE_SHIFT);
> -			kvm_flush_dcache_to_poc((void*)hva, PAGE_SIZE);
> -		}
> +		if (!pte_none(*pte))
> +			kvm_flush_dcache_pte(*pte);
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
>  }
>  
> @@ -226,12 +246,10 @@ static void stage2_flush_pmds(struct kvm *kvm, pud_t *pud,
>  	do {
>  		next = kvm_pmd_addr_end(addr, end);
>  		if (!pmd_none(*pmd)) {
> -			if (kvm_pmd_huge(*pmd)) {
> -				hva_t hva = gfn_to_hva(kvm, addr >> PAGE_SHIFT);
> -				kvm_flush_dcache_to_poc((void*)hva, PMD_SIZE);
> -			} else {
> +			if (kvm_pmd_huge(*pmd))
> +				kvm_flush_dcache_pmd(*pmd);
> +			else
>  				stage2_flush_ptes(kvm, pmd, addr, next);
> -			}
>  		}
>  	} while (pmd++, addr = next, addr != end);
>  }
> @@ -246,12 +264,10 @@ static void stage2_flush_puds(struct kvm *kvm, pgd_t *pgd,
>  	do {
>  		next = kvm_pud_addr_end(addr, end);
>  		if (!pud_none(*pud)) {
> -			if (pud_huge(*pud)) {
> -				hva_t hva = gfn_to_hva(kvm, addr >> PAGE_SHIFT);
> -				kvm_flush_dcache_to_poc((void*)hva, PUD_SIZE);
> -			} else {
> +			if (pud_huge(*pud))
> +				kvm_flush_dcache_pud(*pud);
> +			else
>  				stage2_flush_pmds(kvm, pud, addr, next);
> -			}
>  		}
>  	} while (pud++, addr = next, addr != end);
>  }
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index 14a74f1..b7419f5 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -260,6 +260,24 @@ static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva,
>  
>  #define kvm_virt_to_phys(x)		__virt_to_phys((unsigned long)(x))
>  
> +static inline void __kvm_flush_dcache_pte(pte_t pte)
> +{
> +	struct page *page = pte_page(pte);
> +	kvm_flush_dcache_to_poc(page_address(page), PAGE_SIZE);
> +}
> +
> +static inline void __kvm_flush_dcache_pmd(pmd_t pmd)
> +{
> +	struct page *page = pmd_page(pmd);
> +	kvm_flush_dcache_to_poc(page_address(page), PMD_SIZE);
> +}
> +
> +static inline void __kvm_flush_dcache_pud(pud_t pud)
> +{
> +	struct page *page = pud_page(pud);
> +	kvm_flush_dcache_to_poc(page_address(page), PUD_SIZE);
> +}
> +
>  void stage2_flush_vm(struct kvm *kvm);
>  
>  #endif /* __ASSEMBLY__ */
> -- 
> 2.1.4
> 
Thanks,

-Christoffer

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-08 15:21           ` Peter Maydell
@ 2015-01-09 12:50             ` Christoffer Dall
  2015-01-09 13:03               ` Peter Maydell
  0 siblings, 1 reply; 45+ messages in thread
From: Christoffer Dall @ 2015-01-09 12:50 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Marc Zyngier, kvm-devel, kvmarm

On Thu, Jan 08, 2015 at 03:21:50PM +0000, Peter Maydell wrote:
> On 8 January 2015 at 15:06, Marc Zyngier <marc.zyngier@arm.com> wrote:
> > On 08/01/15 13:16, Peter Maydell wrote:
> >>> ASID cached VIVT icaches are also VMID tagged. It is thus impossible for
> >>> stale cache lines to come with a new page. And if by synchronizing the
> >>> caches you obtain a different instruction stream, it means you've
> >>> restored the wrong page.
> >>
> >> ...is that true even if the dirty data in the dcache comes from
> >> the userspace process doing DMA or writing the initial boot
> >> image or whatever?
> >
> > We perform this on a page that is being brought in stage-2. Two cases:
> >
> > - This is a page is mapped for the first time: the icache should be
> > invalid for this page (the guest should have invalidated it the first
> > place),
> 
> If this is the first instruction in the guest (ie we've just
> (warm) reset the VM and are running the kernel as loaded into the guest
> by QEMU/kvmtool) then the guest can't have invalidated the icache,
> and QEMU can't do the invalidate because it doesn't have the vaddr
> and VMID of the guest.
> 
The guest must clean its icache before turning on the MMU, no?

Whenever we reuse a VMID (rollover), we flush the entire icache for that
vmid.

-Christoffer

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-08 11:59 ` [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault Marc Zyngier
  2015-01-08 12:30   ` Peter Maydell
@ 2015-01-09 12:51   ` Christoffer Dall
  1 sibling, 0 replies; 45+ messages in thread
From: Christoffer Dall @ 2015-01-09 12:51 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: kvm, kvmarm, Steve Capper

On Thu, Jan 08, 2015 at 11:59:09AM +0000, Marc Zyngier wrote:
> When handling a fault in stage-2, we need to resync I$ and D$, just
> to be sure we don't leave any old cache line behind.
> 
> That's very good, except that we do so using the *user* address.
> Under heavy load (swapping like crazy), we may end up in a situation
> where the page gets mapped in stage-2 while being unmapped from
> userspace by another CPU.
> 
> At that point, the DC/IC instructions can generate a fault, which
> we handle with kvm->mmu_lock held. The box quickly deadlocks, user
> is unhappy.
> 
> Instead, perform this invalidation through the kernel mapping,
> which is guaranteed to be present. The box is much happier, and so
> am I.
> 
> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>

This looks good to me!

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-09 12:50             ` Christoffer Dall
@ 2015-01-09 13:03               ` Peter Maydell
  2015-01-09 14:16                 ` Marc Zyngier
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Maydell @ 2015-01-09 13:03 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Marc Zyngier, kvm-devel, kvmarm

On 9 January 2015 at 12:50, Christoffer Dall
<christoffer.dall@linaro.org> wrote:
> On Thu, Jan 08, 2015 at 03:21:50PM +0000, Peter Maydell wrote:
>> If this is the first instruction in the guest (ie we've just
>> (warm) reset the VM and are running the kernel as loaded into the guest
>> by QEMU/kvmtool) then the guest can't have invalidated the icache,
>> and QEMU can't do the invalidate because it doesn't have the vaddr
>> and VMID of the guest.
>>
> The guest must clean its icache before turning on the MMU, no?

Yes, but to execute the "clean icache" insn inside the guest,
the guest is executing instructions, so we'd better not have
stale info for that code...

> Whenever we reuse a VMID (rollover), we flush the entire icache for that
> vmid.

When we reset a cpu by re-calling KVM_ARM_VCPU_INIT, that doesn't
mean we get a new VMID for it, though, does it? I thought that
what causes the icache flush to happen for the reset guest is
that we unmap all of stage 2 and then fault it back in, via
this code. That works for PIPT (we flush the range) and for
VIPT (we do a full icache flush), but at the moment for VIVT
ASID tagged we assume we can do nothing, and I don't think that's
right for this case (though it is right for "faulted because
page was swapped out" and OK for "page was written by DMA").

-- PMM

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-09 13:03               ` Peter Maydell
@ 2015-01-09 14:16                 ` Marc Zyngier
  2015-01-09 15:28                   ` Peter Maydell
  0 siblings, 1 reply; 45+ messages in thread
From: Marc Zyngier @ 2015-01-09 14:16 UTC (permalink / raw)
  To: Peter Maydell, Christoffer Dall; +Cc: kvm-devel, kvmarm

On 09/01/15 13:03, Peter Maydell wrote:
> On 9 January 2015 at 12:50, Christoffer Dall
> <christoffer.dall@linaro.org> wrote:
>> On Thu, Jan 08, 2015 at 03:21:50PM +0000, Peter Maydell wrote:
>>> If this is the first instruction in the guest (ie we've just
>>> (warm) reset the VM and are running the kernel as loaded into the guest
>>> by QEMU/kvmtool) then the guest can't have invalidated the icache,
>>> and QEMU can't do the invalidate because it doesn't have the vaddr
>>> and VMID of the guest.
>>>
>> The guest must clean its icache before turning on the MMU, no?
> 
> Yes, but to execute the "clean icache" insn inside the guest,
> the guest is executing instructions, so we'd better not have
> stale info for that code...

But we never start a guest with caches on. It has to enable them on its own.

>> Whenever we reuse a VMID (rollover), we flush the entire icache for that
>> vmid.
> 
> When we reset a cpu by re-calling KVM_ARM_VCPU_INIT, that doesn't
> mean we get a new VMID for it, though, does it? I thought that
> what causes the icache flush to happen for the reset guest is
> that we unmap all of stage 2 and then fault it back in, via
> this code. That works for PIPT (we flush the range) and for
> VIPT (we do a full icache flush), but at the moment for VIVT
> ASID tagged we assume we can do nothing, and I don't think that's
> right for this case (though it is right for "faulted because
> page was swapped out" and OK for "page was written by DMA").

When we reset the guest, we also turn both its Icache off. Before
turning it back on, the guest has to invalidate it (the ARM ARM doesn't
seem to define the state of the cache out of reset).

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4] arm/arm64: KVM: Flush caches to memory on unmap
  2015-01-09 12:30   ` Christoffer Dall
@ 2015-01-09 14:35     ` Marc Zyngier
  2015-01-11 12:30       ` Christoffer Dall
  0 siblings, 1 reply; 45+ messages in thread
From: Marc Zyngier @ 2015-01-09 14:35 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: kvm, kvmarm, Steve Capper

On 09/01/15 12:30, Christoffer Dall wrote:
> On Thu, Jan 08, 2015 at 11:59:08AM +0000, Marc Zyngier wrote:
>> Let's assume a guest has created an uncached mapping, and written
>> to that page. Let's also assume that the host uses a cache-coherent
>> IO subsystem. Let's finally assume that the host is under memory
>> pressure and starts to swap things out.
>>
>> Before this "uncached" page is evicted, we need to make sure it
>> gets invalidated, or the IO subsystem is going to swap out the
>> cached view, loosing the data that has been written there.
>>
>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
>> ---
>>  arch/arm/include/asm/kvm_mmu.h   | 31 +++++++++++++++++++++++++++
>>  arch/arm/kvm/mmu.c               | 46 +++++++++++++++++++++++++++-------------
>>  arch/arm64/include/asm/kvm_mmu.h | 18 ++++++++++++++++
>>  3 files changed, 80 insertions(+), 15 deletions(-)
>>
>> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
>> index 63e0ecc..7ceb836 100644
>> --- a/arch/arm/include/asm/kvm_mmu.h
>> +++ b/arch/arm/include/asm/kvm_mmu.h
>> @@ -44,6 +44,7 @@
>>  
>>  #ifndef __ASSEMBLY__
>>  
>> +#include <linux/highmem.h>
>>  #include <asm/cacheflush.h>
>>  #include <asm/pgalloc.h>
>>  
>> @@ -190,6 +191,36 @@ static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva,
>>  
>>  #define kvm_virt_to_phys(x)		virt_to_idmap((unsigned long)(x))
>>  
>> +static inline void __kvm_flush_dcache_pte(pte_t pte)
>> +{
>> +	void *va = kmap_atomic(pte_page(pte));
>> +
>> +	kvm_flush_dcache_to_poc(va, PAGE_SIZE);
>> +
>> +	kunmap_atomic(va);
>> +}
>> +
>> +static inline void __kvm_flush_dcache_pmd(pmd_t pmd)
>> +{
>> +	unsigned long size = PMD_SIZE;
>> +	pfn_t pfn = pmd_pfn(pmd);
>> +
>> +	while (size) {
>> +		void *va = kmap_atomic_pfn(pfn);
>> +
>> +		kvm_flush_dcache_to_poc(va, PAGE_SIZE);
>> +
>> +		pfn++;
>> +		size -= PAGE_SIZE;
>> +
>> +		kunmap_atomic(va);
>> +	}
>> +}
>> +
>> +static inline void __kvm_flush_dcache_pud(pud_t pud)
>> +{
>> +}
>> +
>>  void stage2_flush_vm(struct kvm *kvm);
>>  
>>  #endif	/* !__ASSEMBLY__ */
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index 1dc9778..1f5b793 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -58,6 +58,21 @@ static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
>>  		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa);
>>  }
>>  
>> +static void kvm_flush_dcache_pte(pte_t pte)
>> +{
>> +	__kvm_flush_dcache_pte(pte);
>> +}
>> +
>> +static void kvm_flush_dcache_pmd(pmd_t pmd)
>> +{
>> +	__kvm_flush_dcache_pmd(pmd);
>> +}
>> +
>> +static void kvm_flush_dcache_pud(pud_t pud)
>> +{
>> +	__kvm_flush_dcache_pud(pud);
>> +}
>> +
>>  static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
>>  				  int min, int max)
>>  {
>> @@ -128,9 +143,12 @@ static void unmap_ptes(struct kvm *kvm, pmd_t *pmd,
>>  	start_pte = pte = pte_offset_kernel(pmd, addr);
>>  	do {
>>  		if (!pte_none(*pte)) {
>> +			pte_t old_pte = *pte;
>>  			kvm_set_pte(pte, __pte(0));
>> -			put_page(virt_to_page(pte));
> 
> was this a bug beforehand that we released the page before we flushed
> the TLB?

I don't think so. TLB maintenance doesn't require the mapping to exist
in the page tables (while the cache maintenance do).

>>  			kvm_tlb_flush_vmid_ipa(kvm, addr);
>> +			if ((pte_val(old_pte) & PAGE_S2_DEVICE) != PAGE_S2_DEVICE)
>> +				kvm_flush_dcache_pte(old_pte);
> 
> this is confusing me: We are only flushing the cache for cached stage-2
> mappings?  Weren't we trying to flush the cache for uncached mappings?
> (we obviously also need flush a cached stage-2 mapping but where the
> guest is mapping it as uncached, but we don't know that easily).
> 
> Am I missing something completely here?

I think you must be misreading something:
- we want to invalidate mappings because the guest may have created an
uncached mapping
- as we cannot know about the guest's uncached mappings, we flush things
unconditionally (basically anything that is RAM).
- we avoid flushing devices because it is pointless (and the kernel
doesn't have a linear mapping for those).

Is that clearer? Or is it me that is misunderstanding your concerns
(entirely possible, I'm not at my best right now...).

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-09 14:16                 ` Marc Zyngier
@ 2015-01-09 15:28                   ` Peter Maydell
  2015-01-09 17:18                     ` Marc Zyngier
  2015-01-11 12:33                     ` Christoffer Dall
  0 siblings, 2 replies; 45+ messages in thread
From: Peter Maydell @ 2015-01-09 15:28 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: Christoffer Dall, kvm-devel, kvmarm

On 9 January 2015 at 14:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
> On 09/01/15 13:03, Peter Maydell wrote:
>> When we reset a cpu by re-calling KVM_ARM_VCPU_INIT, that doesn't
>> mean we get a new VMID for it, though, does it? I thought that
>> what causes the icache flush to happen for the reset guest is
>> that we unmap all of stage 2 and then fault it back in, via
>> this code. That works for PIPT (we flush the range) and for
>> VIPT (we do a full icache flush), but at the moment for VIVT
>> ASID tagged we assume we can do nothing, and I don't think that's
>> right for this case (though it is right for "faulted because
>> page was swapped out" and OK for "page was written by DMA").
>
> When we reset the guest, we also turn both its Icache off. Before
> turning it back on, the guest has to invalidate it (the ARM ARM doesn't
> seem to define the state of the cache out of reset).

But implementations are allowed to hit in the cache even
when the cache is disabled. In particular, setting the guest
SCTLR_EL1.I to 0 means "iside accesses are Normal Noncacheable
for stage 1 attributes" and (v8 ARM ARM D3.4.6) these can
be held in the instruction cache. So the guest is required
to do an icache-invalidate for any instructions that it writes
*itself* (or DMAs in) even with the icache off. But it cannot
possibly do so for its own initial startup code, and it must
be the job of KVM to do that for it. (You can think of this as
"the VCPU provided by KVM always invalidates the icache on reset
and does not require an impdef magic cache-init routine as
described by D3.4.4" if you like.)

-- PMM

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-09 15:28                   ` Peter Maydell
@ 2015-01-09 17:18                     ` Marc Zyngier
  2015-01-11 12:33                     ` Christoffer Dall
  1 sibling, 0 replies; 45+ messages in thread
From: Marc Zyngier @ 2015-01-09 17:18 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Christoffer Dall, kvm-devel, kvmarm

On 09/01/15 15:28, Peter Maydell wrote:
> On 9 January 2015 at 14:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> On 09/01/15 13:03, Peter Maydell wrote:
>>> When we reset a cpu by re-calling KVM_ARM_VCPU_INIT, that doesn't
>>> mean we get a new VMID for it, though, does it? I thought that
>>> what causes the icache flush to happen for the reset guest is
>>> that we unmap all of stage 2 and then fault it back in, via
>>> this code. That works for PIPT (we flush the range) and for
>>> VIPT (we do a full icache flush), but at the moment for VIVT
>>> ASID tagged we assume we can do nothing, and I don't think that's
>>> right for this case (though it is right for "faulted because
>>> page was swapped out" and OK for "page was written by DMA").
>>
>> When we reset the guest, we also turn both its Icache off. Before
>> turning it back on, the guest has to invalidate it (the ARM ARM doesn't
>> seem to define the state of the cache out of reset).
> 
> But implementations are allowed to hit in the cache even
> when the cache is disabled. In particular, setting the guest
> SCTLR_EL1.I to 0 means "iside accesses are Normal Noncacheable
> for stage 1 attributes" and (v8 ARM ARM D3.4.6) these can
> be held in the instruction cache. So the guest is required
> to do an icache-invalidate for any instructions that it writes
> *itself* (or DMAs in) even with the icache off. But it cannot
> possibly do so for its own initial startup code, and it must
> be the job of KVM to do that for it. (You can think of this as
> "the VCPU provided by KVM always invalidates the icache on reset
> and does not require an impdef magic cache-init routine as
> described by D3.4.4" if you like.)

Right. I think I finally see your point. We've been safe so far that no
ASID-tagged VIVT icache platform is virtualisation capable, AFAIK.

Probably best to just nuke the icache always for non-PIPT caches, then.

I'll fold that in when I respin the series.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4] arm/arm64: KVM: Flush caches to memory on unmap
  2015-01-09 14:35     ` Marc Zyngier
@ 2015-01-11 12:30       ` Christoffer Dall
  2015-01-12 11:15         ` Marc Zyngier
  0 siblings, 1 reply; 45+ messages in thread
From: Christoffer Dall @ 2015-01-11 12:30 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: kvm, kvmarm, Steve Capper

On Fri, Jan 09, 2015 at 02:35:49PM +0000, Marc Zyngier wrote:
> On 09/01/15 12:30, Christoffer Dall wrote:
> > On Thu, Jan 08, 2015 at 11:59:08AM +0000, Marc Zyngier wrote:
> >> Let's assume a guest has created an uncached mapping, and written
> >> to that page. Let's also assume that the host uses a cache-coherent
> >> IO subsystem. Let's finally assume that the host is under memory
> >> pressure and starts to swap things out.
> >>
> >> Before this "uncached" page is evicted, we need to make sure it
> >> gets invalidated, or the IO subsystem is going to swap out the
> >> cached view, loosing the data that has been written there.
> >>
> >> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
> >> ---
> >>  arch/arm/include/asm/kvm_mmu.h   | 31 +++++++++++++++++++++++++++
> >>  arch/arm/kvm/mmu.c               | 46 +++++++++++++++++++++++++++-------------
> >>  arch/arm64/include/asm/kvm_mmu.h | 18 ++++++++++++++++
> >>  3 files changed, 80 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> >> index 63e0ecc..7ceb836 100644
> >> --- a/arch/arm/include/asm/kvm_mmu.h
> >> +++ b/arch/arm/include/asm/kvm_mmu.h
> >> @@ -44,6 +44,7 @@
> >>  
> >>  #ifndef __ASSEMBLY__
> >>  
> >> +#include <linux/highmem.h>
> >>  #include <asm/cacheflush.h>
> >>  #include <asm/pgalloc.h>
> >>  
> >> @@ -190,6 +191,36 @@ static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva,
> >>  
> >>  #define kvm_virt_to_phys(x)		virt_to_idmap((unsigned long)(x))
> >>  
> >> +static inline void __kvm_flush_dcache_pte(pte_t pte)
> >> +{
> >> +	void *va = kmap_atomic(pte_page(pte));
> >> +
> >> +	kvm_flush_dcache_to_poc(va, PAGE_SIZE);
> >> +
> >> +	kunmap_atomic(va);
> >> +}
> >> +
> >> +static inline void __kvm_flush_dcache_pmd(pmd_t pmd)
> >> +{
> >> +	unsigned long size = PMD_SIZE;
> >> +	pfn_t pfn = pmd_pfn(pmd);
> >> +
> >> +	while (size) {
> >> +		void *va = kmap_atomic_pfn(pfn);
> >> +
> >> +		kvm_flush_dcache_to_poc(va, PAGE_SIZE);
> >> +
> >> +		pfn++;
> >> +		size -= PAGE_SIZE;
> >> +
> >> +		kunmap_atomic(va);
> >> +	}
> >> +}
> >> +
> >> +static inline void __kvm_flush_dcache_pud(pud_t pud)
> >> +{
> >> +}
> >> +
> >>  void stage2_flush_vm(struct kvm *kvm);
> >>  
> >>  #endif	/* !__ASSEMBLY__ */
> >> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> >> index 1dc9778..1f5b793 100644
> >> --- a/arch/arm/kvm/mmu.c
> >> +++ b/arch/arm/kvm/mmu.c
> >> @@ -58,6 +58,21 @@ static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
> >>  		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa);
> >>  }
> >>  
> >> +static void kvm_flush_dcache_pte(pte_t pte)
> >> +{
> >> +	__kvm_flush_dcache_pte(pte);
> >> +}
> >> +
> >> +static void kvm_flush_dcache_pmd(pmd_t pmd)
> >> +{
> >> +	__kvm_flush_dcache_pmd(pmd);
> >> +}
> >> +
> >> +static void kvm_flush_dcache_pud(pud_t pud)
> >> +{
> >> +	__kvm_flush_dcache_pud(pud);
> >> +}
> >> +
> >>  static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
> >>  				  int min, int max)
> >>  {
> >> @@ -128,9 +143,12 @@ static void unmap_ptes(struct kvm *kvm, pmd_t *pmd,
> >>  	start_pte = pte = pte_offset_kernel(pmd, addr);
> >>  	do {
> >>  		if (!pte_none(*pte)) {
> >> +			pte_t old_pte = *pte;
> >>  			kvm_set_pte(pte, __pte(0));
> >> -			put_page(virt_to_page(pte));
> > 
> > was this a bug beforehand that we released the page before we flushed
> > the TLB?
> 
> I don't think so. TLB maintenance doesn't require the mapping to exist
> in the page tables (while the cache maintenance do).
> 

duh, the put_page is the ref counting on the page table itself, not the
underlying memory page.  Forget what I asked.

> >>  			kvm_tlb_flush_vmid_ipa(kvm, addr);
> >> +			if ((pte_val(old_pte) & PAGE_S2_DEVICE) != PAGE_S2_DEVICE)
> >> +				kvm_flush_dcache_pte(old_pte);
> > 
> > this is confusing me: We are only flushing the cache for cached stage-2
> > mappings?  Weren't we trying to flush the cache for uncached mappings?
> > (we obviously also need flush a cached stage-2 mapping but where the
> > guest is mapping it as uncached, but we don't know that easily).
> > 
> > Am I missing something completely here?
> 
> I think you must be misreading something:
> - we want to invalidate mappings because the guest may have created an
> uncached mapping

I don't quite understand: we are invalidating mappings because the page
is being swapped out (and the guest must fault if it tries to access it
again).  Not because the guest may have created an uncached mapping,
that's just an aspect of the situation.  Or am I thinking about this the
wrong way?

> - as we cannot know about the guest's uncached mappings, we flush things
> unconditionally (basically anything that is RAM).

so isn't the problem that the host may have an invalid cached view of
the data, so we need to invalidate that view, not flush the invalid data
to RAM?  Does the kernel take care of that somewhere else for a
cache-coherent IO subsystem?

> - we avoid flushing devices because it is pointless (and the kernel
> doesn't have a linear mapping for those).

Ah, that explains the if statement.  I think a one-line comment about
this would be helpful (I didn't get it, obviously).

-Christoffer

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-09 15:28                   ` Peter Maydell
  2015-01-09 17:18                     ` Marc Zyngier
@ 2015-01-11 12:33                     ` Christoffer Dall
  2015-01-11 17:37                       ` Peter Maydell
  1 sibling, 1 reply; 45+ messages in thread
From: Christoffer Dall @ 2015-01-11 12:33 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Marc Zyngier, kvm-devel, kvmarm

On Fri, Jan 09, 2015 at 03:28:58PM +0000, Peter Maydell wrote:
> On 9 January 2015 at 14:16, Marc Zyngier <marc.zyngier@arm.com> wrote:
> > On 09/01/15 13:03, Peter Maydell wrote:
> >> When we reset a cpu by re-calling KVM_ARM_VCPU_INIT, that doesn't
> >> mean we get a new VMID for it, though, does it? I thought that
> >> what causes the icache flush to happen for the reset guest is
> >> that we unmap all of stage 2 and then fault it back in, via
> >> this code. That works for PIPT (we flush the range) and for
> >> VIPT (we do a full icache flush), but at the moment for VIVT
> >> ASID tagged we assume we can do nothing, and I don't think that's
> >> right for this case (though it is right for "faulted because
> >> page was swapped out" and OK for "page was written by DMA").
> >
> > When we reset the guest, we also turn both its Icache off. Before
> > turning it back on, the guest has to invalidate it (the ARM ARM doesn't
> > seem to define the state of the cache out of reset).
> 
> But implementations are allowed to hit in the cache even
> when the cache is disabled. In particular, setting the guest

But how can it hit anything when the icache for the used VMID is
guaranteed to be clear (maybe that requires another full icache
invalidate for that VMID for PSCI reset)?

-Christoffer

> SCTLR_EL1.I to 0 means "iside accesses are Normal Noncacheable
> for stage 1 attributes" and (v8 ARM ARM D3.4.6) these can
> be held in the instruction cache. So the guest is required
> to do an icache-invalidate for any instructions that it writes
> *itself* (or DMAs in) even with the icache off. But it cannot
> possibly do so for its own initial startup code, and it must
> be the job of KVM to do that for it. (You can think of this as
> "the VCPU provided by KVM always invalidates the icache on reset
> and does not require an impdef magic cache-init routine as
> described by D3.4.4" if you like.)
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-11 12:33                     ` Christoffer Dall
@ 2015-01-11 17:37                       ` Peter Maydell
  2015-01-11 17:58                         ` Christoffer Dall
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Maydell @ 2015-01-11 17:37 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Marc Zyngier, kvm-devel, kvmarm

On 11 January 2015 at 12:33, Christoffer Dall
<christoffer.dall@linaro.org> wrote:
> On Fri, Jan 09, 2015 at 03:28:58PM +0000, Peter Maydell wrote:
>> But implementations are allowed to hit in the cache even
>> when the cache is disabled. In particular, setting the guest
>
> But how can it hit anything when the icache for the used VMID is
> guaranteed to be clear (maybe that requires another full icache
> invalidate for that VMID for PSCI reset)?

The point is that at the moment we don't do anything to
guarantee that we've cleared the icache. (Plus could there be
stale data in the icache for this physical CPU for this VMID
because we've run some other vCPU on it? Or does the process
of rescheduling vCPUs across pCPUs and guest ASID management
deal with that?)

You probably want to clear the icache on vcpu (re-)init rather
than reset, though (no guarantee that userspace is going to
handle system resets via PSCI).

-- PMM

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-11 17:37                       ` Peter Maydell
@ 2015-01-11 17:58                         ` Christoffer Dall
  2015-01-11 18:27                           ` Peter Maydell
  0 siblings, 1 reply; 45+ messages in thread
From: Christoffer Dall @ 2015-01-11 17:58 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Marc Zyngier, kvm-devel, kvmarm

On Sun, Jan 11, 2015 at 05:37:52PM +0000, Peter Maydell wrote:
> On 11 January 2015 at 12:33, Christoffer Dall
> <christoffer.dall@linaro.org> wrote:
> > On Fri, Jan 09, 2015 at 03:28:58PM +0000, Peter Maydell wrote:
> >> But implementations are allowed to hit in the cache even
> >> when the cache is disabled. In particular, setting the guest
> >
> > But how can it hit anything when the icache for the used VMID is
> > guaranteed to be clear (maybe that requires another full icache
> > invalidate for that VMID for PSCI reset)?
> 
> The point is that at the moment we don't do anything to
> guarantee that we've cleared the icache. 

that's not entirely accurate, I assume all of the icache is
invalidated/cleaned at system bring-up time, and every time we re-use a
VMID (when we start a VMID rollover) we invalidate the entire icache.

> (Plus could there be
> stale data in the icache for this physical CPU for this VMID
> because we've run some other vCPU on it? Or does the process
> of rescheduling vCPUs across pCPUs and guest ASID management
> deal with that?)

we don't clear the icache for vCPUs migrating onto other pCPUs but
invalidating the icache on a page fault won't guarantee that either.  Do
we really need to do that?

> 
> You probably want to clear the icache on vcpu (re-)init rather
> than reset, though (no guarantee that userspace is going to
> handle system resets via PSCI).
> 
Yes, good point.

-Christoffer

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-11 17:58                         ` Christoffer Dall
@ 2015-01-11 18:27                           ` Peter Maydell
  2015-01-11 18:38                             ` Christoffer Dall
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Maydell @ 2015-01-11 18:27 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Marc Zyngier, kvm-devel, kvmarm

On 11 January 2015 at 17:58, Christoffer Dall
<christoffer.dall@linaro.org> wrote:
> On Sun, Jan 11, 2015 at 05:37:52PM +0000, Peter Maydell wrote:
>> On 11 January 2015 at 12:33, Christoffer Dall
>> <christoffer.dall@linaro.org> wrote:
>> > On Fri, Jan 09, 2015 at 03:28:58PM +0000, Peter Maydell wrote:
>> >> But implementations are allowed to hit in the cache even
>> >> when the cache is disabled. In particular, setting the guest
>> >
>> > But how can it hit anything when the icache for the used VMID is
>> > guaranteed to be clear (maybe that requires another full icache
>> > invalidate for that VMID for PSCI reset)?
>>
>> The point is that at the moment we don't do anything to
>> guarantee that we've cleared the icache.
>
> that's not entirely accurate, I assume all of the icache is
> invalidated/cleaned at system bring-up time, and every time we re-use a
> VMID (when we start a VMID rollover) we invalidate the entire icache.

Right, but that doesn't catch the VM reset case, which is the
one we're talking about.

>> (Plus could there be
>> stale data in the icache for this physical CPU for this VMID
>> because we've run some other vCPU on it? Or does the process
>> of rescheduling vCPUs across pCPUs and guest ASID management
>> deal with that?)
>
> we don't clear the icache for vCPUs migrating onto other pCPUs but
> invalidating the icache on a page fault won't guarantee that either.  Do
> we really need to do that?

I don't think we do, but I haven't thought through exactly
why we don't yet :-)

-- PMM

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-11 18:27                           ` Peter Maydell
@ 2015-01-11 18:38                             ` Christoffer Dall
  2015-01-12  9:58                               ` Marc Zyngier
  0 siblings, 1 reply; 45+ messages in thread
From: Christoffer Dall @ 2015-01-11 18:38 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Marc Zyngier, kvm-devel, kvmarm

On Sun, Jan 11, 2015 at 06:27:35PM +0000, Peter Maydell wrote:
> On 11 January 2015 at 17:58, Christoffer Dall
> <christoffer.dall@linaro.org> wrote:
> > On Sun, Jan 11, 2015 at 05:37:52PM +0000, Peter Maydell wrote:
> >> On 11 January 2015 at 12:33, Christoffer Dall
> >> <christoffer.dall@linaro.org> wrote:
> >> > On Fri, Jan 09, 2015 at 03:28:58PM +0000, Peter Maydell wrote:
> >> >> But implementations are allowed to hit in the cache even
> >> >> when the cache is disabled. In particular, setting the guest
> >> >
> >> > But how can it hit anything when the icache for the used VMID is
> >> > guaranteed to be clear (maybe that requires another full icache
> >> > invalidate for that VMID for PSCI reset)?
> >>
> >> The point is that at the moment we don't do anything to
> >> guarantee that we've cleared the icache.
> >
> > that's not entirely accurate, I assume all of the icache is
> > invalidated/cleaned at system bring-up time, and every time we re-use a
> > VMID (when we start a VMID rollover) we invalidate the entire icache.
> 
> Right, but that doesn't catch the VM reset case, which is the
> one we're talking about.
> 

ok, so that's what you meant by warm reset, I see.

Then I would think we should add that single invalidate on vcpu init
rather than flushing the icache on every page fault?

> >> (Plus could there be
> >> stale data in the icache for this physical CPU for this VMID
> >> because we've run some other vCPU on it? Or does the process
> >> of rescheduling vCPUs across pCPUs and guest ASID management
> >> deal with that?)
> >
> > we don't clear the icache for vCPUs migrating onto other pCPUs but
> > invalidating the icache on a page fault won't guarantee that either.  Do
> > we really need to do that?
> 
> I don't think we do, but I haven't thought through exactly
> why we don't yet :-)
> 

So once you start a secondary vCPU that one can then hit in the icache
from what the primary vCPU put there which I guess is different behavior
from a physical secondary core coming out of reset with the MMU off and
never hitting the icache, right?

And is this not also a different behavior from a native system once the
vCPUs have turned their MMUs on, but we just don't happen to observe it
as being a problem?

In any case, I don't have a great solution for how to solve this except
for always invalidating the icache when we migrate a vCPU to a pCPU, but
that's really nasty...

-Christoffer

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-11 18:38                             ` Christoffer Dall
@ 2015-01-12  9:58                               ` Marc Zyngier
  2015-01-12 20:10                                 ` Christoffer Dall
  0 siblings, 1 reply; 45+ messages in thread
From: Marc Zyngier @ 2015-01-12  9:58 UTC (permalink / raw)
  To: Christoffer Dall, Peter Maydell; +Cc: kvm-devel, kvmarm

On 11/01/15 18:38, Christoffer Dall wrote:
> On Sun, Jan 11, 2015 at 06:27:35PM +0000, Peter Maydell wrote:
>> On 11 January 2015 at 17:58, Christoffer Dall
>> <christoffer.dall@linaro.org> wrote:
>>> On Sun, Jan 11, 2015 at 05:37:52PM +0000, Peter Maydell wrote:
>>>> On 11 January 2015 at 12:33, Christoffer Dall
>>>> <christoffer.dall@linaro.org> wrote:
>>>>> On Fri, Jan 09, 2015 at 03:28:58PM +0000, Peter Maydell wrote:
>>>>>> But implementations are allowed to hit in the cache even
>>>>>> when the cache is disabled. In particular, setting the guest
>>>>>
>>>>> But how can it hit anything when the icache for the used VMID is
>>>>> guaranteed to be clear (maybe that requires another full icache
>>>>> invalidate for that VMID for PSCI reset)?
>>>>
>>>> The point is that at the moment we don't do anything to
>>>> guarantee that we've cleared the icache.
>>>
>>> that's not entirely accurate, I assume all of the icache is
>>> invalidated/cleaned at system bring-up time, and every time we re-use a
>>> VMID (when we start a VMID rollover) we invalidate the entire icache.
>>
>> Right, but that doesn't catch the VM reset case, which is the
>> one we're talking about.
>>
> 
> ok, so that's what you meant by warm reset, I see.
> 
> Then I would think we should add that single invalidate on vcpu init
> rather than flushing the icache on every page fault?
> 
>>>> (Plus could there be
>>>> stale data in the icache for this physical CPU for this VMID
>>>> because we've run some other vCPU on it? Or does the process
>>>> of rescheduling vCPUs across pCPUs and guest ASID management
>>>> deal with that?)
>>>
>>> we don't clear the icache for vCPUs migrating onto other pCPUs but
>>> invalidating the icache on a page fault won't guarantee that either.  Do
>>> we really need to do that?
>>
>> I don't think we do, but I haven't thought through exactly
>> why we don't yet :-)
>>
> 
> So once you start a secondary vCPU that one can then hit in the icache
> from what the primary vCPU put there which I guess is different behavior
> from a physical secondary core coming out of reset with the MMU off and
> never hitting the icache, right?
> 
> And is this not also a different behavior from a native system once the
> vCPUs have turned their MMUs on, but we just don't happen to observe it
> as being a problem?
> 
> In any case, I don't have a great solution for how to solve this except
> for always invalidating the icache when we migrate a vCPU to a pCPU, but
> that's really nasty...

No, it only needs to happen once per vcpu, on any CPU. IC IALLUIS is
broadcast across CPUs, so once it has taken place on the first CPU this
vcpu runs on, we're good.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/4] mm: Correct ordering of *_clear_flush_young_notify
  2015-01-08 19:00   ` Andrea Arcangeli
@ 2015-01-12 10:15     ` Steve Capper
  0 siblings, 0 replies; 45+ messages in thread
From: Steve Capper @ 2015-01-12 10:15 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Marc Zyngier, Christoffer Dall, kvm, kvmarm

On Thu, Jan 08, 2015 at 08:00:45PM +0100, Andrea Arcangeli wrote:
> On Thu, Jan 08, 2015 at 11:59:06AM +0000, Marc Zyngier wrote:
> > From: Steve Capper <steve.capper@linaro.org>
> > 
> > ptep_clear_flush_young_notify and pmdp_clear_flush_young_notify both
> > call the notifiers *after* the pte/pmd has been made young.
> > 
> 
> On x86 on EPT without hardware access bit (!shadow_accessed_mask),
> we'll trigger a KVM page fault (gup_fast) which would mark the page
> referenced to give it higher priority in the LRU (or set the accessed
> bit if it's a THP).
> 
> If we drop the KVM shadow pagetable before clearing the accessed bit
> in the linux pte, there's a window where we won't set the young bit
> for THP. For non-THP it's less of an issue because gup_fast calls
> mark_page_accessed which rolls the lrus and sets the referenced bit in
> the struct page, so the effect of mark_page_accessed doesn't get
> lost when the linux pte accessed bit is cleared.
> 
> We could also consider using mark_page_accessed in
> follow_trans_huge_pmd to workaround the problem.  I think setting the
> young bit in gup_fast is correct and would be more similar to a real
> CPU access (which is what gup_fast simulates anyway) so below patch
> literally is introducing a race condition even if it's going to be
> lost in the noise and it's not a problem.
> 
> > This can cause problems with KVM that relies on being able to block
> > MMU notifiers when carrying out maintenance of second stage
> > descriptors.
> > 
> > This patch ensures that the MMU notifiers are called before ptes and
> > pmds are made old.
> 
> Unfortunately I don't understand why this is needed.
> 
> The only difference this can make to KVM is that without the patch,
> kvm_age_rmapp is called while the linux pte is less likely to have the
> accessed bit set (current behavior). It can still be set by hardware
> through another CPU touching the memory before the mmu notifier is
> invoked.
> 
> With the patch the linux pte is more likely to have the accessed bit
> set as it's not cleared before calling the mmu notifier.
> 
> In both cases (at least in x86 where the accessed bit is always set in
> hardware) the accessed bit may or may not be set. The pte can not
> otherwise change as it's called with the PT lock.
> 
> So again it looks a noop and it introduces a mostly theoretical race
> condition for THP young bit in the linux pte with EPT and
> !shadow_accessed_mask.
> 
> Clearly there must be some arm obscure detail I'm not aware of that
> makes this helpful but the description in commit header isn't enough
> to get what's up with blocking mmu notifiers or such. Could you
> elaborate?

Hi Andrea,
A big thank you for going through this patch in detail.

Having had another think about this; apologies, I think I made a
mistake in formulating this patch to try and fix an issue we were
experiencing on arm64 (which has since been fixed).

I think this patch can be safely ignored, especially as the
kvm_mmu_notifier_clear_flush_young notifier does very little on arm64
(because kvm_age_hva returns 0, due to a lack of shadow pagetables).

Sorry for the noise.

Cheers,
-- 
Steve

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4] arm/arm64: KVM: Flush caches to memory on unmap
  2015-01-11 12:30       ` Christoffer Dall
@ 2015-01-12 11:15         ` Marc Zyngier
  2015-01-12 20:13           ` Christoffer Dall
  0 siblings, 1 reply; 45+ messages in thread
From: Marc Zyngier @ 2015-01-12 11:15 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: kvm, kvmarm, Steve Capper

On 11/01/15 12:30, Christoffer Dall wrote:
> On Fri, Jan 09, 2015 at 02:35:49PM +0000, Marc Zyngier wrote:
>> On 09/01/15 12:30, Christoffer Dall wrote:
>>> On Thu, Jan 08, 2015 at 11:59:08AM +0000, Marc Zyngier wrote:
>>>> Let's assume a guest has created an uncached mapping, and written
>>>> to that page. Let's also assume that the host uses a cache-coherent
>>>> IO subsystem. Let's finally assume that the host is under memory
>>>> pressure and starts to swap things out.
>>>>
>>>> Before this "uncached" page is evicted, we need to make sure it
>>>> gets invalidated, or the IO subsystem is going to swap out the
>>>> cached view, loosing the data that has been written there.
>>>>
>>>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
>>>> ---
>>>>  arch/arm/include/asm/kvm_mmu.h   | 31 +++++++++++++++++++++++++++
>>>>  arch/arm/kvm/mmu.c               | 46 +++++++++++++++++++++++++++-------------
>>>>  arch/arm64/include/asm/kvm_mmu.h | 18 ++++++++++++++++
>>>>  3 files changed, 80 insertions(+), 15 deletions(-)
>>>>
>>>> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
>>>> index 63e0ecc..7ceb836 100644
>>>> --- a/arch/arm/include/asm/kvm_mmu.h
>>>> +++ b/arch/arm/include/asm/kvm_mmu.h
>>>> @@ -44,6 +44,7 @@
>>>>  
>>>>  #ifndef __ASSEMBLY__
>>>>  
>>>> +#include <linux/highmem.h>
>>>>  #include <asm/cacheflush.h>
>>>>  #include <asm/pgalloc.h>
>>>>  
>>>> @@ -190,6 +191,36 @@ static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva,
>>>>  
>>>>  #define kvm_virt_to_phys(x)		virt_to_idmap((unsigned long)(x))
>>>>  
>>>> +static inline void __kvm_flush_dcache_pte(pte_t pte)
>>>> +{
>>>> +	void *va = kmap_atomic(pte_page(pte));
>>>> +
>>>> +	kvm_flush_dcache_to_poc(va, PAGE_SIZE);
>>>> +
>>>> +	kunmap_atomic(va);
>>>> +}
>>>> +
>>>> +static inline void __kvm_flush_dcache_pmd(pmd_t pmd)
>>>> +{
>>>> +	unsigned long size = PMD_SIZE;
>>>> +	pfn_t pfn = pmd_pfn(pmd);
>>>> +
>>>> +	while (size) {
>>>> +		void *va = kmap_atomic_pfn(pfn);
>>>> +
>>>> +		kvm_flush_dcache_to_poc(va, PAGE_SIZE);
>>>> +
>>>> +		pfn++;
>>>> +		size -= PAGE_SIZE;
>>>> +
>>>> +		kunmap_atomic(va);
>>>> +	}
>>>> +}
>>>> +
>>>> +static inline void __kvm_flush_dcache_pud(pud_t pud)
>>>> +{
>>>> +}
>>>> +
>>>>  void stage2_flush_vm(struct kvm *kvm);
>>>>  
>>>>  #endif	/* !__ASSEMBLY__ */
>>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>>> index 1dc9778..1f5b793 100644
>>>> --- a/arch/arm/kvm/mmu.c
>>>> +++ b/arch/arm/kvm/mmu.c
>>>> @@ -58,6 +58,21 @@ static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
>>>>  		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa);
>>>>  }
>>>>  
>>>> +static void kvm_flush_dcache_pte(pte_t pte)
>>>> +{
>>>> +	__kvm_flush_dcache_pte(pte);
>>>> +}
>>>> +
>>>> +static void kvm_flush_dcache_pmd(pmd_t pmd)
>>>> +{
>>>> +	__kvm_flush_dcache_pmd(pmd);
>>>> +}
>>>> +
>>>> +static void kvm_flush_dcache_pud(pud_t pud)
>>>> +{
>>>> +	__kvm_flush_dcache_pud(pud);
>>>> +}
>>>> +
>>>>  static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
>>>>  				  int min, int max)
>>>>  {
>>>> @@ -128,9 +143,12 @@ static void unmap_ptes(struct kvm *kvm, pmd_t *pmd,
>>>>  	start_pte = pte = pte_offset_kernel(pmd, addr);
>>>>  	do {
>>>>  		if (!pte_none(*pte)) {
>>>> +			pte_t old_pte = *pte;
>>>>  			kvm_set_pte(pte, __pte(0));
>>>> -			put_page(virt_to_page(pte));
>>>
>>> was this a bug beforehand that we released the page before we flushed
>>> the TLB?
>>
>> I don't think so. TLB maintenance doesn't require the mapping to exist
>> in the page tables (while the cache maintenance do).
>>
> 
> duh, the put_page is the ref counting on the page table itself, not the
> underlying memory page.  Forget what I asked.
> 
>>>>  			kvm_tlb_flush_vmid_ipa(kvm, addr);
>>>> +			if ((pte_val(old_pte) & PAGE_S2_DEVICE) != PAGE_S2_DEVICE)
>>>> +				kvm_flush_dcache_pte(old_pte);
>>>
>>> this is confusing me: We are only flushing the cache for cached stage-2
>>> mappings?  Weren't we trying to flush the cache for uncached mappings?
>>> (we obviously also need flush a cached stage-2 mapping but where the
>>> guest is mapping it as uncached, but we don't know that easily).
>>>
>>> Am I missing something completely here?
>>
>> I think you must be misreading something:
>> - we want to invalidate mappings because the guest may have created an
>> uncached mapping
> 
> I don't quite understand: we are invalidating mappings because the page
> is being swapped out (and the guest must fault if it tries to access it
> again).  Not because the guest may have created an uncached mapping,
> that's just an aspect of the situation.  Or am I thinking about this the
> wrong way?

My wording was quite ambiguous. Indeed, we're removing the mapping
because the page is being evicted. In a perfect and ideal world (where
the guest doesn't do anything silly and everything is cache coherent),
we shouldn't have to do anything cache wise. But see below.

>> - as we cannot know about the guest's uncached mappings, we flush things
>> unconditionally (basically anything that is RAM).
> 
> so isn't the problem that the host may have an invalid cached view of
> the data, so we need to invalidate that view, not flush the invalid data
> to RAM?  Does the kernel take care of that somewhere else for a
> cache-coherent IO subsystem?

There is two potential problems:

- The guest has created an uncached mapping, and you have a cache
coherent IO subsystem: we need to invalidate the cached view. But since
we don't know where that mapping is (as we don't track guest mappings),
we must do a clean+invalidate in order not to corrupt cached mappings.

- The guest has cached mappings (that's the usual case), and the IO
subsystem is not cache-coherent. In this case, the kernel knows about
this and will do the clean operation for us.

The main problem is that we cannot identify any of these two cases: we
don't know if the IO path is coherent or not, and we don't know about
uncached mappings. The only safe thing to do is to perform the
clean+invalidate, always.

I wish we had a way to identify uncached mappings, That would save us a
lot of over-maintenance and some very tricky games we play at the MMU level.

Does the above make things clearer?

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-12  9:58                               ` Marc Zyngier
@ 2015-01-12 20:10                                 ` Christoffer Dall
  2015-01-13 11:38                                   ` Marc Zyngier
  0 siblings, 1 reply; 45+ messages in thread
From: Christoffer Dall @ 2015-01-12 20:10 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: Peter Maydell, kvm-devel, kvmarm

On Mon, Jan 12, 2015 at 09:58:30AM +0000, Marc Zyngier wrote:
> On 11/01/15 18:38, Christoffer Dall wrote:
> > On Sun, Jan 11, 2015 at 06:27:35PM +0000, Peter Maydell wrote:
> >> On 11 January 2015 at 17:58, Christoffer Dall
> >> <christoffer.dall@linaro.org> wrote:
> >>> On Sun, Jan 11, 2015 at 05:37:52PM +0000, Peter Maydell wrote:
> >>>> On 11 January 2015 at 12:33, Christoffer Dall
> >>>> <christoffer.dall@linaro.org> wrote:
> >>>>> On Fri, Jan 09, 2015 at 03:28:58PM +0000, Peter Maydell wrote:
> >>>>>> But implementations are allowed to hit in the cache even
> >>>>>> when the cache is disabled. In particular, setting the guest
> >>>>>
> >>>>> But how can it hit anything when the icache for the used VMID is
> >>>>> guaranteed to be clear (maybe that requires another full icache
> >>>>> invalidate for that VMID for PSCI reset)?
> >>>>
> >>>> The point is that at the moment we don't do anything to
> >>>> guarantee that we've cleared the icache.
> >>>
> >>> that's not entirely accurate, I assume all of the icache is
> >>> invalidated/cleaned at system bring-up time, and every time we re-use a
> >>> VMID (when we start a VMID rollover) we invalidate the entire icache.
> >>
> >> Right, but that doesn't catch the VM reset case, which is the
> >> one we're talking about.
> >>
> > 
> > ok, so that's what you meant by warm reset, I see.
> > 
> > Then I would think we should add that single invalidate on vcpu init
> > rather than flushing the icache on every page fault?
> > 
> >>>> (Plus could there be
> >>>> stale data in the icache for this physical CPU for this VMID
> >>>> because we've run some other vCPU on it? Or does the process
> >>>> of rescheduling vCPUs across pCPUs and guest ASID management
> >>>> deal with that?)
> >>>
> >>> we don't clear the icache for vCPUs migrating onto other pCPUs but
> >>> invalidating the icache on a page fault won't guarantee that either.  Do
> >>> we really need to do that?
> >>
> >> I don't think we do, but I haven't thought through exactly
> >> why we don't yet :-)
> >>
> > 
> > So once you start a secondary vCPU that one can then hit in the icache
> > from what the primary vCPU put there which I guess is different behavior
> > from a physical secondary core coming out of reset with the MMU off and
> > never hitting the icache, right?
> > 
> > And is this not also a different behavior from a native system once the
> > vCPUs have turned their MMUs on, but we just don't happen to observe it
> > as being a problem?
> > 
> > In any case, I don't have a great solution for how to solve this except
> > for always invalidating the icache when we migrate a vCPU to a pCPU, but
> > that's really nasty...
> 
> No, it only needs to happen once per vcpu, on any CPU. IC IALLUIS is
> broadcast across CPUs, so once it has taken place on the first CPU this
> vcpu runs on, we're good.
> 
But if you compare strictly to a native system, wouldn't a vCPU be able
to hit in the icache suddenly if migrated onto a pCPU that has run code
for the same VM (with the VMID) wihtout having tuned the MMU on?

-Christoffer

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4] arm/arm64: KVM: Flush caches to memory on unmap
  2015-01-12 11:15         ` Marc Zyngier
@ 2015-01-12 20:13           ` Christoffer Dall
  2015-01-13 13:47             ` Christoffer Dall
  0 siblings, 1 reply; 45+ messages in thread
From: Christoffer Dall @ 2015-01-12 20:13 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: kvm, kvmarm, Steve Capper

On Mon, Jan 12, 2015 at 11:15:34AM +0000, Marc Zyngier wrote:
> On 11/01/15 12:30, Christoffer Dall wrote:
> > On Fri, Jan 09, 2015 at 02:35:49PM +0000, Marc Zyngier wrote:
> >> On 09/01/15 12:30, Christoffer Dall wrote:
> >>> On Thu, Jan 08, 2015 at 11:59:08AM +0000, Marc Zyngier wrote:
> >>>> Let's assume a guest has created an uncached mapping, and written
> >>>> to that page. Let's also assume that the host uses a cache-coherent
> >>>> IO subsystem. Let's finally assume that the host is under memory
> >>>> pressure and starts to swap things out.
> >>>>
> >>>> Before this "uncached" page is evicted, we need to make sure it
> >>>> gets invalidated, or the IO subsystem is going to swap out the
> >>>> cached view, loosing the data that has been written there.
> >>>>
> >>>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
> >>>> ---
> >>>>  arch/arm/include/asm/kvm_mmu.h   | 31 +++++++++++++++++++++++++++
> >>>>  arch/arm/kvm/mmu.c               | 46 +++++++++++++++++++++++++++-------------
> >>>>  arch/arm64/include/asm/kvm_mmu.h | 18 ++++++++++++++++
> >>>>  3 files changed, 80 insertions(+), 15 deletions(-)
> >>>>
> >>>> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> >>>> index 63e0ecc..7ceb836 100644
> >>>> --- a/arch/arm/include/asm/kvm_mmu.h
> >>>> +++ b/arch/arm/include/asm/kvm_mmu.h
> >>>> @@ -44,6 +44,7 @@
> >>>>  
> >>>>  #ifndef __ASSEMBLY__
> >>>>  
> >>>> +#include <linux/highmem.h>
> >>>>  #include <asm/cacheflush.h>
> >>>>  #include <asm/pgalloc.h>
> >>>>  
> >>>> @@ -190,6 +191,36 @@ static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva,
> >>>>  
> >>>>  #define kvm_virt_to_phys(x)		virt_to_idmap((unsigned long)(x))
> >>>>  
> >>>> +static inline void __kvm_flush_dcache_pte(pte_t pte)
> >>>> +{
> >>>> +	void *va = kmap_atomic(pte_page(pte));
> >>>> +
> >>>> +	kvm_flush_dcache_to_poc(va, PAGE_SIZE);
> >>>> +
> >>>> +	kunmap_atomic(va);
> >>>> +}
> >>>> +
> >>>> +static inline void __kvm_flush_dcache_pmd(pmd_t pmd)
> >>>> +{
> >>>> +	unsigned long size = PMD_SIZE;
> >>>> +	pfn_t pfn = pmd_pfn(pmd);
> >>>> +
> >>>> +	while (size) {
> >>>> +		void *va = kmap_atomic_pfn(pfn);
> >>>> +
> >>>> +		kvm_flush_dcache_to_poc(va, PAGE_SIZE);
> >>>> +
> >>>> +		pfn++;
> >>>> +		size -= PAGE_SIZE;
> >>>> +
> >>>> +		kunmap_atomic(va);
> >>>> +	}
> >>>> +}
> >>>> +
> >>>> +static inline void __kvm_flush_dcache_pud(pud_t pud)
> >>>> +{
> >>>> +}
> >>>> +
> >>>>  void stage2_flush_vm(struct kvm *kvm);
> >>>>  
> >>>>  #endif	/* !__ASSEMBLY__ */
> >>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> >>>> index 1dc9778..1f5b793 100644
> >>>> --- a/arch/arm/kvm/mmu.c
> >>>> +++ b/arch/arm/kvm/mmu.c
> >>>> @@ -58,6 +58,21 @@ static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
> >>>>  		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa);
> >>>>  }
> >>>>  
> >>>> +static void kvm_flush_dcache_pte(pte_t pte)
> >>>> +{
> >>>> +	__kvm_flush_dcache_pte(pte);
> >>>> +}
> >>>> +
> >>>> +static void kvm_flush_dcache_pmd(pmd_t pmd)
> >>>> +{
> >>>> +	__kvm_flush_dcache_pmd(pmd);
> >>>> +}
> >>>> +
> >>>> +static void kvm_flush_dcache_pud(pud_t pud)
> >>>> +{
> >>>> +	__kvm_flush_dcache_pud(pud);
> >>>> +}
> >>>> +
> >>>>  static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
> >>>>  				  int min, int max)
> >>>>  {
> >>>> @@ -128,9 +143,12 @@ static void unmap_ptes(struct kvm *kvm, pmd_t *pmd,
> >>>>  	start_pte = pte = pte_offset_kernel(pmd, addr);
> >>>>  	do {
> >>>>  		if (!pte_none(*pte)) {
> >>>> +			pte_t old_pte = *pte;
> >>>>  			kvm_set_pte(pte, __pte(0));
> >>>> -			put_page(virt_to_page(pte));
> >>>
> >>> was this a bug beforehand that we released the page before we flushed
> >>> the TLB?
> >>
> >> I don't think so. TLB maintenance doesn't require the mapping to exist
> >> in the page tables (while the cache maintenance do).
> >>
> > 
> > duh, the put_page is the ref counting on the page table itself, not the
> > underlying memory page.  Forget what I asked.
> > 
> >>>>  			kvm_tlb_flush_vmid_ipa(kvm, addr);
> >>>> +			if ((pte_val(old_pte) & PAGE_S2_DEVICE) != PAGE_S2_DEVICE)
> >>>> +				kvm_flush_dcache_pte(old_pte);
> >>>
> >>> this is confusing me: We are only flushing the cache for cached stage-2
> >>> mappings?  Weren't we trying to flush the cache for uncached mappings?
> >>> (we obviously also need flush a cached stage-2 mapping but where the
> >>> guest is mapping it as uncached, but we don't know that easily).
> >>>
> >>> Am I missing something completely here?
> >>
> >> I think you must be misreading something:
> >> - we want to invalidate mappings because the guest may have created an
> >> uncached mapping
> > 
> > I don't quite understand: we are invalidating mappings because the page
> > is being swapped out (and the guest must fault if it tries to access it
> > again).  Not because the guest may have created an uncached mapping,
> > that's just an aspect of the situation.  Or am I thinking about this the
> > wrong way?
> 
> My wording was quite ambiguous. Indeed, we're removing the mapping
> because the page is being evicted. In a perfect and ideal world (where
> the guest doesn't do anything silly and everything is cache coherent),
> we shouldn't have to do anything cache wise. But see below.
> 
> >> - as we cannot know about the guest's uncached mappings, we flush things
> >> unconditionally (basically anything that is RAM).
> > 
> > so isn't the problem that the host may have an invalid cached view of
> > the data, so we need to invalidate that view, not flush the invalid data
> > to RAM?  Does the kernel take care of that somewhere else for a
> > cache-coherent IO subsystem?
> 
> There is two potential problems:
> 
> - The guest has created an uncached mapping, and you have a cache
> coherent IO subsystem: we need to invalidate the cached view. But since
> we don't know where that mapping is (as we don't track guest mappings),
> we must do a clean+invalidate in order not to corrupt cached mappings.

so here you have:

CACHE=dirty random old data
RAM=data that the guest wrote

now your IO subsystem will read from the cache, which is bad.  But when
you do a clean+invalidate, you get:

CACHE=invalidated
RAM=dirty random old data

I still don't get it...

> 
> - The guest has cached mappings (that's the usual case), and the IO
> subsystem is not cache-coherent. In this case, the kernel knows about
> this and will do the clean operation for us.
> 
> The main problem is that we cannot identify any of these two cases: we
> don't know if the IO path is coherent or not, and we don't know about
> uncached mappings. The only safe thing to do is to perform the
> clean+invalidate, always.
> 
> I wish we had a way to identify uncached mappings, That would save us a
> lot of over-maintenance and some very tricky games we play at the MMU level.
> 
> Does the above make things clearer?
> 
The clarification about the host doing the clearning for us for a
non-coherent cache system is perfectly clear, but as I say above, I
still don't quite get the other case.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-12 20:10                                 ` Christoffer Dall
@ 2015-01-13 11:38                                   ` Marc Zyngier
  2015-01-13 12:04                                     ` Christoffer Dall
  0 siblings, 1 reply; 45+ messages in thread
From: Marc Zyngier @ 2015-01-13 11:38 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Peter Maydell, kvm-devel, kvmarm

On 12/01/15 20:10, Christoffer Dall wrote:
> On Mon, Jan 12, 2015 at 09:58:30AM +0000, Marc Zyngier wrote:
>> On 11/01/15 18:38, Christoffer Dall wrote:
>>> On Sun, Jan 11, 2015 at 06:27:35PM +0000, Peter Maydell wrote:
>>>> On 11 January 2015 at 17:58, Christoffer Dall
>>>> <christoffer.dall@linaro.org> wrote:
>>>>> On Sun, Jan 11, 2015 at 05:37:52PM +0000, Peter Maydell wrote:
>>>>>> On 11 January 2015 at 12:33, Christoffer Dall
>>>>>> <christoffer.dall@linaro.org> wrote:
>>>>>>> On Fri, Jan 09, 2015 at 03:28:58PM +0000, Peter Maydell wrote:
>>>>>>>> But implementations are allowed to hit in the cache even
>>>>>>>> when the cache is disabled. In particular, setting the guest
>>>>>>>
>>>>>>> But how can it hit anything when the icache for the used VMID is
>>>>>>> guaranteed to be clear (maybe that requires another full icache
>>>>>>> invalidate for that VMID for PSCI reset)?
>>>>>>
>>>>>> The point is that at the moment we don't do anything to
>>>>>> guarantee that we've cleared the icache.
>>>>>
>>>>> that's not entirely accurate, I assume all of the icache is
>>>>> invalidated/cleaned at system bring-up time, and every time we re-use a
>>>>> VMID (when we start a VMID rollover) we invalidate the entire icache.
>>>>
>>>> Right, but that doesn't catch the VM reset case, which is the
>>>> one we're talking about.
>>>>
>>>
>>> ok, so that's what you meant by warm reset, I see.
>>>
>>> Then I would think we should add that single invalidate on vcpu init
>>> rather than flushing the icache on every page fault?
>>>
>>>>>> (Plus could there be
>>>>>> stale data in the icache for this physical CPU for this VMID
>>>>>> because we've run some other vCPU on it? Or does the process
>>>>>> of rescheduling vCPUs across pCPUs and guest ASID management
>>>>>> deal with that?)
>>>>>
>>>>> we don't clear the icache for vCPUs migrating onto other pCPUs but
>>>>> invalidating the icache on a page fault won't guarantee that either.  Do
>>>>> we really need to do that?
>>>>
>>>> I don't think we do, but I haven't thought through exactly
>>>> why we don't yet :-)
>>>>
>>>
>>> So once you start a secondary vCPU that one can then hit in the icache
>>> from what the primary vCPU put there which I guess is different behavior
>>> from a physical secondary core coming out of reset with the MMU off and
>>> never hitting the icache, right?
>>>
>>> And is this not also a different behavior from a native system once the
>>> vCPUs have turned their MMUs on, but we just don't happen to observe it
>>> as being a problem?
>>>
>>> In any case, I don't have a great solution for how to solve this except
>>> for always invalidating the icache when we migrate a vCPU to a pCPU, but
>>> that's really nasty...
>>
>> No, it only needs to happen once per vcpu, on any CPU. IC IALLUIS is
>> broadcast across CPUs, so once it has taken place on the first CPU this
>> vcpu runs on, we're good.
>>
> But if you compare strictly to a native system, wouldn't a vCPU be able
> to hit in the icache suddenly if migrated onto a pCPU that has run code
> for the same VM (with the VMID) wihtout having tuned the MMU on?

Hmmm. Yes. ASID-tagged VIVT icache are really turning my brain into
jelly, and that's not a pretty thing to see.

So we're back to your initial approach: Each time a vcpu is migrated to
another CPU while its MMU/icache is off, we nuke the icache.

Do we want that in now, or do we keep that for later, when we actually
see such an implementation?

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-13 11:38                                   ` Marc Zyngier
@ 2015-01-13 12:04                                     ` Christoffer Dall
  2015-01-13 12:12                                       ` Peter Maydell
  0 siblings, 1 reply; 45+ messages in thread
From: Christoffer Dall @ 2015-01-13 12:04 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: Peter Maydell, kvm-devel, kvmarm

On Tue, Jan 13, 2015 at 11:38:54AM +0000, Marc Zyngier wrote:
> On 12/01/15 20:10, Christoffer Dall wrote:
> > On Mon, Jan 12, 2015 at 09:58:30AM +0000, Marc Zyngier wrote:
> >> On 11/01/15 18:38, Christoffer Dall wrote:
> >>> On Sun, Jan 11, 2015 at 06:27:35PM +0000, Peter Maydell wrote:
> >>>> On 11 January 2015 at 17:58, Christoffer Dall
> >>>> <christoffer.dall@linaro.org> wrote:
> >>>>> On Sun, Jan 11, 2015 at 05:37:52PM +0000, Peter Maydell wrote:
> >>>>>> On 11 January 2015 at 12:33, Christoffer Dall
> >>>>>> <christoffer.dall@linaro.org> wrote:
> >>>>>>> On Fri, Jan 09, 2015 at 03:28:58PM +0000, Peter Maydell wrote:
> >>>>>>>> But implementations are allowed to hit in the cache even
> >>>>>>>> when the cache is disabled. In particular, setting the guest
> >>>>>>>
> >>>>>>> But how can it hit anything when the icache for the used VMID is
> >>>>>>> guaranteed to be clear (maybe that requires another full icache
> >>>>>>> invalidate for that VMID for PSCI reset)?
> >>>>>>
> >>>>>> The point is that at the moment we don't do anything to
> >>>>>> guarantee that we've cleared the icache.
> >>>>>
> >>>>> that's not entirely accurate, I assume all of the icache is
> >>>>> invalidated/cleaned at system bring-up time, and every time we re-use a
> >>>>> VMID (when we start a VMID rollover) we invalidate the entire icache.
> >>>>
> >>>> Right, but that doesn't catch the VM reset case, which is the
> >>>> one we're talking about.
> >>>>
> >>>
> >>> ok, so that's what you meant by warm reset, I see.
> >>>
> >>> Then I would think we should add that single invalidate on vcpu init
> >>> rather than flushing the icache on every page fault?
> >>>
> >>>>>> (Plus could there be
> >>>>>> stale data in the icache for this physical CPU for this VMID
> >>>>>> because we've run some other vCPU on it? Or does the process
> >>>>>> of rescheduling vCPUs across pCPUs and guest ASID management
> >>>>>> deal with that?)
> >>>>>
> >>>>> we don't clear the icache for vCPUs migrating onto other pCPUs but
> >>>>> invalidating the icache on a page fault won't guarantee that either.  Do
> >>>>> we really need to do that?
> >>>>
> >>>> I don't think we do, but I haven't thought through exactly
> >>>> why we don't yet :-)
> >>>>
> >>>
> >>> So once you start a secondary vCPU that one can then hit in the icache
> >>> from what the primary vCPU put there which I guess is different behavior
> >>> from a physical secondary core coming out of reset with the MMU off and
> >>> never hitting the icache, right?
> >>>
> >>> And is this not also a different behavior from a native system once the
> >>> vCPUs have turned their MMUs on, but we just don't happen to observe it
> >>> as being a problem?
> >>>
> >>> In any case, I don't have a great solution for how to solve this except
> >>> for always invalidating the icache when we migrate a vCPU to a pCPU, but
> >>> that's really nasty...
> >>
> >> No, it only needs to happen once per vcpu, on any CPU. IC IALLUIS is
> >> broadcast across CPUs, so once it has taken place on the first CPU this
> >> vcpu runs on, we're good.
> >>
> > But if you compare strictly to a native system, wouldn't a vCPU be able
> > to hit in the icache suddenly if migrated onto a pCPU that has run code
> > for the same VM (with the VMID) wihtout having tuned the MMU on?
> 
> Hmmm. Yes. ASID-tagged VIVT icache are really turning my brain into
> jelly, and that's not a pretty thing to see.
> 
> So we're back to your initial approach: Each time a vcpu is migrated to
> another CPU while its MMU/icache is off, we nuke the icache.
> 
> Do we want that in now, or do we keep that for later, when we actually
> see such an implementation?
> 
I don't think we want that in there now, if we can't test it anyway,
we're likely not to get it 100% correct.

Additionally, I haven't been able to think of a reasonable guest
scenario where this breaks.  Once the guest turns on its MMU it should
deal with the necessary icache invalidation itself (I think), so we're
really talking about situations where the stage-1 MMU is off, and I
gather that mostly you'll be seeing a single core doing any heavy
lifting and then secondary cores basically coming up, only seeing valid
entries in the icache, and doing the necessary invalidat+turn on mmu
stuff.

But I haven't spent days thinking about this yet.

-Christoffer

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-13 12:04                                     ` Christoffer Dall
@ 2015-01-13 12:12                                       ` Peter Maydell
  2015-01-13 13:35                                         ` Christoffer Dall
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Maydell @ 2015-01-13 12:12 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Marc Zyngier, kvm-devel, kvmarm

On 13 January 2015 at 12:04, Christoffer Dall
<christoffer.dall@linaro.org> wrote:
> Additionally, I haven't been able to think of a reasonable guest
> scenario where this breaks.  Once the guest turns on its MMU it should
> deal with the necessary icache invalidation itself (I think), so we're
> really talking about situations where the stage-1 MMU is off, and I
> gather that mostly you'll be seeing a single core doing any heavy
> lifting and then secondary cores basically coming up, only seeing valid
> entries in the icache, and doing the necessary invalidat+turn on mmu
> stuff.

The trouble with that is that as the secondary comes up, before it
turns on its icache its VA->PA mapping is the identity map; whereas
the primary vCPU's VA->PA mapping is "whatever the guest kernel's
usual mapping is". If the kernel has some mapping other than identity
for the VA which is wherever the secondary-CPU-startup-to-MMU-enable
code lives (which seems quite likely), then you have potential problems.

-- PMM

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-13 12:12                                       ` Peter Maydell
@ 2015-01-13 13:35                                         ` Christoffer Dall
  2015-01-13 13:41                                           ` Peter Maydell
  2015-01-15 12:00                                           ` Mark Rutland
  0 siblings, 2 replies; 45+ messages in thread
From: Christoffer Dall @ 2015-01-13 13:35 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Marc Zyngier, kvm-devel, kvmarm

On Tue, Jan 13, 2015 at 12:12:41PM +0000, Peter Maydell wrote:
> On 13 January 2015 at 12:04, Christoffer Dall
> <christoffer.dall@linaro.org> wrote:
> > Additionally, I haven't been able to think of a reasonable guest
> > scenario where this breaks.  Once the guest turns on its MMU it should
> > deal with the necessary icache invalidation itself (I think), so we're
> > really talking about situations where the stage-1 MMU is off, and I
> > gather that mostly you'll be seeing a single core doing any heavy
> > lifting and then secondary cores basically coming up, only seeing valid
> > entries in the icache, and doing the necessary invalidat+turn on mmu
> > stuff.
> 
> The trouble with that is that as the secondary comes up, before it
> turns on its icache its VA->PA mapping is the identity map; whereas
> the primary vCPU's VA->PA mapping is "whatever the guest kernel's
> usual mapping is". If the kernel has some mapping other than identity
> for the VA which is wherever the secondary-CPU-startup-to-MMU-enable
> code lives (which seems quite likely), then you have potential problems.
> 
Wouldn't a guest (and I believe Linux does this) reserve ASID 0 for
additional cores and use ASID 1+++ for itself?

Or does the potential hits in the icache for a stage-1 turned-off MMU
hit on all ASIDs ?

-Christoffer

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-13 13:35                                         ` Christoffer Dall
@ 2015-01-13 13:41                                           ` Peter Maydell
  2015-01-13 13:49                                             ` Christoffer Dall
  2015-01-15 12:00                                           ` Mark Rutland
  1 sibling, 1 reply; 45+ messages in thread
From: Peter Maydell @ 2015-01-13 13:41 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Marc Zyngier, kvm-devel, kvmarm

On 13 January 2015 at 13:35, Christoffer Dall
<christoffer.dall@linaro.org> wrote:
> Wouldn't a guest (and I believe Linux does this) reserve ASID 0 for
> additional cores and use ASID 1+++ for itself?

If the guest reserves an ASID for "MMU disabled" then yes, that would
work. The question of course is whether all guests do that...

-- PMM

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4] arm/arm64: KVM: Flush caches to memory on unmap
  2015-01-12 20:13           ` Christoffer Dall
@ 2015-01-13 13:47             ` Christoffer Dall
  2015-01-13 13:57               ` Marc Zyngier
  0 siblings, 1 reply; 45+ messages in thread
From: Christoffer Dall @ 2015-01-13 13:47 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: kvm, kvmarm, Steve Capper

On Mon, Jan 12, 2015 at 09:13:52PM +0100, Christoffer Dall wrote:
> On Mon, Jan 12, 2015 at 11:15:34AM +0000, Marc Zyngier wrote:
> > On 11/01/15 12:30, Christoffer Dall wrote:
> > > On Fri, Jan 09, 2015 at 02:35:49PM +0000, Marc Zyngier wrote:
> > >> On 09/01/15 12:30, Christoffer Dall wrote:
> > >>> On Thu, Jan 08, 2015 at 11:59:08AM +0000, Marc Zyngier wrote:
> > >>>> Let's assume a guest has created an uncached mapping, and written
> > >>>> to that page. Let's also assume that the host uses a cache-coherent
> > >>>> IO subsystem. Let's finally assume that the host is under memory
> > >>>> pressure and starts to swap things out.
> > >>>>
> > >>>> Before this "uncached" page is evicted, we need to make sure it
> > >>>> gets invalidated, or the IO subsystem is going to swap out the
> > >>>> cached view, loosing the data that has been written there.
> > >>>>
> > >>>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
> > >>>> ---
> > >>>>  arch/arm/include/asm/kvm_mmu.h   | 31 +++++++++++++++++++++++++++
> > >>>>  arch/arm/kvm/mmu.c               | 46 +++++++++++++++++++++++++++-------------
> > >>>>  arch/arm64/include/asm/kvm_mmu.h | 18 ++++++++++++++++
> > >>>>  3 files changed, 80 insertions(+), 15 deletions(-)
> > >>>>
> > >>>> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> > >>>> index 63e0ecc..7ceb836 100644
> > >>>> --- a/arch/arm/include/asm/kvm_mmu.h
> > >>>> +++ b/arch/arm/include/asm/kvm_mmu.h
> > >>>> @@ -44,6 +44,7 @@
> > >>>>  
> > >>>>  #ifndef __ASSEMBLY__
> > >>>>  
> > >>>> +#include <linux/highmem.h>
> > >>>>  #include <asm/cacheflush.h>
> > >>>>  #include <asm/pgalloc.h>
> > >>>>  
> > >>>> @@ -190,6 +191,36 @@ static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva,
> > >>>>  
> > >>>>  #define kvm_virt_to_phys(x)		virt_to_idmap((unsigned long)(x))
> > >>>>  
> > >>>> +static inline void __kvm_flush_dcache_pte(pte_t pte)
> > >>>> +{
> > >>>> +	void *va = kmap_atomic(pte_page(pte));
> > >>>> +
> > >>>> +	kvm_flush_dcache_to_poc(va, PAGE_SIZE);
> > >>>> +
> > >>>> +	kunmap_atomic(va);
> > >>>> +}
> > >>>> +
> > >>>> +static inline void __kvm_flush_dcache_pmd(pmd_t pmd)
> > >>>> +{
> > >>>> +	unsigned long size = PMD_SIZE;
> > >>>> +	pfn_t pfn = pmd_pfn(pmd);
> > >>>> +
> > >>>> +	while (size) {
> > >>>> +		void *va = kmap_atomic_pfn(pfn);
> > >>>> +
> > >>>> +		kvm_flush_dcache_to_poc(va, PAGE_SIZE);
> > >>>> +
> > >>>> +		pfn++;
> > >>>> +		size -= PAGE_SIZE;
> > >>>> +
> > >>>> +		kunmap_atomic(va);
> > >>>> +	}
> > >>>> +}
> > >>>> +
> > >>>> +static inline void __kvm_flush_dcache_pud(pud_t pud)
> > >>>> +{
> > >>>> +}
> > >>>> +
> > >>>>  void stage2_flush_vm(struct kvm *kvm);
> > >>>>  
> > >>>>  #endif	/* !__ASSEMBLY__ */
> > >>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> > >>>> index 1dc9778..1f5b793 100644
> > >>>> --- a/arch/arm/kvm/mmu.c
> > >>>> +++ b/arch/arm/kvm/mmu.c
> > >>>> @@ -58,6 +58,21 @@ static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
> > >>>>  		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa);
> > >>>>  }
> > >>>>  
> > >>>> +static void kvm_flush_dcache_pte(pte_t pte)
> > >>>> +{
> > >>>> +	__kvm_flush_dcache_pte(pte);
> > >>>> +}
> > >>>> +
> > >>>> +static void kvm_flush_dcache_pmd(pmd_t pmd)
> > >>>> +{
> > >>>> +	__kvm_flush_dcache_pmd(pmd);
> > >>>> +}
> > >>>> +
> > >>>> +static void kvm_flush_dcache_pud(pud_t pud)
> > >>>> +{
> > >>>> +	__kvm_flush_dcache_pud(pud);
> > >>>> +}
> > >>>> +
> > >>>>  static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
> > >>>>  				  int min, int max)
> > >>>>  {
> > >>>> @@ -128,9 +143,12 @@ static void unmap_ptes(struct kvm *kvm, pmd_t *pmd,
> > >>>>  	start_pte = pte = pte_offset_kernel(pmd, addr);
> > >>>>  	do {
> > >>>>  		if (!pte_none(*pte)) {
> > >>>> +			pte_t old_pte = *pte;
> > >>>>  			kvm_set_pte(pte, __pte(0));
> > >>>> -			put_page(virt_to_page(pte));
> > >>>
> > >>> was this a bug beforehand that we released the page before we flushed
> > >>> the TLB?
> > >>
> > >> I don't think so. TLB maintenance doesn't require the mapping to exist
> > >> in the page tables (while the cache maintenance do).
> > >>
> > > 
> > > duh, the put_page is the ref counting on the page table itself, not the
> > > underlying memory page.  Forget what I asked.
> > > 
> > >>>>  			kvm_tlb_flush_vmid_ipa(kvm, addr);
> > >>>> +			if ((pte_val(old_pte) & PAGE_S2_DEVICE) != PAGE_S2_DEVICE)
> > >>>> +				kvm_flush_dcache_pte(old_pte);
> > >>>
> > >>> this is confusing me: We are only flushing the cache for cached stage-2
> > >>> mappings?  Weren't we trying to flush the cache for uncached mappings?
> > >>> (we obviously also need flush a cached stage-2 mapping but where the
> > >>> guest is mapping it as uncached, but we don't know that easily).
> > >>>
> > >>> Am I missing something completely here?
> > >>
> > >> I think you must be misreading something:
> > >> - we want to invalidate mappings because the guest may have created an
> > >> uncached mapping
> > > 
> > > I don't quite understand: we are invalidating mappings because the page
> > > is being swapped out (and the guest must fault if it tries to access it
> > > again).  Not because the guest may have created an uncached mapping,
> > > that's just an aspect of the situation.  Or am I thinking about this the
> > > wrong way?
> > 
> > My wording was quite ambiguous. Indeed, we're removing the mapping
> > because the page is being evicted. In a perfect and ideal world (where
> > the guest doesn't do anything silly and everything is cache coherent),
> > we shouldn't have to do anything cache wise. But see below.
> > 
> > >> - as we cannot know about the guest's uncached mappings, we flush things
> > >> unconditionally (basically anything that is RAM).
> > > 
> > > so isn't the problem that the host may have an invalid cached view of
> > > the data, so we need to invalidate that view, not flush the invalid data
> > > to RAM?  Does the kernel take care of that somewhere else for a
> > > cache-coherent IO subsystem?
> > 
> > There is two potential problems:
> > 
> > - The guest has created an uncached mapping, and you have a cache
> > coherent IO subsystem: we need to invalidate the cached view. But since
> > we don't know where that mapping is (as we don't track guest mappings),
> > we must do a clean+invalidate in order not to corrupt cached mappings.
> 
> so here you have:
> 
> CACHE=dirty random old data
> RAM=data that the guest wrote
> 
> now your IO subsystem will read from the cache, which is bad.  But when
> you do a clean+invalidate, you get:
> 
> CACHE=invalidated
> RAM=dirty random old data
> 
> I still don't get it...
> 
ok, so talked to Marc earlier and he explained me the details.  I think
this is really quite a subtle, but nevertheless important situation to
catch:

If a guest maps certain memory pages as uncached, all writes will bypass
the data cache and go directly to RAM.  However, the CPUs can still
speculate reads (not writes) and fill cache lines with data.

Those cache lines will be *clean* cache lines though, so a
clean+invalidate operation is equivalent to an invalidate operation,
because no cache lines are marked dirty.

Those clean cache lines could be filled prior to an uncached write by
the guest, and the cache coherent IO subsystem would therefore end up
writing old data to disk.

We can therefore do exactly what Marc was suggesting to do safely, and
we have to do it.

I hope I finally got this right?

I think we need this explanation properly as part of the commit message
or as comments in the code, and I think we need to add a comment about
why it's pointless to flush device memory mappings (or that that is why
we do the check against PAGE_S2_DEVICE).

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-13 13:41                                           ` Peter Maydell
@ 2015-01-13 13:49                                             ` Christoffer Dall
  0 siblings, 0 replies; 45+ messages in thread
From: Christoffer Dall @ 2015-01-13 13:49 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Marc Zyngier, kvm-devel, kvmarm

On Tue, Jan 13, 2015 at 01:41:03PM +0000, Peter Maydell wrote:
> On 13 January 2015 at 13:35, Christoffer Dall
> <christoffer.dall@linaro.org> wrote:
> > Wouldn't a guest (and I believe Linux does this) reserve ASID 0 for
> > additional cores and use ASID 1+++ for itself?
> 
> If the guest reserves an ASID for "MMU disabled" then yes, that would
> work. The question of course is whether all guests do that...
> 
which ASID would match for MMU disabled?  Did you come across that in
the ARM ARM somewhere?

The fact that Linux does it would indicate that this may be a
requirement on real hardware too and therefore all guests have to do
it...

-Christoffer

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/4] arm/arm64: KVM: Flush caches to memory on unmap
  2015-01-13 13:47             ` Christoffer Dall
@ 2015-01-13 13:57               ` Marc Zyngier
  0 siblings, 0 replies; 45+ messages in thread
From: Marc Zyngier @ 2015-01-13 13:57 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: kvm, kvmarm, Steve Capper

On 13/01/15 13:47, Christoffer Dall wrote:
> On Mon, Jan 12, 2015 at 09:13:52PM +0100, Christoffer Dall wrote:
>> On Mon, Jan 12, 2015 at 11:15:34AM +0000, Marc Zyngier wrote:
>>> On 11/01/15 12:30, Christoffer Dall wrote:
>>>> On Fri, Jan 09, 2015 at 02:35:49PM +0000, Marc Zyngier wrote:
>>>>> On 09/01/15 12:30, Christoffer Dall wrote:
>>>>>> On Thu, Jan 08, 2015 at 11:59:08AM +0000, Marc Zyngier wrote:
>>>>>>> Let's assume a guest has created an uncached mapping, and written
>>>>>>> to that page. Let's also assume that the host uses a cache-coherent
>>>>>>> IO subsystem. Let's finally assume that the host is under memory
>>>>>>> pressure and starts to swap things out.
>>>>>>>
>>>>>>> Before this "uncached" page is evicted, we need to make sure it
>>>>>>> gets invalidated, or the IO subsystem is going to swap out the
>>>>>>> cached view, loosing the data that has been written there.
>>>>>>>
>>>>>>> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
>>>>>>> ---
>>>>>>>  arch/arm/include/asm/kvm_mmu.h   | 31 +++++++++++++++++++++++++++
>>>>>>>  arch/arm/kvm/mmu.c               | 46 +++++++++++++++++++++++++++-------------
>>>>>>>  arch/arm64/include/asm/kvm_mmu.h | 18 ++++++++++++++++
>>>>>>>  3 files changed, 80 insertions(+), 15 deletions(-)
>>>>>>>
>>>>>>> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
>>>>>>> index 63e0ecc..7ceb836 100644
>>>>>>> --- a/arch/arm/include/asm/kvm_mmu.h
>>>>>>> +++ b/arch/arm/include/asm/kvm_mmu.h
>>>>>>> @@ -44,6 +44,7 @@
>>>>>>>  
>>>>>>>  #ifndef __ASSEMBLY__
>>>>>>>  
>>>>>>> +#include <linux/highmem.h>
>>>>>>>  #include <asm/cacheflush.h>
>>>>>>>  #include <asm/pgalloc.h>
>>>>>>>  
>>>>>>> @@ -190,6 +191,36 @@ static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva,
>>>>>>>  
>>>>>>>  #define kvm_virt_to_phys(x)		virt_to_idmap((unsigned long)(x))
>>>>>>>  
>>>>>>> +static inline void __kvm_flush_dcache_pte(pte_t pte)
>>>>>>> +{
>>>>>>> +	void *va = kmap_atomic(pte_page(pte));
>>>>>>> +
>>>>>>> +	kvm_flush_dcache_to_poc(va, PAGE_SIZE);
>>>>>>> +
>>>>>>> +	kunmap_atomic(va);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static inline void __kvm_flush_dcache_pmd(pmd_t pmd)
>>>>>>> +{
>>>>>>> +	unsigned long size = PMD_SIZE;
>>>>>>> +	pfn_t pfn = pmd_pfn(pmd);
>>>>>>> +
>>>>>>> +	while (size) {
>>>>>>> +		void *va = kmap_atomic_pfn(pfn);
>>>>>>> +
>>>>>>> +		kvm_flush_dcache_to_poc(va, PAGE_SIZE);
>>>>>>> +
>>>>>>> +		pfn++;
>>>>>>> +		size -= PAGE_SIZE;
>>>>>>> +
>>>>>>> +		kunmap_atomic(va);
>>>>>>> +	}
>>>>>>> +}
>>>>>>> +
>>>>>>> +static inline void __kvm_flush_dcache_pud(pud_t pud)
>>>>>>> +{
>>>>>>> +}
>>>>>>> +
>>>>>>>  void stage2_flush_vm(struct kvm *kvm);
>>>>>>>  
>>>>>>>  #endif	/* !__ASSEMBLY__ */
>>>>>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>>>>>> index 1dc9778..1f5b793 100644
>>>>>>> --- a/arch/arm/kvm/mmu.c
>>>>>>> +++ b/arch/arm/kvm/mmu.c
>>>>>>> @@ -58,6 +58,21 @@ static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
>>>>>>>  		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa);
>>>>>>>  }
>>>>>>>  
>>>>>>> +static void kvm_flush_dcache_pte(pte_t pte)
>>>>>>> +{
>>>>>>> +	__kvm_flush_dcache_pte(pte);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void kvm_flush_dcache_pmd(pmd_t pmd)
>>>>>>> +{
>>>>>>> +	__kvm_flush_dcache_pmd(pmd);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void kvm_flush_dcache_pud(pud_t pud)
>>>>>>> +{
>>>>>>> +	__kvm_flush_dcache_pud(pud);
>>>>>>> +}
>>>>>>> +
>>>>>>>  static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
>>>>>>>  				  int min, int max)
>>>>>>>  {
>>>>>>> @@ -128,9 +143,12 @@ static void unmap_ptes(struct kvm *kvm, pmd_t *pmd,
>>>>>>>  	start_pte = pte = pte_offset_kernel(pmd, addr);
>>>>>>>  	do {
>>>>>>>  		if (!pte_none(*pte)) {
>>>>>>> +			pte_t old_pte = *pte;
>>>>>>>  			kvm_set_pte(pte, __pte(0));
>>>>>>> -			put_page(virt_to_page(pte));
>>>>>>
>>>>>> was this a bug beforehand that we released the page before we flushed
>>>>>> the TLB?
>>>>>
>>>>> I don't think so. TLB maintenance doesn't require the mapping to exist
>>>>> in the page tables (while the cache maintenance do).
>>>>>
>>>>
>>>> duh, the put_page is the ref counting on the page table itself, not the
>>>> underlying memory page.  Forget what I asked.
>>>>
>>>>>>>  			kvm_tlb_flush_vmid_ipa(kvm, addr);
>>>>>>> +			if ((pte_val(old_pte) & PAGE_S2_DEVICE) != PAGE_S2_DEVICE)
>>>>>>> +				kvm_flush_dcache_pte(old_pte);
>>>>>>
>>>>>> this is confusing me: We are only flushing the cache for cached stage-2
>>>>>> mappings?  Weren't we trying to flush the cache for uncached mappings?
>>>>>> (we obviously also need flush a cached stage-2 mapping but where the
>>>>>> guest is mapping it as uncached, but we don't know that easily).
>>>>>>
>>>>>> Am I missing something completely here?
>>>>>
>>>>> I think you must be misreading something:
>>>>> - we want to invalidate mappings because the guest may have created an
>>>>> uncached mapping
>>>>
>>>> I don't quite understand: we are invalidating mappings because the page
>>>> is being swapped out (and the guest must fault if it tries to access it
>>>> again).  Not because the guest may have created an uncached mapping,
>>>> that's just an aspect of the situation.  Or am I thinking about this the
>>>> wrong way?
>>>
>>> My wording was quite ambiguous. Indeed, we're removing the mapping
>>> because the page is being evicted. In a perfect and ideal world (where
>>> the guest doesn't do anything silly and everything is cache coherent),
>>> we shouldn't have to do anything cache wise. But see below.
>>>
>>>>> - as we cannot know about the guest's uncached mappings, we flush things
>>>>> unconditionally (basically anything that is RAM).
>>>>
>>>> so isn't the problem that the host may have an invalid cached view of
>>>> the data, so we need to invalidate that view, not flush the invalid data
>>>> to RAM?  Does the kernel take care of that somewhere else for a
>>>> cache-coherent IO subsystem?
>>>
>>> There is two potential problems:
>>>
>>> - The guest has created an uncached mapping, and you have a cache
>>> coherent IO subsystem: we need to invalidate the cached view. But since
>>> we don't know where that mapping is (as we don't track guest mappings),
>>> we must do a clean+invalidate in order not to corrupt cached mappings.
>>
>> so here you have:
>>
>> CACHE=dirty random old data
>> RAM=data that the guest wrote
>>
>> now your IO subsystem will read from the cache, which is bad.  But when
>> you do a clean+invalidate, you get:
>>
>> CACHE=invalidated
>> RAM=dirty random old data
>>
>> I still don't get it...
>>
> ok, so talked to Marc earlier and he explained me the details.  I think
> this is really quite a subtle, but nevertheless important situation to
> catch:
> 
> If a guest maps certain memory pages as uncached, all writes will bypass
> the data cache and go directly to RAM.  However, the CPUs can still
> speculate reads (not writes) and fill cache lines with data.
> 
> Those cache lines will be *clean* cache lines though, so a
> clean+invalidate operation is equivalent to an invalidate operation,
> because no cache lines are marked dirty.
> 
> Those clean cache lines could be filled prior to an uncached write by
> the guest, and the cache coherent IO subsystem would therefore end up
> writing old data to disk.
> 
> We can therefore do exactly what Marc was suggesting to do safely, and
> we have to do it.
> 
> I hope I finally got this right?

Yes, that's exactly what I meant all this time. Thanks for translating
my messy brain dump into proper English! :-)

> I think we need this explanation properly as part of the commit message
> or as comments in the code, and I think we need to add a comment about
> why it's pointless to flush device memory mappings (or that that is why
> we do the check against PAGE_S2_DEVICE).

I'll add both comments to the patch and repost it.

Thanks again,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-13 13:35                                         ` Christoffer Dall
  2015-01-13 13:41                                           ` Peter Maydell
@ 2015-01-15 12:00                                           ` Mark Rutland
  2015-01-15 13:00                                             ` Christoffer Dall
  1 sibling, 1 reply; 45+ messages in thread
From: Mark Rutland @ 2015-01-15 12:00 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Peter Maydell, kvmarm, kvm-devel

On Tue, Jan 13, 2015 at 01:35:16PM +0000, Christoffer Dall wrote:
> On Tue, Jan 13, 2015 at 12:12:41PM +0000, Peter Maydell wrote:
> > On 13 January 2015 at 12:04, Christoffer Dall
> > <christoffer.dall@linaro.org> wrote:
> > > Additionally, I haven't been able to think of a reasonable guest
> > > scenario where this breaks.  Once the guest turns on its MMU it should
> > > deal with the necessary icache invalidation itself (I think), so we're
> > > really talking about situations where the stage-1 MMU is off, and I
> > > gather that mostly you'll be seeing a single core doing any heavy
> > > lifting and then secondary cores basically coming up, only seeing valid
> > > entries in the icache, and doing the necessary invalidat+turn on mmu
> > > stuff.
> > 
> > The trouble with that is that as the secondary comes up, before it
> > turns on its icache its VA->PA mapping is the identity map; whereas
> > the primary vCPU's VA->PA mapping is "whatever the guest kernel's
> > usual mapping is". If the kernel has some mapping other than identity
> > for the VA which is wherever the secondary-CPU-startup-to-MMU-enable
> > code lives (which seems quite likely), then you have potential problems.
> > 
> Wouldn't a guest (and I believe Linux does this) reserve ASID 0 for
> additional cores and use ASID 1+++ for itself?

Not on arm since 52af9c6cd863fe37 (ARM: 6943/1: mm: use TTBR1 instead of
reserved context ID) and 45b95235b0ac86ce (ARM: 6944/1: mm: allow ASID 0
to be allocated to tasks). The swapper_pg_dir uses global mappings since
d427958a46af24f7 (ARM: 6942/1: mm: make TTBR1 always point to
swapper_pg_dir on ARMv6/7).

Similarly on arm64 the swapper_pg_dir and idmap_pg_dir use global
mappings and we don't reserve any ASIDs for use by the kernel.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-15 12:00                                           ` Mark Rutland
@ 2015-01-15 13:00                                             ` Christoffer Dall
  2015-01-15 15:47                                               ` Mark Rutland
  0 siblings, 1 reply; 45+ messages in thread
From: Christoffer Dall @ 2015-01-15 13:00 UTC (permalink / raw)
  To: Mark Rutland; +Cc: Peter Maydell, kvmarm, kvm-devel

Hi Mark,

On Thu, Jan 15, 2015 at 12:00:20PM +0000, Mark Rutland wrote:
> On Tue, Jan 13, 2015 at 01:35:16PM +0000, Christoffer Dall wrote:
> > On Tue, Jan 13, 2015 at 12:12:41PM +0000, Peter Maydell wrote:
> > > On 13 January 2015 at 12:04, Christoffer Dall
> > > <christoffer.dall@linaro.org> wrote:
> > > > Additionally, I haven't been able to think of a reasonable guest
> > > > scenario where this breaks.  Once the guest turns on its MMU it should
> > > > deal with the necessary icache invalidation itself (I think), so we're
> > > > really talking about situations where the stage-1 MMU is off, and I
> > > > gather that mostly you'll be seeing a single core doing any heavy
> > > > lifting and then secondary cores basically coming up, only seeing valid
> > > > entries in the icache, and doing the necessary invalidat+turn on mmu
> > > > stuff.
> > > 
> > > The trouble with that is that as the secondary comes up, before it
> > > turns on its icache its VA->PA mapping is the identity map; whereas
> > > the primary vCPU's VA->PA mapping is "whatever the guest kernel's
> > > usual mapping is". If the kernel has some mapping other than identity
> > > for the VA which is wherever the secondary-CPU-startup-to-MMU-enable
> > > code lives (which seems quite likely), then you have potential problems.
> > > 
> > Wouldn't a guest (and I believe Linux does this) reserve ASID 0 for
> > additional cores and use ASID 1+++ for itself?
> 
> Not on arm since 52af9c6cd863fe37 (ARM: 6943/1: mm: use TTBR1 instead of
> reserved context ID) and 45b95235b0ac86ce (ARM: 6944/1: mm: allow ASID 0
> to be allocated to tasks). The swapper_pg_dir uses global mappings since
> d427958a46af24f7 (ARM: 6942/1: mm: make TTBR1 always point to
> swapper_pg_dir on ARMv6/7).

thanks for the pointer, been a while since I looked at that code.

> 
> Similarly on arm64 the swapper_pg_dir and idmap_pg_dir use global
> mappings and we don't reserve any ASIDs for use by the kernel.
> 

Do you happen to know which ASIDs are matched in the icache when the MMU
is off?

-Christoffer

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault
  2015-01-15 13:00                                             ` Christoffer Dall
@ 2015-01-15 15:47                                               ` Mark Rutland
  0 siblings, 0 replies; 45+ messages in thread
From: Mark Rutland @ 2015-01-15 15:47 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: Peter Maydell, kvmarm, kvm-devel

On Thu, Jan 15, 2015 at 01:00:58PM +0000, Christoffer Dall wrote:
> Hi Mark,
> 
> On Thu, Jan 15, 2015 at 12:00:20PM +0000, Mark Rutland wrote:
> > On Tue, Jan 13, 2015 at 01:35:16PM +0000, Christoffer Dall wrote:
> > > On Tue, Jan 13, 2015 at 12:12:41PM +0000, Peter Maydell wrote:
> > > > On 13 January 2015 at 12:04, Christoffer Dall
> > > > <christoffer.dall@linaro.org> wrote:
> > > > > Additionally, I haven't been able to think of a reasonable guest
> > > > > scenario where this breaks.  Once the guest turns on its MMU it should
> > > > > deal with the necessary icache invalidation itself (I think), so we're
> > > > > really talking about situations where the stage-1 MMU is off, and I
> > > > > gather that mostly you'll be seeing a single core doing any heavy
> > > > > lifting and then secondary cores basically coming up, only seeing valid
> > > > > entries in the icache, and doing the necessary invalidat+turn on mmu
> > > > > stuff.
> > > > 
> > > > The trouble with that is that as the secondary comes up, before it
> > > > turns on its icache its VA->PA mapping is the identity map; whereas
> > > > the primary vCPU's VA->PA mapping is "whatever the guest kernel's
> > > > usual mapping is". If the kernel has some mapping other than identity
> > > > for the VA which is wherever the secondary-CPU-startup-to-MMU-enable
> > > > code lives (which seems quite likely), then you have potential problems.
> > > > 
> > > Wouldn't a guest (and I believe Linux does this) reserve ASID 0 for
> > > additional cores and use ASID 1+++ for itself?
> > 
> > Not on arm since 52af9c6cd863fe37 (ARM: 6943/1: mm: use TTBR1 instead of
> > reserved context ID) and 45b95235b0ac86ce (ARM: 6944/1: mm: allow ASID 0
> > to be allocated to tasks). The swapper_pg_dir uses global mappings since
> > d427958a46af24f7 (ARM: 6942/1: mm: make TTBR1 always point to
> > swapper_pg_dir on ARMv6/7).
> 
> thanks for the pointer, been a while since I looked at that code.
> 
> > 
> > Similarly on arm64 the swapper_pg_dir and idmap_pg_dir use global
> > mappings and we don't reserve any ASIDs for use by the kernel.
> > 
> 
> Do you happen to know which ASIDs are matched in the icache when the MMU
> is off?

Unfortunately not. I only seem to have derived as much as anyone else
here by reading the ARM ARM.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2015-01-15 15:48 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-08 11:59 [PATCH 0/4] arm/arm64: KVM: Random selection of MM related fixes Marc Zyngier
2015-01-08 11:59 ` [PATCH 1/4] mm: Correct ordering of *_clear_flush_young_notify Marc Zyngier
2015-01-08 13:12   ` Paolo Bonzini
2015-01-08 19:00   ` Andrea Arcangeli
2015-01-12 10:15     ` Steve Capper
2015-01-08 11:59 ` [PATCH 2/4] arm/arm64: KVM: Use set/way op trapping to track the state of the caches Marc Zyngier
2015-01-09 11:19   ` Christoffer Dall
2015-01-09 11:38     ` Marc Zyngier
2015-01-09 12:12       ` Christoffer Dall
2015-01-08 11:59 ` [PATCH 3/4] arm/arm64: KVM: Flush caches to memory on unmap Marc Zyngier
2015-01-09 12:30   ` Christoffer Dall
2015-01-09 14:35     ` Marc Zyngier
2015-01-11 12:30       ` Christoffer Dall
2015-01-12 11:15         ` Marc Zyngier
2015-01-12 20:13           ` Christoffer Dall
2015-01-13 13:47             ` Christoffer Dall
2015-01-13 13:57               ` Marc Zyngier
2015-01-08 11:59 ` [PATCH 4/4] arm/arm64: KVM: use kernel mapping to perform invalidation on page fault Marc Zyngier
2015-01-08 12:30   ` Peter Maydell
2015-01-08 13:07     ` Marc Zyngier
2015-01-08 13:16       ` Peter Maydell
2015-01-08 15:06         ` Marc Zyngier
2015-01-08 15:21           ` Peter Maydell
2015-01-09 12:50             ` Christoffer Dall
2015-01-09 13:03               ` Peter Maydell
2015-01-09 14:16                 ` Marc Zyngier
2015-01-09 15:28                   ` Peter Maydell
2015-01-09 17:18                     ` Marc Zyngier
2015-01-11 12:33                     ` Christoffer Dall
2015-01-11 17:37                       ` Peter Maydell
2015-01-11 17:58                         ` Christoffer Dall
2015-01-11 18:27                           ` Peter Maydell
2015-01-11 18:38                             ` Christoffer Dall
2015-01-12  9:58                               ` Marc Zyngier
2015-01-12 20:10                                 ` Christoffer Dall
2015-01-13 11:38                                   ` Marc Zyngier
2015-01-13 12:04                                     ` Christoffer Dall
2015-01-13 12:12                                       ` Peter Maydell
2015-01-13 13:35                                         ` Christoffer Dall
2015-01-13 13:41                                           ` Peter Maydell
2015-01-13 13:49                                             ` Christoffer Dall
2015-01-15 12:00                                           ` Mark Rutland
2015-01-15 13:00                                             ` Christoffer Dall
2015-01-15 15:47                                               ` Mark Rutland
2015-01-09 12:51   ` Christoffer Dall

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.