linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/5] KVM TLB flushing improvements (for radix)
@ 2018-04-16  4:32 Nicholas Piggin
  2018-04-16  4:32 ` [PATCH v2 1/5] KVM: PPC: Book3S HV: radix use correct tlbie sequence in kvmppc_radix_tlbie_page Nicholas Piggin
                   ` (4 more replies)
  0 siblings, 5 replies; 7+ messages in thread
From: Nicholas Piggin @ 2018-04-16  4:32 UTC (permalink / raw)
  To: kvm-ppc; +Cc: Nicholas Piggin, linuxppc-dev

This series moves some of the radix mode TLB flushing into
powerpc/mm, which allows them them to implement the powerpc:tlbie
tracepoints for KVM invalidations.

This also fixes a partition scoped page fault performance issue
that was found by looking at partition scoped tlbie traces.

Since v1:
- Fixed a bug where I mixed up PRS values, leading to guest page
  fault hangs.
- Fixed up the hash cases that still need to be done in real-mode.
- Dropped the hash changes including the interesting case of a
  hash tlbie issued by a radix host, for mixed mode support.

This has survived some stress testing over the weekend now, so
it should be ready for wider review.

Thanks,
Nick

Nicholas Piggin (5):
  KVM: PPC: Book3S HV: radix use correct tlbie sequence in
    kvmppc_radix_tlbie_page
  powerpc/mm/radix: implement LPID based TLB flushes to be used by KVM
  KVM: PPC: Book3S HV: radix use the Linux TLB flush function in
    kvmppc_radix_tlbie_page
  KVM: PPC: Book3S HV: radix handle process scoped LPID flush in C, with
    relocation on
  KVM: PPC: Book3S HV: radix do not clear partition scoped page table
    when page fault races with other vCPUs.

 .../include/asm/book3s/64/tlbflush-radix.h    |   6 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c        |  78 ++++-----
 arch/powerpc/kvm/book3s_hv.c                  |  26 +++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S       |  13 +-
 arch/powerpc/mm/tlb-radix.c                   | 160 ++++++++++++++++++
 5 files changed, 238 insertions(+), 45 deletions(-)

-- 
2.17.0

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2 1/5] KVM: PPC: Book3S HV: radix use correct tlbie sequence in kvmppc_radix_tlbie_page
  2018-04-16  4:32 [PATCH v2 0/5] KVM TLB flushing improvements (for radix) Nicholas Piggin
@ 2018-04-16  4:32 ` Nicholas Piggin
  2018-04-16  4:32 ` [PATCH v2 2/5] powerpc/mm/radix: implement LPID based TLB flushes to be used by KVM Nicholas Piggin
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Nicholas Piggin @ 2018-04-16  4:32 UTC (permalink / raw)
  To: kvm-ppc; +Cc: Nicholas Piggin, linuxppc-dev

The standard eieio ; tlbsync ; ptesync must follow tlbie to ensure it
is ordered with respect to subsequent operations.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 5d9bafe9a371..81d5ad26f9a1 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -160,7 +160,7 @@ static void kvmppc_radix_tlbie_page(struct kvm *kvm, unsigned long addr,
 	if (cpu_has_feature(CPU_FTR_P9_TLBIE_BUG))
 		asm volatile(PPC_TLBIE_5(%0, %1, 0, 0, 1)
 			     : : "r" (addr), "r" (kvm->arch.lpid) : "memory");
-	asm volatile("ptesync": : :"memory");
+	asm volatile("eieio ; tlbsync ; ptesync": : :"memory");
 }
 
 unsigned long kvmppc_radix_update_pte(struct kvm *kvm, pte_t *ptep,
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 2/5] powerpc/mm/radix: implement LPID based TLB flushes to be used by KVM
  2018-04-16  4:32 [PATCH v2 0/5] KVM TLB flushing improvements (for radix) Nicholas Piggin
  2018-04-16  4:32 ` [PATCH v2 1/5] KVM: PPC: Book3S HV: radix use correct tlbie sequence in kvmppc_radix_tlbie_page Nicholas Piggin
@ 2018-04-16  4:32 ` Nicholas Piggin
  2018-04-16  4:32 ` [PATCH v2 3/5] KVM: PPC: Book3S HV: radix use the Linux TLB flush function in kvmppc_radix_tlbie_page Nicholas Piggin
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Nicholas Piggin @ 2018-04-16  4:32 UTC (permalink / raw)
  To: kvm-ppc; +Cc: Nicholas Piggin, linuxppc-dev

Implement a local TLB flush for invalidating an LPID with variants for
process or partition scope. And a global TLB flush for invalidating
a partition scoped page of an LPID.

These will be used by KVM in subsequent patches.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 .../include/asm/book3s/64/tlbflush-radix.h    |   6 +
 arch/powerpc/mm/tlb-radix.c                   | 160 ++++++++++++++++++
 2 files changed, 166 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
index 19b45ba6caf9..12c02c0e5a4b 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
@@ -51,4 +51,10 @@ extern void radix__flush_tlb_all(void);
 extern void radix__flush_tlb_pte_p9_dd1(unsigned long old_pte, struct mm_struct *mm,
 					unsigned long address);
 
+extern void radix__flush_tlb_lpid_page(unsigned int lpid,
+					unsigned long addr,
+					unsigned long page_size);
+extern void radix__local_flush_tlb_lpid(unsigned int lpid);
+extern void radix__local_flush_tlb_lpid_guest(unsigned int lpid);
+
 #endif
diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index 2fba6170ab3f..c2b001dc4dd9 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -119,6 +119,39 @@ static inline void __tlbie_pid(unsigned long pid, unsigned long ric)
 	trace_tlbie(0, 0, rb, rs, ric, prs, r);
 }
 
+static inline void __tlbiel_lpid(unsigned long lpid, int set,
+				unsigned long ric)
+{
+	unsigned long rb,rs,prs,r;
+
+	rb = PPC_BIT(52); /* IS = 2 */
+	rb |= set << PPC_BITLSHIFT(51);
+	rs = 0;  /* LPID comes from LPIDR */
+	prs = 0; /* partition scoped */
+	r = 1;   /* radix format */
+
+	asm volatile(PPC_TLBIEL(%0, %4, %3, %2, %1)
+		     : : "r"(rb), "i"(r), "i"(prs), "i"(ric), "r"(rs) : "memory");
+	trace_tlbie(lpid, 1, rb, rs, ric, prs, r);
+}
+
+static inline void __tlbiel_lpid_guest(unsigned long lpid, int set,
+				unsigned long ric)
+{
+	unsigned long rb,rs,prs,r;
+
+	rb = PPC_BIT(52); /* IS = 2 */
+	rb |= set << PPC_BITLSHIFT(51);
+	rs = 0;  /* LPID comes from LPIDR */
+	prs = 1; /* process scoped */
+	r = 1;   /* radix format */
+
+	asm volatile(PPC_TLBIEL(%0, %4, %3, %2, %1)
+		     : : "r"(rb), "i"(r), "i"(prs), "i"(ric), "r"(rs) : "memory");
+	trace_tlbie(lpid, 1, rb, rs, ric, prs, r);
+}
+
+
 static inline void __tlbiel_va(unsigned long va, unsigned long pid,
 			       unsigned long ap, unsigned long ric)
 {
@@ -151,6 +184,22 @@ static inline void __tlbie_va(unsigned long va, unsigned long pid,
 	trace_tlbie(0, 0, rb, rs, ric, prs, r);
 }
 
+static inline void __tlbie_lpid_va(unsigned long va, unsigned long lpid,
+			      unsigned long ap, unsigned long ric)
+{
+	unsigned long rb,rs,prs,r;
+
+	rb = va & ~(PPC_BITMASK(52, 63));
+	rb |= ap << PPC_BITLSHIFT(58);
+	rs = lpid;
+	prs = 0; /* partition scoped */
+	r = 1;   /* radix format */
+
+	asm volatile(PPC_TLBIE_5(%0, %4, %3, %2, %1)
+		     : : "r"(rb), "i"(r), "i"(prs), "i"(ric), "r"(rs) : "memory");
+	trace_tlbie(lpid, 0, rb, rs, ric, prs, r);
+}
+
 static inline void fixup_tlbie(void)
 {
 	unsigned long pid = 0;
@@ -162,6 +211,16 @@ static inline void fixup_tlbie(void)
 	}
 }
 
+static inline void fixup_tlbie_lpid(unsigned long lpid)
+{
+	unsigned long va = ((1UL << 52) - 1);
+
+	if (cpu_has_feature(CPU_FTR_P9_TLBIE_BUG)) {
+		asm volatile("ptesync": : :"memory");
+		__tlbie_lpid_va(va, lpid, mmu_get_ap(MMU_PAGE_64K), RIC_FLUSH_TLB);
+	}
+}
+
 /*
  * We use 128 set in radix mode and 256 set in hpt mode.
  */
@@ -215,6 +274,62 @@ static inline void _tlbie_pid(unsigned long pid, unsigned long ric)
 	asm volatile("eieio; tlbsync; ptesync": : :"memory");
 }
 
+static inline void _tlbiel_lpid(unsigned long lpid, unsigned long ric)
+{
+	int set;
+
+	VM_BUG_ON(mfspr(SPRN_LPID) != lpid);
+
+	asm volatile("ptesync": : :"memory");
+
+	/*
+	 * Flush the first set of the TLB, and if we're doing a RIC_FLUSH_ALL,
+	 * also flush the entire Page Walk Cache.
+	 */
+	__tlbiel_lpid(lpid, 0, ric);
+
+	/* For PWC, only one flush is needed */
+	if (ric == RIC_FLUSH_PWC) {
+		asm volatile("ptesync": : :"memory");
+		return;
+	}
+
+	/* For the remaining sets, just flush the TLB */
+	for (set = 1; set < POWER9_TLB_SETS_RADIX ; set++)
+		__tlbiel_lpid(lpid, set, RIC_FLUSH_TLB);
+
+	asm volatile("ptesync": : :"memory");
+	asm volatile(PPC_INVALIDATE_ERAT "; isync" : : :"memory");
+}
+
+static inline void _tlbiel_lpid_guest(unsigned long lpid, unsigned long ric)
+{
+	int set;
+
+	VM_BUG_ON(mfspr(SPRN_LPID) != lpid);
+
+	asm volatile("ptesync": : :"memory");
+
+	/*
+	 * Flush the first set of the TLB, and if we're doing a RIC_FLUSH_ALL,
+	 * also flush the entire Page Walk Cache.
+	 */
+	__tlbiel_lpid_guest(lpid, 0, ric);
+
+	/* For PWC, only one flush is needed */
+	if (ric == RIC_FLUSH_PWC) {
+		asm volatile("ptesync": : :"memory");
+		return;
+	}
+
+	/* For the remaining sets, just flush the TLB */
+	for (set = 1; set < POWER9_TLB_SETS_RADIX ; set++)
+		__tlbiel_lpid_guest(lpid, set, RIC_FLUSH_TLB);
+
+	asm volatile("ptesync": : :"memory");
+}
+
+
 static inline void __tlbiel_va_range(unsigned long start, unsigned long end,
 				    unsigned long pid, unsigned long page_size,
 				    unsigned long psize)
@@ -269,6 +384,17 @@ static inline void _tlbie_va(unsigned long va, unsigned long pid,
 	asm volatile("eieio; tlbsync; ptesync": : :"memory");
 }
 
+static inline void _tlbie_lpid_va(unsigned long va, unsigned long lpid,
+			      unsigned long psize, unsigned long ric)
+{
+	unsigned long ap = mmu_get_ap(psize);
+
+	asm volatile("ptesync": : :"memory");
+	__tlbie_lpid_va(va, lpid, ap, ric);
+	fixup_tlbie_lpid(lpid);
+	asm volatile("eieio; tlbsync; ptesync": : :"memory");
+}
+
 static inline void _tlbie_va_range(unsigned long start, unsigned long end,
 				    unsigned long pid, unsigned long page_size,
 				    unsigned long psize, bool also_pwc)
@@ -535,6 +661,40 @@ static int radix_get_mmu_psize(int page_size)
 	return psize;
 }
 
+/*
+ * Flush partition scoped LPID address translation for all CPUs.
+ */
+void radix__flush_tlb_lpid_page(unsigned int lpid,
+					unsigned long addr,
+					unsigned long page_size)
+{
+	int psize = radix_get_mmu_psize(page_size);
+
+	_tlbie_lpid_va(addr, lpid, psize, RIC_FLUSH_TLB);
+}
+EXPORT_SYMBOL_GPL(radix__flush_tlb_lpid_page);
+
+/*
+ * Flush partition scoped translations from LPID (=LPIDR)
+ */
+void radix__local_flush_tlb_lpid(unsigned int lpid)
+{
+	_tlbiel_lpid(lpid, RIC_FLUSH_ALL);
+}
+EXPORT_SYMBOL_GPL(radix__local_flush_tlb_lpid);
+
+/*
+ * Flush process scoped translations from LPID (=LPIDR).
+ * Important difference, the guest normally manages its own translations,
+ * but some cases e.g., vCPU CPU migration require KVM to flush.
+ */
+void radix__local_flush_tlb_lpid_guest(unsigned int lpid)
+{
+	_tlbiel_lpid_guest(lpid, RIC_FLUSH_ALL);
+}
+EXPORT_SYMBOL_GPL(radix__local_flush_tlb_lpid_guest);
+
+
 static void radix__flush_tlb_pwc_range_psize(struct mm_struct *mm, unsigned long start,
 				  unsigned long end, int psize);
 
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 3/5] KVM: PPC: Book3S HV: radix use the Linux TLB flush function in kvmppc_radix_tlbie_page
  2018-04-16  4:32 [PATCH v2 0/5] KVM TLB flushing improvements (for radix) Nicholas Piggin
  2018-04-16  4:32 ` [PATCH v2 1/5] KVM: PPC: Book3S HV: radix use correct tlbie sequence in kvmppc_radix_tlbie_page Nicholas Piggin
  2018-04-16  4:32 ` [PATCH v2 2/5] powerpc/mm/radix: implement LPID based TLB flushes to be used by KVM Nicholas Piggin
@ 2018-04-16  4:32 ` Nicholas Piggin
  2018-04-16  4:32 ` [PATCH v2 4/5] KVM: PPC: Book3S HV: radix handle process scoped LPID flush in C, with relocation on Nicholas Piggin
  2018-04-16  4:32 ` [PATCH v2 5/5] KVM: PPC: Book3S HV: radix do not clear partition scoped page table when page fault races with other vCPUs Nicholas Piggin
  4 siblings, 0 replies; 7+ messages in thread
From: Nicholas Piggin @ 2018-04-16  4:32 UTC (permalink / raw)
  To: kvm-ppc; +Cc: Nicholas Piggin, linuxppc-dev

This has the advantage of consolidating TLB flush code in fewer
places, and it also implements powerpc:tlbie trace events.

1GB pages should be handled without further modification.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 26 +++++++-------------------
 1 file changed, 7 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 81d5ad26f9a1..dab6b622011c 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -139,28 +139,16 @@ int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t eaddr,
 	return 0;
 }
 
-#ifdef CONFIG_PPC_64K_PAGES
-#define MMU_BASE_PSIZE	MMU_PAGE_64K
-#else
-#define MMU_BASE_PSIZE	MMU_PAGE_4K
-#endif
-
 static void kvmppc_radix_tlbie_page(struct kvm *kvm, unsigned long addr,
 				    unsigned int pshift)
 {
-	int psize = MMU_BASE_PSIZE;
-
-	if (pshift >= PMD_SHIFT)
-		psize = MMU_PAGE_2M;
-	addr &= ~0xfffUL;
-	addr |= mmu_psize_defs[psize].ap << 5;
-	asm volatile("ptesync": : :"memory");
-	asm volatile(PPC_TLBIE_5(%0, %1, 0, 0, 1)
-		     : : "r" (addr), "r" (kvm->arch.lpid) : "memory");
-	if (cpu_has_feature(CPU_FTR_P9_TLBIE_BUG))
-		asm volatile(PPC_TLBIE_5(%0, %1, 0, 0, 1)
-			     : : "r" (addr), "r" (kvm->arch.lpid) : "memory");
-	asm volatile("eieio ; tlbsync ; ptesync": : :"memory");
+	unsigned long psize = PAGE_SIZE;
+
+	if (pshift)
+		psize = 1UL << pshift;
+
+	addr &= ~(psize - 1);
+	radix__flush_tlb_lpid_page(kvm->arch.lpid, addr, psize);
 }
 
 unsigned long kvmppc_radix_update_pte(struct kvm *kvm, pte_t *ptep,
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 4/5] KVM: PPC: Book3S HV: radix handle process scoped LPID flush in C, with relocation on
  2018-04-16  4:32 [PATCH v2 0/5] KVM TLB flushing improvements (for radix) Nicholas Piggin
                   ` (2 preceding siblings ...)
  2018-04-16  4:32 ` [PATCH v2 3/5] KVM: PPC: Book3S HV: radix use the Linux TLB flush function in kvmppc_radix_tlbie_page Nicholas Piggin
@ 2018-04-16  4:32 ` Nicholas Piggin
  2018-04-16  4:32 ` [PATCH v2 5/5] KVM: PPC: Book3S HV: radix do not clear partition scoped page table when page fault races with other vCPUs Nicholas Piggin
  4 siblings, 0 replies; 7+ messages in thread
From: Nicholas Piggin @ 2018-04-16  4:32 UTC (permalink / raw)
  To: kvm-ppc; +Cc: Nicholas Piggin, linuxppc-dev

The radix guest code can has fewer restrictions about what context it
can run in, so move this flushing out of assembly and have it use the
Linux TLB flush implementations introduced previously.

This allows powerpc:tlbie trace events to be used.

This changes the tlbiel sequence to only execute RIC=2 flush once on
the first set flushed, which matches the rest of the Linux flushing.
This does not change semantics of the flush.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/kvm/book3s_hv.c            | 26 +++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 13 ++++++-------
 2 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 81e2ea882d97..c1660df41190 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -2901,6 +2901,32 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore *vc)
 	for (sub = 0; sub < core_info.n_subcores; ++sub)
 		spin_unlock(&core_info.vc[sub]->lock);
 
+	if (kvm_is_radix(vc->kvm)) {
+		int tmp = pcpu;
+
+		/*
+		 * Do we need to flush the process scoped TLB for the LPAR?
+		 *
+		 * On POWER9, individual threads can come in here, but the
+		 * TLB is shared between the 4 threads in a core, hence
+		 * invalidating on one thread invalidates for all.
+		 * Thus we make all 4 threads use the same bit here.
+		 *
+		 * Hash must be flushed in realmode in order to use tlbiel.
+		 */
+		mtspr(SPRN_LPID, vc->kvm->arch.lpid);
+		isync();
+
+		if (cpu_has_feature(CPU_FTR_ARCH_300))
+			tmp &= ~0x3UL;
+
+		if (cpumask_test_cpu(tmp, &vc->kvm->arch.need_tlb_flush)) {
+			radix__local_flush_tlb_lpid_guest(vc->kvm->arch.lpid);
+			/* Clear the bit after the TLB flush */
+			cpumask_clear_cpu(tmp, &vc->kvm->arch.need_tlb_flush);
+		}
+	}
+
 	/*
 	 * Interrupts will be enabled once we get into the guest,
 	 * so tell lockdep that we're about to enable interrupts.
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index bd63fa8a08b5..d4c7bb3e777e 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -636,6 +636,10 @@ kvmppc_hv_entry:
 	/* Primary thread switches to guest partition. */
 	cmpwi	r6,0
 	bne	10f
+
+	/* Radix has already switched LPID and flushed core TLB */
+	bne	cr7,22f
+
 	lwz	r7,KVM_LPID(r9)
 BEGIN_FTR_SECTION
 	ld	r6,KVM_SDR1(r9)
@@ -647,7 +651,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
 	mtspr	SPRN_LPID,r7
 	isync
 
-	/* See if we need to flush the TLB */
+	/* See if we need to flush the TLB. Hash has to be done in RM */
 	lhz	r6,PACAPACAINDEX(r13)	/* test_bit(cpu, need_tlb_flush) */
 BEGIN_FTR_SECTION
 	/*
@@ -674,15 +678,10 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
 	li	r7,0x800		/* IS field = 0b10 */
 	ptesync
 	li	r0,0			/* RS for P9 version of tlbiel */
-	bne	cr7, 29f
 28:	tlbiel	r7			/* On P9, rs=0, RIC=0, PRS=0, R=0 */
 	addi	r7,r7,0x1000
 	bdnz	28b
-	b	30f
-29:	PPC_TLBIEL(7,0,2,1,1)		/* for radix, RIC=2, PRS=1, R=1 */
-	addi	r7,r7,0x1000
-	bdnz	29b
-30:	ptesync
+	ptesync
 23:	ldarx	r7,0,r6			/* clear the bit after TLB flushed */
 	andc	r7,r7,r8
 	stdcx.	r7,0,r6
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 5/5] KVM: PPC: Book3S HV: radix do not clear partition scoped page table when page fault races with other vCPUs.
  2018-04-16  4:32 [PATCH v2 0/5] KVM TLB flushing improvements (for radix) Nicholas Piggin
                   ` (3 preceding siblings ...)
  2018-04-16  4:32 ` [PATCH v2 4/5] KVM: PPC: Book3S HV: radix handle process scoped LPID flush in C, with relocation on Nicholas Piggin
@ 2018-04-16  4:32 ` Nicholas Piggin
  2018-04-17  0:17   ` Nicholas Piggin
  4 siblings, 1 reply; 7+ messages in thread
From: Nicholas Piggin @ 2018-04-16  4:32 UTC (permalink / raw)
  To: kvm-ppc; +Cc: Nicholas Piggin, linuxppc-dev

When running a SMP radix guest, KVM can get into page fault / tlbie
storms -- hundreds of thousands to the same address from different
threads -- due to partition scoped page faults invalidating the
page table entry if it was found to be already set up by a racing
CPU.

What can happen is that guest threads can hit page faults for the
same addresses, this can happen when KSM or THP takes out a commonly
used page. gRA zero (the interrupt vectors and important kernel text)
was a common one. Multiple CPUs will page fault and contend on the
same lock, when one CPU sets up the page table and releases the lock,
the next will find the new entry and invalidate it before installing
its own, which causes other page faults which invalidate that entry,
etc.

The solution to this is to avoid invalidating the entry or flushing
TLBs in case of a race. The pte may still need bits updated, but
those are to add R/C or relax access restrictions so no flush is
required.

This solves the page fault / tlbie storms.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 52 ++++++++++++++++----------
 1 file changed, 33 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index dab6b622011c..2d3af22f90dd 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -199,7 +199,6 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, unsigned long gpa,
 	pud_t *pud, *new_pud = NULL;
 	pmd_t *pmd, *new_pmd = NULL;
 	pte_t *ptep, *new_ptep = NULL;
-	unsigned long old;
 	int ret;
 
 	/* Traverse the guest's 2nd-level tree, allocate new levels needed */
@@ -243,6 +242,7 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, unsigned long gpa,
 	pmd = pmd_offset(pud, gpa);
 	if (pmd_is_leaf(*pmd)) {
 		unsigned long lgpa = gpa & PMD_MASK;
+		pte_t old_pte = *pmdp_ptep(pmd);
 
 		/*
 		 * If we raced with another CPU which has just put
@@ -252,18 +252,22 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, unsigned long gpa,
 			ret = -EAGAIN;
 			goto out_unlock;
 		}
-		/* Valid 2MB page here already, remove it */
-		old = kvmppc_radix_update_pte(kvm, pmdp_ptep(pmd),
-					      ~0UL, 0, lgpa, PMD_SHIFT);
-		kvmppc_radix_tlbie_page(kvm, lgpa, PMD_SHIFT);
-		if (old & _PAGE_DIRTY) {
-			unsigned long gfn = lgpa >> PAGE_SHIFT;
-			struct kvm_memory_slot *memslot;
-			memslot = gfn_to_memslot(kvm, gfn);
-			if (memslot && memslot->dirty_bitmap)
-				kvmppc_update_dirty_map(memslot,
-							gfn, PMD_SIZE);
+
+		/* PTE was previously valid, so update it */
+		if (pte_val(old_pte) == pte_val(pte)) {
+			ret = -EAGAIN;
+			goto out_unlock;
 		}
+
+		/* Make sure we're weren't trying to take bits away */
+		WARN_ON_ONCE(pte_pfn(old_pte) != pte_pfn(pte));
+		WARN_ON_ONCE((pte_val(old_pte) & ~pte_val(pte)) &
+			(_PAGE_PRESENT | _PAGE_READ | _PAGE_WRITE));
+
+		kvmppc_radix_update_pte(kvm, pmdp_ptep(pmd),
+					0, pte_val(pte), lgpa, PMD_SHIFT);
+		ret = 0;
+		goto out_unlock;
 	} else if (level == 1 && !pmd_none(*pmd)) {
 		/*
 		 * There's a page table page here, but we wanted
@@ -274,6 +278,8 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, unsigned long gpa,
 		goto out_unlock;
 	}
 	if (level == 0) {
+		pte_t old_pte;
+
 		if (pmd_none(*pmd)) {
 			if (!new_ptep)
 				goto out_unlock;
@@ -281,13 +287,21 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, unsigned long gpa,
 			new_ptep = NULL;
 		}
 		ptep = pte_offset_kernel(pmd, gpa);
-		if (pte_present(*ptep)) {
-			/* PTE was previously valid, so invalidate it */
-			old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_PRESENT,
-						      0, gpa, 0);
-			kvmppc_radix_tlbie_page(kvm, gpa, 0);
-			if (old & _PAGE_DIRTY)
-				mark_page_dirty(kvm, gpa >> PAGE_SHIFT);
+		old_pte = *ptep;
+		if (pte_present(old_pte)) {
+			/* PTE was previously valid, so update it */
+			if (pte_val(old_pte) == pte_val(pte)) {
+				ret = -EAGAIN;
+				goto out_unlock;
+			}
+
+			/* Make sure we're weren't trying to take bits away */
+			WARN_ON_ONCE(pte_pfn(old_pte) != pte_pfn(pte));
+			WARN_ON_ONCE((pte_val(old_pte) & ~pte_val(pte)) &
+				(_PAGE_PRESENT | _PAGE_READ | _PAGE_WRITE));
+
+			kvmppc_radix_update_pte(kvm, ptep, 0,
+						pte_val(pte), gpa, 0);
 		}
 		kvmppc_radix_set_pte_at(kvm, gpa, ptep, pte);
 	} else {
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 5/5] KVM: PPC: Book3S HV: radix do not clear partition scoped page table when page fault races with other vCPUs.
  2018-04-16  4:32 ` [PATCH v2 5/5] KVM: PPC: Book3S HV: radix do not clear partition scoped page table when page fault races with other vCPUs Nicholas Piggin
@ 2018-04-17  0:17   ` Nicholas Piggin
  0 siblings, 0 replies; 7+ messages in thread
From: Nicholas Piggin @ 2018-04-17  0:17 UTC (permalink / raw)
  To: kvm-ppc; +Cc: linuxppc-dev

On Mon, 16 Apr 2018 14:32:40 +1000
Nicholas Piggin <npiggin@gmail.com> wrote:

> When running a SMP radix guest, KVM can get into page fault / tlbie
> storms -- hundreds of thousands to the same address from different
> threads -- due to partition scoped page faults invalidating the
> page table entry if it was found to be already set up by a racing
> CPU.
> 
> What can happen is that guest threads can hit page faults for the
> same addresses, this can happen when KSM or THP takes out a commonly
> used page. gRA zero (the interrupt vectors and important kernel text)
> was a common one. Multiple CPUs will page fault and contend on the
> same lock, when one CPU sets up the page table and releases the lock,
> the next will find the new entry and invalidate it before installing
> its own, which causes other page faults which invalidate that entry,
> etc.
> 
> The solution to this is to avoid invalidating the entry or flushing
> TLBs in case of a race. The pte may still need bits updated, but
> those are to add R/C or relax access restrictions so no flush is
> required.
> 
> This solves the page fault / tlbie storms.

Oh, I didn't notice "KVM: PPC: Book3S HV: Radix page fault handler
optimizations" does much the same thing as this one and it's been
merged upstream now.

That also adds a partition scoped PWC flush that I'll add to
powerpc/mm, so I'll rebase this series.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-04-17  0:17 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-16  4:32 [PATCH v2 0/5] KVM TLB flushing improvements (for radix) Nicholas Piggin
2018-04-16  4:32 ` [PATCH v2 1/5] KVM: PPC: Book3S HV: radix use correct tlbie sequence in kvmppc_radix_tlbie_page Nicholas Piggin
2018-04-16  4:32 ` [PATCH v2 2/5] powerpc/mm/radix: implement LPID based TLB flushes to be used by KVM Nicholas Piggin
2018-04-16  4:32 ` [PATCH v2 3/5] KVM: PPC: Book3S HV: radix use the Linux TLB flush function in kvmppc_radix_tlbie_page Nicholas Piggin
2018-04-16  4:32 ` [PATCH v2 4/5] KVM: PPC: Book3S HV: radix handle process scoped LPID flush in C, with relocation on Nicholas Piggin
2018-04-16  4:32 ` [PATCH v2 5/5] KVM: PPC: Book3S HV: radix do not clear partition scoped page table when page fault races with other vCPUs Nicholas Piggin
2018-04-17  0:17   ` Nicholas Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).