[RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-02-19 10:54 ` Ard Biesheuvel
  0 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-02-19 10:54 UTC (permalink / raw)
  To: lersek, christoffer.dall, marc.zyngier, linux-arm-kernel, peter.maydell
  Cc: pbonzini, Ard Biesheuvel, kvmarm, kvm

This is a 0th order approximation of how we could potentially force the guest
to avoid uncached mappings, at least from the moment the MMU is on. (Before
that, all of memory is implicitly classified as Device-nGnRnE)

The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings
with cached ones. This way, there is no need to mangle any guest page tables.

The downside is that, to do this correctly, we need to always trap writes to
the VM sysreg group, which includes registers that the guest may write to very
often. To reduce the associated performance hit, patch #1 introduces a fast path
for EL2 to perform trivial sysreg writes on behalf of the guest, without the
need for a full world switch to the host and back.

The main purpose of these patches is to quantify the performance hit, and
verify whether the MAIR_EL1 handling works correctly. 

Ard Biesheuvel (3):
  arm64: KVM: handle some sysreg writes in EL2
  arm64: KVM: mangle MAIR register to prevent uncached guest mappings
  arm64: KVM: keep trapping of VM sysreg writes enabled

 arch/arm/kvm/mmu.c               |   2 +-
 arch/arm64/include/asm/kvm_arm.h |   2 +-
 arch/arm64/kvm/hyp.S             | 101 +++++++++++++++++++++++++++++++++++++++
 arch/arm64/kvm/sys_regs.c        |  63 ++++++++++++++++++++----
 4 files changed, 156 insertions(+), 12 deletions(-)

-- 
1.8.3.2

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-02-19 10:54 ` Ard Biesheuvel
  0 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-02-19 10:54 UTC (permalink / raw)
  To: linux-arm-kernel

This is a 0th order approximation of how we could potentially force the guest
to avoid uncached mappings, at least from the moment the MMU is on. (Before
that, all of memory is implicitly classified as Device-nGnRnE)

The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings
with cached ones. This way, there is no need to mangle any guest page tables.

The downside is that, to do this correctly, we need to always trap writes to
the VM sysreg group, which includes registers that the guest may write to very
often. To reduce the associated performance hit, patch #1 introduces a fast path
for EL2 to perform trivial sysreg writes on behalf of the guest, without the
need for a full world switch to the host and back.

The main purpose of these patches is to quantify the performance hit, and
verify whether the MAIR_EL1 handling works correctly. 

Ard Biesheuvel (3):
  arm64: KVM: handle some sysreg writes in EL2
  arm64: KVM: mangle MAIR register to prevent uncached guest mappings
  arm64: KVM: keep trapping of VM sysreg writes enabled

 arch/arm/kvm/mmu.c               |   2 +-
 arch/arm64/include/asm/kvm_arm.h |   2 +-
 arch/arm64/kvm/hyp.S             | 101 +++++++++++++++++++++++++++++++++++++++
 arch/arm64/kvm/sys_regs.c        |  63 ++++++++++++++++++++----
 4 files changed, 156 insertions(+), 12 deletions(-)

-- 
1.8.3.2

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 1/3] arm64: KVM: handle some sysreg writes in EL2
  2015-02-19 10:54 ` Ard Biesheuvel
@ 2015-02-19 10:54   ` Ard Biesheuvel
  -1 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-02-19 10:54 UTC (permalink / raw)
  To: lersek, christoffer.dall, marc.zyngier, linux-arm-kernel, peter.maydell
  Cc: kvm, kvmarm, agraf, pbonzini, Ard Biesheuvel

This adds handling to el1_trap() to perform some sysreg writes directly
in EL2, without performing the full world switch to the host and back
again. This is mainly for doing writes that don't need special handling,
but where the register is part of the group that we need to trap for
other reasons.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/kvm/hyp.S      | 101 ++++++++++++++++++++++++++++++++++++++++++++++
 arch/arm64/kvm/sys_regs.c |  28 ++++++++-----
 2 files changed, 120 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
index c3ca89c27c6b..e3af6840cb3f 100644
--- a/arch/arm64/kvm/hyp.S
+++ b/arch/arm64/kvm/hyp.S
@@ -26,6 +26,7 @@
 #include <asm/kvm_asm.h>
 #include <asm/kvm_arm.h>
 #include <asm/kvm_mmu.h>
+#include <asm/sysreg.h>
 
 #define CPU_GP_REG_OFFSET(x)	(CPU_GP_REGS + x)
 #define CPU_XREG_OFFSET(x)	CPU_GP_REG_OFFSET(CPU_USER_PT_REGS + 8*x)
@@ -887,6 +888,34 @@
 1:
 .endm
 
+/*
+ * Macro to conditionally perform a parametrised system register write. Note
+ * that we currently only support writing x3 to a system register in class
+ * Op0 == 3 and Op1 == 0, which is all we need at the moment.
+ */
+.macro	cond_sysreg_write,op0,op1,crn,crm,op2,sreg,opreg,outlbl
+	.ifnc	\op0,3    ; .err ; .endif
+	.ifnc	\op1,0    ; .err ; .endif
+	.ifnc	\opreg,x3 ; .err ; .endif
+	cmp	\sreg, #((\crm) | ((\crn) << 4) | ((\op2) << 8))
+	bne	9999f
+	// doesn't work: msr_s sys_reg(\op0,\op1,\crn,\crm,\op2), \opreg
+	.inst	0xd5180003|((\crn) << 12)|((\crm) << 8)|((\op2 << 5))
+	b	\outlbl
+9999:
+.endm
+
+/*
+ * Pack CRn, CRm and Op2 into 11 adjacent low bits so we can use a single
+ * cmp instruction to compare it with a 12-bit immediate.
+ */
+.macro	pack_sysreg_idx, outreg, inreg
+	ubfm	\outreg, \inreg, #(17 - 8), #(17 + 2)	// Op2 -> bits 8 - 10
+	bfm	\outreg, \inreg, #(10 - 4), #(10 + 3)	// CRn -> bits 4 - 7
+	bfm	\outreg, \inreg, #(1 - 0), #(1 + 3)	// CRm -> bits 0 - 3
+.endm
+
+
 __save_sysregs:
 	save_sysregs
 	ret
@@ -1178,6 +1207,15 @@ el1_trap:
 	 * x1: ESR
 	 * x2: ESR_EC
 	 */
+
+	/*
+	 * Find out if the exception we are about to pass to the host is a
+	 * write to a system register, which we may prefer to handle in EL2.
+	 */
+	tst	x1, #1				// direction == write (0) ?
+	ccmp	x2, #ESR_EL2_EC_SYS64, #0, eq	// is a sysreg access?
+	b.eq	4f
+
 	cmp	x2, #ESR_EL2_EC_DABT
 	mov	x0, #ESR_EL2_EC_IABT
 	ccmp	x2, x0, #4, ne
@@ -1239,6 +1277,69 @@ el1_trap:
 
 	eret
 
+4:	and	x2, x1, #(3 << 20)		// check for Op0 == 0b11
+	cmp	x2, #(3 << 20)
+	b.ne	1b
+	ands	x2, x1, #(7 << 14)		// check for Op1 == 0b000
+	b.ne	1b
+
+	/*
+	 * If we end up here, we are about to perform a system register write
+	 * with Op0 == 0b11 and Op1 == 0b000. Move the operand to x3 first, we
+	 * will check later if we are actually going to handle this write in EL2
+	 */
+	adr	x0, 5f
+	ubfx	x2, x1, #5, #5		// operand reg# in bits 9 .. 5
+	add	x0, x0, x2, lsl #3
+	br	x0
+5:	ldr	x3, [sp, #16]		// x0 from the stack
+	b	6f
+	ldr	x3, [sp, #24]		// x1 from the stack
+	b	6f
+	ldr	x3, [sp]		// x2 from the stack
+	b	6f
+	ldr	x3, [sp, #8]		// x3 from the stack
+	b	6f
+	.irp	reg,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30
+	mov	x3, x\reg
+	b	6f
+	.endr
+	mov	x3, xzr			// x31
+
+	/*
+	 * Ok, so now we have the desired value in x3, let's write it into the
+	 * sysreg if it's a register write we want to handle in EL2. Since these
+	 * are tried in order, it makes sense to put the ones used most often at
+	 * the top.
+	 */
+6:	pack_sysreg_idx		x2, x1
+	cond_sysreg_write	3,0, 2,0,0,x2,x3,7f	// TTBR0_EL1
+	cond_sysreg_write	3,0, 2,0,1,x2,x3,7f	// TTBR1_EL1
+	cond_sysreg_write	3,0, 2,0,2,x2,x3,7f	// TCR_EL1
+	cond_sysreg_write	3,0, 5,2,0,x2,x3,7f	// ESR_EL1
+	cond_sysreg_write	3,0, 6,0,0,x2,x3,7f	// FAR_EL1
+	cond_sysreg_write	3,0, 5,1,0,x2,x3,7f	// AFSR0_EL1
+	cond_sysreg_write	3,0, 5,1,1,x2,x3,7f	// AFSR1_EL1
+	cond_sysreg_write	3,0,10,3,0,x2,x3,7f	// AMAIR_EL1
+	cond_sysreg_write	3,0,13,0,1,x2,x3,7f	// CONTEXTIDR_EL1
+
+	/*
+	 * If we end up here, the write is to a register that we don't handle
+	 * in EL2. Let the host handle it instead ...
+	 */
+	b	1b
+
+	/*
+	 * We have handled the write. Increment the pc and return to the
+	 * guest.
+	 */
+7:	mrs	x0, elr_el2
+	add	x0, x0, #4
+	msr	elr_el2, x0
+	pop	x2, x3
+	pop	x0, x1
+	eret
+
 el1_irq:
 	push	x0, x1
 	push	x2, x3
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index f31e8bb2bc5b..1e170eab6603 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -187,6 +187,16 @@ static bool trap_debug_regs(struct kvm_vcpu *vcpu,
 	return true;
 }
 
+static bool access_handled_at_el2(struct kvm_vcpu *vcpu,
+				  const struct sys_reg_params *params,
+				  const struct sys_reg_desc *r)
+{
+	kvm_debug("sys_reg write at %lx should have been handled in EL2\n",
+		  *vcpu_pc(vcpu));
+	print_sys_reg_instr(params);
+	return false;
+}
+
 static void reset_amair_el1(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r)
 {
 	u64 amair;
@@ -328,26 +338,26 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 	  NULL, reset_val, CPACR_EL1, 0 },
 	/* TTBR0_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0010), CRm(0b0000), Op2(0b000),
-	  access_vm_reg, reset_unknown, TTBR0_EL1 },
+	  access_handled_at_el2, reset_unknown, TTBR0_EL1 },
 	/* TTBR1_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0010), CRm(0b0000), Op2(0b001),
-	  access_vm_reg, reset_unknown, TTBR1_EL1 },
+	  access_handled_at_el2, reset_unknown, TTBR1_EL1 },
 	/* TCR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0010), CRm(0b0000), Op2(0b010),
-	  access_vm_reg, reset_val, TCR_EL1, 0 },
+	  access_handled_at_el2, reset_val, TCR_EL1, 0 },
 
 	/* AFSR0_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0101), CRm(0b0001), Op2(0b000),
-	  access_vm_reg, reset_unknown, AFSR0_EL1 },
+	  access_handled_at_el2, reset_unknown, AFSR0_EL1 },
 	/* AFSR1_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0101), CRm(0b0001), Op2(0b001),
-	  access_vm_reg, reset_unknown, AFSR1_EL1 },
+	  access_handled_at_el2, reset_unknown, AFSR1_EL1 },
 	/* ESR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0101), CRm(0b0010), Op2(0b000),
-	  access_vm_reg, reset_unknown, ESR_EL1 },
+	  access_handled_at_el2, reset_unknown, ESR_EL1 },
 	/* FAR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0110), CRm(0b0000), Op2(0b000),
-	  access_vm_reg, reset_unknown, FAR_EL1 },
+	  access_handled_at_el2, reset_unknown, FAR_EL1 },
 	/* PAR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0111), CRm(0b0100), Op2(0b000),
 	  NULL, reset_unknown, PAR_EL1 },
@@ -364,7 +374,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 	  access_vm_reg, reset_unknown, MAIR_EL1 },
 	/* AMAIR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b1010), CRm(0b0011), Op2(0b000),
-	  access_vm_reg, reset_amair_el1, AMAIR_EL1 },
+	  access_handled_at_el2, reset_amair_el1, AMAIR_EL1 },
 
 	/* VBAR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b1100), CRm(0b0000), Op2(0b000),
@@ -376,7 +386,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 
 	/* CONTEXTIDR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b1101), CRm(0b0000), Op2(0b001),
-	  access_vm_reg, reset_val, CONTEXTIDR_EL1, 0 },
+	  access_handled_at_el2, reset_val, CONTEXTIDR_EL1, 0 },
 	/* TPIDR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b1101), CRm(0b0000), Op2(0b100),
 	  NULL, reset_unknown, TPIDR_EL1 },
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 1/3] arm64: KVM: handle some sysreg writes in EL2
@ 2015-02-19 10:54   ` Ard Biesheuvel
  0 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-02-19 10:54 UTC (permalink / raw)
  To: linux-arm-kernel

This adds handling to el1_trap() to perform some sysreg writes directly
in EL2, without performing the full world switch to the host and back
again. This is mainly for doing writes that don't need special handling,
but where the register is part of the group that we need to trap for
other reasons.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/kvm/hyp.S      | 101 ++++++++++++++++++++++++++++++++++++++++++++++
 arch/arm64/kvm/sys_regs.c |  28 ++++++++-----
 2 files changed, 120 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
index c3ca89c27c6b..e3af6840cb3f 100644
--- a/arch/arm64/kvm/hyp.S
+++ b/arch/arm64/kvm/hyp.S
@@ -26,6 +26,7 @@
 #include <asm/kvm_asm.h>
 #include <asm/kvm_arm.h>
 #include <asm/kvm_mmu.h>
+#include <asm/sysreg.h>
 
 #define CPU_GP_REG_OFFSET(x)	(CPU_GP_REGS + x)
 #define CPU_XREG_OFFSET(x)	CPU_GP_REG_OFFSET(CPU_USER_PT_REGS + 8*x)
@@ -887,6 +888,34 @@
 1:
 .endm
 
+/*
+ * Macro to conditionally perform a parametrised system register write. Note
+ * that we currently only support writing x3 to a system register in class
+ * Op0 == 3 and Op1 == 0, which is all we need at the moment.
+ */
+.macro	cond_sysreg_write,op0,op1,crn,crm,op2,sreg,opreg,outlbl
+	.ifnc	\op0,3    ; .err ; .endif
+	.ifnc	\op1,0    ; .err ; .endif
+	.ifnc	\opreg,x3 ; .err ; .endif
+	cmp	\sreg, #((\crm) | ((\crn) << 4) | ((\op2) << 8))
+	bne	9999f
+	// doesn't work: msr_s sys_reg(\op0,\op1,\crn,\crm,\op2), \opreg
+	.inst	0xd5180003|((\crn) << 12)|((\crm) << 8)|((\op2 << 5))
+	b	\outlbl
+9999:
+.endm
+
+/*
+ * Pack CRn, CRm and Op2 into 11 adjacent low bits so we can use a single
+ * cmp instruction to compare it with a 12-bit immediate.
+ */
+.macro	pack_sysreg_idx, outreg, inreg
+	ubfm	\outreg, \inreg, #(17 - 8), #(17 + 2)	// Op2 -> bits 8 - 10
+	bfm	\outreg, \inreg, #(10 - 4), #(10 + 3)	// CRn -> bits 4 - 7
+	bfm	\outreg, \inreg, #(1 - 0), #(1 + 3)	// CRm -> bits 0 - 3
+.endm
+
+
 __save_sysregs:
 	save_sysregs
 	ret
@@ -1178,6 +1207,15 @@ el1_trap:
 	 * x1: ESR
 	 * x2: ESR_EC
 	 */
+
+	/*
+	 * Find out if the exception we are about to pass to the host is a
+	 * write to a system register, which we may prefer to handle in EL2.
+	 */
+	tst	x1, #1				// direction == write (0) ?
+	ccmp	x2, #ESR_EL2_EC_SYS64, #0, eq	// is a sysreg access?
+	b.eq	4f
+
 	cmp	x2, #ESR_EL2_EC_DABT
 	mov	x0, #ESR_EL2_EC_IABT
 	ccmp	x2, x0, #4, ne
@@ -1239,6 +1277,69 @@ el1_trap:
 
 	eret
 
+4:	and	x2, x1, #(3 << 20)		// check for Op0 == 0b11
+	cmp	x2, #(3 << 20)
+	b.ne	1b
+	ands	x2, x1, #(7 << 14)		// check for Op1 == 0b000
+	b.ne	1b
+
+	/*
+	 * If we end up here, we are about to perform a system register write
+	 * with Op0 == 0b11 and Op1 == 0b000. Move the operand to x3 first, we
+	 * will check later if we are actually going to handle this write in EL2
+	 */
+	adr	x0, 5f
+	ubfx	x2, x1, #5, #5		// operand reg# in bits 9 .. 5
+	add	x0, x0, x2, lsl #3
+	br	x0
+5:	ldr	x3, [sp, #16]		// x0 from the stack
+	b	6f
+	ldr	x3, [sp, #24]		// x1 from the stack
+	b	6f
+	ldr	x3, [sp]		// x2 from the stack
+	b	6f
+	ldr	x3, [sp, #8]		// x3 from the stack
+	b	6f
+	.irp	reg,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30
+	mov	x3, x\reg
+	b	6f
+	.endr
+	mov	x3, xzr			// x31
+
+	/*
+	 * Ok, so now we have the desired value in x3, let's write it into the
+	 * sysreg if it's a register write we want to handle in EL2. Since these
+	 * are tried in order, it makes sense to put the ones used most often at
+	 * the top.
+	 */
+6:	pack_sysreg_idx		x2, x1
+	cond_sysreg_write	3,0, 2,0,0,x2,x3,7f	// TTBR0_EL1
+	cond_sysreg_write	3,0, 2,0,1,x2,x3,7f	// TTBR1_EL1
+	cond_sysreg_write	3,0, 2,0,2,x2,x3,7f	// TCR_EL1
+	cond_sysreg_write	3,0, 5,2,0,x2,x3,7f	// ESR_EL1
+	cond_sysreg_write	3,0, 6,0,0,x2,x3,7f	// FAR_EL1
+	cond_sysreg_write	3,0, 5,1,0,x2,x3,7f	// AFSR0_EL1
+	cond_sysreg_write	3,0, 5,1,1,x2,x3,7f	// AFSR1_EL1
+	cond_sysreg_write	3,0,10,3,0,x2,x3,7f	// AMAIR_EL1
+	cond_sysreg_write	3,0,13,0,1,x2,x3,7f	// CONTEXTIDR_EL1
+
+	/*
+	 * If we end up here, the write is to a register that we don't handle
+	 * in EL2. Let the host handle it instead ...
+	 */
+	b	1b
+
+	/*
+	 * We have handled the write. Increment the pc and return to the
+	 * guest.
+	 */
+7:	mrs	x0, elr_el2
+	add	x0, x0, #4
+	msr	elr_el2, x0
+	pop	x2, x3
+	pop	x0, x1
+	eret
+
 el1_irq:
 	push	x0, x1
 	push	x2, x3
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index f31e8bb2bc5b..1e170eab6603 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -187,6 +187,16 @@ static bool trap_debug_regs(struct kvm_vcpu *vcpu,
 	return true;
 }
 
+static bool access_handled_at_el2(struct kvm_vcpu *vcpu,
+				  const struct sys_reg_params *params,
+				  const struct sys_reg_desc *r)
+{
+	kvm_debug("sys_reg write@%lx should have been handled in EL2\n",
+		  *vcpu_pc(vcpu));
+	print_sys_reg_instr(params);
+	return false;
+}
+
 static void reset_amair_el1(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r)
 {
 	u64 amair;
@@ -328,26 +338,26 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 	  NULL, reset_val, CPACR_EL1, 0 },
 	/* TTBR0_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0010), CRm(0b0000), Op2(0b000),
-	  access_vm_reg, reset_unknown, TTBR0_EL1 },
+	  access_handled_at_el2, reset_unknown, TTBR0_EL1 },
 	/* TTBR1_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0010), CRm(0b0000), Op2(0b001),
-	  access_vm_reg, reset_unknown, TTBR1_EL1 },
+	  access_handled_at_el2, reset_unknown, TTBR1_EL1 },
 	/* TCR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0010), CRm(0b0000), Op2(0b010),
-	  access_vm_reg, reset_val, TCR_EL1, 0 },
+	  access_handled_at_el2, reset_val, TCR_EL1, 0 },
 
 	/* AFSR0_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0101), CRm(0b0001), Op2(0b000),
-	  access_vm_reg, reset_unknown, AFSR0_EL1 },
+	  access_handled_at_el2, reset_unknown, AFSR0_EL1 },
 	/* AFSR1_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0101), CRm(0b0001), Op2(0b001),
-	  access_vm_reg, reset_unknown, AFSR1_EL1 },
+	  access_handled_at_el2, reset_unknown, AFSR1_EL1 },
 	/* ESR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0101), CRm(0b0010), Op2(0b000),
-	  access_vm_reg, reset_unknown, ESR_EL1 },
+	  access_handled_at_el2, reset_unknown, ESR_EL1 },
 	/* FAR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0110), CRm(0b0000), Op2(0b000),
-	  access_vm_reg, reset_unknown, FAR_EL1 },
+	  access_handled_at_el2, reset_unknown, FAR_EL1 },
 	/* PAR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0111), CRm(0b0100), Op2(0b000),
 	  NULL, reset_unknown, PAR_EL1 },
@@ -364,7 +374,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 	  access_vm_reg, reset_unknown, MAIR_EL1 },
 	/* AMAIR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b1010), CRm(0b0011), Op2(0b000),
-	  access_vm_reg, reset_amair_el1, AMAIR_EL1 },
+	  access_handled_at_el2, reset_amair_el1, AMAIR_EL1 },
 
 	/* VBAR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b1100), CRm(0b0000), Op2(0b000),
@@ -376,7 +386,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 
 	/* CONTEXTIDR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b1101), CRm(0b0000), Op2(0b001),
-	  access_vm_reg, reset_val, CONTEXTIDR_EL1, 0 },
+	  access_handled_at_el2, reset_val, CONTEXTIDR_EL1, 0 },
 	/* TPIDR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b1101), CRm(0b0000), Op2(0b100),
 	  NULL, reset_unknown, TPIDR_EL1 },
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 2/3] arm64: KVM: mangle MAIR register to prevent uncached guest mappings
  2015-02-19 10:54 ` Ard Biesheuvel
@ 2015-02-19 10:54   ` Ard Biesheuvel
  -1 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-02-19 10:54 UTC (permalink / raw)
  To: lersek, christoffer.dall, marc.zyngier, linux-arm-kernel, peter.maydell
  Cc: pbonzini, Ard Biesheuvel, kvmarm, kvm

Mangle the memory attribute register values at each write to MAIR_EL1
so that regions that the guest intends to map as device or uncached are
in fact mapped as cached instead. This avoids incoherency issues when
the guest bypassed the caches to access memory that the host has mapped
as cached.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/kvm/sys_regs.c | 35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 1e170eab6603..bde2b49a7cd8 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -110,6 +110,39 @@ static bool access_vm_reg(struct kvm_vcpu *vcpu,
 	return true;
 }
 
+static bool access_mair(struct kvm_vcpu *vcpu,
+			const struct sys_reg_params *p,
+			const struct sys_reg_desc *r)
+{
+	unsigned long val, mask;
+
+	BUG_ON(!p->is_write);
+
+	val = *vcpu_reg(vcpu, p->Rt);
+
+	if (!p->is_aarch32) {
+		/*
+		 * Mangle val so that all device and uncached attributes are
+		 * replaced with cached attributes.
+		 * For each attribute, check whether any of bit 7, bit 5 or bit
+		 * 4 are set. If not, it is a device or outer non-cacheable
+		 * mapping and we override it with inner, outer write-through,
+		 * read+write-allocate (0xbb).
+		 * TODO: handle outer cacheable inner non-cacheable
+		 */
+		mask = ~(val >> 7 | val >> 5 | val >> 4) & 0x0101010101010101UL;
+		val = (val & ~(mask * 0xff)) | (mask * 0xbb);
+
+		vcpu_sys_reg(vcpu, r->reg) = val;
+	} else {
+		if (!p->is_32bit)
+			vcpu_cp15_64_high(vcpu, r->reg) = val >> 32;
+		vcpu_cp15_64_low(vcpu, r->reg) = val & 0xffffffffUL;
+	}
+
+	return true;
+}
+
 static bool trap_raz_wi(struct kvm_vcpu *vcpu,
 			const struct sys_reg_params *p,
 			const struct sys_reg_desc *r)
@@ -371,7 +404,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 
 	/* MAIR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b1010), CRm(0b0010), Op2(0b000),
-	  access_vm_reg, reset_unknown, MAIR_EL1 },
+	  access_mair, reset_unknown, MAIR_EL1 },
 	/* AMAIR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b1010), CRm(0b0011), Op2(0b000),
 	  access_handled_at_el2, reset_amair_el1, AMAIR_EL1 },
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 2/3] arm64: KVM: mangle MAIR register to prevent uncached guest mappings
@ 2015-02-19 10:54   ` Ard Biesheuvel
  0 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-02-19 10:54 UTC (permalink / raw)
  To: linux-arm-kernel

Mangle the memory attribute register values at each write to MAIR_EL1
so that regions that the guest intends to map as device or uncached are
in fact mapped as cached instead. This avoids incoherency issues when
the guest bypassed the caches to access memory that the host has mapped
as cached.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/kvm/sys_regs.c | 35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 1e170eab6603..bde2b49a7cd8 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -110,6 +110,39 @@ static bool access_vm_reg(struct kvm_vcpu *vcpu,
 	return true;
 }
 
+static bool access_mair(struct kvm_vcpu *vcpu,
+			const struct sys_reg_params *p,
+			const struct sys_reg_desc *r)
+{
+	unsigned long val, mask;
+
+	BUG_ON(!p->is_write);
+
+	val = *vcpu_reg(vcpu, p->Rt);
+
+	if (!p->is_aarch32) {
+		/*
+		 * Mangle val so that all device and uncached attributes are
+		 * replaced with cached attributes.
+		 * For each attribute, check whether any of bit 7, bit 5 or bit
+		 * 4 are set. If not, it is a device or outer non-cacheable
+		 * mapping and we override it with inner, outer write-through,
+		 * read+write-allocate (0xbb).
+		 * TODO: handle outer cacheable inner non-cacheable
+		 */
+		mask = ~(val >> 7 | val >> 5 | val >> 4) & 0x0101010101010101UL;
+		val = (val & ~(mask * 0xff)) | (mask * 0xbb);
+
+		vcpu_sys_reg(vcpu, r->reg) = val;
+	} else {
+		if (!p->is_32bit)
+			vcpu_cp15_64_high(vcpu, r->reg) = val >> 32;
+		vcpu_cp15_64_low(vcpu, r->reg) = val & 0xffffffffUL;
+	}
+
+	return true;
+}
+
 static bool trap_raz_wi(struct kvm_vcpu *vcpu,
 			const struct sys_reg_params *p,
 			const struct sys_reg_desc *r)
@@ -371,7 +404,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 
 	/* MAIR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b1010), CRm(0b0010), Op2(0b000),
-	  access_vm_reg, reset_unknown, MAIR_EL1 },
+	  access_mair, reset_unknown, MAIR_EL1 },
 	/* AMAIR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b1010), CRm(0b0011), Op2(0b000),
 	  access_handled_at_el2, reset_amair_el1, AMAIR_EL1 },
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 3/3] arm64: KVM: keep trapping of VM sysreg writes enabled
  2015-02-19 10:54 ` Ard Biesheuvel
@ 2015-02-19 10:54   ` Ard Biesheuvel
  -1 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-02-19 10:54 UTC (permalink / raw)
  To: lersek, christoffer.dall, marc.zyngier, linux-arm-kernel, peter.maydell
  Cc: pbonzini, Ard Biesheuvel, kvmarm, kvm

---
 arch/arm/kvm/mmu.c               | 2 +-
 arch/arm64/include/asm/kvm_arm.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 136662547ca6..fa8ec55220ea 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -1530,7 +1530,7 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled)
 		stage2_flush_vm(vcpu->kvm);
 
 	/* Caches are now on, stop trapping VM ops (until a S/W op) */
-	if (now_enabled)
+	if (0)//now_enabled)
 		vcpu_set_hcr(vcpu, vcpu_get_hcr(vcpu) & ~HCR_TVM);
 
 	trace_kvm_toggle_cache(*vcpu_pc(vcpu), was_enabled, now_enabled);
diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 8afb863f5a9e..437e1ec17539 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -75,7 +75,7 @@
  * FMO:		Override CPSR.F and enable signaling with VF
  * SWIO:	Turn set/way invalidates into set/way clean+invalidate
  */
-#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \
+#define HCR_GUEST_FLAGS (HCR_TSC | /* HCR_TSW | */ HCR_TWE | HCR_TWI | HCR_VM | \
 			 HCR_TVM | HCR_BSU_IS | HCR_FB | HCR_TAC | \
 			 HCR_AMO | HCR_SWIO | HCR_TIDCP | HCR_RW)
 #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF)
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 3/3] arm64: KVM: keep trapping of VM sysreg writes enabled
@ 2015-02-19 10:54   ` Ard Biesheuvel
  0 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-02-19 10:54 UTC (permalink / raw)
  To: linux-arm-kernel

---
 arch/arm/kvm/mmu.c               | 2 +-
 arch/arm64/include/asm/kvm_arm.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 136662547ca6..fa8ec55220ea 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -1530,7 +1530,7 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled)
 		stage2_flush_vm(vcpu->kvm);
 
 	/* Caches are now on, stop trapping VM ops (until a S/W op) */
-	if (now_enabled)
+	if (0)//now_enabled)
 		vcpu_set_hcr(vcpu, vcpu_get_hcr(vcpu) & ~HCR_TVM);
 
 	trace_kvm_toggle_cache(*vcpu_pc(vcpu), was_enabled, now_enabled);
diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index 8afb863f5a9e..437e1ec17539 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -75,7 +75,7 @@
  * FMO:		Override CPSR.F and enable signaling with VF
  * SWIO:	Turn set/way invalidates into set/way clean+invalidate
  */
-#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \
+#define HCR_GUEST_FLAGS (HCR_TSC | /* HCR_TSW | */ HCR_TWE | HCR_TWI | HCR_VM | \
 			 HCR_TVM | HCR_BSU_IS | HCR_FB | HCR_TAC | \
 			 HCR_AMO | HCR_SWIO | HCR_TIDCP | HCR_RW)
 #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF)
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 3/3] arm64: KVM: keep trapping of VM sysreg writes enabled
  2015-02-19 10:54   ` Ard Biesheuvel
@ 2015-02-19 13:40     ` Marc Zyngier
  -1 siblings, 0 replies; 110+ messages in thread
From: Marc Zyngier @ 2015-02-19 13:40 UTC (permalink / raw)
  To: Ard Biesheuvel, lersek, christoffer.dall, linux-arm-kernel,
	peter.maydell
  Cc: pbonzini, kvmarm, kvm

On 19/02/15 10:54, Ard Biesheuvel wrote:
> ---
>  arch/arm/kvm/mmu.c               | 2 +-
>  arch/arm64/include/asm/kvm_arm.h | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 136662547ca6..fa8ec55220ea 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -1530,7 +1530,7 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled)
>  		stage2_flush_vm(vcpu->kvm);
>  
>  	/* Caches are now on, stop trapping VM ops (until a S/W op) */
> -	if (now_enabled)
> +	if (0)//now_enabled)
>  		vcpu_set_hcr(vcpu, vcpu_get_hcr(vcpu) & ~HCR_TVM);
>  
>  	trace_kvm_toggle_cache(*vcpu_pc(vcpu), was_enabled, now_enabled);
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index 8afb863f5a9e..437e1ec17539 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -75,7 +75,7 @@
>   * FMO:		Override CPSR.F and enable signaling with VF
>   * SWIO:	Turn set/way invalidates into set/way clean+invalidate
>   */
> -#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \
> +#define HCR_GUEST_FLAGS (HCR_TSC | /* HCR_TSW | */ HCR_TWE | HCR_TWI | HCR_VM | \

Why do we stop to trap S/W ops here? We can't let the guest issue those
without doing anything, as this will break anything that expects the
data to make it to memory. Think of the 32bit kernel decompressor, for
example.

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 3/3] arm64: KVM: keep trapping of VM sysreg writes enabled
@ 2015-02-19 13:40     ` Marc Zyngier
  0 siblings, 0 replies; 110+ messages in thread
From: Marc Zyngier @ 2015-02-19 13:40 UTC (permalink / raw)
  To: linux-arm-kernel

On 19/02/15 10:54, Ard Biesheuvel wrote:
> ---
>  arch/arm/kvm/mmu.c               | 2 +-
>  arch/arm64/include/asm/kvm_arm.h | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 136662547ca6..fa8ec55220ea 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -1530,7 +1530,7 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled)
>  		stage2_flush_vm(vcpu->kvm);
>  
>  	/* Caches are now on, stop trapping VM ops (until a S/W op) */
> -	if (now_enabled)
> +	if (0)//now_enabled)
>  		vcpu_set_hcr(vcpu, vcpu_get_hcr(vcpu) & ~HCR_TVM);
>  
>  	trace_kvm_toggle_cache(*vcpu_pc(vcpu), was_enabled, now_enabled);
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index 8afb863f5a9e..437e1ec17539 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -75,7 +75,7 @@
>   * FMO:		Override CPSR.F and enable signaling with VF
>   * SWIO:	Turn set/way invalidates into set/way clean+invalidate
>   */
> -#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \
> +#define HCR_GUEST_FLAGS (HCR_TSC | /* HCR_TSW | */ HCR_TWE | HCR_TWI | HCR_VM | \

Why do we stop to trap S/W ops here? We can't let the guest issue those
without doing anything, as this will break anything that expects the
data to make it to memory. Think of the 32bit kernel decompressor, for
example.

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 3/3] arm64: KVM: keep trapping of VM sysreg writes enabled
  2015-02-19 13:40     ` Marc Zyngier
@ 2015-02-19 13:44       ` Ard Biesheuvel
  -1 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-02-19 13:44 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: lersek, christoffer.dall, linux-arm-kernel, peter.maydell, kvm,
	kvmarm, agraf, pbonzini

On 19 February 2015 at 13:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
> On 19/02/15 10:54, Ard Biesheuvel wrote:
>> ---
>>  arch/arm/kvm/mmu.c               | 2 +-
>>  arch/arm64/include/asm/kvm_arm.h | 2 +-
>>  2 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index 136662547ca6..fa8ec55220ea 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -1530,7 +1530,7 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled)
>>               stage2_flush_vm(vcpu->kvm);
>>
>>       /* Caches are now on, stop trapping VM ops (until a S/W op) */
>> -     if (now_enabled)
>> +     if (0)//now_enabled)
>>               vcpu_set_hcr(vcpu, vcpu_get_hcr(vcpu) & ~HCR_TVM);
>>
>>       trace_kvm_toggle_cache(*vcpu_pc(vcpu), was_enabled, now_enabled);
>> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
>> index 8afb863f5a9e..437e1ec17539 100644
>> --- a/arch/arm64/include/asm/kvm_arm.h
>> +++ b/arch/arm64/include/asm/kvm_arm.h
>> @@ -75,7 +75,7 @@
>>   * FMO:              Override CPSR.F and enable signaling with VF
>>   * SWIO:     Turn set/way invalidates into set/way clean+invalidate
>>   */
>> -#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \
>> +#define HCR_GUEST_FLAGS (HCR_TSC | /* HCR_TSW | */ HCR_TWE | HCR_TWI | HCR_VM | \
>
> Why do we stop to trap S/W ops here? We can't let the guest issue those
> without doing anything, as this will break anything that expects the
> data to make it to memory. Think of the 32bit kernel decompressor, for
> example.
>

TBH patch #3 is just a q'n'd hack to ensure that the TVM bit remains
set in HCR. I was assuming that cleaning the entire cache on mmu
enable/disable would be sufficient to quantify the performance impact
and check whether patch #2 works as advertised.

I was wondering: isn't calling stage2_flush_vm() for each set of each
way very costly?

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 3/3] arm64: KVM: keep trapping of VM sysreg writes enabled
@ 2015-02-19 13:44       ` Ard Biesheuvel
  0 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-02-19 13:44 UTC (permalink / raw)
  To: linux-arm-kernel

On 19 February 2015 at 13:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
> On 19/02/15 10:54, Ard Biesheuvel wrote:
>> ---
>>  arch/arm/kvm/mmu.c               | 2 +-
>>  arch/arm64/include/asm/kvm_arm.h | 2 +-
>>  2 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index 136662547ca6..fa8ec55220ea 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -1530,7 +1530,7 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled)
>>               stage2_flush_vm(vcpu->kvm);
>>
>>       /* Caches are now on, stop trapping VM ops (until a S/W op) */
>> -     if (now_enabled)
>> +     if (0)//now_enabled)
>>               vcpu_set_hcr(vcpu, vcpu_get_hcr(vcpu) & ~HCR_TVM);
>>
>>       trace_kvm_toggle_cache(*vcpu_pc(vcpu), was_enabled, now_enabled);
>> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
>> index 8afb863f5a9e..437e1ec17539 100644
>> --- a/arch/arm64/include/asm/kvm_arm.h
>> +++ b/arch/arm64/include/asm/kvm_arm.h
>> @@ -75,7 +75,7 @@
>>   * FMO:              Override CPSR.F and enable signaling with VF
>>   * SWIO:     Turn set/way invalidates into set/way clean+invalidate
>>   */
>> -#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \
>> +#define HCR_GUEST_FLAGS (HCR_TSC | /* HCR_TSW | */ HCR_TWE | HCR_TWI | HCR_VM | \
>
> Why do we stop to trap S/W ops here? We can't let the guest issue those
> without doing anything, as this will break anything that expects the
> data to make it to memory. Think of the 32bit kernel decompressor, for
> example.
>

TBH patch #3 is just a q'n'd hack to ensure that the TVM bit remains
set in HCR. I was assuming that cleaning the entire cache on mmu
enable/disable would be sufficient to quantify the performance impact
and check whether patch #2 works as advertised.

I was wondering: isn't calling stage2_flush_vm() for each set of each
way very costly?

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-02-19 10:54 ` Ard Biesheuvel
@ 2015-02-19 14:50   ` Alexander Graf
  -1 siblings, 0 replies; 110+ messages in thread
From: Alexander Graf @ 2015-02-19 14:50 UTC (permalink / raw)
  To: Ard Biesheuvel, lersek, christoffer.dall, marc.zyngier,
	linux-arm-kernel, peter.maydell
  Cc: kvm, kvmarm, pbonzini



On 19.02.15 11:54, Ard Biesheuvel wrote:
> This is a 0th order approximation of how we could potentially force the guest
> to avoid uncached mappings, at least from the moment the MMU is on. (Before
> that, all of memory is implicitly classified as Device-nGnRnE)
> 
> The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings
> with cached ones. This way, there is no need to mangle any guest page tables.

Would you mind to give a brief explanation on what this does? What
happens to actually assigned devices that need to be mapped as uncached?
What happens to DMA from such devices when the guest assumes that it's
accessing RAM uncached and then triggers DMA?


Alex

> 
> The downside is that, to do this correctly, we need to always trap writes to
> the VM sysreg group, which includes registers that the guest may write to very
> often. To reduce the associated performance hit, patch #1 introduces a fast path
> for EL2 to perform trivial sysreg writes on behalf of the guest, without the
> need for a full world switch to the host and back.
> 
> The main purpose of these patches is to quantify the performance hit, and
> verify whether the MAIR_EL1 handling works correctly. 
> 
> Ard Biesheuvel (3):
>   arm64: KVM: handle some sysreg writes in EL2
>   arm64: KVM: mangle MAIR register to prevent uncached guest mappings
>   arm64: KVM: keep trapping of VM sysreg writes enabled
> 
>  arch/arm/kvm/mmu.c               |   2 +-
>  arch/arm64/include/asm/kvm_arm.h |   2 +-
>  arch/arm64/kvm/hyp.S             | 101 +++++++++++++++++++++++++++++++++++++++
>  arch/arm64/kvm/sys_regs.c        |  63 ++++++++++++++++++++----
>  4 files changed, 156 insertions(+), 12 deletions(-)
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-02-19 14:50   ` Alexander Graf
  0 siblings, 0 replies; 110+ messages in thread
From: Alexander Graf @ 2015-02-19 14:50 UTC (permalink / raw)
  To: linux-arm-kernel



On 19.02.15 11:54, Ard Biesheuvel wrote:
> This is a 0th order approximation of how we could potentially force the guest
> to avoid uncached mappings, at least from the moment the MMU is on. (Before
> that, all of memory is implicitly classified as Device-nGnRnE)
> 
> The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings
> with cached ones. This way, there is no need to mangle any guest page tables.

Would you mind to give a brief explanation on what this does? What
happens to actually assigned devices that need to be mapped as uncached?
What happens to DMA from such devices when the guest assumes that it's
accessing RAM uncached and then triggers DMA?


Alex

> 
> The downside is that, to do this correctly, we need to always trap writes to
> the VM sysreg group, which includes registers that the guest may write to very
> often. To reduce the associated performance hit, patch #1 introduces a fast path
> for EL2 to perform trivial sysreg writes on behalf of the guest, without the
> need for a full world switch to the host and back.
> 
> The main purpose of these patches is to quantify the performance hit, and
> verify whether the MAIR_EL1 handling works correctly. 
> 
> Ard Biesheuvel (3):
>   arm64: KVM: handle some sysreg writes in EL2
>   arm64: KVM: mangle MAIR register to prevent uncached guest mappings
>   arm64: KVM: keep trapping of VM sysreg writes enabled
> 
>  arch/arm/kvm/mmu.c               |   2 +-
>  arch/arm64/include/asm/kvm_arm.h |   2 +-
>  arch/arm64/kvm/hyp.S             | 101 +++++++++++++++++++++++++++++++++++++++
>  arch/arm64/kvm/sys_regs.c        |  63 ++++++++++++++++++++----
>  4 files changed, 156 insertions(+), 12 deletions(-)
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-02-19 14:50   ` Alexander Graf
@ 2015-02-19 14:56     ` Ard Biesheuvel
  -1 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-02-19 14:56 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Laszlo Ersek, Christoffer Dall, Marc Zyngier, linux-arm-kernel,
	Peter Maydell, KVM devel mailing list, kvmarm, Paolo Bonzini

On 19 February 2015 at 14:50, Alexander Graf <agraf@suse.de> wrote:
>
>
> On 19.02.15 11:54, Ard Biesheuvel wrote:
>> This is a 0th order approximation of how we could potentially force the guest
>> to avoid uncached mappings, at least from the moment the MMU is on. (Before
>> that, all of memory is implicitly classified as Device-nGnRnE)
>>
>> The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings
>> with cached ones. This way, there is no need to mangle any guest page tables.
>
> Would you mind to give a brief explanation on what this does? What
> happens to actually assigned devices that need to be mapped as uncached?
> What happens to DMA from such devices when the guest assumes that it's
> accessing RAM uncached and then triggers DMA?
>

On ARM, stage 2 mappings that are more strict will supersede stage 1
mappings, so the idea is to use cached mappings exclusively for stage
1 so that the host is fully in control of the actual memory attributes
by setting the attributes at stage 2. This also makes sense because
the host will ultimately know better whether some range that the guest
thinks is a device is actually a device or just emulated (no stage 2
mapping), backed by host memory (such as the NOR flash read case) or
backed by a passthrough device.

-- 
Ard.


>>
>> The downside is that, to do this correctly, we need to always trap writes to
>> the VM sysreg group, which includes registers that the guest may write to very
>> often. To reduce the associated performance hit, patch #1 introduces a fast path
>> for EL2 to perform trivial sysreg writes on behalf of the guest, without the
>> need for a full world switch to the host and back.
>>
>> The main purpose of these patches is to quantify the performance hit, and
>> verify whether the MAIR_EL1 handling works correctly.
>>
>> Ard Biesheuvel (3):
>>   arm64: KVM: handle some sysreg writes in EL2
>>   arm64: KVM: mangle MAIR register to prevent uncached guest mappings
>>   arm64: KVM: keep trapping of VM sysreg writes enabled
>>
>>  arch/arm/kvm/mmu.c               |   2 +-
>>  arch/arm64/include/asm/kvm_arm.h |   2 +-
>>  arch/arm64/kvm/hyp.S             | 101 +++++++++++++++++++++++++++++++++++++++
>>  arch/arm64/kvm/sys_regs.c        |  63 ++++++++++++++++++++----
>>  4 files changed, 156 insertions(+), 12 deletions(-)
>>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-02-19 14:56     ` Ard Biesheuvel
  0 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-02-19 14:56 UTC (permalink / raw)
  To: linux-arm-kernel

On 19 February 2015 at 14:50, Alexander Graf <agraf@suse.de> wrote:
>
>
> On 19.02.15 11:54, Ard Biesheuvel wrote:
>> This is a 0th order approximation of how we could potentially force the guest
>> to avoid uncached mappings, at least from the moment the MMU is on. (Before
>> that, all of memory is implicitly classified as Device-nGnRnE)
>>
>> The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings
>> with cached ones. This way, there is no need to mangle any guest page tables.
>
> Would you mind to give a brief explanation on what this does? What
> happens to actually assigned devices that need to be mapped as uncached?
> What happens to DMA from such devices when the guest assumes that it's
> accessing RAM uncached and then triggers DMA?
>

On ARM, stage 2 mappings that are more strict will supersede stage 1
mappings, so the idea is to use cached mappings exclusively for stage
1 so that the host is fully in control of the actual memory attributes
by setting the attributes at stage 2. This also makes sense because
the host will ultimately know better whether some range that the guest
thinks is a device is actually a device or just emulated (no stage 2
mapping), backed by host memory (such as the NOR flash read case) or
backed by a passthrough device.

-- 
Ard.


>>
>> The downside is that, to do this correctly, we need to always trap writes to
>> the VM sysreg group, which includes registers that the guest may write to very
>> often. To reduce the associated performance hit, patch #1 introduces a fast path
>> for EL2 to perform trivial sysreg writes on behalf of the guest, without the
>> need for a full world switch to the host and back.
>>
>> The main purpose of these patches is to quantify the performance hit, and
>> verify whether the MAIR_EL1 handling works correctly.
>>
>> Ard Biesheuvel (3):
>>   arm64: KVM: handle some sysreg writes in EL2
>>   arm64: KVM: mangle MAIR register to prevent uncached guest mappings
>>   arm64: KVM: keep trapping of VM sysreg writes enabled
>>
>>  arch/arm/kvm/mmu.c               |   2 +-
>>  arch/arm64/include/asm/kvm_arm.h |   2 +-
>>  arch/arm64/kvm/hyp.S             | 101 +++++++++++++++++++++++++++++++++++++++
>>  arch/arm64/kvm/sys_regs.c        |  63 ++++++++++++++++++++----
>>  4 files changed, 156 insertions(+), 12 deletions(-)
>>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 3/3] arm64: KVM: keep trapping of VM sysreg writes enabled
  2015-02-19 13:44       ` Ard Biesheuvel
@ 2015-02-19 15:19         ` Marc Zyngier
  -1 siblings, 0 replies; 110+ messages in thread
From: Marc Zyngier @ 2015-02-19 15:19 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: lersek, christoffer.dall, linux-arm-kernel, peter.maydell, kvm,
	kvmarm, agraf, pbonzini

On 19/02/15 13:44, Ard Biesheuvel wrote:
> On 19 February 2015 at 13:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> On 19/02/15 10:54, Ard Biesheuvel wrote:
>>> ---
>>>  arch/arm/kvm/mmu.c               | 2 +-
>>>  arch/arm64/include/asm/kvm_arm.h | 2 +-
>>>  2 files changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>> index 136662547ca6..fa8ec55220ea 100644
>>> --- a/arch/arm/kvm/mmu.c
>>> +++ b/arch/arm/kvm/mmu.c
>>> @@ -1530,7 +1530,7 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled)
>>>               stage2_flush_vm(vcpu->kvm);
>>>
>>>       /* Caches are now on, stop trapping VM ops (until a S/W op) */
>>> -     if (now_enabled)
>>> +     if (0)//now_enabled)
>>>               vcpu_set_hcr(vcpu, vcpu_get_hcr(vcpu) & ~HCR_TVM);
>>>
>>>       trace_kvm_toggle_cache(*vcpu_pc(vcpu), was_enabled, now_enabled);
>>> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
>>> index 8afb863f5a9e..437e1ec17539 100644
>>> --- a/arch/arm64/include/asm/kvm_arm.h
>>> +++ b/arch/arm64/include/asm/kvm_arm.h
>>> @@ -75,7 +75,7 @@
>>>   * FMO:              Override CPSR.F and enable signaling with VF
>>>   * SWIO:     Turn set/way invalidates into set/way clean+invalidate
>>>   */
>>> -#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \
>>> +#define HCR_GUEST_FLAGS (HCR_TSC | /* HCR_TSW | */ HCR_TWE | HCR_TWI | HCR_VM | \
>>
>> Why do we stop to trap S/W ops here? We can't let the guest issue those
>> without doing anything, as this will break anything that expects the
>> data to make it to memory. Think of the 32bit kernel decompressor, for
>> example.
>>
> 
> TBH patch #3 is just a q'n'd hack to ensure that the TVM bit remains
> set in HCR. I was assuming that cleaning the entire cache on mmu
> enable/disable would be sufficient to quantify the performance impact
> and check whether patch #2 works as advertised.

OK.

> I was wondering: isn't calling stage2_flush_vm() for each set of each
> way very costly?

It's only called once, when TVM is not set. We then set TVM to make sure
that this doesn't happen anymore, until we stop trapping.

Of course, with your new approach, this doesn't work anymore and we'd
need to find another approach.

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 3/3] arm64: KVM: keep trapping of VM sysreg writes enabled
@ 2015-02-19 15:19         ` Marc Zyngier
  0 siblings, 0 replies; 110+ messages in thread
From: Marc Zyngier @ 2015-02-19 15:19 UTC (permalink / raw)
  To: linux-arm-kernel

On 19/02/15 13:44, Ard Biesheuvel wrote:
> On 19 February 2015 at 13:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> On 19/02/15 10:54, Ard Biesheuvel wrote:
>>> ---
>>>  arch/arm/kvm/mmu.c               | 2 +-
>>>  arch/arm64/include/asm/kvm_arm.h | 2 +-
>>>  2 files changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>> index 136662547ca6..fa8ec55220ea 100644
>>> --- a/arch/arm/kvm/mmu.c
>>> +++ b/arch/arm/kvm/mmu.c
>>> @@ -1530,7 +1530,7 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled)
>>>               stage2_flush_vm(vcpu->kvm);
>>>
>>>       /* Caches are now on, stop trapping VM ops (until a S/W op) */
>>> -     if (now_enabled)
>>> +     if (0)//now_enabled)
>>>               vcpu_set_hcr(vcpu, vcpu_get_hcr(vcpu) & ~HCR_TVM);
>>>
>>>       trace_kvm_toggle_cache(*vcpu_pc(vcpu), was_enabled, now_enabled);
>>> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
>>> index 8afb863f5a9e..437e1ec17539 100644
>>> --- a/arch/arm64/include/asm/kvm_arm.h
>>> +++ b/arch/arm64/include/asm/kvm_arm.h
>>> @@ -75,7 +75,7 @@
>>>   * FMO:              Override CPSR.F and enable signaling with VF
>>>   * SWIO:     Turn set/way invalidates into set/way clean+invalidate
>>>   */
>>> -#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \
>>> +#define HCR_GUEST_FLAGS (HCR_TSC | /* HCR_TSW | */ HCR_TWE | HCR_TWI | HCR_VM | \
>>
>> Why do we stop to trap S/W ops here? We can't let the guest issue those
>> without doing anything, as this will break anything that expects the
>> data to make it to memory. Think of the 32bit kernel decompressor, for
>> example.
>>
> 
> TBH patch #3 is just a q'n'd hack to ensure that the TVM bit remains
> set in HCR. I was assuming that cleaning the entire cache on mmu
> enable/disable would be sufficient to quantify the performance impact
> and check whether patch #2 works as advertised.

OK.

> I was wondering: isn't calling stage2_flush_vm() for each set of each
> way very costly?

It's only called once, when TVM is not set. We then set TVM to make sure
that this doesn't happen anymore, until we stop trapping.

Of course, with your new approach, this doesn't work anymore and we'd
need to find another approach.

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 3/3] arm64: KVM: keep trapping of VM sysreg writes enabled
  2015-02-19 15:19         ` Marc Zyngier
@ 2015-02-19 15:22           ` Ard Biesheuvel
  -1 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-02-19 15:22 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: kvm, pbonzini, lersek, kvmarm, linux-arm-kernel

On 19 February 2015 at 15:19, Marc Zyngier <marc.zyngier@arm.com> wrote:
> On 19/02/15 13:44, Ard Biesheuvel wrote:
>> On 19 February 2015 at 13:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>> On 19/02/15 10:54, Ard Biesheuvel wrote:
>>>> ---
>>>>  arch/arm/kvm/mmu.c               | 2 +-
>>>>  arch/arm64/include/asm/kvm_arm.h | 2 +-
>>>>  2 files changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>>> index 136662547ca6..fa8ec55220ea 100644
>>>> --- a/arch/arm/kvm/mmu.c
>>>> +++ b/arch/arm/kvm/mmu.c
>>>> @@ -1530,7 +1530,7 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled)
>>>>               stage2_flush_vm(vcpu->kvm);
>>>>
>>>>       /* Caches are now on, stop trapping VM ops (until a S/W op) */
>>>> -     if (now_enabled)
>>>> +     if (0)//now_enabled)
>>>>               vcpu_set_hcr(vcpu, vcpu_get_hcr(vcpu) & ~HCR_TVM);
>>>>
>>>>       trace_kvm_toggle_cache(*vcpu_pc(vcpu), was_enabled, now_enabled);
>>>> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
>>>> index 8afb863f5a9e..437e1ec17539 100644
>>>> --- a/arch/arm64/include/asm/kvm_arm.h
>>>> +++ b/arch/arm64/include/asm/kvm_arm.h
>>>> @@ -75,7 +75,7 @@
>>>>   * FMO:              Override CPSR.F and enable signaling with VF
>>>>   * SWIO:     Turn set/way invalidates into set/way clean+invalidate
>>>>   */
>>>> -#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \
>>>> +#define HCR_GUEST_FLAGS (HCR_TSC | /* HCR_TSW | */ HCR_TWE | HCR_TWI | HCR_VM | \
>>>
>>> Why do we stop to trap S/W ops here? We can't let the guest issue those
>>> without doing anything, as this will break anything that expects the
>>> data to make it to memory. Think of the 32bit kernel decompressor, for
>>> example.
>>>
>>
>> TBH patch #3 is just a q'n'd hack to ensure that the TVM bit remains
>> set in HCR. I was assuming that cleaning the entire cache on mmu
>> enable/disable would be sufficient to quantify the performance impact
>> and check whether patch #2 works as advertised.
>
> OK.
>
>> I was wondering: isn't calling stage2_flush_vm() for each set of each
>> way very costly?
>
> It's only called once, when TVM is not set. We then set TVM to make sure
> that this doesn't happen anymore, until we stop trapping.
>

Ah, right, of course.

> Of course, with your new approach, this doesn't work anymore and we'd
> need to find another approach.
>

Well, *if* this approach is feasible in the first place, I'm sure we
can find another bit to set to keep track of whether to perform the
s/w operations. For now, I just ripped it out because I was afraid it
might skew performance measurements done with the TVM kept set.

-- 
Ard.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 3/3] arm64: KVM: keep trapping of VM sysreg writes enabled
@ 2015-02-19 15:22           ` Ard Biesheuvel
  0 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-02-19 15:22 UTC (permalink / raw)
  To: linux-arm-kernel

On 19 February 2015 at 15:19, Marc Zyngier <marc.zyngier@arm.com> wrote:
> On 19/02/15 13:44, Ard Biesheuvel wrote:
>> On 19 February 2015 at 13:40, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>> On 19/02/15 10:54, Ard Biesheuvel wrote:
>>>> ---
>>>>  arch/arm/kvm/mmu.c               | 2 +-
>>>>  arch/arm64/include/asm/kvm_arm.h | 2 +-
>>>>  2 files changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>>> index 136662547ca6..fa8ec55220ea 100644
>>>> --- a/arch/arm/kvm/mmu.c
>>>> +++ b/arch/arm/kvm/mmu.c
>>>> @@ -1530,7 +1530,7 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled)
>>>>               stage2_flush_vm(vcpu->kvm);
>>>>
>>>>       /* Caches are now on, stop trapping VM ops (until a S/W op) */
>>>> -     if (now_enabled)
>>>> +     if (0)//now_enabled)
>>>>               vcpu_set_hcr(vcpu, vcpu_get_hcr(vcpu) & ~HCR_TVM);
>>>>
>>>>       trace_kvm_toggle_cache(*vcpu_pc(vcpu), was_enabled, now_enabled);
>>>> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
>>>> index 8afb863f5a9e..437e1ec17539 100644
>>>> --- a/arch/arm64/include/asm/kvm_arm.h
>>>> +++ b/arch/arm64/include/asm/kvm_arm.h
>>>> @@ -75,7 +75,7 @@
>>>>   * FMO:              Override CPSR.F and enable signaling with VF
>>>>   * SWIO:     Turn set/way invalidates into set/way clean+invalidate
>>>>   */
>>>> -#define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \
>>>> +#define HCR_GUEST_FLAGS (HCR_TSC | /* HCR_TSW | */ HCR_TWE | HCR_TWI | HCR_VM | \
>>>
>>> Why do we stop to trap S/W ops here? We can't let the guest issue those
>>> without doing anything, as this will break anything that expects the
>>> data to make it to memory. Think of the 32bit kernel decompressor, for
>>> example.
>>>
>>
>> TBH patch #3 is just a q'n'd hack to ensure that the TVM bit remains
>> set in HCR. I was assuming that cleaning the entire cache on mmu
>> enable/disable would be sufficient to quantify the performance impact
>> and check whether patch #2 works as advertised.
>
> OK.
>
>> I was wondering: isn't calling stage2_flush_vm() for each set of each
>> way very costly?
>
> It's only called once, when TVM is not set. We then set TVM to make sure
> that this doesn't happen anymore, until we stop trapping.
>

Ah, right, of course.

> Of course, with your new approach, this doesn't work anymore and we'd
> need to find another approach.
>

Well, *if* this approach is feasible in the first place, I'm sure we
can find another bit to set to keep track of whether to perform the
s/w operations. For now, I just ripped it out because I was afraid it
might skew performance measurements done with the TVM kept set.

-- 
Ard.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-02-19 14:56     ` Ard Biesheuvel
@ 2015-02-19 15:27       ` Alexander Graf
  -1 siblings, 0 replies; 110+ messages in thread
From: Alexander Graf @ 2015-02-19 15:27 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Laszlo Ersek, Christoffer Dall, Marc Zyngier, linux-arm-kernel,
	Peter Maydell, KVM devel mailing list, kvmarm, Paolo Bonzini



On 19.02.15 15:56, Ard Biesheuvel wrote:
> On 19 February 2015 at 14:50, Alexander Graf <agraf@suse.de> wrote:
>>
>>
>> On 19.02.15 11:54, Ard Biesheuvel wrote:
>>> This is a 0th order approximation of how we could potentially force the guest
>>> to avoid uncached mappings, at least from the moment the MMU is on. (Before
>>> that, all of memory is implicitly classified as Device-nGnRnE)
>>>
>>> The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings
>>> with cached ones. This way, there is no need to mangle any guest page tables.
>>
>> Would you mind to give a brief explanation on what this does? What
>> happens to actually assigned devices that need to be mapped as uncached?
>> What happens to DMA from such devices when the guest assumes that it's
>> accessing RAM uncached and then triggers DMA?
>>
> 
> On ARM, stage 2 mappings that are more strict will supersede stage 1
> mappings, so the idea is to use cached mappings exclusively for stage
> 1 so that the host is fully in control of the actual memory attributes
> by setting the attributes at stage 2. This also makes sense because
> the host will ultimately know better whether some range that the guest
> thinks is a device is actually a device or just emulated (no stage 2
> mapping), backed by host memory (such as the NOR flash read case) or
> backed by a passthrough device.

Ok, so that means if the guest maps RAM as uncached, it will actually
end up as cached memory. Now if the guest triggers a DMA request to a
passed through device to that RAM, it will conflict with the cache.

I don't know whether it's a big deal, but it's the scenario that came up
with the approach above before when I talked to people about it.


Alex

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-02-19 15:27       ` Alexander Graf
  0 siblings, 0 replies; 110+ messages in thread
From: Alexander Graf @ 2015-02-19 15:27 UTC (permalink / raw)
  To: linux-arm-kernel



On 19.02.15 15:56, Ard Biesheuvel wrote:
> On 19 February 2015 at 14:50, Alexander Graf <agraf@suse.de> wrote:
>>
>>
>> On 19.02.15 11:54, Ard Biesheuvel wrote:
>>> This is a 0th order approximation of how we could potentially force the guest
>>> to avoid uncached mappings, at least from the moment the MMU is on. (Before
>>> that, all of memory is implicitly classified as Device-nGnRnE)
>>>
>>> The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings
>>> with cached ones. This way, there is no need to mangle any guest page tables.
>>
>> Would you mind to give a brief explanation on what this does? What
>> happens to actually assigned devices that need to be mapped as uncached?
>> What happens to DMA from such devices when the guest assumes that it's
>> accessing RAM uncached and then triggers DMA?
>>
> 
> On ARM, stage 2 mappings that are more strict will supersede stage 1
> mappings, so the idea is to use cached mappings exclusively for stage
> 1 so that the host is fully in control of the actual memory attributes
> by setting the attributes at stage 2. This also makes sense because
> the host will ultimately know better whether some range that the guest
> thinks is a device is actually a device or just emulated (no stage 2
> mapping), backed by host memory (such as the NOR flash read case) or
> backed by a passthrough device.

Ok, so that means if the guest maps RAM as uncached, it will actually
end up as cached memory. Now if the guest triggers a DMA request to a
passed through device to that RAM, it will conflict with the cache.

I don't know whether it's a big deal, but it's the scenario that came up
with the approach above before when I talked to people about it.


Alex

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-02-19 15:27       ` Alexander Graf
@ 2015-02-19 15:31         ` Ard Biesheuvel
  -1 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-02-19 15:31 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Laszlo Ersek, Christoffer Dall, Marc Zyngier, linux-arm-kernel,
	Peter Maydell, KVM devel mailing list, kvmarm, Paolo Bonzini

On 19 February 2015 at 15:27, Alexander Graf <agraf@suse.de> wrote:
>
>
> On 19.02.15 15:56, Ard Biesheuvel wrote:
>> On 19 February 2015 at 14:50, Alexander Graf <agraf@suse.de> wrote:
>>>
>>>
>>> On 19.02.15 11:54, Ard Biesheuvel wrote:
>>>> This is a 0th order approximation of how we could potentially force the guest
>>>> to avoid uncached mappings, at least from the moment the MMU is on. (Before
>>>> that, all of memory is implicitly classified as Device-nGnRnE)
>>>>
>>>> The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings
>>>> with cached ones. This way, there is no need to mangle any guest page tables.
>>>
>>> Would you mind to give a brief explanation on what this does? What
>>> happens to actually assigned devices that need to be mapped as uncached?
>>> What happens to DMA from such devices when the guest assumes that it's
>>> accessing RAM uncached and then triggers DMA?
>>>
>>
>> On ARM, stage 2 mappings that are more strict will supersede stage 1
>> mappings, so the idea is to use cached mappings exclusively for stage
>> 1 so that the host is fully in control of the actual memory attributes
>> by setting the attributes at stage 2. This also makes sense because
>> the host will ultimately know better whether some range that the guest
>> thinks is a device is actually a device or just emulated (no stage 2
>> mapping), backed by host memory (such as the NOR flash read case) or
>> backed by a passthrough device.
>
> Ok, so that means if the guest maps RAM as uncached, it will actually
> end up as cached memory. Now if the guest triggers a DMA request to a
> passed through device to that RAM, it will conflict with the cache.
>
> I don't know whether it's a big deal, but it's the scenario that came up
> with the approach above before when I talked to people about it.
>

Well, I am using write-through read+write allocate, which hopefully
means that the actual RAM is kept in sync with the cache, but I must
confess I am a bit out of my depth here with the fine print in the ARM
ARM.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-02-19 15:31         ` Ard Biesheuvel
  0 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-02-19 15:31 UTC (permalink / raw)
  To: linux-arm-kernel

On 19 February 2015 at 15:27, Alexander Graf <agraf@suse.de> wrote:
>
>
> On 19.02.15 15:56, Ard Biesheuvel wrote:
>> On 19 February 2015 at 14:50, Alexander Graf <agraf@suse.de> wrote:
>>>
>>>
>>> On 19.02.15 11:54, Ard Biesheuvel wrote:
>>>> This is a 0th order approximation of how we could potentially force the guest
>>>> to avoid uncached mappings, at least from the moment the MMU is on. (Before
>>>> that, all of memory is implicitly classified as Device-nGnRnE)
>>>>
>>>> The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings
>>>> with cached ones. This way, there is no need to mangle any guest page tables.
>>>
>>> Would you mind to give a brief explanation on what this does? What
>>> happens to actually assigned devices that need to be mapped as uncached?
>>> What happens to DMA from such devices when the guest assumes that it's
>>> accessing RAM uncached and then triggers DMA?
>>>
>>
>> On ARM, stage 2 mappings that are more strict will supersede stage 1
>> mappings, so the idea is to use cached mappings exclusively for stage
>> 1 so that the host is fully in control of the actual memory attributes
>> by setting the attributes at stage 2. This also makes sense because
>> the host will ultimately know better whether some range that the guest
>> thinks is a device is actually a device or just emulated (no stage 2
>> mapping), backed by host memory (such as the NOR flash read case) or
>> backed by a passthrough device.
>
> Ok, so that means if the guest maps RAM as uncached, it will actually
> end up as cached memory. Now if the guest triggers a DMA request to a
> passed through device to that RAM, it will conflict with the cache.
>
> I don't know whether it's a big deal, but it's the scenario that came up
> with the approach above before when I talked to people about it.
>

Well, I am using write-through read+write allocate, which hopefully
means that the actual RAM is kept in sync with the cache, but I must
confess I am a bit out of my depth here with the fine print in the ARM
ARM.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-02-19 10:54 ` Ard Biesheuvel
@ 2015-02-19 16:57   ` Andrew Jones
  -1 siblings, 0 replies; 110+ messages in thread
From: Andrew Jones @ 2015-02-19 16:57 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: kvm, marc.zyngier, pbonzini, lersek, kvmarm, linux-arm-kernel

On Thu, Feb 19, 2015 at 10:54:43AM +0000, Ard Biesheuvel wrote:
> This is a 0th order approximation of how we could potentially force the guest
> to avoid uncached mappings, at least from the moment the MMU is on. (Before
> that, all of memory is implicitly classified as Device-nGnRnE)
> 
> The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings
> with cached ones. This way, there is no need to mangle any guest page tables.
> 
> The downside is that, to do this correctly, we need to always trap writes to
> the VM sysreg group, which includes registers that the guest may write to very
> often. To reduce the associated performance hit, patch #1 introduces a fast path
> for EL2 to perform trivial sysreg writes on behalf of the guest, without the
> need for a full world switch to the host and back.
> 
> The main purpose of these patches is to quantify the performance hit, and
> verify whether the MAIR_EL1 handling works correctly. 
> 
> Ard Biesheuvel (3):
>   arm64: KVM: handle some sysreg writes in EL2
>   arm64: KVM: mangle MAIR register to prevent uncached guest mappings
>   arm64: KVM: keep trapping of VM sysreg writes enabled

Hi Ard,

I took this series for test drive. Unfortunately I have bad news and worse
news. First, a description of the test; simply boot a guest, once at login,
login, and then shutdown with 'poweroff'. The guest boots through AAVMF using
a build from Laszlo that enables PCI, but does *not* have the 'map pci mmio
as cached' kludge. This test allows us to check for corrupt vram on the
graphical console, plus it completes a boot/shutdown cycle allowing us to
count sysreg traps of the boot/shutdown cycle.

So, the bad news

Before this series we trapped 50 times on sysreg writes with the test
described above. With this series we trap 62873 times. But, less than
20 required going to EL1.

(I don't have an exact number for how many times it went to EL1 because
 access_mair() doesn't have a trace point.)
(I got the 62873 number by testing a 3rd kernel build that only had patch
 3/3 applied to the base, and counting kvm_toggle_cache events.)
(The number 50 is the number of kvm_toggle_cache events *without* 3/3
 applied.)

I consider this bad news because, even considering it only goes to EL2,
it goes a ton more than it used to. I realize patch 3/3 isn't the final
plan for enabling traps though.

And, now the worse news

The vram corruption persists with this patch series.

drew


> 
>  arch/arm/kvm/mmu.c               |   2 +-
>  arch/arm64/include/asm/kvm_arm.h |   2 +-
>  arch/arm64/kvm/hyp.S             | 101 +++++++++++++++++++++++++++++++++++++++
>  arch/arm64/kvm/sys_regs.c        |  63 ++++++++++++++++++++----
>  4 files changed, 156 insertions(+), 12 deletions(-)
> 
> -- 
> 1.8.3.2
> 
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-02-19 16:57   ` Andrew Jones
  0 siblings, 0 replies; 110+ messages in thread
From: Andrew Jones @ 2015-02-19 16:57 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 19, 2015 at 10:54:43AM +0000, Ard Biesheuvel wrote:
> This is a 0th order approximation of how we could potentially force the guest
> to avoid uncached mappings, at least from the moment the MMU is on. (Before
> that, all of memory is implicitly classified as Device-nGnRnE)
> 
> The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings
> with cached ones. This way, there is no need to mangle any guest page tables.
> 
> The downside is that, to do this correctly, we need to always trap writes to
> the VM sysreg group, which includes registers that the guest may write to very
> often. To reduce the associated performance hit, patch #1 introduces a fast path
> for EL2 to perform trivial sysreg writes on behalf of the guest, without the
> need for a full world switch to the host and back.
> 
> The main purpose of these patches is to quantify the performance hit, and
> verify whether the MAIR_EL1 handling works correctly. 
> 
> Ard Biesheuvel (3):
>   arm64: KVM: handle some sysreg writes in EL2
>   arm64: KVM: mangle MAIR register to prevent uncached guest mappings
>   arm64: KVM: keep trapping of VM sysreg writes enabled

Hi Ard,

I took this series for test drive. Unfortunately I have bad news and worse
news. First, a description of the test; simply boot a guest, once at login,
login, and then shutdown with 'poweroff'. The guest boots through AAVMF using
a build from Laszlo that enables PCI, but does *not* have the 'map pci mmio
as cached' kludge. This test allows us to check for corrupt vram on the
graphical console, plus it completes a boot/shutdown cycle allowing us to
count sysreg traps of the boot/shutdown cycle.

So, the bad news

Before this series we trapped 50 times on sysreg writes with the test
described above. With this series we trap 62873 times. But, less than
20 required going to EL1.

(I don't have an exact number for how many times it went to EL1 because
 access_mair() doesn't have a trace point.)
(I got the 62873 number by testing a 3rd kernel build that only had patch
 3/3 applied to the base, and counting kvm_toggle_cache events.)
(The number 50 is the number of kvm_toggle_cache events *without* 3/3
 applied.)

I consider this bad news because, even considering it only goes to EL2,
it goes a ton more than it used to. I realize patch 3/3 isn't the final
plan for enabling traps though.

And, now the worse news

The vram corruption persists with this patch series.

drew


> 
>  arch/arm/kvm/mmu.c               |   2 +-
>  arch/arm64/include/asm/kvm_arm.h |   2 +-
>  arch/arm64/kvm/hyp.S             | 101 +++++++++++++++++++++++++++++++++++++++
>  arch/arm64/kvm/sys_regs.c        |  63 ++++++++++++++++++++----
>  4 files changed, 156 insertions(+), 12 deletions(-)
> 
> -- 
> 1.8.3.2
> 
> _______________________________________________
> kvmarm mailing list
> kvmarm at lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-02-19 16:57   ` Andrew Jones
@ 2015-02-19 17:19     ` Ard Biesheuvel
  -1 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-02-19 17:19 UTC (permalink / raw)
  To: Andrew Jones
  Cc: KVM devel mailing list, Marc Zyngier, Paolo Bonzini,
	Laszlo Ersek, kvmarm, linux-arm-kernel

On 19 February 2015 at 16:57, Andrew Jones <drjones@redhat.com> wrote:
> On Thu, Feb 19, 2015 at 10:54:43AM +0000, Ard Biesheuvel wrote:
>> This is a 0th order approximation of how we could potentially force the guest
>> to avoid uncached mappings, at least from the moment the MMU is on. (Before
>> that, all of memory is implicitly classified as Device-nGnRnE)
>>
>> The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings
>> with cached ones. This way, there is no need to mangle any guest page tables.
>>
>> The downside is that, to do this correctly, we need to always trap writes to
>> the VM sysreg group, which includes registers that the guest may write to very
>> often. To reduce the associated performance hit, patch #1 introduces a fast path
>> for EL2 to perform trivial sysreg writes on behalf of the guest, without the
>> need for a full world switch to the host and back.
>>
>> The main purpose of these patches is to quantify the performance hit, and
>> verify whether the MAIR_EL1 handling works correctly.
>>
>> Ard Biesheuvel (3):
>>   arm64: KVM: handle some sysreg writes in EL2
>>   arm64: KVM: mangle MAIR register to prevent uncached guest mappings
>>   arm64: KVM: keep trapping of VM sysreg writes enabled
>
> Hi Ard,
>
> I took this series for test drive. Unfortunately I have bad news and worse
> news. First, a description of the test; simply boot a guest, once at login,
> login, and then shutdown with 'poweroff'. The guest boots through AAVMF using
> a build from Laszlo that enables PCI, but does *not* have the 'map pci mmio
> as cached' kludge. This test allows us to check for corrupt vram on the
> graphical console, plus it completes a boot/shutdown cycle allowing us to
> count sysreg traps of the boot/shutdown cycle.
>

Thanks a lot for giving this a spin right away!

> So, the bad news
>
> Before this series we trapped 50 times on sysreg writes with the test
> described above. With this series we trap 62873 times. But, less than
> 20 required going to EL1.
>

OK, this is very useful information. We still don't know what the
penalty is of all those traps, but that's quite a big number indeed.

> (I don't have an exact number for how many times it went to EL1 because
>  access_mair() doesn't have a trace point.)
> (I got the 62873 number by testing a 3rd kernel build that only had patch
>  3/3 applied to the base, and counting kvm_toggle_cache events.)
> (The number 50 is the number of kvm_toggle_cache events *without* 3/3
>  applied.)
>
> I consider this bad news because, even considering it only goes to EL2,
> it goes a ton more than it used to. I realize patch 3/3 isn't the final
> plan for enabling traps though.
>
> And, now the worse news
>
> The vram corruption persists with this patch series.
>

OK, so the primary difference is that I am not substituting for write
back mappings, as Laszlo is doing in his patch.
If you have energy left, would you mind having another go but use 0xff
(not 0xbb) for the MAIR values in patch #2?

>>
>>  arch/arm/kvm/mmu.c               |   2 +-
>>  arch/arm64/include/asm/kvm_arm.h |   2 +-
>>  arch/arm64/kvm/hyp.S             | 101 +++++++++++++++++++++++++++++++++++++++
>>  arch/arm64/kvm/sys_regs.c        |  63 ++++++++++++++++++++----
>>  4 files changed, 156 insertions(+), 12 deletions(-)
>>
>> --
>> 1.8.3.2
>>
>> _______________________________________________
>> kvmarm mailing list
>> kvmarm@lists.cs.columbia.edu
>> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-02-19 17:19     ` Ard Biesheuvel
  0 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-02-19 17:19 UTC (permalink / raw)
  To: linux-arm-kernel

On 19 February 2015 at 16:57, Andrew Jones <drjones@redhat.com> wrote:
> On Thu, Feb 19, 2015 at 10:54:43AM +0000, Ard Biesheuvel wrote:
>> This is a 0th order approximation of how we could potentially force the guest
>> to avoid uncached mappings, at least from the moment the MMU is on. (Before
>> that, all of memory is implicitly classified as Device-nGnRnE)
>>
>> The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings
>> with cached ones. This way, there is no need to mangle any guest page tables.
>>
>> The downside is that, to do this correctly, we need to always trap writes to
>> the VM sysreg group, which includes registers that the guest may write to very
>> often. To reduce the associated performance hit, patch #1 introduces a fast path
>> for EL2 to perform trivial sysreg writes on behalf of the guest, without the
>> need for a full world switch to the host and back.
>>
>> The main purpose of these patches is to quantify the performance hit, and
>> verify whether the MAIR_EL1 handling works correctly.
>>
>> Ard Biesheuvel (3):
>>   arm64: KVM: handle some sysreg writes in EL2
>>   arm64: KVM: mangle MAIR register to prevent uncached guest mappings
>>   arm64: KVM: keep trapping of VM sysreg writes enabled
>
> Hi Ard,
>
> I took this series for test drive. Unfortunately I have bad news and worse
> news. First, a description of the test; simply boot a guest, once at login,
> login, and then shutdown with 'poweroff'. The guest boots through AAVMF using
> a build from Laszlo that enables PCI, but does *not* have the 'map pci mmio
> as cached' kludge. This test allows us to check for corrupt vram on the
> graphical console, plus it completes a boot/shutdown cycle allowing us to
> count sysreg traps of the boot/shutdown cycle.
>

Thanks a lot for giving this a spin right away!

> So, the bad news
>
> Before this series we trapped 50 times on sysreg writes with the test
> described above. With this series we trap 62873 times. But, less than
> 20 required going to EL1.
>

OK, this is very useful information. We still don't know what the
penalty is of all those traps, but that's quite a big number indeed.

> (I don't have an exact number for how many times it went to EL1 because
>  access_mair() doesn't have a trace point.)
> (I got the 62873 number by testing a 3rd kernel build that only had patch
>  3/3 applied to the base, and counting kvm_toggle_cache events.)
> (The number 50 is the number of kvm_toggle_cache events *without* 3/3
>  applied.)
>
> I consider this bad news because, even considering it only goes to EL2,
> it goes a ton more than it used to. I realize patch 3/3 isn't the final
> plan for enabling traps though.
>
> And, now the worse news
>
> The vram corruption persists with this patch series.
>

OK, so the primary difference is that I am not substituting for write
back mappings, as Laszlo is doing in his patch.
If you have energy left, would you mind having another go but use 0xff
(not 0xbb) for the MAIR values in patch #2?

>>
>>  arch/arm/kvm/mmu.c               |   2 +-
>>  arch/arm64/include/asm/kvm_arm.h |   2 +-
>>  arch/arm64/kvm/hyp.S             | 101 +++++++++++++++++++++++++++++++++++++++
>>  arch/arm64/kvm/sys_regs.c        |  63 ++++++++++++++++++++----
>>  4 files changed, 156 insertions(+), 12 deletions(-)
>>
>> --
>> 1.8.3.2
>>
>> _______________________________________________
>> kvmarm mailing list
>> kvmarm at lists.cs.columbia.edu
>> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-02-19 17:19     ` Ard Biesheuvel
@ 2015-02-19 17:55       ` Andrew Jones
  -1 siblings, 0 replies; 110+ messages in thread
From: Andrew Jones @ 2015-02-19 17:55 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: KVM devel mailing list, Marc Zyngier, Paolo Bonzini,
	Laszlo Ersek, kvmarm, linux-arm-kernel

On Thu, Feb 19, 2015 at 05:19:35PM +0000, Ard Biesheuvel wrote:
> On 19 February 2015 at 16:57, Andrew Jones <drjones@redhat.com> wrote:
> > On Thu, Feb 19, 2015 at 10:54:43AM +0000, Ard Biesheuvel wrote:
> >> This is a 0th order approximation of how we could potentially force the guest
> >> to avoid uncached mappings, at least from the moment the MMU is on. (Before
> >> that, all of memory is implicitly classified as Device-nGnRnE)
> >>
> >> The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings
> >> with cached ones. This way, there is no need to mangle any guest page tables.
> >>
> >> The downside is that, to do this correctly, we need to always trap writes to
> >> the VM sysreg group, which includes registers that the guest may write to very
> >> often. To reduce the associated performance hit, patch #1 introduces a fast path
> >> for EL2 to perform trivial sysreg writes on behalf of the guest, without the
> >> need for a full world switch to the host and back.
> >>
> >> The main purpose of these patches is to quantify the performance hit, and
> >> verify whether the MAIR_EL1 handling works correctly.
> >>
> >> Ard Biesheuvel (3):
> >>   arm64: KVM: handle some sysreg writes in EL2
> >>   arm64: KVM: mangle MAIR register to prevent uncached guest mappings
> >>   arm64: KVM: keep trapping of VM sysreg writes enabled
> >
> > Hi Ard,
> >
> > I took this series for test drive. Unfortunately I have bad news and worse
> > news. First, a description of the test; simply boot a guest, once at login,
> > login, and then shutdown with 'poweroff'. The guest boots through AAVMF using
> > a build from Laszlo that enables PCI, but does *not* have the 'map pci mmio
> > as cached' kludge. This test allows us to check for corrupt vram on the
> > graphical console, plus it completes a boot/shutdown cycle allowing us to
> > count sysreg traps of the boot/shutdown cycle.
> >
> 
> Thanks a lot for giving this a spin right away!
> 
> > So, the bad news
> >
> > Before this series we trapped 50 times on sysreg writes with the test
> > described above. With this series we trap 62873 times. But, less than
> > 20 required going to EL1.
> >
> 
> OK, this is very useful information. We still don't know what the
> penalty is of all those traps, but that's quite a big number indeed.
> 
> > (I don't have an exact number for how many times it went to EL1 because
> >  access_mair() doesn't have a trace point.)
> > (I got the 62873 number by testing a 3rd kernel build that only had patch
> >  3/3 applied to the base, and counting kvm_toggle_cache events.)
> > (The number 50 is the number of kvm_toggle_cache events *without* 3/3
> >  applied.)
> >
> > I consider this bad news because, even considering it only goes to EL2,
> > it goes a ton more than it used to. I realize patch 3/3 isn't the final
> > plan for enabling traps though.
> >
> > And, now the worse news
> >
> > The vram corruption persists with this patch series.
> >
> 
> OK, so the primary difference is that I am not substituting for write
> back mappings, as Laszlo is doing in his patch.
> If you have energy left, would you mind having another go but use 0xff
> (not 0xbb) for the MAIR values in patch #2?

Yup, a bit energy left, and, yup, 0xff fixes it.

Thanks,
drew

> 
> >>
> >>  arch/arm/kvm/mmu.c               |   2 +-
> >>  arch/arm64/include/asm/kvm_arm.h |   2 +-
> >>  arch/arm64/kvm/hyp.S             | 101 +++++++++++++++++++++++++++++++++++++++
> >>  arch/arm64/kvm/sys_regs.c        |  63 ++++++++++++++++++++----
> >>  4 files changed, 156 insertions(+), 12 deletions(-)
> >>
> >> --
> >> 1.8.3.2
> >>
> >> _______________________________________________
> >> kvmarm mailing list
> >> kvmarm@lists.cs.columbia.edu
> >> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-02-19 17:55       ` Andrew Jones
  0 siblings, 0 replies; 110+ messages in thread
From: Andrew Jones @ 2015-02-19 17:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 19, 2015 at 05:19:35PM +0000, Ard Biesheuvel wrote:
> On 19 February 2015 at 16:57, Andrew Jones <drjones@redhat.com> wrote:
> > On Thu, Feb 19, 2015 at 10:54:43AM +0000, Ard Biesheuvel wrote:
> >> This is a 0th order approximation of how we could potentially force the guest
> >> to avoid uncached mappings, at least from the moment the MMU is on. (Before
> >> that, all of memory is implicitly classified as Device-nGnRnE)
> >>
> >> The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings
> >> with cached ones. This way, there is no need to mangle any guest page tables.
> >>
> >> The downside is that, to do this correctly, we need to always trap writes to
> >> the VM sysreg group, which includes registers that the guest may write to very
> >> often. To reduce the associated performance hit, patch #1 introduces a fast path
> >> for EL2 to perform trivial sysreg writes on behalf of the guest, without the
> >> need for a full world switch to the host and back.
> >>
> >> The main purpose of these patches is to quantify the performance hit, and
> >> verify whether the MAIR_EL1 handling works correctly.
> >>
> >> Ard Biesheuvel (3):
> >>   arm64: KVM: handle some sysreg writes in EL2
> >>   arm64: KVM: mangle MAIR register to prevent uncached guest mappings
> >>   arm64: KVM: keep trapping of VM sysreg writes enabled
> >
> > Hi Ard,
> >
> > I took this series for test drive. Unfortunately I have bad news and worse
> > news. First, a description of the test; simply boot a guest, once at login,
> > login, and then shutdown with 'poweroff'. The guest boots through AAVMF using
> > a build from Laszlo that enables PCI, but does *not* have the 'map pci mmio
> > as cached' kludge. This test allows us to check for corrupt vram on the
> > graphical console, plus it completes a boot/shutdown cycle allowing us to
> > count sysreg traps of the boot/shutdown cycle.
> >
> 
> Thanks a lot for giving this a spin right away!
> 
> > So, the bad news
> >
> > Before this series we trapped 50 times on sysreg writes with the test
> > described above. With this series we trap 62873 times. But, less than
> > 20 required going to EL1.
> >
> 
> OK, this is very useful information. We still don't know what the
> penalty is of all those traps, but that's quite a big number indeed.
> 
> > (I don't have an exact number for how many times it went to EL1 because
> >  access_mair() doesn't have a trace point.)
> > (I got the 62873 number by testing a 3rd kernel build that only had patch
> >  3/3 applied to the base, and counting kvm_toggle_cache events.)
> > (The number 50 is the number of kvm_toggle_cache events *without* 3/3
> >  applied.)
> >
> > I consider this bad news because, even considering it only goes to EL2,
> > it goes a ton more than it used to. I realize patch 3/3 isn't the final
> > plan for enabling traps though.
> >
> > And, now the worse news
> >
> > The vram corruption persists with this patch series.
> >
> 
> OK, so the primary difference is that I am not substituting for write
> back mappings, as Laszlo is doing in his patch.
> If you have energy left, would you mind having another go but use 0xff
> (not 0xbb) for the MAIR values in patch #2?

Yup, a bit energy left, and, yup, 0xff fixes it.

Thanks,
drew

> 
> >>
> >>  arch/arm/kvm/mmu.c               |   2 +-
> >>  arch/arm64/include/asm/kvm_arm.h |   2 +-
> >>  arch/arm64/kvm/hyp.S             | 101 +++++++++++++++++++++++++++++++++++++++
> >>  arch/arm64/kvm/sys_regs.c        |  63 ++++++++++++++++++++----
> >>  4 files changed, 156 insertions(+), 12 deletions(-)
> >>
> >> --
> >> 1.8.3.2
> >>
> >> _______________________________________________
> >> kvmarm mailing list
> >> kvmarm at lists.cs.columbia.edu
> >> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-02-19 17:55       ` Andrew Jones
@ 2015-02-19 17:57         ` Paolo Bonzini
  -1 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-02-19 17:57 UTC (permalink / raw)
  To: Andrew Jones, Ard Biesheuvel
  Cc: KVM devel mailing list, Marc Zyngier, Laszlo Ersek, kvmarm,
	linux-arm-kernel



On 19/02/2015 18:55, Andrew Jones wrote:
>> > > (I don't have an exact number for how many times it went to EL1 because
>> > >  access_mair() doesn't have a trace point.)
>> > > (I got the 62873 number by testing a 3rd kernel build that only had patch
>> > >  3/3 applied to the base, and counting kvm_toggle_cache events.)
>> > > (The number 50 is the number of kvm_toggle_cache events *without* 3/3
>> > >  applied.)
>> > >
>> > > I consider this bad news because, even considering it only goes to EL2,
>> > > it goes a ton more than it used to. I realize patch 3/3 isn't the final
>> > > plan for enabling traps though.

If a full guest boots, can you try timing a kernel compile?

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-02-19 17:57         ` Paolo Bonzini
  0 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-02-19 17:57 UTC (permalink / raw)
  To: linux-arm-kernel



On 19/02/2015 18:55, Andrew Jones wrote:
>> > > (I don't have an exact number for how many times it went to EL1 because
>> > >  access_mair() doesn't have a trace point.)
>> > > (I got the 62873 number by testing a 3rd kernel build that only had patch
>> > >  3/3 applied to the base, and counting kvm_toggle_cache events.)
>> > > (The number 50 is the number of kvm_toggle_cache events *without* 3/3
>> > >  applied.)
>> > >
>> > > I consider this bad news because, even considering it only goes to EL2,
>> > > it goes a ton more than it used to. I realize patch 3/3 isn't the final
>> > > plan for enabling traps though.

If a full guest boots, can you try timing a kernel compile?

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-02-19 17:55       ` Andrew Jones
@ 2015-02-19 18:44         ` Ard Biesheuvel
  -1 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-02-19 18:44 UTC (permalink / raw)
  To: Andrew Jones
  Cc: KVM devel mailing list, Marc Zyngier, Paolo Bonzini,
	Laszlo Ersek, kvmarm, linux-arm-kernel


> On 19 feb. 2015, at 17:55, Andrew Jones <drjones@redhat.com> wrote:
> 
>> On Thu, Feb 19, 2015 at 05:19:35PM +0000, Ard Biesheuvel wrote:
>>> On 19 February 2015 at 16:57, Andrew Jones <drjones@redhat.com> wrote:
>>>> On Thu, Feb 19, 2015 at 10:54:43AM +0000, Ard Biesheuvel wrote:
>>>> This is a 0th order approximation of how we could potentially force the guest
>>>> to avoid uncached mappings, at least from the moment the MMU is on. (Before
>>>> that, all of memory is implicitly classified as Device-nGnRnE)
>>>> 
>>>> The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings
>>>> with cached ones. This way, there is no need to mangle any guest page tables.
>>>> 
>>>> The downside is that, to do this correctly, we need to always trap writes to
>>>> the VM sysreg group, which includes registers that the guest may write to very
>>>> often. To reduce the associated performance hit, patch #1 introduces a fast path
>>>> for EL2 to perform trivial sysreg writes on behalf of the guest, without the
>>>> need for a full world switch to the host and back.
>>>> 
>>>> The main purpose of these patches is to quantify the performance hit, and
>>>> verify whether the MAIR_EL1 handling works correctly.
>>>> 
>>>> Ard Biesheuvel (3):
>>>>  arm64: KVM: handle some sysreg writes in EL2
>>>>  arm64: KVM: mangle MAIR register to prevent uncached guest mappings
>>>>  arm64: KVM: keep trapping of VM sysreg writes enabled
>>> 
>>> Hi Ard,
>>> 
>>> I took this series for test drive. Unfortunately I have bad news and worse
>>> news. First, a description of the test; simply boot a guest, once at login,
>>> login, and then shutdown with 'poweroff'. The guest boots through AAVMF using
>>> a build from Laszlo that enables PCI, but does *not* have the 'map pci mmio
>>> as cached' kludge. This test allows us to check for corrupt vram on the
>>> graphical console, plus it completes a boot/shutdown cycle allowing us to
>>> count sysreg traps of the boot/shutdown cycle.
>> 
>> Thanks a lot for giving this a spin right away!
>> 
>>> So, the bad news
>>> 
>>> Before this series we trapped 50 times on sysreg writes with the test
>>> described above. With this series we trap 62873 times. But, less than
>>> 20 required going to EL1.
>> 
>> OK, this is very useful information. We still don't know what the
>> penalty is of all those traps, but that's quite a big number indeed.
>> 
>>> (I don't have an exact number for how many times it went to EL1 because
>>> access_mair() doesn't have a trace point.)
>>> (I got the 62873 number by testing a 3rd kernel build that only had patch
>>> 3/3 applied to the base, and counting kvm_toggle_cache events.)
>>> (The number 50 is the number of kvm_toggle_cache events *without* 3/3
>>> applied.)
>>> 
>>> I consider this bad news because, even considering it only goes to EL2,
>>> it goes a ton more than it used to. I realize patch 3/3 isn't the final
>>> plan for enabling traps though.
>>> 
>>> And, now the worse news
>>> 
>>> The vram corruption persists with this patch series.
>> 
>> OK, so the primary difference is that I am not substituting for write
>> back mappings, as Laszlo is doing in his patch.
>> If you have energy left, would you mind having another go but use 0xff
>> (not 0xbb) for the MAIR values in patch #2?
> 
> Yup, a bit energy left, and, yup, 0xff fixes it

ok so that means we'd need to map as writeback cacheable by default, and restrict it as necessary at stage 2.

thanks


> Thanks,
> drew
> 
>> 
>>>> 
>>>> arch/arm/kvm/mmu.c               |   2 +-
>>>> arch/arm64/include/asm/kvm_arm.h |   2 +-
>>>> arch/arm64/kvm/hyp.S             | 101 +++++++++++++++++++++++++++++++++++++++
>>>> arch/arm64/kvm/sys_regs.c        |  63 ++++++++++++++++++++----
>>>> 4 files changed, 156 insertions(+), 12 deletions(-)
>>>> 
>>>> --
>>>> 1.8.3.2
>>>> 
>>>> _______________________________________________
>>>> kvmarm mailing list
>>>> kvmarm@lists.cs.columbia.edu
>>>> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-02-19 18:44         ` Ard Biesheuvel
  0 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-02-19 18:44 UTC (permalink / raw)
  To: linux-arm-kernel


> On 19 feb. 2015, at 17:55, Andrew Jones <drjones@redhat.com> wrote:
> 
>> On Thu, Feb 19, 2015 at 05:19:35PM +0000, Ard Biesheuvel wrote:
>>> On 19 February 2015 at 16:57, Andrew Jones <drjones@redhat.com> wrote:
>>>> On Thu, Feb 19, 2015 at 10:54:43AM +0000, Ard Biesheuvel wrote:
>>>> This is a 0th order approximation of how we could potentially force the guest
>>>> to avoid uncached mappings, at least from the moment the MMU is on. (Before
>>>> that, all of memory is implicitly classified as Device-nGnRnE)
>>>> 
>>>> The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings
>>>> with cached ones. This way, there is no need to mangle any guest page tables.
>>>> 
>>>> The downside is that, to do this correctly, we need to always trap writes to
>>>> the VM sysreg group, which includes registers that the guest may write to very
>>>> often. To reduce the associated performance hit, patch #1 introduces a fast path
>>>> for EL2 to perform trivial sysreg writes on behalf of the guest, without the
>>>> need for a full world switch to the host and back.
>>>> 
>>>> The main purpose of these patches is to quantify the performance hit, and
>>>> verify whether the MAIR_EL1 handling works correctly.
>>>> 
>>>> Ard Biesheuvel (3):
>>>>  arm64: KVM: handle some sysreg writes in EL2
>>>>  arm64: KVM: mangle MAIR register to prevent uncached guest mappings
>>>>  arm64: KVM: keep trapping of VM sysreg writes enabled
>>> 
>>> Hi Ard,
>>> 
>>> I took this series for test drive. Unfortunately I have bad news and worse
>>> news. First, a description of the test; simply boot a guest, once at login,
>>> login, and then shutdown with 'poweroff'. The guest boots through AAVMF using
>>> a build from Laszlo that enables PCI, but does *not* have the 'map pci mmio
>>> as cached' kludge. This test allows us to check for corrupt vram on the
>>> graphical console, plus it completes a boot/shutdown cycle allowing us to
>>> count sysreg traps of the boot/shutdown cycle.
>> 
>> Thanks a lot for giving this a spin right away!
>> 
>>> So, the bad news
>>> 
>>> Before this series we trapped 50 times on sysreg writes with the test
>>> described above. With this series we trap 62873 times. But, less than
>>> 20 required going to EL1.
>> 
>> OK, this is very useful information. We still don't know what the
>> penalty is of all those traps, but that's quite a big number indeed.
>> 
>>> (I don't have an exact number for how many times it went to EL1 because
>>> access_mair() doesn't have a trace point.)
>>> (I got the 62873 number by testing a 3rd kernel build that only had patch
>>> 3/3 applied to the base, and counting kvm_toggle_cache events.)
>>> (The number 50 is the number of kvm_toggle_cache events *without* 3/3
>>> applied.)
>>> 
>>> I consider this bad news because, even considering it only goes to EL2,
>>> it goes a ton more than it used to. I realize patch 3/3 isn't the final
>>> plan for enabling traps though.
>>> 
>>> And, now the worse news
>>> 
>>> The vram corruption persists with this patch series.
>> 
>> OK, so the primary difference is that I am not substituting for write
>> back mappings, as Laszlo is doing in his patch.
>> If you have energy left, would you mind having another go but use 0xff
>> (not 0xbb) for the MAIR values in patch #2?
> 
> Yup, a bit energy left, and, yup, 0xff fixes it

ok so that means we'd need to map as writeback cacheable by default, and restrict it as necessary at stage 2.

thanks


> Thanks,
> drew
> 
>> 
>>>> 
>>>> arch/arm/kvm/mmu.c               |   2 +-
>>>> arch/arm64/include/asm/kvm_arm.h |   2 +-
>>>> arch/arm64/kvm/hyp.S             | 101 +++++++++++++++++++++++++++++++++++++++
>>>> arch/arm64/kvm/sys_regs.c        |  63 ++++++++++++++++++++----
>>>> 4 files changed, 156 insertions(+), 12 deletions(-)
>>>> 
>>>> --
>>>> 1.8.3.2
>>>> 
>>>> _______________________________________________
>>>> kvmarm mailing list
>>>> kvmarm at lists.cs.columbia.edu
>>>> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-02-19 17:57         ` Paolo Bonzini
@ 2015-02-20 14:29           ` Andrew Jones
  -1 siblings, 0 replies; 110+ messages in thread
From: Andrew Jones @ 2015-02-20 14:29 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: KVM devel mailing list, Ard Biesheuvel, Marc Zyngier,
	Laszlo Ersek, kvmarm, linux-arm-kernel

On Thu, Feb 19, 2015 at 06:57:24PM +0100, Paolo Bonzini wrote:
> 
> 
> On 19/02/2015 18:55, Andrew Jones wrote:
> >> > > (I don't have an exact number for how many times it went to EL1 because
> >> > >  access_mair() doesn't have a trace point.)
> >> > > (I got the 62873 number by testing a 3rd kernel build that only had patch
> >> > >  3/3 applied to the base, and counting kvm_toggle_cache events.)
> >> > > (The number 50 is the number of kvm_toggle_cache events *without* 3/3
> >> > >  applied.)
> >> > >
> >> > > I consider this bad news because, even considering it only goes to EL2,
> >> > > it goes a ton more than it used to. I realize patch 3/3 isn't the final
> >> > > plan for enabling traps though.
> 
> If a full guest boots, can you try timing a kernel compile?
>

Guests boot. I used an 8 vcpu, 14G memory guest; compiled the kernel 4
times inside the guest for each host kernel; base and mair. I dropped
the time from the first run of each set, and captured the other 3.
Command line used below. Time is from the
  Elapsed (wall clock) time (h:mm:ss or m:ss):
output of /usr/bin/time - the host's wall clock.

  /usr/bin/time --verbose ssh $VM 'cd kernel && make -s clean && make -s -j8'

Results:
base: 3:06.11 3:07.00 3:10.93
mair: 3:08.47 3:06.75 3:04.76

So looks like the 3 orders of magnitude greater number of traps
(only to el2) don't impact kernel compiles.

Then I thought I'd be able to quick measure the number of cycles
a trap to el2 takes with this kvm-unit-tests test

int main(void)
{
	unsigned long start, end;
	unsigned int sctlr;

	asm volatile(
	"	mrs %0, sctlr_el1\n"
	"	msr pmcr_el0, %1\n"
	: "=&r" (sctlr) : "r" (5));

	asm volatile(
	"	mrs %0, pmccntr_el0\n"
	"	msr sctlr_el1, %2\n"
	"	mrs %1, pmccntr_el0\n"
	: "=&r" (start), "=&r" (end) : "r" (sctlr));

	printf("%llx\n", end - start);
	return 0;
}

after applying this patch to kvm

diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
index bb91b6fc63861..5de39d740aa58 100644
--- a/arch/arm64/kvm/hyp.S
+++ b/arch/arm64/kvm/hyp.S
@@ -770,7 +770,7 @@
 
 	mrs	x2, mdcr_el2
 	and	x2, x2, #MDCR_EL2_HPMN_MASK
-	orr	x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
+//	orr	x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
 	orr	x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
 
 	// Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap

But I get zero for the cycle count. Not sure what I'm missing.

drew

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-02-20 14:29           ` Andrew Jones
  0 siblings, 0 replies; 110+ messages in thread
From: Andrew Jones @ 2015-02-20 14:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 19, 2015 at 06:57:24PM +0100, Paolo Bonzini wrote:
> 
> 
> On 19/02/2015 18:55, Andrew Jones wrote:
> >> > > (I don't have an exact number for how many times it went to EL1 because
> >> > >  access_mair() doesn't have a trace point.)
> >> > > (I got the 62873 number by testing a 3rd kernel build that only had patch
> >> > >  3/3 applied to the base, and counting kvm_toggle_cache events.)
> >> > > (The number 50 is the number of kvm_toggle_cache events *without* 3/3
> >> > >  applied.)
> >> > >
> >> > > I consider this bad news because, even considering it only goes to EL2,
> >> > > it goes a ton more than it used to. I realize patch 3/3 isn't the final
> >> > > plan for enabling traps though.
> 
> If a full guest boots, can you try timing a kernel compile?
>

Guests boot. I used an 8 vcpu, 14G memory guest; compiled the kernel 4
times inside the guest for each host kernel; base and mair. I dropped
the time from the first run of each set, and captured the other 3.
Command line used below. Time is from the
  Elapsed (wall clock) time (h:mm:ss or m:ss):
output of /usr/bin/time - the host's wall clock.

  /usr/bin/time --verbose ssh $VM 'cd kernel && make -s clean && make -s -j8'

Results:
base: 3:06.11 3:07.00 3:10.93
mair: 3:08.47 3:06.75 3:04.76

So looks like the 3 orders of magnitude greater number of traps
(only to el2) don't impact kernel compiles.

Then I thought I'd be able to quick measure the number of cycles
a trap to el2 takes with this kvm-unit-tests test

int main(void)
{
	unsigned long start, end;
	unsigned int sctlr;

	asm volatile(
	"	mrs %0, sctlr_el1\n"
	"	msr pmcr_el0, %1\n"
	: "=&r" (sctlr) : "r" (5));

	asm volatile(
	"	mrs %0, pmccntr_el0\n"
	"	msr sctlr_el1, %2\n"
	"	mrs %1, pmccntr_el0\n"
	: "=&r" (start), "=&r" (end) : "r" (sctlr));

	printf("%llx\n", end - start);
	return 0;
}

after applying this patch to kvm

diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
index bb91b6fc63861..5de39d740aa58 100644
--- a/arch/arm64/kvm/hyp.S
+++ b/arch/arm64/kvm/hyp.S
@@ -770,7 +770,7 @@
 
 	mrs	x2, mdcr_el2
 	and	x2, x2, #MDCR_EL2_HPMN_MASK
-	orr	x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
+//	orr	x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
 	orr	x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
 
 	// Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap

But I get zero for the cycle count. Not sure what I'm missing.

drew

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-02-20 14:29           ` Andrew Jones
@ 2015-02-20 14:37             ` Ard Biesheuvel
  -1 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-02-20 14:37 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Paolo Bonzini, Laszlo Ersek, Christoffer Dall, Marc Zyngier,
	linux-arm-kernel, Peter Maydell, kvmarm, KVM devel mailing list

On 20 February 2015 at 14:29, Andrew Jones <drjones@redhat.com> wrote:
> On Thu, Feb 19, 2015 at 06:57:24PM +0100, Paolo Bonzini wrote:
>>
>>
>> On 19/02/2015 18:55, Andrew Jones wrote:
>> >> > > (I don't have an exact number for how many times it went to EL1 because
>> >> > >  access_mair() doesn't have a trace point.)
>> >> > > (I got the 62873 number by testing a 3rd kernel build that only had patch
>> >> > >  3/3 applied to the base, and counting kvm_toggle_cache events.)
>> >> > > (The number 50 is the number of kvm_toggle_cache events *without* 3/3
>> >> > >  applied.)
>> >> > >
>> >> > > I consider this bad news because, even considering it only goes to EL2,
>> >> > > it goes a ton more than it used to. I realize patch 3/3 isn't the final
>> >> > > plan for enabling traps though.
>>
>> If a full guest boots, can you try timing a kernel compile?
>>
>
> Guests boot. I used an 8 vcpu, 14G memory guest; compiled the kernel 4
> times inside the guest for each host kernel; base and mair. I dropped
> the time from the first run of each set, and captured the other 3.
> Command line used below. Time is from the
>   Elapsed (wall clock) time (h:mm:ss or m:ss):
> output of /usr/bin/time - the host's wall clock.
>
>   /usr/bin/time --verbose ssh $VM 'cd kernel && make -s clean && make -s -j8'
>
> Results:
> base: 3:06.11 3:07.00 3:10.93
> mair: 3:08.47 3:06.75 3:04.76
>
> So looks like the 3 orders of magnitude greater number of traps
> (only to el2) don't impact kernel compiles.
>

OK, good! That was what I was hoping for, obviously.

> Then I thought I'd be able to quick measure the number of cycles
> a trap to el2 takes with this kvm-unit-tests test
>
> int main(void)
> {
>         unsigned long start, end;
>         unsigned int sctlr;
>
>         asm volatile(
>         "       mrs %0, sctlr_el1\n"
>         "       msr pmcr_el0, %1\n"
>         : "=&r" (sctlr) : "r" (5));
>
>         asm volatile(
>         "       mrs %0, pmccntr_el0\n"
>         "       msr sctlr_el1, %2\n"
>         "       mrs %1, pmccntr_el0\n"
>         : "=&r" (start), "=&r" (end) : "r" (sctlr));
>
>         printf("%llx\n", end - start);
>         return 0;
> }
>
> after applying this patch to kvm
>
> diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> index bb91b6fc63861..5de39d740aa58 100644
> --- a/arch/arm64/kvm/hyp.S
> +++ b/arch/arm64/kvm/hyp.S
> @@ -770,7 +770,7 @@
>
>         mrs     x2, mdcr_el2
>         and     x2, x2, #MDCR_EL2_HPMN_MASK
> -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
>         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
>
>         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
>
> But I get zero for the cycle count. Not sure what I'm missing.
>

No clue tbh. Does the counter work as expected in the host?

-- 
Ard.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-02-20 14:37             ` Ard Biesheuvel
  0 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-02-20 14:37 UTC (permalink / raw)
  To: linux-arm-kernel

On 20 February 2015 at 14:29, Andrew Jones <drjones@redhat.com> wrote:
> On Thu, Feb 19, 2015 at 06:57:24PM +0100, Paolo Bonzini wrote:
>>
>>
>> On 19/02/2015 18:55, Andrew Jones wrote:
>> >> > > (I don't have an exact number for how many times it went to EL1 because
>> >> > >  access_mair() doesn't have a trace point.)
>> >> > > (I got the 62873 number by testing a 3rd kernel build that only had patch
>> >> > >  3/3 applied to the base, and counting kvm_toggle_cache events.)
>> >> > > (The number 50 is the number of kvm_toggle_cache events *without* 3/3
>> >> > >  applied.)
>> >> > >
>> >> > > I consider this bad news because, even considering it only goes to EL2,
>> >> > > it goes a ton more than it used to. I realize patch 3/3 isn't the final
>> >> > > plan for enabling traps though.
>>
>> If a full guest boots, can you try timing a kernel compile?
>>
>
> Guests boot. I used an 8 vcpu, 14G memory guest; compiled the kernel 4
> times inside the guest for each host kernel; base and mair. I dropped
> the time from the first run of each set, and captured the other 3.
> Command line used below. Time is from the
>   Elapsed (wall clock) time (h:mm:ss or m:ss):
> output of /usr/bin/time - the host's wall clock.
>
>   /usr/bin/time --verbose ssh $VM 'cd kernel && make -s clean && make -s -j8'
>
> Results:
> base: 3:06.11 3:07.00 3:10.93
> mair: 3:08.47 3:06.75 3:04.76
>
> So looks like the 3 orders of magnitude greater number of traps
> (only to el2) don't impact kernel compiles.
>

OK, good! That was what I was hoping for, obviously.

> Then I thought I'd be able to quick measure the number of cycles
> a trap to el2 takes with this kvm-unit-tests test
>
> int main(void)
> {
>         unsigned long start, end;
>         unsigned int sctlr;
>
>         asm volatile(
>         "       mrs %0, sctlr_el1\n"
>         "       msr pmcr_el0, %1\n"
>         : "=&r" (sctlr) : "r" (5));
>
>         asm volatile(
>         "       mrs %0, pmccntr_el0\n"
>         "       msr sctlr_el1, %2\n"
>         "       mrs %1, pmccntr_el0\n"
>         : "=&r" (start), "=&r" (end) : "r" (sctlr));
>
>         printf("%llx\n", end - start);
>         return 0;
> }
>
> after applying this patch to kvm
>
> diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> index bb91b6fc63861..5de39d740aa58 100644
> --- a/arch/arm64/kvm/hyp.S
> +++ b/arch/arm64/kvm/hyp.S
> @@ -770,7 +770,7 @@
>
>         mrs     x2, mdcr_el2
>         and     x2, x2, #MDCR_EL2_HPMN_MASK
> -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
>         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
>
>         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
>
> But I get zero for the cycle count. Not sure what I'm missing.
>

No clue tbh. Does the counter work as expected in the host?

-- 
Ard.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-02-20 14:37             ` Ard Biesheuvel
@ 2015-02-20 15:36               ` Andrew Jones
  -1 siblings, 0 replies; 110+ messages in thread
From: Andrew Jones @ 2015-02-20 15:36 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: KVM devel mailing list, Marc Zyngier, Paolo Bonzini,
	Laszlo Ersek, kvmarm, linux-arm-kernel

On Fri, Feb 20, 2015 at 02:37:25PM +0000, Ard Biesheuvel wrote:
> On 20 February 2015 at 14:29, Andrew Jones <drjones@redhat.com> wrote:
> > So looks like the 3 orders of magnitude greater number of traps
> > (only to el2) don't impact kernel compiles.
> >
> 
> OK, good! That was what I was hoping for, obviously.
> 
> > Then I thought I'd be able to quick measure the number of cycles
> > a trap to el2 takes with this kvm-unit-tests test
> >
> > int main(void)
> > {
> >         unsigned long start, end;
> >         unsigned int sctlr;
> >
> >         asm volatile(
> >         "       mrs %0, sctlr_el1\n"
> >         "       msr pmcr_el0, %1\n"
> >         : "=&r" (sctlr) : "r" (5));
> >
> >         asm volatile(
> >         "       mrs %0, pmccntr_el0\n"
> >         "       msr sctlr_el1, %2\n"
> >         "       mrs %1, pmccntr_el0\n"
> >         : "=&r" (start), "=&r" (end) : "r" (sctlr));
> >
> >         printf("%llx\n", end - start);
> >         return 0;
> > }
> >
> > after applying this patch to kvm
> >
> > diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> > index bb91b6fc63861..5de39d740aa58 100644
> > --- a/arch/arm64/kvm/hyp.S
> > +++ b/arch/arm64/kvm/hyp.S
> > @@ -770,7 +770,7 @@
> >
> >         mrs     x2, mdcr_el2
> >         and     x2, x2, #MDCR_EL2_HPMN_MASK
> > -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> > +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> >         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
> >
> >         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
> >
> > But I get zero for the cycle count. Not sure what I'm missing.
> >
> 
> No clue tbh. Does the counter work as expected in the host?
>

Guess not. I dropped the test into a module_init and inserted
it on the host. Always get zero for pmccntr_el0 reads. Or, if
I set it to something non-zero with a write, then I always get
that back - no increments. pmcr_el0 looks OK... I had forgotten
to set bit 31 of pmcntenset_el0, but doing that still doesn't
help. Anyway, I assume the problem is me. I'll keep looking to
see what I'm missing.

drew

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-02-20 15:36               ` Andrew Jones
  0 siblings, 0 replies; 110+ messages in thread
From: Andrew Jones @ 2015-02-20 15:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 20, 2015 at 02:37:25PM +0000, Ard Biesheuvel wrote:
> On 20 February 2015 at 14:29, Andrew Jones <drjones@redhat.com> wrote:
> > So looks like the 3 orders of magnitude greater number of traps
> > (only to el2) don't impact kernel compiles.
> >
> 
> OK, good! That was what I was hoping for, obviously.
> 
> > Then I thought I'd be able to quick measure the number of cycles
> > a trap to el2 takes with this kvm-unit-tests test
> >
> > int main(void)
> > {
> >         unsigned long start, end;
> >         unsigned int sctlr;
> >
> >         asm volatile(
> >         "       mrs %0, sctlr_el1\n"
> >         "       msr pmcr_el0, %1\n"
> >         : "=&r" (sctlr) : "r" (5));
> >
> >         asm volatile(
> >         "       mrs %0, pmccntr_el0\n"
> >         "       msr sctlr_el1, %2\n"
> >         "       mrs %1, pmccntr_el0\n"
> >         : "=&r" (start), "=&r" (end) : "r" (sctlr));
> >
> >         printf("%llx\n", end - start);
> >         return 0;
> > }
> >
> > after applying this patch to kvm
> >
> > diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> > index bb91b6fc63861..5de39d740aa58 100644
> > --- a/arch/arm64/kvm/hyp.S
> > +++ b/arch/arm64/kvm/hyp.S
> > @@ -770,7 +770,7 @@
> >
> >         mrs     x2, mdcr_el2
> >         and     x2, x2, #MDCR_EL2_HPMN_MASK
> > -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> > +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> >         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
> >
> >         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
> >
> > But I get zero for the cycle count. Not sure what I'm missing.
> >
> 
> No clue tbh. Does the counter work as expected in the host?
>

Guess not. I dropped the test into a module_init and inserted
it on the host. Always get zero for pmccntr_el0 reads. Or, if
I set it to something non-zero with a write, then I always get
that back - no increments. pmcr_el0 looks OK... I had forgotten
to set bit 31 of pmcntenset_el0, but doing that still doesn't
help. Anyway, I assume the problem is me. I'll keep looking to
see what I'm missing.

drew

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-02-20 15:36               ` Andrew Jones
@ 2015-02-24 14:55                 ` Andrew Jones
  -1 siblings, 0 replies; 110+ messages in thread
From: Andrew Jones @ 2015-02-24 14:55 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: KVM devel mailing list, Marc Zyngier, Paolo Bonzini,
	Laszlo Ersek, kvmarm, linux-arm-kernel

[-- Attachment #1: Type: text/plain, Size: 3011 bytes --]

On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
> On Fri, Feb 20, 2015 at 02:37:25PM +0000, Ard Biesheuvel wrote:
> > On 20 February 2015 at 14:29, Andrew Jones <drjones@redhat.com> wrote:
> > > So looks like the 3 orders of magnitude greater number of traps
> > > (only to el2) don't impact kernel compiles.
> > >
> > 
> > OK, good! That was what I was hoping for, obviously.
> > 
> > > Then I thought I'd be able to quick measure the number of cycles
> > > a trap to el2 takes with this kvm-unit-tests test
> > >
> > > int main(void)
> > > {
> > >         unsigned long start, end;
> > >         unsigned int sctlr;
> > >
> > >         asm volatile(
> > >         "       mrs %0, sctlr_el1\n"
> > >         "       msr pmcr_el0, %1\n"
> > >         : "=&r" (sctlr) : "r" (5));
> > >
> > >         asm volatile(
> > >         "       mrs %0, pmccntr_el0\n"
> > >         "       msr sctlr_el1, %2\n"
> > >         "       mrs %1, pmccntr_el0\n"
> > >         : "=&r" (start), "=&r" (end) : "r" (sctlr));
> > >
> > >         printf("%llx\n", end - start);
> > >         return 0;
> > > }
> > >
> > > after applying this patch to kvm
> > >
> > > diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> > > index bb91b6fc63861..5de39d740aa58 100644
> > > --- a/arch/arm64/kvm/hyp.S
> > > +++ b/arch/arm64/kvm/hyp.S
> > > @@ -770,7 +770,7 @@
> > >
> > >         mrs     x2, mdcr_el2
> > >         and     x2, x2, #MDCR_EL2_HPMN_MASK
> > > -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> > > +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> > >         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
> > >
> > >         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
> > >
> > > But I get zero for the cycle count. Not sure what I'm missing.
> > >
> > 
> > No clue tbh. Does the counter work as expected in the host?
> >
> 
> Guess not. I dropped the test into a module_init and inserted
> it on the host. Always get zero for pmccntr_el0 reads. Or, if
> I set it to something non-zero with a write, then I always get
> that back - no increments. pmcr_el0 looks OK... I had forgotten
> to set bit 31 of pmcntenset_el0, but doing that still doesn't
> help. Anyway, I assume the problem is me. I'll keep looking to
> see what I'm missing.
>

I returned to this and see that the problem was indeed me. I needed yet
another enable bit set (the filter register needed to be instructed to
count cycles while in el2). I've attached the code for the curious.
The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
running on a host without this patch series (after TVM traps have been
disabled), I get a pretty consistent 40.

I checked how many vm-sysreg traps we do during the kernel compile
benchmark. It's 124924. So it's a bit strange that we don't see the
benchmark taking 10 to 20 seconds longer on average. I should probably
double check my runs. In any case, while I like the approach of this
series, the overhead is looking non-negligible.

drew

[-- Attachment #2: trapcycles.c --]
[-- Type: text/plain, Size: 798 bytes --]

#include <libcflat.h>

static void prep_cc(void)
{
	asm volatile(
	"	msr pmovsclr_el0, %0\n"
	"	msr pmccfiltr_el0, %1\n"
	"	msr pmcntenset_el0, %2\n"
	"	msr pmcr_el0, %3\n"
	"	isb\n"
	:
	: "r" (1 << 31), "r" (1 << 27), "r" (1 << 31),
	  "r" (1 << 6 | 1 << 2 | 1 << 0));
}

int main(void)
{
	unsigned long start, end;
	unsigned int sctlr;
	int i, zeros = 0;

	asm volatile("mrs %0, sctlr_el1" : "=&r" (sctlr));
	prep_cc();

	for (i = 0; i < 100000; ++i) {
		asm volatile(
		"	mrs %0, pmccntr_el0\n"
		"	msr sctlr_el1, %2\n"
		"	mrs %1, pmccntr_el0\n"
		"	isb\n"
		: "=&r" (start), "=&r" (end) : "r" (sctlr));

		if ((i % 10) == 0)
			printf("\n");

		printf(" %d", end - start);

		if ((end - start) == 0) {
			++zeros;
			prep_cc();
		}
	}

	printf("\nnum zero counts = %d\n", zeros);
	return 0;
}

[-- Attachment #3: Type: text/plain, Size: 151 bytes --]

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-02-24 14:55                 ` Andrew Jones
  0 siblings, 0 replies; 110+ messages in thread
From: Andrew Jones @ 2015-02-24 14:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
> On Fri, Feb 20, 2015 at 02:37:25PM +0000, Ard Biesheuvel wrote:
> > On 20 February 2015 at 14:29, Andrew Jones <drjones@redhat.com> wrote:
> > > So looks like the 3 orders of magnitude greater number of traps
> > > (only to el2) don't impact kernel compiles.
> > >
> > 
> > OK, good! That was what I was hoping for, obviously.
> > 
> > > Then I thought I'd be able to quick measure the number of cycles
> > > a trap to el2 takes with this kvm-unit-tests test
> > >
> > > int main(void)
> > > {
> > >         unsigned long start, end;
> > >         unsigned int sctlr;
> > >
> > >         asm volatile(
> > >         "       mrs %0, sctlr_el1\n"
> > >         "       msr pmcr_el0, %1\n"
> > >         : "=&r" (sctlr) : "r" (5));
> > >
> > >         asm volatile(
> > >         "       mrs %0, pmccntr_el0\n"
> > >         "       msr sctlr_el1, %2\n"
> > >         "       mrs %1, pmccntr_el0\n"
> > >         : "=&r" (start), "=&r" (end) : "r" (sctlr));
> > >
> > >         printf("%llx\n", end - start);
> > >         return 0;
> > > }
> > >
> > > after applying this patch to kvm
> > >
> > > diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> > > index bb91b6fc63861..5de39d740aa58 100644
> > > --- a/arch/arm64/kvm/hyp.S
> > > +++ b/arch/arm64/kvm/hyp.S
> > > @@ -770,7 +770,7 @@
> > >
> > >         mrs     x2, mdcr_el2
> > >         and     x2, x2, #MDCR_EL2_HPMN_MASK
> > > -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> > > +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> > >         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
> > >
> > >         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
> > >
> > > But I get zero for the cycle count. Not sure what I'm missing.
> > >
> > 
> > No clue tbh. Does the counter work as expected in the host?
> >
> 
> Guess not. I dropped the test into a module_init and inserted
> it on the host. Always get zero for pmccntr_el0 reads. Or, if
> I set it to something non-zero with a write, then I always get
> that back - no increments. pmcr_el0 looks OK... I had forgotten
> to set bit 31 of pmcntenset_el0, but doing that still doesn't
> help. Anyway, I assume the problem is me. I'll keep looking to
> see what I'm missing.
>

I returned to this and see that the problem was indeed me. I needed yet
another enable bit set (the filter register needed to be instructed to
count cycles while in el2). I've attached the code for the curious.
The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
running on a host without this patch series (after TVM traps have been
disabled), I get a pretty consistent 40.

I checked how many vm-sysreg traps we do during the kernel compile
benchmark. It's 124924. So it's a bit strange that we don't see the
benchmark taking 10 to 20 seconds longer on average. I should probably
double check my runs. In any case, while I like the approach of this
series, the overhead is looking non-negligible.

drew
-------------- next part --------------
#include <libcflat.h>

static void prep_cc(void)
{
	asm volatile(
	"	msr pmovsclr_el0, %0\n"
	"	msr pmccfiltr_el0, %1\n"
	"	msr pmcntenset_el0, %2\n"
	"	msr pmcr_el0, %3\n"
	"	isb\n"
	:
	: "r" (1 << 31), "r" (1 << 27), "r" (1 << 31),
	  "r" (1 << 6 | 1 << 2 | 1 << 0));
}

int main(void)
{
	unsigned long start, end;
	unsigned int sctlr;
	int i, zeros = 0;

	asm volatile("mrs %0, sctlr_el1" : "=&r" (sctlr));
	prep_cc();

	for (i = 0; i < 100000; ++i) {
		asm volatile(
		"	mrs %0, pmccntr_el0\n"
		"	msr sctlr_el1, %2\n"
		"	mrs %1, pmccntr_el0\n"
		"	isb\n"
		: "=&r" (start), "=&r" (end) : "r" (sctlr));

		if ((i % 10) == 0)
			printf("\n");

		printf(" %d", end - start);

		if ((end - start) == 0) {
			++zeros;
			prep_cc();
		}
	}

	printf("\nnum zero counts = %d\n", zeros);
	return 0;
}

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-02-24 14:55                 ` Andrew Jones
@ 2015-02-24 17:47                   ` Ard Biesheuvel
  -1 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-02-24 17:47 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Paolo Bonzini, Laszlo Ersek, Christoffer Dall, Marc Zyngier,
	linux-arm-kernel, Peter Maydell, kvmarm, KVM devel mailing list

On 24 February 2015 at 14:55, Andrew Jones <drjones@redhat.com> wrote:
> On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
>> On Fri, Feb 20, 2015 at 02:37:25PM +0000, Ard Biesheuvel wrote:
>> > On 20 February 2015 at 14:29, Andrew Jones <drjones@redhat.com> wrote:
>> > > So looks like the 3 orders of magnitude greater number of traps
>> > > (only to el2) don't impact kernel compiles.
>> > >
>> >
>> > OK, good! That was what I was hoping for, obviously.
>> >
>> > > Then I thought I'd be able to quick measure the number of cycles
>> > > a trap to el2 takes with this kvm-unit-tests test
>> > >
>> > > int main(void)
>> > > {
>> > >         unsigned long start, end;
>> > >         unsigned int sctlr;
>> > >
>> > >         asm volatile(
>> > >         "       mrs %0, sctlr_el1\n"
>> > >         "       msr pmcr_el0, %1\n"
>> > >         : "=&r" (sctlr) : "r" (5));
>> > >
>> > >         asm volatile(
>> > >         "       mrs %0, pmccntr_el0\n"
>> > >         "       msr sctlr_el1, %2\n"
>> > >         "       mrs %1, pmccntr_el0\n"
>> > >         : "=&r" (start), "=&r" (end) : "r" (sctlr));
>> > >
>> > >         printf("%llx\n", end - start);
>> > >         return 0;
>> > > }
>> > >
>> > > after applying this patch to kvm
>> > >
>> > > diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
>> > > index bb91b6fc63861..5de39d740aa58 100644
>> > > --- a/arch/arm64/kvm/hyp.S
>> > > +++ b/arch/arm64/kvm/hyp.S
>> > > @@ -770,7 +770,7 @@
>> > >
>> > >         mrs     x2, mdcr_el2
>> > >         and     x2, x2, #MDCR_EL2_HPMN_MASK
>> > > -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
>> > > +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
>> > >         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
>> > >
>> > >         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
>> > >
>> > > But I get zero for the cycle count. Not sure what I'm missing.
>> > >
>> >
>> > No clue tbh. Does the counter work as expected in the host?
>> >
>>
>> Guess not. I dropped the test into a module_init and inserted
>> it on the host. Always get zero for pmccntr_el0 reads. Or, if
>> I set it to something non-zero with a write, then I always get
>> that back - no increments. pmcr_el0 looks OK... I had forgotten
>> to set bit 31 of pmcntenset_el0, but doing that still doesn't
>> help. Anyway, I assume the problem is me. I'll keep looking to
>> see what I'm missing.
>>
>
> I returned to this and see that the problem was indeed me. I needed yet
> another enable bit set (the filter register needed to be instructed to
> count cycles while in el2). I've attached the code for the curious.
> The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
> running on a host without this patch series (after TVM traps have been
> disabled), I get a pretty consistent 40.
>
> I checked how many vm-sysreg traps we do during the kernel compile
> benchmark. It's 124924. So it's a bit strange that we don't see the
> benchmark taking 10 to 20 seconds longer on average. I should probably
> double check my runs. In any case, while I like the approach of this
> series, the overhead is looking non-negligible.
>

Thanks a lot for producing these numbers. 125k x 7k == <1 billion
cycles == <1 second on a >1 GHz machine, I think?
Or am I missing something? How long does the actual compile take?

-- 
Ard.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-02-24 17:47                   ` Ard Biesheuvel
  0 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-02-24 17:47 UTC (permalink / raw)
  To: linux-arm-kernel

On 24 February 2015 at 14:55, Andrew Jones <drjones@redhat.com> wrote:
> On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
>> On Fri, Feb 20, 2015 at 02:37:25PM +0000, Ard Biesheuvel wrote:
>> > On 20 February 2015 at 14:29, Andrew Jones <drjones@redhat.com> wrote:
>> > > So looks like the 3 orders of magnitude greater number of traps
>> > > (only to el2) don't impact kernel compiles.
>> > >
>> >
>> > OK, good! That was what I was hoping for, obviously.
>> >
>> > > Then I thought I'd be able to quick measure the number of cycles
>> > > a trap to el2 takes with this kvm-unit-tests test
>> > >
>> > > int main(void)
>> > > {
>> > >         unsigned long start, end;
>> > >         unsigned int sctlr;
>> > >
>> > >         asm volatile(
>> > >         "       mrs %0, sctlr_el1\n"
>> > >         "       msr pmcr_el0, %1\n"
>> > >         : "=&r" (sctlr) : "r" (5));
>> > >
>> > >         asm volatile(
>> > >         "       mrs %0, pmccntr_el0\n"
>> > >         "       msr sctlr_el1, %2\n"
>> > >         "       mrs %1, pmccntr_el0\n"
>> > >         : "=&r" (start), "=&r" (end) : "r" (sctlr));
>> > >
>> > >         printf("%llx\n", end - start);
>> > >         return 0;
>> > > }
>> > >
>> > > after applying this patch to kvm
>> > >
>> > > diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
>> > > index bb91b6fc63861..5de39d740aa58 100644
>> > > --- a/arch/arm64/kvm/hyp.S
>> > > +++ b/arch/arm64/kvm/hyp.S
>> > > @@ -770,7 +770,7 @@
>> > >
>> > >         mrs     x2, mdcr_el2
>> > >         and     x2, x2, #MDCR_EL2_HPMN_MASK
>> > > -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
>> > > +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
>> > >         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
>> > >
>> > >         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
>> > >
>> > > But I get zero for the cycle count. Not sure what I'm missing.
>> > >
>> >
>> > No clue tbh. Does the counter work as expected in the host?
>> >
>>
>> Guess not. I dropped the test into a module_init and inserted
>> it on the host. Always get zero for pmccntr_el0 reads. Or, if
>> I set it to something non-zero with a write, then I always get
>> that back - no increments. pmcr_el0 looks OK... I had forgotten
>> to set bit 31 of pmcntenset_el0, but doing that still doesn't
>> help. Anyway, I assume the problem is me. I'll keep looking to
>> see what I'm missing.
>>
>
> I returned to this and see that the problem was indeed me. I needed yet
> another enable bit set (the filter register needed to be instructed to
> count cycles while in el2). I've attached the code for the curious.
> The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
> running on a host without this patch series (after TVM traps have been
> disabled), I get a pretty consistent 40.
>
> I checked how many vm-sysreg traps we do during the kernel compile
> benchmark. It's 124924. So it's a bit strange that we don't see the
> benchmark taking 10 to 20 seconds longer on average. I should probably
> double check my runs. In any case, while I like the approach of this
> series, the overhead is looking non-negligible.
>

Thanks a lot for producing these numbers. 125k x 7k == <1 billion
cycles == <1 second on a >1 GHz machine, I think?
Or am I missing something? How long does the actual compile take?

-- 
Ard.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-02-24 17:47                   ` Ard Biesheuvel
@ 2015-02-24 19:12                     ` Andrew Jones
  -1 siblings, 0 replies; 110+ messages in thread
From: Andrew Jones @ 2015-02-24 19:12 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: KVM devel mailing list, Marc Zyngier, Paolo Bonzini,
	Laszlo Ersek, kvmarm, linux-arm-kernel

On Tue, Feb 24, 2015 at 05:47:19PM +0000, Ard Biesheuvel wrote:
> On 24 February 2015 at 14:55, Andrew Jones <drjones@redhat.com> wrote:
> > On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
> >> On Fri, Feb 20, 2015 at 02:37:25PM +0000, Ard Biesheuvel wrote:
> >> > On 20 February 2015 at 14:29, Andrew Jones <drjones@redhat.com> wrote:
> >> > > So looks like the 3 orders of magnitude greater number of traps
> >> > > (only to el2) don't impact kernel compiles.
> >> > >
> >> >
> >> > OK, good! That was what I was hoping for, obviously.
> >> >
> >> > > Then I thought I'd be able to quick measure the number of cycles
> >> > > a trap to el2 takes with this kvm-unit-tests test
> >> > >
> >> > > int main(void)
> >> > > {
> >> > >         unsigned long start, end;
> >> > >         unsigned int sctlr;
> >> > >
> >> > >         asm volatile(
> >> > >         "       mrs %0, sctlr_el1\n"
> >> > >         "       msr pmcr_el0, %1\n"
> >> > >         : "=&r" (sctlr) : "r" (5));
> >> > >
> >> > >         asm volatile(
> >> > >         "       mrs %0, pmccntr_el0\n"
> >> > >         "       msr sctlr_el1, %2\n"
> >> > >         "       mrs %1, pmccntr_el0\n"
> >> > >         : "=&r" (start), "=&r" (end) : "r" (sctlr));
> >> > >
> >> > >         printf("%llx\n", end - start);
> >> > >         return 0;
> >> > > }
> >> > >
> >> > > after applying this patch to kvm
> >> > >
> >> > > diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> >> > > index bb91b6fc63861..5de39d740aa58 100644
> >> > > --- a/arch/arm64/kvm/hyp.S
> >> > > +++ b/arch/arm64/kvm/hyp.S
> >> > > @@ -770,7 +770,7 @@
> >> > >
> >> > >         mrs     x2, mdcr_el2
> >> > >         and     x2, x2, #MDCR_EL2_HPMN_MASK
> >> > > -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> >> > > +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> >> > >         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
> >> > >
> >> > >         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
> >> > >
> >> > > But I get zero for the cycle count. Not sure what I'm missing.
> >> > >
> >> >
> >> > No clue tbh. Does the counter work as expected in the host?
> >> >
> >>
> >> Guess not. I dropped the test into a module_init and inserted
> >> it on the host. Always get zero for pmccntr_el0 reads. Or, if
> >> I set it to something non-zero with a write, then I always get
> >> that back - no increments. pmcr_el0 looks OK... I had forgotten
> >> to set bit 31 of pmcntenset_el0, but doing that still doesn't
> >> help. Anyway, I assume the problem is me. I'll keep looking to
> >> see what I'm missing.
> >>
> >
> > I returned to this and see that the problem was indeed me. I needed yet
> > another enable bit set (the filter register needed to be instructed to
> > count cycles while in el2). I've attached the code for the curious.
> > The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
> > running on a host without this patch series (after TVM traps have been
> > disabled), I get a pretty consistent 40.
> >
> > I checked how many vm-sysreg traps we do during the kernel compile
> > benchmark. It's 124924. So it's a bit strange that we don't see the
> > benchmark taking 10 to 20 seconds longer on average. I should probably
> > double check my runs. In any case, while I like the approach of this
> > series, the overhead is looking non-negligible.
> >
> 
> Thanks a lot for producing these numbers. 125k x 7k == <1 billion
> cycles == <1 second on a >1 GHz machine, I think?
> Or am I missing something? How long does the actual compile take?
>

Wait, my fault. I dropped a pretty big divisor in my calculation. Don't
ask... I'll just go home and study one of my daughter's math books now...

So, I even have a 2.4 GHz machine, which explains why the benchmark times
are the same with and without this series (those times are provided earlier
in this thread, they're roughly 03:10). I'm glad you straighted me out. I
was second guessing my benchmark results, and considering redoing them.

Anyway, this series, at least wrt to overhead, is looking good again.

Thanks,
drew

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-02-24 19:12                     ` Andrew Jones
  0 siblings, 0 replies; 110+ messages in thread
From: Andrew Jones @ 2015-02-24 19:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Feb 24, 2015 at 05:47:19PM +0000, Ard Biesheuvel wrote:
> On 24 February 2015 at 14:55, Andrew Jones <drjones@redhat.com> wrote:
> > On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
> >> On Fri, Feb 20, 2015 at 02:37:25PM +0000, Ard Biesheuvel wrote:
> >> > On 20 February 2015 at 14:29, Andrew Jones <drjones@redhat.com> wrote:
> >> > > So looks like the 3 orders of magnitude greater number of traps
> >> > > (only to el2) don't impact kernel compiles.
> >> > >
> >> >
> >> > OK, good! That was what I was hoping for, obviously.
> >> >
> >> > > Then I thought I'd be able to quick measure the number of cycles
> >> > > a trap to el2 takes with this kvm-unit-tests test
> >> > >
> >> > > int main(void)
> >> > > {
> >> > >         unsigned long start, end;
> >> > >         unsigned int sctlr;
> >> > >
> >> > >         asm volatile(
> >> > >         "       mrs %0, sctlr_el1\n"
> >> > >         "       msr pmcr_el0, %1\n"
> >> > >         : "=&r" (sctlr) : "r" (5));
> >> > >
> >> > >         asm volatile(
> >> > >         "       mrs %0, pmccntr_el0\n"
> >> > >         "       msr sctlr_el1, %2\n"
> >> > >         "       mrs %1, pmccntr_el0\n"
> >> > >         : "=&r" (start), "=&r" (end) : "r" (sctlr));
> >> > >
> >> > >         printf("%llx\n", end - start);
> >> > >         return 0;
> >> > > }
> >> > >
> >> > > after applying this patch to kvm
> >> > >
> >> > > diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> >> > > index bb91b6fc63861..5de39d740aa58 100644
> >> > > --- a/arch/arm64/kvm/hyp.S
> >> > > +++ b/arch/arm64/kvm/hyp.S
> >> > > @@ -770,7 +770,7 @@
> >> > >
> >> > >         mrs     x2, mdcr_el2
> >> > >         and     x2, x2, #MDCR_EL2_HPMN_MASK
> >> > > -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> >> > > +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> >> > >         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
> >> > >
> >> > >         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
> >> > >
> >> > > But I get zero for the cycle count. Not sure what I'm missing.
> >> > >
> >> >
> >> > No clue tbh. Does the counter work as expected in the host?
> >> >
> >>
> >> Guess not. I dropped the test into a module_init and inserted
> >> it on the host. Always get zero for pmccntr_el0 reads. Or, if
> >> I set it to something non-zero with a write, then I always get
> >> that back - no increments. pmcr_el0 looks OK... I had forgotten
> >> to set bit 31 of pmcntenset_el0, but doing that still doesn't
> >> help. Anyway, I assume the problem is me. I'll keep looking to
> >> see what I'm missing.
> >>
> >
> > I returned to this and see that the problem was indeed me. I needed yet
> > another enable bit set (the filter register needed to be instructed to
> > count cycles while in el2). I've attached the code for the curious.
> > The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
> > running on a host without this patch series (after TVM traps have been
> > disabled), I get a pretty consistent 40.
> >
> > I checked how many vm-sysreg traps we do during the kernel compile
> > benchmark. It's 124924. So it's a bit strange that we don't see the
> > benchmark taking 10 to 20 seconds longer on average. I should probably
> > double check my runs. In any case, while I like the approach of this
> > series, the overhead is looking non-negligible.
> >
> 
> Thanks a lot for producing these numbers. 125k x 7k == <1 billion
> cycles == <1 second on a >1 GHz machine, I think?
> Or am I missing something? How long does the actual compile take?
>

Wait, my fault. I dropped a pretty big divisor in my calculation. Don't
ask... I'll just go home and study one of my daughter's math books now...

So, I even have a 2.4 GHz machine, which explains why the benchmark times
are the same with and without this series (those times are provided earlier
in this thread, they're roughly 03:10). I'm glad you straighted me out. I
was second guessing my benchmark results, and considering redoing them.

Anyway, this series, at least wrt to overhead, is looking good again.

Thanks,
drew

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-02-24 17:47                   ` Ard Biesheuvel
@ 2015-03-02 16:31                     ` Christoffer Dall
  -1 siblings, 0 replies; 110+ messages in thread
From: Christoffer Dall @ 2015-03-02 16:31 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: KVM devel mailing list, Marc Zyngier, Paolo Bonzini,
	Laszlo Ersek, kvmarm, linux-arm-kernel

On Tue, Feb 24, 2015 at 05:47:19PM +0000, Ard Biesheuvel wrote:
> On 24 February 2015 at 14:55, Andrew Jones <drjones@redhat.com> wrote:
> > On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
> >> On Fri, Feb 20, 2015 at 02:37:25PM +0000, Ard Biesheuvel wrote:
> >> > On 20 February 2015 at 14:29, Andrew Jones <drjones@redhat.com> wrote:
> >> > > So looks like the 3 orders of magnitude greater number of traps
> >> > > (only to el2) don't impact kernel compiles.
> >> > >
> >> >
> >> > OK, good! That was what I was hoping for, obviously.
> >> >
> >> > > Then I thought I'd be able to quick measure the number of cycles
> >> > > a trap to el2 takes with this kvm-unit-tests test
> >> > >
> >> > > int main(void)
> >> > > {
> >> > >         unsigned long start, end;
> >> > >         unsigned int sctlr;
> >> > >
> >> > >         asm volatile(
> >> > >         "       mrs %0, sctlr_el1\n"
> >> > >         "       msr pmcr_el0, %1\n"
> >> > >         : "=&r" (sctlr) : "r" (5));
> >> > >
> >> > >         asm volatile(
> >> > >         "       mrs %0, pmccntr_el0\n"
> >> > >         "       msr sctlr_el1, %2\n"
> >> > >         "       mrs %1, pmccntr_el0\n"
> >> > >         : "=&r" (start), "=&r" (end) : "r" (sctlr));
> >> > >
> >> > >         printf("%llx\n", end - start);
> >> > >         return 0;
> >> > > }
> >> > >
> >> > > after applying this patch to kvm
> >> > >
> >> > > diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> >> > > index bb91b6fc63861..5de39d740aa58 100644
> >> > > --- a/arch/arm64/kvm/hyp.S
> >> > > +++ b/arch/arm64/kvm/hyp.S
> >> > > @@ -770,7 +770,7 @@
> >> > >
> >> > >         mrs     x2, mdcr_el2
> >> > >         and     x2, x2, #MDCR_EL2_HPMN_MASK
> >> > > -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> >> > > +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> >> > >         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
> >> > >
> >> > >         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
> >> > >
> >> > > But I get zero for the cycle count. Not sure what I'm missing.
> >> > >
> >> >
> >> > No clue tbh. Does the counter work as expected in the host?
> >> >
> >>
> >> Guess not. I dropped the test into a module_init and inserted
> >> it on the host. Always get zero for pmccntr_el0 reads. Or, if
> >> I set it to something non-zero with a write, then I always get
> >> that back - no increments. pmcr_el0 looks OK... I had forgotten
> >> to set bit 31 of pmcntenset_el0, but doing that still doesn't
> >> help. Anyway, I assume the problem is me. I'll keep looking to
> >> see what I'm missing.
> >>
> >
> > I returned to this and see that the problem was indeed me. I needed yet
> > another enable bit set (the filter register needed to be instructed to
> > count cycles while in el2). I've attached the code for the curious.
> > The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
> > running on a host without this patch series (after TVM traps have been
> > disabled), I get a pretty consistent 40.
> >
> > I checked how many vm-sysreg traps we do during the kernel compile
> > benchmark. It's 124924. So it's a bit strange that we don't see the
> > benchmark taking 10 to 20 seconds longer on average. I should probably
> > double check my runs. In any case, while I like the approach of this
> > series, the overhead is looking non-negligible.
> >
> 
> Thanks a lot for producing these numbers. 125k x 7k == <1 billion
> cycles == <1 second on a >1 GHz machine, I think?
> Or am I missing something? How long does the actual compile take?
> 
I ran a sequence of benchmarks that I occasionally run (pbzip,
kernbench, and hackbench) and I also saw < 1% performance degradation,
so I think we can trust that somewhat.  (I can post the raw numbers when
I have ssh access to my Linux desktop - sending this from Somewhere Over
The Atlantic).

However, my concern with these patches are on two points:

1. It's not a fix-all.  We still have the case where the guest expects
the behavior of device memory (for strong ordering for example) on a RAM
region, which we now break.  Similiarly this doesn't support the
non-coherent DMA to RAM region case.

2. While the code is probably as nice as this kind of stuff gets, it
is non-trivial and extremely difficult to debug.  The counter-point here
is that we may end up handling other stuff at EL2 for performanc reasons
in the future.

Mainly because of point 1 above, I am leaning to thinking userspace
should do the invalidation when it knows it needs to, either through KVM
via a memslot flag or through some other syscall mechanism.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-02 16:31                     ` Christoffer Dall
  0 siblings, 0 replies; 110+ messages in thread
From: Christoffer Dall @ 2015-03-02 16:31 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Feb 24, 2015 at 05:47:19PM +0000, Ard Biesheuvel wrote:
> On 24 February 2015 at 14:55, Andrew Jones <drjones@redhat.com> wrote:
> > On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
> >> On Fri, Feb 20, 2015 at 02:37:25PM +0000, Ard Biesheuvel wrote:
> >> > On 20 February 2015 at 14:29, Andrew Jones <drjones@redhat.com> wrote:
> >> > > So looks like the 3 orders of magnitude greater number of traps
> >> > > (only to el2) don't impact kernel compiles.
> >> > >
> >> >
> >> > OK, good! That was what I was hoping for, obviously.
> >> >
> >> > > Then I thought I'd be able to quick measure the number of cycles
> >> > > a trap to el2 takes with this kvm-unit-tests test
> >> > >
> >> > > int main(void)
> >> > > {
> >> > >         unsigned long start, end;
> >> > >         unsigned int sctlr;
> >> > >
> >> > >         asm volatile(
> >> > >         "       mrs %0, sctlr_el1\n"
> >> > >         "       msr pmcr_el0, %1\n"
> >> > >         : "=&r" (sctlr) : "r" (5));
> >> > >
> >> > >         asm volatile(
> >> > >         "       mrs %0, pmccntr_el0\n"
> >> > >         "       msr sctlr_el1, %2\n"
> >> > >         "       mrs %1, pmccntr_el0\n"
> >> > >         : "=&r" (start), "=&r" (end) : "r" (sctlr));
> >> > >
> >> > >         printf("%llx\n", end - start);
> >> > >         return 0;
> >> > > }
> >> > >
> >> > > after applying this patch to kvm
> >> > >
> >> > > diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> >> > > index bb91b6fc63861..5de39d740aa58 100644
> >> > > --- a/arch/arm64/kvm/hyp.S
> >> > > +++ b/arch/arm64/kvm/hyp.S
> >> > > @@ -770,7 +770,7 @@
> >> > >
> >> > >         mrs     x2, mdcr_el2
> >> > >         and     x2, x2, #MDCR_EL2_HPMN_MASK
> >> > > -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> >> > > +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> >> > >         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
> >> > >
> >> > >         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
> >> > >
> >> > > But I get zero for the cycle count. Not sure what I'm missing.
> >> > >
> >> >
> >> > No clue tbh. Does the counter work as expected in the host?
> >> >
> >>
> >> Guess not. I dropped the test into a module_init and inserted
> >> it on the host. Always get zero for pmccntr_el0 reads. Or, if
> >> I set it to something non-zero with a write, then I always get
> >> that back - no increments. pmcr_el0 looks OK... I had forgotten
> >> to set bit 31 of pmcntenset_el0, but doing that still doesn't
> >> help. Anyway, I assume the problem is me. I'll keep looking to
> >> see what I'm missing.
> >>
> >
> > I returned to this and see that the problem was indeed me. I needed yet
> > another enable bit set (the filter register needed to be instructed to
> > count cycles while in el2). I've attached the code for the curious.
> > The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
> > running on a host without this patch series (after TVM traps have been
> > disabled), I get a pretty consistent 40.
> >
> > I checked how many vm-sysreg traps we do during the kernel compile
> > benchmark. It's 124924. So it's a bit strange that we don't see the
> > benchmark taking 10 to 20 seconds longer on average. I should probably
> > double check my runs. In any case, while I like the approach of this
> > series, the overhead is looking non-negligible.
> >
> 
> Thanks a lot for producing these numbers. 125k x 7k == <1 billion
> cycles == <1 second on a >1 GHz machine, I think?
> Or am I missing something? How long does the actual compile take?
> 
I ran a sequence of benchmarks that I occasionally run (pbzip,
kernbench, and hackbench) and I also saw < 1% performance degradation,
so I think we can trust that somewhat.  (I can post the raw numbers when
I have ssh access to my Linux desktop - sending this from Somewhere Over
The Atlantic).

However, my concern with these patches are on two points:

1. It's not a fix-all.  We still have the case where the guest expects
the behavior of device memory (for strong ordering for example) on a RAM
region, which we now break.  Similiarly this doesn't support the
non-coherent DMA to RAM region case.

2. While the code is probably as nice as this kind of stuff gets, it
is non-trivial and extremely difficult to debug.  The counter-point here
is that we may end up handling other stuff@EL2 for performanc reasons
in the future.

Mainly because of point 1 above, I am leaning to thinking userspace
should do the invalidation when it knows it needs to, either through KVM
via a memslot flag or through some other syscall mechanism.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-02 16:31                     ` Christoffer Dall
@ 2015-03-02 16:47                       ` Paolo Bonzini
  -1 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-03-02 16:47 UTC (permalink / raw)
  To: Christoffer Dall, Ard Biesheuvel
  Cc: KVM devel mailing list, Marc Zyngier, Laszlo Ersek, kvmarm,
	linux-arm-kernel

On 02/03/2015 17:31, Christoffer Dall wrote:
> 2. While the code is probably as nice as this kind of stuff gets, it
> is non-trivial and extremely difficult to debug.  The counter-point here
> is that we may end up handling other stuff at EL2 for performanc reasons
> in the future.
> 
> Mainly because of point 1 above, I am leaning to thinking userspace
> should do the invalidation when it knows it needs to, either through KVM
> via a memslot flag or through some other syscall mechanism.

I'm okay with adding a KVM capability and ioctl that flushes the dcache
for a given gpa range.  However:

1) I'd like to have an implementation for QEMU and/or kvmtool before
accepting that ioctl.

2) I think the ioctl should work whatever the stage1 mapping is (e.g.
with and without Ard's patches, with and without Laszlo's OVMF patch, etc.).

Also, we may want to invalidate the cache for dirty pages before
returning the dirty bitmap, and probably should do that directly in
KVM_GET_DIRTY_LOG.

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-02 16:47                       ` Paolo Bonzini
  0 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-03-02 16:47 UTC (permalink / raw)
  To: linux-arm-kernel

On 02/03/2015 17:31, Christoffer Dall wrote:
> 2. While the code is probably as nice as this kind of stuff gets, it
> is non-trivial and extremely difficult to debug.  The counter-point here
> is that we may end up handling other stuff at EL2 for performanc reasons
> in the future.
> 
> Mainly because of point 1 above, I am leaning to thinking userspace
> should do the invalidation when it knows it needs to, either through KVM
> via a memslot flag or through some other syscall mechanism.

I'm okay with adding a KVM capability and ioctl that flushes the dcache
for a given gpa range.  However:

1) I'd like to have an implementation for QEMU and/or kvmtool before
accepting that ioctl.

2) I think the ioctl should work whatever the stage1 mapping is (e.g.
with and without Ard's patches, with and without Laszlo's OVMF patch, etc.).

Also, we may want to invalidate the cache for dirty pages before
returning the dirty bitmap, and probably should do that directly in
KVM_GET_DIRTY_LOG.

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-02 16:31                     ` Christoffer Dall
@ 2015-03-02 16:48                       ` Andrew Jones
  -1 siblings, 0 replies; 110+ messages in thread
From: Andrew Jones @ 2015-03-02 16:48 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: KVM devel mailing list, Ard Biesheuvel, Marc Zyngier,
	Paolo Bonzini, Laszlo Ersek, kvmarm, linux-arm-kernel

On Mon, Mar 02, 2015 at 08:31:46AM -0800, Christoffer Dall wrote:
> On Tue, Feb 24, 2015 at 05:47:19PM +0000, Ard Biesheuvel wrote:
> > On 24 February 2015 at 14:55, Andrew Jones <drjones@redhat.com> wrote:
> > > On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
> > >> On Fri, Feb 20, 2015 at 02:37:25PM +0000, Ard Biesheuvel wrote:
> > >> > On 20 February 2015 at 14:29, Andrew Jones <drjones@redhat.com> wrote:
> > >> > > So looks like the 3 orders of magnitude greater number of traps
> > >> > > (only to el2) don't impact kernel compiles.
> > >> > >
> > >> >
> > >> > OK, good! That was what I was hoping for, obviously.
> > >> >
> > >> > > Then I thought I'd be able to quick measure the number of cycles
> > >> > > a trap to el2 takes with this kvm-unit-tests test
> > >> > >
> > >> > > int main(void)
> > >> > > {
> > >> > >         unsigned long start, end;
> > >> > >         unsigned int sctlr;
> > >> > >
> > >> > >         asm volatile(
> > >> > >         "       mrs %0, sctlr_el1\n"
> > >> > >         "       msr pmcr_el0, %1\n"
> > >> > >         : "=&r" (sctlr) : "r" (5));
> > >> > >
> > >> > >         asm volatile(
> > >> > >         "       mrs %0, pmccntr_el0\n"
> > >> > >         "       msr sctlr_el1, %2\n"
> > >> > >         "       mrs %1, pmccntr_el0\n"
> > >> > >         : "=&r" (start), "=&r" (end) : "r" (sctlr));
> > >> > >
> > >> > >         printf("%llx\n", end - start);
> > >> > >         return 0;
> > >> > > }
> > >> > >
> > >> > > after applying this patch to kvm
> > >> > >
> > >> > > diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> > >> > > index bb91b6fc63861..5de39d740aa58 100644
> > >> > > --- a/arch/arm64/kvm/hyp.S
> > >> > > +++ b/arch/arm64/kvm/hyp.S
> > >> > > @@ -770,7 +770,7 @@
> > >> > >
> > >> > >         mrs     x2, mdcr_el2
> > >> > >         and     x2, x2, #MDCR_EL2_HPMN_MASK
> > >> > > -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> > >> > > +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> > >> > >         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
> > >> > >
> > >> > >         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
> > >> > >
> > >> > > But I get zero for the cycle count. Not sure what I'm missing.
> > >> > >
> > >> >
> > >> > No clue tbh. Does the counter work as expected in the host?
> > >> >
> > >>
> > >> Guess not. I dropped the test into a module_init and inserted
> > >> it on the host. Always get zero for pmccntr_el0 reads. Or, if
> > >> I set it to something non-zero with a write, then I always get
> > >> that back - no increments. pmcr_el0 looks OK... I had forgotten
> > >> to set bit 31 of pmcntenset_el0, but doing that still doesn't
> > >> help. Anyway, I assume the problem is me. I'll keep looking to
> > >> see what I'm missing.
> > >>
> > >
> > > I returned to this and see that the problem was indeed me. I needed yet
> > > another enable bit set (the filter register needed to be instructed to
> > > count cycles while in el2). I've attached the code for the curious.
> > > The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
> > > running on a host without this patch series (after TVM traps have been
> > > disabled), I get a pretty consistent 40.
> > >
> > > I checked how many vm-sysreg traps we do during the kernel compile
> > > benchmark. It's 124924. So it's a bit strange that we don't see the
> > > benchmark taking 10 to 20 seconds longer on average. I should probably
> > > double check my runs. In any case, while I like the approach of this
> > > series, the overhead is looking non-negligible.
> > >
> > 
> > Thanks a lot for producing these numbers. 125k x 7k == <1 billion
> > cycles == <1 second on a >1 GHz machine, I think?
> > Or am I missing something? How long does the actual compile take?
> > 
> I ran a sequence of benchmarks that I occasionally run (pbzip,
> kernbench, and hackbench) and I also saw < 1% performance degradation,
> so I think we can trust that somewhat.  (I can post the raw numbers when
> I have ssh access to my Linux desktop - sending this from Somewhere Over
> The Atlantic).
> 
> However, my concern with these patches are on two points:
> 
> 1. It's not a fix-all.  We still have the case where the guest expects
> the behavior of device memory (for strong ordering for example) on a RAM
> region, which we now break.  Similiarly this doesn't support the
> non-coherent DMA to RAM region case.
> 
> 2. While the code is probably as nice as this kind of stuff gets, it
> is non-trivial and extremely difficult to debug.  The counter-point here
> is that we may end up handling other stuff at EL2 for performanc reasons
> in the future.
> 
> Mainly because of point 1 above, I am leaning to thinking userspace
> should do the invalidation when it knows it needs to, either through KVM
> via a memslot flag or through some other syscall mechanism.

I've started down the memslot flag road by promoting KVM_MEMSLOT_INCOHERENT
to uapi/KVM_MEM_INCOHERENT, replacing the readonly memslot heuristic.
With a couple more changes it should work for all memory regions with
the 'incoherent' property. I'll make some changes to QEMU to test it all
out as well. Progress was slow last week due to too many higher priority
tasks, but I plan to return to it this week.

Thanks,
drew


> 
> Thanks,
> -Christoffer

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-02 16:48                       ` Andrew Jones
  0 siblings, 0 replies; 110+ messages in thread
From: Andrew Jones @ 2015-03-02 16:48 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 02, 2015 at 08:31:46AM -0800, Christoffer Dall wrote:
> On Tue, Feb 24, 2015 at 05:47:19PM +0000, Ard Biesheuvel wrote:
> > On 24 February 2015 at 14:55, Andrew Jones <drjones@redhat.com> wrote:
> > > On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
> > >> On Fri, Feb 20, 2015 at 02:37:25PM +0000, Ard Biesheuvel wrote:
> > >> > On 20 February 2015 at 14:29, Andrew Jones <drjones@redhat.com> wrote:
> > >> > > So looks like the 3 orders of magnitude greater number of traps
> > >> > > (only to el2) don't impact kernel compiles.
> > >> > >
> > >> >
> > >> > OK, good! That was what I was hoping for, obviously.
> > >> >
> > >> > > Then I thought I'd be able to quick measure the number of cycles
> > >> > > a trap to el2 takes with this kvm-unit-tests test
> > >> > >
> > >> > > int main(void)
> > >> > > {
> > >> > >         unsigned long start, end;
> > >> > >         unsigned int sctlr;
> > >> > >
> > >> > >         asm volatile(
> > >> > >         "       mrs %0, sctlr_el1\n"
> > >> > >         "       msr pmcr_el0, %1\n"
> > >> > >         : "=&r" (sctlr) : "r" (5));
> > >> > >
> > >> > >         asm volatile(
> > >> > >         "       mrs %0, pmccntr_el0\n"
> > >> > >         "       msr sctlr_el1, %2\n"
> > >> > >         "       mrs %1, pmccntr_el0\n"
> > >> > >         : "=&r" (start), "=&r" (end) : "r" (sctlr));
> > >> > >
> > >> > >         printf("%llx\n", end - start);
> > >> > >         return 0;
> > >> > > }
> > >> > >
> > >> > > after applying this patch to kvm
> > >> > >
> > >> > > diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> > >> > > index bb91b6fc63861..5de39d740aa58 100644
> > >> > > --- a/arch/arm64/kvm/hyp.S
> > >> > > +++ b/arch/arm64/kvm/hyp.S
> > >> > > @@ -770,7 +770,7 @@
> > >> > >
> > >> > >         mrs     x2, mdcr_el2
> > >> > >         and     x2, x2, #MDCR_EL2_HPMN_MASK
> > >> > > -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> > >> > > +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
> > >> > >         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
> > >> > >
> > >> > >         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
> > >> > >
> > >> > > But I get zero for the cycle count. Not sure what I'm missing.
> > >> > >
> > >> >
> > >> > No clue tbh. Does the counter work as expected in the host?
> > >> >
> > >>
> > >> Guess not. I dropped the test into a module_init and inserted
> > >> it on the host. Always get zero for pmccntr_el0 reads. Or, if
> > >> I set it to something non-zero with a write, then I always get
> > >> that back - no increments. pmcr_el0 looks OK... I had forgotten
> > >> to set bit 31 of pmcntenset_el0, but doing that still doesn't
> > >> help. Anyway, I assume the problem is me. I'll keep looking to
> > >> see what I'm missing.
> > >>
> > >
> > > I returned to this and see that the problem was indeed me. I needed yet
> > > another enable bit set (the filter register needed to be instructed to
> > > count cycles while in el2). I've attached the code for the curious.
> > > The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
> > > running on a host without this patch series (after TVM traps have been
> > > disabled), I get a pretty consistent 40.
> > >
> > > I checked how many vm-sysreg traps we do during the kernel compile
> > > benchmark. It's 124924. So it's a bit strange that we don't see the
> > > benchmark taking 10 to 20 seconds longer on average. I should probably
> > > double check my runs. In any case, while I like the approach of this
> > > series, the overhead is looking non-negligible.
> > >
> > 
> > Thanks a lot for producing these numbers. 125k x 7k == <1 billion
> > cycles == <1 second on a >1 GHz machine, I think?
> > Or am I missing something? How long does the actual compile take?
> > 
> I ran a sequence of benchmarks that I occasionally run (pbzip,
> kernbench, and hackbench) and I also saw < 1% performance degradation,
> so I think we can trust that somewhat.  (I can post the raw numbers when
> I have ssh access to my Linux desktop - sending this from Somewhere Over
> The Atlantic).
> 
> However, my concern with these patches are on two points:
> 
> 1. It's not a fix-all.  We still have the case where the guest expects
> the behavior of device memory (for strong ordering for example) on a RAM
> region, which we now break.  Similiarly this doesn't support the
> non-coherent DMA to RAM region case.
> 
> 2. While the code is probably as nice as this kind of stuff gets, it
> is non-trivial and extremely difficult to debug.  The counter-point here
> is that we may end up handling other stuff at EL2 for performanc reasons
> in the future.
> 
> Mainly because of point 1 above, I am leaning to thinking userspace
> should do the invalidation when it knows it needs to, either through KVM
> via a memslot flag or through some other syscall mechanism.

I've started down the memslot flag road by promoting KVM_MEMSLOT_INCOHERENT
to uapi/KVM_MEM_INCOHERENT, replacing the readonly memslot heuristic.
With a couple more changes it should work for all memory regions with
the 'incoherent' property. I'll make some changes to QEMU to test it all
out as well. Progress was slow last week due to too many higher priority
tasks, but I plan to return to it this week.

Thanks,
drew


> 
> Thanks,
> -Christoffer

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-02 16:47                       ` Paolo Bonzini
@ 2015-03-02 16:55                         ` Laszlo Ersek
  -1 siblings, 0 replies; 110+ messages in thread
From: Laszlo Ersek @ 2015-03-02 16:55 UTC (permalink / raw)
  To: Paolo Bonzini, Christoffer Dall, Ard Biesheuvel
  Cc: KVM devel mailing list, Marc Zyngier, kvmarm, linux-arm-kernel

On 03/02/15 17:47, Paolo Bonzini wrote:
> 
> Also, we may want to invalidate the cache for dirty pages before
> returning the dirty bitmap, and probably should do that directly in
> KVM_GET_DIRTY_LOG.

"I agree."

If KVM_GET_DIRTY_LOG is supposed to be atomic fetch and clear (from
userspace's aspect), then the cache invalidation should be an atomic
part of it too (from the same aspect).

(Sorry if I just said something incredibly stupid.)

Laszlo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-02 16:55                         ` Laszlo Ersek
  0 siblings, 0 replies; 110+ messages in thread
From: Laszlo Ersek @ 2015-03-02 16:55 UTC (permalink / raw)
  To: linux-arm-kernel

On 03/02/15 17:47, Paolo Bonzini wrote:
> 
> Also, we may want to invalidate the cache for dirty pages before
> returning the dirty bitmap, and probably should do that directly in
> KVM_GET_DIRTY_LOG.

"I agree."

If KVM_GET_DIRTY_LOG is supposed to be atomic fetch and clear (from
userspace's aspect), then the cache invalidation should be an atomic
part of it too (from the same aspect).

(Sorry if I just said something incredibly stupid.)

Laszlo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-02 16:55                         ` Laszlo Ersek
@ 2015-03-02 17:05                           ` Andrew Jones
  -1 siblings, 0 replies; 110+ messages in thread
From: Andrew Jones @ 2015-03-02 17:05 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: KVM devel mailing list, Ard Biesheuvel, Marc Zyngier,
	Paolo Bonzini, kvmarm, linux-arm-kernel

On Mon, Mar 02, 2015 at 05:55:44PM +0100, Laszlo Ersek wrote:
> On 03/02/15 17:47, Paolo Bonzini wrote:
> > 
> > Also, we may want to invalidate the cache for dirty pages before
> > returning the dirty bitmap, and probably should do that directly in
> > KVM_GET_DIRTY_LOG.
> 
> "I agree."
> 
> If KVM_GET_DIRTY_LOG is supposed to be atomic fetch and clear (from
> userspace's aspect), then the cache invalidation should be an atomic
> part of it too (from the same aspect).
> 
> (Sorry if I just said something incredibly stupid.)
>

With the path I'm headed down, all cache maintenance operations will
be done before exiting to userspace (and after returning). I was
actually already letting a feature creep into this PoC by setting
KVM_MEM_LOG_DIRTY_PAGES when we see KVM_MEM_INCOHERENT has been set,
and the region isn't readonly. The dirty log would then be used by
KVM internally to know exactly which pages need to be invalidated
before the exit.

drew

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-02 17:05                           ` Andrew Jones
  0 siblings, 0 replies; 110+ messages in thread
From: Andrew Jones @ 2015-03-02 17:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 02, 2015 at 05:55:44PM +0100, Laszlo Ersek wrote:
> On 03/02/15 17:47, Paolo Bonzini wrote:
> > 
> > Also, we may want to invalidate the cache for dirty pages before
> > returning the dirty bitmap, and probably should do that directly in
> > KVM_GET_DIRTY_LOG.
> 
> "I agree."
> 
> If KVM_GET_DIRTY_LOG is supposed to be atomic fetch and clear (from
> userspace's aspect), then the cache invalidation should be an atomic
> part of it too (from the same aspect).
> 
> (Sorry if I just said something incredibly stupid.)
>

With the path I'm headed down, all cache maintenance operations will
be done before exiting to userspace (and after returning). I was
actually already letting a feature creep into this PoC by setting
KVM_MEM_LOG_DIRTY_PAGES when we see KVM_MEM_INCOHERENT has been set,
and the region isn't readonly. The dirty log would then be used by
KVM internally to know exactly which pages need to be invalidated
before the exit.

drew

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-02 16:31                     ` Christoffer Dall
@ 2015-03-03  2:20                       ` Mario Smarduch
  -1 siblings, 0 replies; 110+ messages in thread
From: Mario Smarduch @ 2015-03-03  2:20 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: KVM devel mailing list, Ard Biesheuvel, Marc Zyngier,
	Paolo Bonzini, Laszlo Ersek, kvmarm, linux-arm-kernel

Hi Christoffer,

I don't understand how can the CPU handle different cache attributes
used by
QEMU and Guest won't you run into B2.9 checklist? Wouldn't cache
evictions or
cleans wipe out guest updates to same cache line(s)?

- Mario


On 03/02/2015 08:31 AM, Christoffer Dall wrote:
> On Tue, Feb 24, 2015 at 05:47:19PM +0000, Ard Biesheuvel wrote:
>> On 24 February 2015 at 14:55, Andrew Jones <drjones@redhat.com> wrote:
>>> On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
>>>> On Fri, Feb 20, 2015 at 02:37:25PM +0000, Ard Biesheuvel wrote:
>>>>> On 20 February 2015 at 14:29, Andrew Jones <drjones@redhat.com> wrote:
>>>>>> So looks like the 3 orders of magnitude greater number of traps
>>>>>> (only to el2) don't impact kernel compiles.
>>>>>>
>>>>>
>>>>> OK, good! That was what I was hoping for, obviously.
>>>>>
>>>>>> Then I thought I'd be able to quick measure the number of cycles
>>>>>> a trap to el2 takes with this kvm-unit-tests test
>>>>>>
>>>>>> int main(void)
>>>>>> {
>>>>>>         unsigned long start, end;
>>>>>>         unsigned int sctlr;
>>>>>>
>>>>>>         asm volatile(
>>>>>>         "       mrs %0, sctlr_el1\n"
>>>>>>         "       msr pmcr_el0, %1\n"
>>>>>>         : "=&r" (sctlr) : "r" (5));
>>>>>>
>>>>>>         asm volatile(
>>>>>>         "       mrs %0, pmccntr_el0\n"
>>>>>>         "       msr sctlr_el1, %2\n"
>>>>>>         "       mrs %1, pmccntr_el0\n"
>>>>>>         : "=&r" (start), "=&r" (end) : "r" (sctlr));
>>>>>>
>>>>>>         printf("%llx\n", end - start);
>>>>>>         return 0;
>>>>>> }
>>>>>>
>>>>>> after applying this patch to kvm
>>>>>>
>>>>>> diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
>>>>>> index bb91b6fc63861..5de39d740aa58 100644
>>>>>> --- a/arch/arm64/kvm/hyp.S
>>>>>> +++ b/arch/arm64/kvm/hyp.S
>>>>>> @@ -770,7 +770,7 @@
>>>>>>
>>>>>>         mrs     x2, mdcr_el2
>>>>>>         and     x2, x2, #MDCR_EL2_HPMN_MASK
>>>>>> -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
>>>>>> +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
>>>>>>         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
>>>>>>
>>>>>>         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
>>>>>>
>>>>>> But I get zero for the cycle count. Not sure what I'm missing.
>>>>>>
>>>>>
>>>>> No clue tbh. Does the counter work as expected in the host?
>>>>>
>>>>
>>>> Guess not. I dropped the test into a module_init and inserted
>>>> it on the host. Always get zero for pmccntr_el0 reads. Or, if
>>>> I set it to something non-zero with a write, then I always get
>>>> that back - no increments. pmcr_el0 looks OK... I had forgotten
>>>> to set bit 31 of pmcntenset_el0, but doing that still doesn't
>>>> help. Anyway, I assume the problem is me. I'll keep looking to
>>>> see what I'm missing.
>>>>
>>>
>>> I returned to this and see that the problem was indeed me. I needed yet
>>> another enable bit set (the filter register needed to be instructed to
>>> count cycles while in el2). I've attached the code for the curious.
>>> The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
>>> running on a host without this patch series (after TVM traps have been
>>> disabled), I get a pretty consistent 40.
>>>
>>> I checked how many vm-sysreg traps we do during the kernel compile
>>> benchmark. It's 124924. So it's a bit strange that we don't see the
>>> benchmark taking 10 to 20 seconds longer on average. I should probably
>>> double check my runs. In any case, while I like the approach of this
>>> series, the overhead is looking non-negligible.
>>>
>>
>> Thanks a lot for producing these numbers. 125k x 7k == <1 billion
>> cycles == <1 second on a >1 GHz machine, I think?
>> Or am I missing something? How long does the actual compile take?
>>
> I ran a sequence of benchmarks that I occasionally run (pbzip,
> kernbench, and hackbench) and I also saw < 1% performance degradation,
> so I think we can trust that somewhat.  (I can post the raw numbers when
> I have ssh access to my Linux desktop - sending this from Somewhere Over
> The Atlantic).
> 
> However, my concern with these patches are on two points:
> 
> 1. It's not a fix-all.  We still have the case where the guest expects
> the behavior of device memory (for strong ordering for example) on a RAM
> region, which we now break.  Similiarly this doesn't support the
> non-coherent DMA to RAM region case.
> 
> 2. While the code is probably as nice as this kind of stuff gets, it
> is non-trivial and extremely difficult to debug.  The counter-point here
> is that we may end up handling other stuff at EL2 for performanc reasons
> in the future.
> 
> Mainly because of point 1 above, I am leaning to thinking userspace
> should do the invalidation when it knows it needs to, either through KVM
> via a memslot flag or through some other syscall mechanism.
> 
> Thanks,
> -Christoffer
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-03  2:20                       ` Mario Smarduch
  0 siblings, 0 replies; 110+ messages in thread
From: Mario Smarduch @ 2015-03-03  2:20 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Christoffer,

I don't understand how can the CPU handle different cache attributes
used by
QEMU and Guest won't you run into B2.9 checklist? Wouldn't cache
evictions or
cleans wipe out guest updates to same cache line(s)?

- Mario


On 03/02/2015 08:31 AM, Christoffer Dall wrote:
> On Tue, Feb 24, 2015 at 05:47:19PM +0000, Ard Biesheuvel wrote:
>> On 24 February 2015 at 14:55, Andrew Jones <drjones@redhat.com> wrote:
>>> On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
>>>> On Fri, Feb 20, 2015 at 02:37:25PM +0000, Ard Biesheuvel wrote:
>>>>> On 20 February 2015 at 14:29, Andrew Jones <drjones@redhat.com> wrote:
>>>>>> So looks like the 3 orders of magnitude greater number of traps
>>>>>> (only to el2) don't impact kernel compiles.
>>>>>>
>>>>>
>>>>> OK, good! That was what I was hoping for, obviously.
>>>>>
>>>>>> Then I thought I'd be able to quick measure the number of cycles
>>>>>> a trap to el2 takes with this kvm-unit-tests test
>>>>>>
>>>>>> int main(void)
>>>>>> {
>>>>>>         unsigned long start, end;
>>>>>>         unsigned int sctlr;
>>>>>>
>>>>>>         asm volatile(
>>>>>>         "       mrs %0, sctlr_el1\n"
>>>>>>         "       msr pmcr_el0, %1\n"
>>>>>>         : "=&r" (sctlr) : "r" (5));
>>>>>>
>>>>>>         asm volatile(
>>>>>>         "       mrs %0, pmccntr_el0\n"
>>>>>>         "       msr sctlr_el1, %2\n"
>>>>>>         "       mrs %1, pmccntr_el0\n"
>>>>>>         : "=&r" (start), "=&r" (end) : "r" (sctlr));
>>>>>>
>>>>>>         printf("%llx\n", end - start);
>>>>>>         return 0;
>>>>>> }
>>>>>>
>>>>>> after applying this patch to kvm
>>>>>>
>>>>>> diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
>>>>>> index bb91b6fc63861..5de39d740aa58 100644
>>>>>> --- a/arch/arm64/kvm/hyp.S
>>>>>> +++ b/arch/arm64/kvm/hyp.S
>>>>>> @@ -770,7 +770,7 @@
>>>>>>
>>>>>>         mrs     x2, mdcr_el2
>>>>>>         and     x2, x2, #MDCR_EL2_HPMN_MASK
>>>>>> -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
>>>>>> +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
>>>>>>         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
>>>>>>
>>>>>>         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
>>>>>>
>>>>>> But I get zero for the cycle count. Not sure what I'm missing.
>>>>>>
>>>>>
>>>>> No clue tbh. Does the counter work as expected in the host?
>>>>>
>>>>
>>>> Guess not. I dropped the test into a module_init and inserted
>>>> it on the host. Always get zero for pmccntr_el0 reads. Or, if
>>>> I set it to something non-zero with a write, then I always get
>>>> that back - no increments. pmcr_el0 looks OK... I had forgotten
>>>> to set bit 31 of pmcntenset_el0, but doing that still doesn't
>>>> help. Anyway, I assume the problem is me. I'll keep looking to
>>>> see what I'm missing.
>>>>
>>>
>>> I returned to this and see that the problem was indeed me. I needed yet
>>> another enable bit set (the filter register needed to be instructed to
>>> count cycles while in el2). I've attached the code for the curious.
>>> The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
>>> running on a host without this patch series (after TVM traps have been
>>> disabled), I get a pretty consistent 40.
>>>
>>> I checked how many vm-sysreg traps we do during the kernel compile
>>> benchmark. It's 124924. So it's a bit strange that we don't see the
>>> benchmark taking 10 to 20 seconds longer on average. I should probably
>>> double check my runs. In any case, while I like the approach of this
>>> series, the overhead is looking non-negligible.
>>>
>>
>> Thanks a lot for producing these numbers. 125k x 7k == <1 billion
>> cycles == <1 second on a >1 GHz machine, I think?
>> Or am I missing something? How long does the actual compile take?
>>
> I ran a sequence of benchmarks that I occasionally run (pbzip,
> kernbench, and hackbench) and I also saw < 1% performance degradation,
> so I think we can trust that somewhat.  (I can post the raw numbers when
> I have ssh access to my Linux desktop - sending this from Somewhere Over
> The Atlantic).
> 
> However, my concern with these patches are on two points:
> 
> 1. It's not a fix-all.  We still have the case where the guest expects
> the behavior of device memory (for strong ordering for example) on a RAM
> region, which we now break.  Similiarly this doesn't support the
> non-coherent DMA to RAM region case.
> 
> 2. While the code is probably as nice as this kind of stuff gets, it
> is non-trivial and extremely difficult to debug.  The counter-point here
> is that we may end up handling other stuff at EL2 for performanc reasons
> in the future.
> 
> Mainly because of point 1 above, I am leaning to thinking userspace
> should do the invalidation when it knows it needs to, either through KVM
> via a memslot flag or through some other syscall mechanism.
> 
> Thanks,
> -Christoffer
> _______________________________________________
> kvmarm mailing list
> kvmarm at lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-02-19 10:54 ` Ard Biesheuvel
@ 2015-03-03 17:34   ` Alexander Graf
  -1 siblings, 0 replies; 110+ messages in thread
From: Alexander Graf @ 2015-03-03 17:34 UTC (permalink / raw)
  To: Ard Biesheuvel, lersek, christoffer.dall, marc.zyngier,
	linux-arm-kernel, peter.maydell
  Cc: kvm, kvmarm, pbonzini

On 02/19/2015 11:54 AM, Ard Biesheuvel wrote:
> This is a 0th order approximation of how we could potentially force the guest
> to avoid uncached mappings, at least from the moment the MMU is on. (Before
> that, all of memory is implicitly classified as Device-nGnRnE)
>
> The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings
> with cached ones. This way, there is no need to mangle any guest page tables.
>
> The downside is that, to do this correctly, we need to always trap writes to
> the VM sysreg group, which includes registers that the guest may write to very
> often. To reduce the associated performance hit, patch #1 introduces a fast path
> for EL2 to perform trivial sysreg writes on behalf of the guest, without the
> need for a full world switch to the host and back.
>
> The main purpose of these patches is to quantify the performance hit, and
> verify whether the MAIR_EL1 handling works correctly.

I gave this a quick spin on a VM running with QEMU.

   * VGA output is still distorted, I get random junk black lines in the 
output in between
   * When I add -device nec-usb-xhci -device usb-kbd the VM doesn't even 
boot up

With TCG, both bits work fine.


Alex

>
> Ard Biesheuvel (3):
>    arm64: KVM: handle some sysreg writes in EL2
>    arm64: KVM: mangle MAIR register to prevent uncached guest mappings
>    arm64: KVM: keep trapping of VM sysreg writes enabled
>
>   arch/arm/kvm/mmu.c               |   2 +-
>   arch/arm64/include/asm/kvm_arm.h |   2 +-
>   arch/arm64/kvm/hyp.S             | 101 +++++++++++++++++++++++++++++++++++++++
>   arch/arm64/kvm/sys_regs.c        |  63 ++++++++++++++++++++----
>   4 files changed, 156 insertions(+), 12 deletions(-)
>


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-03 17:34   ` Alexander Graf
  0 siblings, 0 replies; 110+ messages in thread
From: Alexander Graf @ 2015-03-03 17:34 UTC (permalink / raw)
  To: linux-arm-kernel

On 02/19/2015 11:54 AM, Ard Biesheuvel wrote:
> This is a 0th order approximation of how we could potentially force the guest
> to avoid uncached mappings, at least from the moment the MMU is on. (Before
> that, all of memory is implicitly classified as Device-nGnRnE)
>
> The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings
> with cached ones. This way, there is no need to mangle any guest page tables.
>
> The downside is that, to do this correctly, we need to always trap writes to
> the VM sysreg group, which includes registers that the guest may write to very
> often. To reduce the associated performance hit, patch #1 introduces a fast path
> for EL2 to perform trivial sysreg writes on behalf of the guest, without the
> need for a full world switch to the host and back.
>
> The main purpose of these patches is to quantify the performance hit, and
> verify whether the MAIR_EL1 handling works correctly.

I gave this a quick spin on a VM running with QEMU.

   * VGA output is still distorted, I get random junk black lines in the 
output in between
   * When I add -device nec-usb-xhci -device usb-kbd the VM doesn't even 
boot up

With TCG, both bits work fine.


Alex

>
> Ard Biesheuvel (3):
>    arm64: KVM: handle some sysreg writes in EL2
>    arm64: KVM: mangle MAIR register to prevent uncached guest mappings
>    arm64: KVM: keep trapping of VM sysreg writes enabled
>
>   arch/arm/kvm/mmu.c               |   2 +-
>   arch/arm64/include/asm/kvm_arm.h |   2 +-
>   arch/arm64/kvm/hyp.S             | 101 +++++++++++++++++++++++++++++++++++++++
>   arch/arm64/kvm/sys_regs.c        |  63 ++++++++++++++++++++----
>   4 files changed, 156 insertions(+), 12 deletions(-)
>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 1/3] arm64: KVM: handle some sysreg writes in EL2
  2015-02-19 10:54   ` Ard Biesheuvel
@ 2015-03-03 17:59     ` Mario Smarduch
  -1 siblings, 0 replies; 110+ messages in thread
From: Mario Smarduch @ 2015-03-03 17:59 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: kvm, marc.zyngier, pbonzini, lersek, kvmarm, linux-arm-kernel

Hi Ard,
  couple side effects would be guest address translation may
return attributes in PAR register it wound not expect.
Likewise for read of MAIR registers.

- Mario

On 02/19/2015 02:54 AM, Ard Biesheuvel wrote:
> This adds handling to el1_trap() to perform some sysreg writes directly
> in EL2, without performing the full world switch to the host and back
> again. This is mainly for doing writes that don't need special handling,
> but where the register is part of the group that we need to trap for
> other reasons.
> 
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
>  arch/arm64/kvm/hyp.S      | 101 ++++++++++++++++++++++++++++++++++++++++++++++
>  arch/arm64/kvm/sys_regs.c |  28 ++++++++-----
>  2 files changed, 120 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> index c3ca89c27c6b..e3af6840cb3f 100644
> --- a/arch/arm64/kvm/hyp.S
> +++ b/arch/arm64/kvm/hyp.S
> @@ -26,6 +26,7 @@
>  #include <asm/kvm_asm.h>
>  #include <asm/kvm_arm.h>
>  #include <asm/kvm_mmu.h>
> +#include <asm/sysreg.h>
>  
>  #define CPU_GP_REG_OFFSET(x)	(CPU_GP_REGS + x)
>  #define CPU_XREG_OFFSET(x)	CPU_GP_REG_OFFSET(CPU_USER_PT_REGS + 8*x)
> @@ -887,6 +888,34 @@
>  1:
>  .endm
>  
> +/*
> + * Macro to conditionally perform a parametrised system register write. Note
> + * that we currently only support writing x3 to a system register in class
> + * Op0 == 3 and Op1 == 0, which is all we need at the moment.
> + */
> +.macro	cond_sysreg_write,op0,op1,crn,crm,op2,sreg,opreg,outlbl
> +	.ifnc	\op0,3    ; .err ; .endif
> +	.ifnc	\op1,0    ; .err ; .endif
> +	.ifnc	\opreg,x3 ; .err ; .endif
> +	cmp	\sreg, #((\crm) | ((\crn) << 4) | ((\op2) << 8))
> +	bne	9999f
> +	// doesn't work: msr_s sys_reg(\op0,\op1,\crn,\crm,\op2), \opreg
> +	.inst	0xd5180003|((\crn) << 12)|((\crm) << 8)|((\op2 << 5))
> +	b	\outlbl
> +9999:
> +.endm
> +
> +/*
> + * Pack CRn, CRm and Op2 into 11 adjacent low bits so we can use a single
> + * cmp instruction to compare it with a 12-bit immediate.
> + */
> +.macro	pack_sysreg_idx, outreg, inreg
> +	ubfm	\outreg, \inreg, #(17 - 8), #(17 + 2)	// Op2 -> bits 8 - 10
> +	bfm	\outreg, \inreg, #(10 - 4), #(10 + 3)	// CRn -> bits 4 - 7
> +	bfm	\outreg, \inreg, #(1 - 0), #(1 + 3)	// CRm -> bits 0 - 3
> +.endm
> +
> +
>  __save_sysregs:
>  	save_sysregs
>  	ret
> @@ -1178,6 +1207,15 @@ el1_trap:
>  	 * x1: ESR
>  	 * x2: ESR_EC
>  	 */
> +
> +	/*
> +	 * Find out if the exception we are about to pass to the host is a
> +	 * write to a system register, which we may prefer to handle in EL2.
> +	 */
> +	tst	x1, #1				// direction == write (0) ?
> +	ccmp	x2, #ESR_EL2_EC_SYS64, #0, eq	// is a sysreg access?
> +	b.eq	4f
> +
>  	cmp	x2, #ESR_EL2_EC_DABT
>  	mov	x0, #ESR_EL2_EC_IABT
>  	ccmp	x2, x0, #4, ne
> @@ -1239,6 +1277,69 @@ el1_trap:
>  
>  	eret
>  
> +4:	and	x2, x1, #(3 << 20)		// check for Op0 == 0b11
> +	cmp	x2, #(3 << 20)
> +	b.ne	1b
> +	ands	x2, x1, #(7 << 14)		// check for Op1 == 0b000
> +	b.ne	1b
> +
> +	/*
> +	 * If we end up here, we are about to perform a system register write
> +	 * with Op0 == 0b11 and Op1 == 0b000. Move the operand to x3 first, we
> +	 * will check later if we are actually going to handle this write in EL2
> +	 */
> +	adr	x0, 5f
> +	ubfx	x2, x1, #5, #5		// operand reg# in bits 9 .. 5
> +	add	x0, x0, x2, lsl #3
> +	br	x0
> +5:	ldr	x3, [sp, #16]		// x0 from the stack
> +	b	6f
> +	ldr	x3, [sp, #24]		// x1 from the stack
> +	b	6f
> +	ldr	x3, [sp]		// x2 from the stack
> +	b	6f
> +	ldr	x3, [sp, #8]		// x3 from the stack
> +	b	6f
> +	.irp	reg,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30
> +	mov	x3, x\reg
> +	b	6f
> +	.endr
> +	mov	x3, xzr			// x31
> +
> +	/*
> +	 * Ok, so now we have the desired value in x3, let's write it into the
> +	 * sysreg if it's a register write we want to handle in EL2. Since these
> +	 * are tried in order, it makes sense to put the ones used most often at
> +	 * the top.
> +	 */
> +6:	pack_sysreg_idx		x2, x1
> +	cond_sysreg_write	3,0, 2,0,0,x2,x3,7f	// TTBR0_EL1
> +	cond_sysreg_write	3,0, 2,0,1,x2,x3,7f	// TTBR1_EL1
> +	cond_sysreg_write	3,0, 2,0,2,x2,x3,7f	// TCR_EL1
> +	cond_sysreg_write	3,0, 5,2,0,x2,x3,7f	// ESR_EL1
> +	cond_sysreg_write	3,0, 6,0,0,x2,x3,7f	// FAR_EL1
> +	cond_sysreg_write	3,0, 5,1,0,x2,x3,7f	// AFSR0_EL1
> +	cond_sysreg_write	3,0, 5,1,1,x2,x3,7f	// AFSR1_EL1
> +	cond_sysreg_write	3,0,10,3,0,x2,x3,7f	// AMAIR_EL1
> +	cond_sysreg_write	3,0,13,0,1,x2,x3,7f	// CONTEXTIDR_EL1
> +
> +	/*
> +	 * If we end up here, the write is to a register that we don't handle
> +	 * in EL2. Let the host handle it instead ...
> +	 */
> +	b	1b
> +
> +	/*
> +	 * We have handled the write. Increment the pc and return to the
> +	 * guest.
> +	 */
> +7:	mrs	x0, elr_el2
> +	add	x0, x0, #4
> +	msr	elr_el2, x0
> +	pop	x2, x3
> +	pop	x0, x1
> +	eret
> +
>  el1_irq:
>  	push	x0, x1
>  	push	x2, x3
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index f31e8bb2bc5b..1e170eab6603 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -187,6 +187,16 @@ static bool trap_debug_regs(struct kvm_vcpu *vcpu,
>  	return true;
>  }
>  
> +static bool access_handled_at_el2(struct kvm_vcpu *vcpu,
> +				  const struct sys_reg_params *params,
> +				  const struct sys_reg_desc *r)
> +{
> +	kvm_debug("sys_reg write at %lx should have been handled in EL2\n",
> +		  *vcpu_pc(vcpu));
> +	print_sys_reg_instr(params);
> +	return false;
> +}
> +
>  static void reset_amair_el1(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r)
>  {
>  	u64 amair;
> @@ -328,26 +338,26 @@ static const struct sys_reg_desc sys_reg_descs[] = {
>  	  NULL, reset_val, CPACR_EL1, 0 },
>  	/* TTBR0_EL1 */
>  	{ Op0(0b11), Op1(0b000), CRn(0b0010), CRm(0b0000), Op2(0b000),
> -	  access_vm_reg, reset_unknown, TTBR0_EL1 },
> +	  access_handled_at_el2, reset_unknown, TTBR0_EL1 },
>  	/* TTBR1_EL1 */
>  	{ Op0(0b11), Op1(0b000), CRn(0b0010), CRm(0b0000), Op2(0b001),
> -	  access_vm_reg, reset_unknown, TTBR1_EL1 },
> +	  access_handled_at_el2, reset_unknown, TTBR1_EL1 },
>  	/* TCR_EL1 */
>  	{ Op0(0b11), Op1(0b000), CRn(0b0010), CRm(0b0000), Op2(0b010),
> -	  access_vm_reg, reset_val, TCR_EL1, 0 },
> +	  access_handled_at_el2, reset_val, TCR_EL1, 0 },
>  
>  	/* AFSR0_EL1 */
>  	{ Op0(0b11), Op1(0b000), CRn(0b0101), CRm(0b0001), Op2(0b000),
> -	  access_vm_reg, reset_unknown, AFSR0_EL1 },
> +	  access_handled_at_el2, reset_unknown, AFSR0_EL1 },
>  	/* AFSR1_EL1 */
>  	{ Op0(0b11), Op1(0b000), CRn(0b0101), CRm(0b0001), Op2(0b001),
> -	  access_vm_reg, reset_unknown, AFSR1_EL1 },
> +	  access_handled_at_el2, reset_unknown, AFSR1_EL1 },
>  	/* ESR_EL1 */
>  	{ Op0(0b11), Op1(0b000), CRn(0b0101), CRm(0b0010), Op2(0b000),
> -	  access_vm_reg, reset_unknown, ESR_EL1 },
> +	  access_handled_at_el2, reset_unknown, ESR_EL1 },
>  	/* FAR_EL1 */
>  	{ Op0(0b11), Op1(0b000), CRn(0b0110), CRm(0b0000), Op2(0b000),
> -	  access_vm_reg, reset_unknown, FAR_EL1 },
> +	  access_handled_at_el2, reset_unknown, FAR_EL1 },
>  	/* PAR_EL1 */
>  	{ Op0(0b11), Op1(0b000), CRn(0b0111), CRm(0b0100), Op2(0b000),
>  	  NULL, reset_unknown, PAR_EL1 },
> @@ -364,7 +374,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
>  	  access_vm_reg, reset_unknown, MAIR_EL1 },
>  	/* AMAIR_EL1 */
>  	{ Op0(0b11), Op1(0b000), CRn(0b1010), CRm(0b0011), Op2(0b000),
> -	  access_vm_reg, reset_amair_el1, AMAIR_EL1 },
> +	  access_handled_at_el2, reset_amair_el1, AMAIR_EL1 },
>  
>  	/* VBAR_EL1 */
>  	{ Op0(0b11), Op1(0b000), CRn(0b1100), CRm(0b0000), Op2(0b000),
> @@ -376,7 +386,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
>  
>  	/* CONTEXTIDR_EL1 */
>  	{ Op0(0b11), Op1(0b000), CRn(0b1101), CRm(0b0000), Op2(0b001),
> -	  access_vm_reg, reset_val, CONTEXTIDR_EL1, 0 },
> +	  access_handled_at_el2, reset_val, CONTEXTIDR_EL1, 0 },
>  	/* TPIDR_EL1 */
>  	{ Op0(0b11), Op1(0b000), CRn(0b1101), CRm(0b0000), Op2(0b100),
>  	  NULL, reset_unknown, TPIDR_EL1 },
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 1/3] arm64: KVM: handle some sysreg writes in EL2
@ 2015-03-03 17:59     ` Mario Smarduch
  0 siblings, 0 replies; 110+ messages in thread
From: Mario Smarduch @ 2015-03-03 17:59 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Ard,
  couple side effects would be guest address translation may
return attributes in PAR register it wound not expect.
Likewise for read of MAIR registers.

- Mario

On 02/19/2015 02:54 AM, Ard Biesheuvel wrote:
> This adds handling to el1_trap() to perform some sysreg writes directly
> in EL2, without performing the full world switch to the host and back
> again. This is mainly for doing writes that don't need special handling,
> but where the register is part of the group that we need to trap for
> other reasons.
> 
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
>  arch/arm64/kvm/hyp.S      | 101 ++++++++++++++++++++++++++++++++++++++++++++++
>  arch/arm64/kvm/sys_regs.c |  28 ++++++++-----
>  2 files changed, 120 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
> index c3ca89c27c6b..e3af6840cb3f 100644
> --- a/arch/arm64/kvm/hyp.S
> +++ b/arch/arm64/kvm/hyp.S
> @@ -26,6 +26,7 @@
>  #include <asm/kvm_asm.h>
>  #include <asm/kvm_arm.h>
>  #include <asm/kvm_mmu.h>
> +#include <asm/sysreg.h>
>  
>  #define CPU_GP_REG_OFFSET(x)	(CPU_GP_REGS + x)
>  #define CPU_XREG_OFFSET(x)	CPU_GP_REG_OFFSET(CPU_USER_PT_REGS + 8*x)
> @@ -887,6 +888,34 @@
>  1:
>  .endm
>  
> +/*
> + * Macro to conditionally perform a parametrised system register write. Note
> + * that we currently only support writing x3 to a system register in class
> + * Op0 == 3 and Op1 == 0, which is all we need at the moment.
> + */
> +.macro	cond_sysreg_write,op0,op1,crn,crm,op2,sreg,opreg,outlbl
> +	.ifnc	\op0,3    ; .err ; .endif
> +	.ifnc	\op1,0    ; .err ; .endif
> +	.ifnc	\opreg,x3 ; .err ; .endif
> +	cmp	\sreg, #((\crm) | ((\crn) << 4) | ((\op2) << 8))
> +	bne	9999f
> +	// doesn't work: msr_s sys_reg(\op0,\op1,\crn,\crm,\op2), \opreg
> +	.inst	0xd5180003|((\crn) << 12)|((\crm) << 8)|((\op2 << 5))
> +	b	\outlbl
> +9999:
> +.endm
> +
> +/*
> + * Pack CRn, CRm and Op2 into 11 adjacent low bits so we can use a single
> + * cmp instruction to compare it with a 12-bit immediate.
> + */
> +.macro	pack_sysreg_idx, outreg, inreg
> +	ubfm	\outreg, \inreg, #(17 - 8), #(17 + 2)	// Op2 -> bits 8 - 10
> +	bfm	\outreg, \inreg, #(10 - 4), #(10 + 3)	// CRn -> bits 4 - 7
> +	bfm	\outreg, \inreg, #(1 - 0), #(1 + 3)	// CRm -> bits 0 - 3
> +.endm
> +
> +
>  __save_sysregs:
>  	save_sysregs
>  	ret
> @@ -1178,6 +1207,15 @@ el1_trap:
>  	 * x1: ESR
>  	 * x2: ESR_EC
>  	 */
> +
> +	/*
> +	 * Find out if the exception we are about to pass to the host is a
> +	 * write to a system register, which we may prefer to handle in EL2.
> +	 */
> +	tst	x1, #1				// direction == write (0) ?
> +	ccmp	x2, #ESR_EL2_EC_SYS64, #0, eq	// is a sysreg access?
> +	b.eq	4f
> +
>  	cmp	x2, #ESR_EL2_EC_DABT
>  	mov	x0, #ESR_EL2_EC_IABT
>  	ccmp	x2, x0, #4, ne
> @@ -1239,6 +1277,69 @@ el1_trap:
>  
>  	eret
>  
> +4:	and	x2, x1, #(3 << 20)		// check for Op0 == 0b11
> +	cmp	x2, #(3 << 20)
> +	b.ne	1b
> +	ands	x2, x1, #(7 << 14)		// check for Op1 == 0b000
> +	b.ne	1b
> +
> +	/*
> +	 * If we end up here, we are about to perform a system register write
> +	 * with Op0 == 0b11 and Op1 == 0b000. Move the operand to x3 first, we
> +	 * will check later if we are actually going to handle this write in EL2
> +	 */
> +	adr	x0, 5f
> +	ubfx	x2, x1, #5, #5		// operand reg# in bits 9 .. 5
> +	add	x0, x0, x2, lsl #3
> +	br	x0
> +5:	ldr	x3, [sp, #16]		// x0 from the stack
> +	b	6f
> +	ldr	x3, [sp, #24]		// x1 from the stack
> +	b	6f
> +	ldr	x3, [sp]		// x2 from the stack
> +	b	6f
> +	ldr	x3, [sp, #8]		// x3 from the stack
> +	b	6f
> +	.irp	reg,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30
> +	mov	x3, x\reg
> +	b	6f
> +	.endr
> +	mov	x3, xzr			// x31
> +
> +	/*
> +	 * Ok, so now we have the desired value in x3, let's write it into the
> +	 * sysreg if it's a register write we want to handle in EL2. Since these
> +	 * are tried in order, it makes sense to put the ones used most often at
> +	 * the top.
> +	 */
> +6:	pack_sysreg_idx		x2, x1
> +	cond_sysreg_write	3,0, 2,0,0,x2,x3,7f	// TTBR0_EL1
> +	cond_sysreg_write	3,0, 2,0,1,x2,x3,7f	// TTBR1_EL1
> +	cond_sysreg_write	3,0, 2,0,2,x2,x3,7f	// TCR_EL1
> +	cond_sysreg_write	3,0, 5,2,0,x2,x3,7f	// ESR_EL1
> +	cond_sysreg_write	3,0, 6,0,0,x2,x3,7f	// FAR_EL1
> +	cond_sysreg_write	3,0, 5,1,0,x2,x3,7f	// AFSR0_EL1
> +	cond_sysreg_write	3,0, 5,1,1,x2,x3,7f	// AFSR1_EL1
> +	cond_sysreg_write	3,0,10,3,0,x2,x3,7f	// AMAIR_EL1
> +	cond_sysreg_write	3,0,13,0,1,x2,x3,7f	// CONTEXTIDR_EL1
> +
> +	/*
> +	 * If we end up here, the write is to a register that we don't handle
> +	 * in EL2. Let the host handle it instead ...
> +	 */
> +	b	1b
> +
> +	/*
> +	 * We have handled the write. Increment the pc and return to the
> +	 * guest.
> +	 */
> +7:	mrs	x0, elr_el2
> +	add	x0, x0, #4
> +	msr	elr_el2, x0
> +	pop	x2, x3
> +	pop	x0, x1
> +	eret
> +
>  el1_irq:
>  	push	x0, x1
>  	push	x2, x3
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index f31e8bb2bc5b..1e170eab6603 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -187,6 +187,16 @@ static bool trap_debug_regs(struct kvm_vcpu *vcpu,
>  	return true;
>  }
>  
> +static bool access_handled_at_el2(struct kvm_vcpu *vcpu,
> +				  const struct sys_reg_params *params,
> +				  const struct sys_reg_desc *r)
> +{
> +	kvm_debug("sys_reg write at %lx should have been handled in EL2\n",
> +		  *vcpu_pc(vcpu));
> +	print_sys_reg_instr(params);
> +	return false;
> +}
> +
>  static void reset_amair_el1(struct kvm_vcpu *vcpu, const struct sys_reg_desc *r)
>  {
>  	u64 amair;
> @@ -328,26 +338,26 @@ static const struct sys_reg_desc sys_reg_descs[] = {
>  	  NULL, reset_val, CPACR_EL1, 0 },
>  	/* TTBR0_EL1 */
>  	{ Op0(0b11), Op1(0b000), CRn(0b0010), CRm(0b0000), Op2(0b000),
> -	  access_vm_reg, reset_unknown, TTBR0_EL1 },
> +	  access_handled_at_el2, reset_unknown, TTBR0_EL1 },
>  	/* TTBR1_EL1 */
>  	{ Op0(0b11), Op1(0b000), CRn(0b0010), CRm(0b0000), Op2(0b001),
> -	  access_vm_reg, reset_unknown, TTBR1_EL1 },
> +	  access_handled_at_el2, reset_unknown, TTBR1_EL1 },
>  	/* TCR_EL1 */
>  	{ Op0(0b11), Op1(0b000), CRn(0b0010), CRm(0b0000), Op2(0b010),
> -	  access_vm_reg, reset_val, TCR_EL1, 0 },
> +	  access_handled_at_el2, reset_val, TCR_EL1, 0 },
>  
>  	/* AFSR0_EL1 */
>  	{ Op0(0b11), Op1(0b000), CRn(0b0101), CRm(0b0001), Op2(0b000),
> -	  access_vm_reg, reset_unknown, AFSR0_EL1 },
> +	  access_handled_at_el2, reset_unknown, AFSR0_EL1 },
>  	/* AFSR1_EL1 */
>  	{ Op0(0b11), Op1(0b000), CRn(0b0101), CRm(0b0001), Op2(0b001),
> -	  access_vm_reg, reset_unknown, AFSR1_EL1 },
> +	  access_handled_at_el2, reset_unknown, AFSR1_EL1 },
>  	/* ESR_EL1 */
>  	{ Op0(0b11), Op1(0b000), CRn(0b0101), CRm(0b0010), Op2(0b000),
> -	  access_vm_reg, reset_unknown, ESR_EL1 },
> +	  access_handled_at_el2, reset_unknown, ESR_EL1 },
>  	/* FAR_EL1 */
>  	{ Op0(0b11), Op1(0b000), CRn(0b0110), CRm(0b0000), Op2(0b000),
> -	  access_vm_reg, reset_unknown, FAR_EL1 },
> +	  access_handled_at_el2, reset_unknown, FAR_EL1 },
>  	/* PAR_EL1 */
>  	{ Op0(0b11), Op1(0b000), CRn(0b0111), CRm(0b0100), Op2(0b000),
>  	  NULL, reset_unknown, PAR_EL1 },
> @@ -364,7 +374,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
>  	  access_vm_reg, reset_unknown, MAIR_EL1 },
>  	/* AMAIR_EL1 */
>  	{ Op0(0b11), Op1(0b000), CRn(0b1010), CRm(0b0011), Op2(0b000),
> -	  access_vm_reg, reset_amair_el1, AMAIR_EL1 },
> +	  access_handled_at_el2, reset_amair_el1, AMAIR_EL1 },
>  
>  	/* VBAR_EL1 */
>  	{ Op0(0b11), Op1(0b000), CRn(0b1100), CRm(0b0000), Op2(0b000),
> @@ -376,7 +386,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
>  
>  	/* CONTEXTIDR_EL1 */
>  	{ Op0(0b11), Op1(0b000), CRn(0b1101), CRm(0b0000), Op2(0b001),
> -	  access_vm_reg, reset_val, CONTEXTIDR_EL1, 0 },
> +	  access_handled_at_el2, reset_val, CONTEXTIDR_EL1, 0 },
>  	/* TPIDR_EL1 */
>  	{ Op0(0b11), Op1(0b000), CRn(0b1101), CRm(0b0000), Op2(0b100),
>  	  NULL, reset_unknown, TPIDR_EL1 },
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-03 17:34   ` Alexander Graf
@ 2015-03-03 18:13     ` Laszlo Ersek
  -1 siblings, 0 replies; 110+ messages in thread
From: Laszlo Ersek @ 2015-03-03 18:13 UTC (permalink / raw)
  To: Alexander Graf, Ard Biesheuvel, christoffer.dall, marc.zyngier,
	linux-arm-kernel, peter.maydell
  Cc: pbonzini, kvmarm, kvm

On 03/03/15 18:34, Alexander Graf wrote:
> On 02/19/2015 11:54 AM, Ard Biesheuvel wrote:
>> This is a 0th order approximation of how we could potentially force
>> the guest
>> to avoid uncached mappings, at least from the moment the MMU is on.
>> (Before
>> that, all of memory is implicitly classified as Device-nGnRnE)
>>
>> The idea (patch #2) is to trap writes to MAIR_EL1, and replace
>> uncached mappings
>> with cached ones. This way, there is no need to mangle any guest page
>> tables.
>>
>> The downside is that, to do this correctly, we need to always trap
>> writes to
>> the VM sysreg group, which includes registers that the guest may write
>> to very
>> often. To reduce the associated performance hit, patch #1 introduces a
>> fast path
>> for EL2 to perform trivial sysreg writes on behalf of the guest,
>> without the
>> need for a full world switch to the host and back.
>>
>> The main purpose of these patches is to quantify the performance hit, and
>> verify whether the MAIR_EL1 handling works correctly.
> 
> I gave this a quick spin on a VM running with QEMU.
> 
>   * VGA output is still distorted, I get random junk black lines in the
> output in between
>   * When I add -device nec-usb-xhci -device usb-kbd the VM doesn't even
> boot up

Do you also have the dirty page tracking patches in your host kernel? I
needed both (and got them via Drew's backport, thanks) and then both VGA
and USB started working fine.

Without the MAIR patches, I got cache-line size "random" corruptions in
the VGA display (16 pixel wide small segments). Without dirty page
tracking, big chunks (sometimes even almost the entire screen) was blank.

Regarding USB, unless you have both of the patchsets in the host kernel,
the guest will indeed crash early during boot. Gerd confirmed for me
that "usb controller (all uhci/ehci/xhci) pci regions see both read
(status bits) and write (control bits) access". So if there's any
corruption in there, on read, that looks like a malfunctioning piece of
hw for the guest kernel, and in this case it happens to crash.

> With TCG, both bits work fine.

Yep.

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-03 18:13     ` Laszlo Ersek
  0 siblings, 0 replies; 110+ messages in thread
From: Laszlo Ersek @ 2015-03-03 18:13 UTC (permalink / raw)
  To: linux-arm-kernel

On 03/03/15 18:34, Alexander Graf wrote:
> On 02/19/2015 11:54 AM, Ard Biesheuvel wrote:
>> This is a 0th order approximation of how we could potentially force
>> the guest
>> to avoid uncached mappings, at least from the moment the MMU is on.
>> (Before
>> that, all of memory is implicitly classified as Device-nGnRnE)
>>
>> The idea (patch #2) is to trap writes to MAIR_EL1, and replace
>> uncached mappings
>> with cached ones. This way, there is no need to mangle any guest page
>> tables.
>>
>> The downside is that, to do this correctly, we need to always trap
>> writes to
>> the VM sysreg group, which includes registers that the guest may write
>> to very
>> often. To reduce the associated performance hit, patch #1 introduces a
>> fast path
>> for EL2 to perform trivial sysreg writes on behalf of the guest,
>> without the
>> need for a full world switch to the host and back.
>>
>> The main purpose of these patches is to quantify the performance hit, and
>> verify whether the MAIR_EL1 handling works correctly.
> 
> I gave this a quick spin on a VM running with QEMU.
> 
>   * VGA output is still distorted, I get random junk black lines in the
> output in between
>   * When I add -device nec-usb-xhci -device usb-kbd the VM doesn't even
> boot up

Do you also have the dirty page tracking patches in your host kernel? I
needed both (and got them via Drew's backport, thanks) and then both VGA
and USB started working fine.

Without the MAIR patches, I got cache-line size "random" corruptions in
the VGA display (16 pixel wide small segments). Without dirty page
tracking, big chunks (sometimes even almost the entire screen) was blank.

Regarding USB, unless you have both of the patchsets in the host kernel,
the guest will indeed crash early during boot. Gerd confirmed for me
that "usb controller (all uhci/ehci/xhci) pci regions see both read
(status bits) and write (control bits) access". So if there's any
corruption in there, on read, that looks like a malfunctioning piece of
hw for the guest kernel, and in this case it happens to crash.

> With TCG, both bits work fine.

Yep.

Thanks
Laszlo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-02-19 10:54 ` Ard Biesheuvel
@ 2015-03-03 18:32   ` Catalin Marinas
  -1 siblings, 0 replies; 110+ messages in thread
From: Catalin Marinas @ 2015-03-03 18:32 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: lersek, christoffer.dall, marc.zyngier, linux-arm-kernel,
	peter.maydell, pbonzini, kvmarm, kvm, agraf

On Thu, Feb 19, 2015 at 10:54:43AM +0000, Ard Biesheuvel wrote:
> This is a 0th order approximation of how we could potentially force the guest
> to avoid uncached mappings, at least from the moment the MMU is on. (Before
> that, all of memory is implicitly classified as Device-nGnRnE)

That's just for data accesses. IIRC instructions are cacheable on ARMv8
(though I think without allocation in the unified caches).

> The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings
> with cached ones. This way, there is no need to mangle any guest page tables.

There is another big downside to this - breaking the guest assumptions
about the (non-)cacheability of its mappings. It also only works for
guests that use MAIR_EL1 (LPAE).

We have two main cases where the guest and host cacheability do not
match:

1. During boot, as you said, when the MMU is off. What we have done in
   the guest kernel is to invalidate the data ranges that it writes with
   the MMU off in case there were any speculatively loaded cache lines
   via the cacheable mappings (in the host). We don't have any nice
   solution in the host here and MAIR_EL1 tweaking does not work

2. Guest explicitly creating a non-cacheable mapping (MMU enabled). Here
   we have two sub-cases:
   a) guest-only accesses to such mapping. The guest would need to
      perform cache maintenance as required if it ever accesses such
      memory via cacheable mappings (we do this already, see the
      streaming DMA API)
   b) memory shared with the host: e.g Qemu emulating DMA (frame buffer
      etc.)

This 2.b case is not any different than the OS dealing with a
(non-)coherent DMA-capable device. If the device is coherent, the
DMA buffer in the guest must be coherent as well, otherwise
non-coherent. Imagine a real VGA device that always snoops CPU caches.
You would not create a non-cacheable frame buffer mapping since the
device cannot see the updates and only read stale cache entries.

We don't (can't) have a safe set of DMA ops that would work in both
cases. So if Qemu cannot use a non-cacheable mapping or cannot perform
cache maintenance, the only solution is to tell the guest that such
virtual device is cache _coherent_. This also gives you better
performance overall anyway.

-- 
Catalin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-03 18:32   ` Catalin Marinas
  0 siblings, 0 replies; 110+ messages in thread
From: Catalin Marinas @ 2015-03-03 18:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 19, 2015 at 10:54:43AM +0000, Ard Biesheuvel wrote:
> This is a 0th order approximation of how we could potentially force the guest
> to avoid uncached mappings, at least from the moment the MMU is on. (Before
> that, all of memory is implicitly classified as Device-nGnRnE)

That's just for data accesses. IIRC instructions are cacheable on ARMv8
(though I think without allocation in the unified caches).

> The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings
> with cached ones. This way, there is no need to mangle any guest page tables.

There is another big downside to this - breaking the guest assumptions
about the (non-)cacheability of its mappings. It also only works for
guests that use MAIR_EL1 (LPAE).

We have two main cases where the guest and host cacheability do not
match:

1. During boot, as you said, when the MMU is off. What we have done in
   the guest kernel is to invalidate the data ranges that it writes with
   the MMU off in case there were any speculatively loaded cache lines
   via the cacheable mappings (in the host). We don't have any nice
   solution in the host here and MAIR_EL1 tweaking does not work

2. Guest explicitly creating a non-cacheable mapping (MMU enabled). Here
   we have two sub-cases:
   a) guest-only accesses to such mapping. The guest would need to
      perform cache maintenance as required if it ever accesses such
      memory via cacheable mappings (we do this already, see the
      streaming DMA API)
   b) memory shared with the host: e.g Qemu emulating DMA (frame buffer
      etc.)

This 2.b case is not any different than the OS dealing with a
(non-)coherent DMA-capable device. If the device is coherent, the
DMA buffer in the guest must be coherent as well, otherwise
non-coherent. Imagine a real VGA device that always snoops CPU caches.
You would not create a non-cacheable frame buffer mapping since the
device cannot see the updates and only read stale cache entries.

We don't (can't) have a safe set of DMA ops that would work in both
cases. So if Qemu cannot use a non-cacheable mapping or cannot perform
cache maintenance, the only solution is to tell the guest that such
virtual device is cache _coherent_. This also gives you better
performance overall anyway.

-- 
Catalin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-03 18:13     ` Laszlo Ersek
@ 2015-03-03 20:58       ` Andrew Jones
  -1 siblings, 0 replies; 110+ messages in thread
From: Andrew Jones @ 2015-03-03 20:58 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: kvm, Ard Biesheuvel, marc.zyngier, pbonzini, kvmarm, linux-arm-kernel

On Tue, Mar 03, 2015 at 07:13:48PM +0100, Laszlo Ersek wrote:
> On 03/03/15 18:34, Alexander Graf wrote:
> > On 02/19/2015 11:54 AM, Ard Biesheuvel wrote:
> >> This is a 0th order approximation of how we could potentially force
> >> the guest
> >> to avoid uncached mappings, at least from the moment the MMU is on.
> >> (Before
> >> that, all of memory is implicitly classified as Device-nGnRnE)
> >>
> >> The idea (patch #2) is to trap writes to MAIR_EL1, and replace
> >> uncached mappings
> >> with cached ones. This way, there is no need to mangle any guest page
> >> tables.
> >>
> >> The downside is that, to do this correctly, we need to always trap
> >> writes to
> >> the VM sysreg group, which includes registers that the guest may write
> >> to very
> >> often. To reduce the associated performance hit, patch #1 introduces a
> >> fast path
> >> for EL2 to perform trivial sysreg writes on behalf of the guest,
> >> without the
> >> need for a full world switch to the host and back.
> >>
> >> The main purpose of these patches is to quantify the performance hit, and
> >> verify whether the MAIR_EL1 handling works correctly.
> > 
> > I gave this a quick spin on a VM running with QEMU.
> > 
> >   * VGA output is still distorted, I get random junk black lines in the
> > output in between
> >   * When I add -device nec-usb-xhci -device usb-kbd the VM doesn't even
> > boot up
> 
> Do you also have the dirty page tracking patches in your host kernel? I
> needed both (and got them via Drew's backport, thanks) and then both VGA
> and USB started working fine.

Assuming you have the dirty page tracking already, then you're probably
missing the fixup to patch 2/3, s/0xbb/0xff/

> 
> Without the MAIR patches, I got cache-line size "random" corruptions in
> the VGA display (16 pixel wide small segments). Without dirty page
> tracking, big chunks (sometimes even almost the entire screen) was blank.
> 
> Regarding USB, unless you have both of the patchsets in the host kernel,
> the guest will indeed crash early during boot. Gerd confirmed for me
> that "usb controller (all uhci/ehci/xhci) pci regions see both read
> (status bits) and write (control bits) access". So if there's any
> corruption in there, on read, that looks like a malfunctioning piece of
> hw for the guest kernel, and in this case it happens to crash.
> 
> > With TCG, both bits work fine.
> 
> Yep.
> 
> Thanks
> Laszlo
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-03 20:58       ` Andrew Jones
  0 siblings, 0 replies; 110+ messages in thread
From: Andrew Jones @ 2015-03-03 20:58 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Mar 03, 2015 at 07:13:48PM +0100, Laszlo Ersek wrote:
> On 03/03/15 18:34, Alexander Graf wrote:
> > On 02/19/2015 11:54 AM, Ard Biesheuvel wrote:
> >> This is a 0th order approximation of how we could potentially force
> >> the guest
> >> to avoid uncached mappings, at least from the moment the MMU is on.
> >> (Before
> >> that, all of memory is implicitly classified as Device-nGnRnE)
> >>
> >> The idea (patch #2) is to trap writes to MAIR_EL1, and replace
> >> uncached mappings
> >> with cached ones. This way, there is no need to mangle any guest page
> >> tables.
> >>
> >> The downside is that, to do this correctly, we need to always trap
> >> writes to
> >> the VM sysreg group, which includes registers that the guest may write
> >> to very
> >> often. To reduce the associated performance hit, patch #1 introduces a
> >> fast path
> >> for EL2 to perform trivial sysreg writes on behalf of the guest,
> >> without the
> >> need for a full world switch to the host and back.
> >>
> >> The main purpose of these patches is to quantify the performance hit, and
> >> verify whether the MAIR_EL1 handling works correctly.
> > 
> > I gave this a quick spin on a VM running with QEMU.
> > 
> >   * VGA output is still distorted, I get random junk black lines in the
> > output in between
> >   * When I add -device nec-usb-xhci -device usb-kbd the VM doesn't even
> > boot up
> 
> Do you also have the dirty page tracking patches in your host kernel? I
> needed both (and got them via Drew's backport, thanks) and then both VGA
> and USB started working fine.

Assuming you have the dirty page tracking already, then you're probably
missing the fixup to patch 2/3, s/0xbb/0xff/

> 
> Without the MAIR patches, I got cache-line size "random" corruptions in
> the VGA display (16 pixel wide small segments). Without dirty page
> tracking, big chunks (sometimes even almost the entire screen) was blank.
> 
> Regarding USB, unless you have both of the patchsets in the host kernel,
> the guest will indeed crash early during boot. Gerd confirmed for me
> that "usb controller (all uhci/ehci/xhci) pci regions see both read
> (status bits) and write (control bits) access". So if there's any
> corruption in there, on read, that looks like a malfunctioning piece of
> hw for the guest kernel, and in this case it happens to crash.
> 
> > With TCG, both bits work fine.
> 
> Yep.
> 
> Thanks
> Laszlo
> _______________________________________________
> kvmarm mailing list
> kvmarm at lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-03  2:20                       ` Mario Smarduch
@ 2015-03-04 11:35                         ` Catalin Marinas
  -1 siblings, 0 replies; 110+ messages in thread
From: Catalin Marinas @ 2015-03-04 11:35 UTC (permalink / raw)
  To: Mario Smarduch
  Cc: Christoffer Dall, KVM devel mailing list, Ard Biesheuvel,
	Marc Zyngier, Paolo Bonzini, Laszlo Ersek, kvmarm,
	linux-arm-kernel

(please try to avoid top-posting)

On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote:
> On 03/02/2015 08:31 AM, Christoffer Dall wrote:
> > However, my concern with these patches are on two points:
> > 
> > 1. It's not a fix-all.  We still have the case where the guest expects
> > the behavior of device memory (for strong ordering for example) on a RAM
> > region, which we now break.  Similiarly this doesn't support the
> > non-coherent DMA to RAM region case.
> > 
> > 2. While the code is probably as nice as this kind of stuff gets, it
> > is non-trivial and extremely difficult to debug.  The counter-point here
> > is that we may end up handling other stuff at EL2 for performanc reasons
> > in the future.
> > 
> > Mainly because of point 1 above, I am leaning to thinking userspace
> > should do the invalidation when it knows it needs to, either through KVM
> > via a memslot flag or through some other syscall mechanism.

I expressed my concerns as well, I'm definitely against merging this
series.

> I don't understand how can the CPU handle different cache attributes
> used by QEMU and Guest won't you run into B2.9 checklist? Wouldn't
> cache evictions or cleans wipe out guest updates to same cache
> line(s)?

"Clean+invalidate" is a safe operation even if the guest accesses the
memory in a cacheable way. But if the guest can update the cache lines,
Qemu should avoid cache maintenance from a performance perspective.

The guest is either told that the DMA is coherent (via DT properties) or
Qemu deals with (non-)coherency itself. The latter is fully in line with
the B2.9 chapter in the ARM ARM, more precisely point 5:

  If the mismatched attributes for a memory location all assign the same
  shareability attribute to the location, any loss of uniprocessor
  semantics or coherency within a shareability domain can be avoided by
  use of software cache management.

... it continues with what kind of cache maintenance is required,
together with:

  A clean and invalidate instruction can be used instead of a clean
  instruction, or instead of an invalidate instruction.

-- 
Catalin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-04 11:35                         ` Catalin Marinas
  0 siblings, 0 replies; 110+ messages in thread
From: Catalin Marinas @ 2015-03-04 11:35 UTC (permalink / raw)
  To: linux-arm-kernel

(please try to avoid top-posting)

On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote:
> On 03/02/2015 08:31 AM, Christoffer Dall wrote:
> > However, my concern with these patches are on two points:
> > 
> > 1. It's not a fix-all.  We still have the case where the guest expects
> > the behavior of device memory (for strong ordering for example) on a RAM
> > region, which we now break.  Similiarly this doesn't support the
> > non-coherent DMA to RAM region case.
> > 
> > 2. While the code is probably as nice as this kind of stuff gets, it
> > is non-trivial and extremely difficult to debug.  The counter-point here
> > is that we may end up handling other stuff at EL2 for performanc reasons
> > in the future.
> > 
> > Mainly because of point 1 above, I am leaning to thinking userspace
> > should do the invalidation when it knows it needs to, either through KVM
> > via a memslot flag or through some other syscall mechanism.

I expressed my concerns as well, I'm definitely against merging this
series.

> I don't understand how can the CPU handle different cache attributes
> used by QEMU and Guest won't you run into B2.9 checklist? Wouldn't
> cache evictions or cleans wipe out guest updates to same cache
> line(s)?

"Clean+invalidate" is a safe operation even if the guest accesses the
memory in a cacheable way. But if the guest can update the cache lines,
Qemu should avoid cache maintenance from a performance perspective.

The guest is either told that the DMA is coherent (via DT properties) or
Qemu deals with (non-)coherency itself. The latter is fully in line with
the B2.9 chapter in the ARM ARM, more precisely point 5:

  If the mismatched attributes for a memory location all assign the same
  shareability attribute to the location, any loss of uniprocessor
  semantics or coherency within a shareability domain can be avoided by
  use of software cache management.

... it continues with what kind of cache maintenance is required,
together with:

  A clean and invalidate instruction can be used instead of a clean
  instruction, or instead of an invalidate instruction.

-- 
Catalin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-04 11:35                         ` Catalin Marinas
@ 2015-03-04 11:50                           ` Ard Biesheuvel
  -1 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-03-04 11:50 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: KVM devel mailing list, Marc Zyngier, linux-arm-kernel,
	Paolo Bonzini, Laszlo Ersek, kvmarm

On 4 March 2015 at 12:35, Catalin Marinas <catalin.marinas@arm.com> wrote:
> (please try to avoid top-posting)
>
> On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote:
>> On 03/02/2015 08:31 AM, Christoffer Dall wrote:
>> > However, my concern with these patches are on two points:
>> >
>> > 1. It's not a fix-all.  We still have the case where the guest expects
>> > the behavior of device memory (for strong ordering for example) on a RAM
>> > region, which we now break.  Similiarly this doesn't support the
>> > non-coherent DMA to RAM region case.
>> >
>> > 2. While the code is probably as nice as this kind of stuff gets, it
>> > is non-trivial and extremely difficult to debug.  The counter-point here
>> > is that we may end up handling other stuff at EL2 for performanc reasons
>> > in the future.
>> >
>> > Mainly because of point 1 above, I am leaning to thinking userspace
>> > should do the invalidation when it knows it needs to, either through KVM
>> > via a memslot flag or through some other syscall mechanism.
>
> I expressed my concerns as well, I'm definitely against merging this
> series.
>

Don't worry, that was never the intention, at least not as-is :-)

I think we have established that the performance hit is not the
problem but the correctness is.

I do have a remaining question, though: my original [non-working]
approach was to replace uncached mappings with write-through
read-allocate write-allocate, which I expected would keep the caches
in sync with main memory, but apparently I am misunderstanding
something here. (This is the reason for s/0xbb/0xff/ in patch #2 to
get it to work: it replaces WT/RA/WA with WB/RA/WA)

Is there no way to use write-through caching here?

>> I don't understand how can the CPU handle different cache attributes
>> used by QEMU and Guest won't you run into B2.9 checklist? Wouldn't
>> cache evictions or cleans wipe out guest updates to same cache
>> line(s)?
>
> "Clean+invalidate" is a safe operation even if the guest accesses the
> memory in a cacheable way. But if the guest can update the cache lines,
> Qemu should avoid cache maintenance from a performance perspective.
>
> The guest is either told that the DMA is coherent (via DT properties) or
> Qemu deals with (non-)coherency itself. The latter is fully in line with
> the B2.9 chapter in the ARM ARM, more precisely point 5:
>
>   If the mismatched attributes for a memory location all assign the same
>   shareability attribute to the location, any loss of uniprocessor
>   semantics or coherency within a shareability domain can be avoided by
>   use of software cache management.
>
> ... it continues with what kind of cache maintenance is required,
> together with:
>
>   A clean and invalidate instruction can be used instead of a clean
>   instruction, or instead of an invalidate instruction.
>
> --
> Catalin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-04 11:50                           ` Ard Biesheuvel
  0 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-03-04 11:50 UTC (permalink / raw)
  To: linux-arm-kernel

On 4 March 2015 at 12:35, Catalin Marinas <catalin.marinas@arm.com> wrote:
> (please try to avoid top-posting)
>
> On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote:
>> On 03/02/2015 08:31 AM, Christoffer Dall wrote:
>> > However, my concern with these patches are on two points:
>> >
>> > 1. It's not a fix-all.  We still have the case where the guest expects
>> > the behavior of device memory (for strong ordering for example) on a RAM
>> > region, which we now break.  Similiarly this doesn't support the
>> > non-coherent DMA to RAM region case.
>> >
>> > 2. While the code is probably as nice as this kind of stuff gets, it
>> > is non-trivial and extremely difficult to debug.  The counter-point here
>> > is that we may end up handling other stuff at EL2 for performanc reasons
>> > in the future.
>> >
>> > Mainly because of point 1 above, I am leaning to thinking userspace
>> > should do the invalidation when it knows it needs to, either through KVM
>> > via a memslot flag or through some other syscall mechanism.
>
> I expressed my concerns as well, I'm definitely against merging this
> series.
>

Don't worry, that was never the intention, at least not as-is :-)

I think we have established that the performance hit is not the
problem but the correctness is.

I do have a remaining question, though: my original [non-working]
approach was to replace uncached mappings with write-through
read-allocate write-allocate, which I expected would keep the caches
in sync with main memory, but apparently I am misunderstanding
something here. (This is the reason for s/0xbb/0xff/ in patch #2 to
get it to work: it replaces WT/RA/WA with WB/RA/WA)

Is there no way to use write-through caching here?

>> I don't understand how can the CPU handle different cache attributes
>> used by QEMU and Guest won't you run into B2.9 checklist? Wouldn't
>> cache evictions or cleans wipe out guest updates to same cache
>> line(s)?
>
> "Clean+invalidate" is a safe operation even if the guest accesses the
> memory in a cacheable way. But if the guest can update the cache lines,
> Qemu should avoid cache maintenance from a performance perspective.
>
> The guest is either told that the DMA is coherent (via DT properties) or
> Qemu deals with (non-)coherency itself. The latter is fully in line with
> the B2.9 chapter in the ARM ARM, more precisely point 5:
>
>   If the mismatched attributes for a memory location all assign the same
>   shareability attribute to the location, any loss of uniprocessor
>   semantics or coherency within a shareability domain can be avoided by
>   use of software cache management.
>
> ... it continues with what kind of cache maintenance is required,
> together with:
>
>   A clean and invalidate instruction can be used instead of a clean
>   instruction, or instead of an invalidate instruction.
>
> --
> Catalin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-04 11:50                           ` Ard Biesheuvel
@ 2015-03-04 12:29                             ` Catalin Marinas
  -1 siblings, 0 replies; 110+ messages in thread
From: Catalin Marinas @ 2015-03-04 12:29 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: KVM devel mailing list, Marc Zyngier, linux-arm-kernel,
	Paolo Bonzini, Laszlo Ersek, kvmarm, Christoffer Dall,
	Mario Smarduch

On Wed, Mar 04, 2015 at 12:50:57PM +0100, Ard Biesheuvel wrote:
> On 4 March 2015 at 12:35, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote:
> >> On 03/02/2015 08:31 AM, Christoffer Dall wrote:
> >> > However, my concern with these patches are on two points:
> >> >
> >> > 1. It's not a fix-all.  We still have the case where the guest expects
> >> > the behavior of device memory (for strong ordering for example) on a RAM
> >> > region, which we now break.  Similiarly this doesn't support the
> >> > non-coherent DMA to RAM region case.
> >> >
> >> > 2. While the code is probably as nice as this kind of stuff gets, it
> >> > is non-trivial and extremely difficult to debug.  The counter-point here
> >> > is that we may end up handling other stuff at EL2 for performanc reasons
> >> > in the future.
> >> >
> >> > Mainly because of point 1 above, I am leaning to thinking userspace
> >> > should do the invalidation when it knows it needs to, either through KVM
> >> > via a memslot flag or through some other syscall mechanism.
> >
> > I expressed my concerns as well, I'm definitely against merging this
> > series.
> 
> Don't worry, that was never the intention, at least not as-is :-)

I wasn't worried, just wanted to make my position clearer ;).

> I think we have established that the performance hit is not the
> problem but the correctness is.

I haven't looked at the performance figures but has anyone assessed the
hit caused by doing cache maintenance in Qemu vs cacheable guest
accesses (and no maintenance)?

> I do have a remaining question, though: my original [non-working]
> approach was to replace uncached mappings with write-through
> read-allocate write-allocate,

Does it make sense to have write-through and write-allocate at the same
time? The write-allocate hint would probably be ignored as write-through
writes do not generate linefills.

> which I expected would keep the caches
> in sync with main memory, but apparently I am misunderstanding
> something here. (This is the reason for s/0xbb/0xff/ in patch #2 to
> get it to work: it replaces WT/RA/WA with WB/RA/WA)
> 
> Is there no way to use write-through caching here?

Write-through is considered non-cacheable from a write perspective when
it does not hit in the cache. AFAIK, it should still be able to hit
existing cache lines and evict. The ARM ARM states that cache cleaning
to _PoU_ is not required for coherency when the writes are to
write-through memory but I have to dig further into the PoC because
that's what we care about here.

What platform did you test it on? I can't tell what the behaviour of
system caches is. I know they intercept explicit cache maintenance by VA
but not sure what happens to write-through writes when they hit in the
system cache (are they evicted to RAM or not?). If such write-through
writes are only evicted to the point-of-unification, they won't work
since non-cacheable accesses go all the way to PoC.

I need to do more reading through the ARM ARM, it should be hidden
somewhere ;).

-- 
Catalin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-04 12:29                             ` Catalin Marinas
  0 siblings, 0 replies; 110+ messages in thread
From: Catalin Marinas @ 2015-03-04 12:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 04, 2015 at 12:50:57PM +0100, Ard Biesheuvel wrote:
> On 4 March 2015 at 12:35, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote:
> >> On 03/02/2015 08:31 AM, Christoffer Dall wrote:
> >> > However, my concern with these patches are on two points:
> >> >
> >> > 1. It's not a fix-all.  We still have the case where the guest expects
> >> > the behavior of device memory (for strong ordering for example) on a RAM
> >> > region, which we now break.  Similiarly this doesn't support the
> >> > non-coherent DMA to RAM region case.
> >> >
> >> > 2. While the code is probably as nice as this kind of stuff gets, it
> >> > is non-trivial and extremely difficult to debug.  The counter-point here
> >> > is that we may end up handling other stuff at EL2 for performanc reasons
> >> > in the future.
> >> >
> >> > Mainly because of point 1 above, I am leaning to thinking userspace
> >> > should do the invalidation when it knows it needs to, either through KVM
> >> > via a memslot flag or through some other syscall mechanism.
> >
> > I expressed my concerns as well, I'm definitely against merging this
> > series.
> 
> Don't worry, that was never the intention, at least not as-is :-)

I wasn't worried, just wanted to make my position clearer ;).

> I think we have established that the performance hit is not the
> problem but the correctness is.

I haven't looked at the performance figures but has anyone assessed the
hit caused by doing cache maintenance in Qemu vs cacheable guest
accesses (and no maintenance)?

> I do have a remaining question, though: my original [non-working]
> approach was to replace uncached mappings with write-through
> read-allocate write-allocate,

Does it make sense to have write-through and write-allocate at the same
time? The write-allocate hint would probably be ignored as write-through
writes do not generate linefills.

> which I expected would keep the caches
> in sync with main memory, but apparently I am misunderstanding
> something here. (This is the reason for s/0xbb/0xff/ in patch #2 to
> get it to work: it replaces WT/RA/WA with WB/RA/WA)
> 
> Is there no way to use write-through caching here?

Write-through is considered non-cacheable from a write perspective when
it does not hit in the cache. AFAIK, it should still be able to hit
existing cache lines and evict. The ARM ARM states that cache cleaning
to _PoU_ is not required for coherency when the writes are to
write-through memory but I have to dig further into the PoC because
that's what we care about here.

What platform did you test it on? I can't tell what the behaviour of
system caches is. I know they intercept explicit cache maintenance by VA
but not sure what happens to write-through writes when they hit in the
system cache (are they evicted to RAM or not?). If such write-through
writes are only evicted to the point-of-unification, they won't work
since non-cacheable accesses go all the way to PoC.

I need to do more reading through the ARM ARM, it should be hidden
somewhere ;).

-- 
Catalin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-04 12:29                             ` Catalin Marinas
@ 2015-03-04 12:43                               ` Ard Biesheuvel
  -1 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-03-04 12:43 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: KVM devel mailing list, Marc Zyngier, linux-arm-kernel,
	Paolo Bonzini, Laszlo Ersek, kvmarm, Christoffer Dall,
	Mario Smarduch

On 4 March 2015 at 13:29, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Wed, Mar 04, 2015 at 12:50:57PM +0100, Ard Biesheuvel wrote:
>> On 4 March 2015 at 12:35, Catalin Marinas <catalin.marinas@arm.com> wrote:
>> > On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote:
>> >> On 03/02/2015 08:31 AM, Christoffer Dall wrote:
>> >> > However, my concern with these patches are on two points:
>> >> >
>> >> > 1. It's not a fix-all.  We still have the case where the guest expects
>> >> > the behavior of device memory (for strong ordering for example) on a RAM
>> >> > region, which we now break.  Similiarly this doesn't support the
>> >> > non-coherent DMA to RAM region case.
>> >> >
>> >> > 2. While the code is probably as nice as this kind of stuff gets, it
>> >> > is non-trivial and extremely difficult to debug.  The counter-point here
>> >> > is that we may end up handling other stuff at EL2 for performanc reasons
>> >> > in the future.
>> >> >
>> >> > Mainly because of point 1 above, I am leaning to thinking userspace
>> >> > should do the invalidation when it knows it needs to, either through KVM
>> >> > via a memslot flag or through some other syscall mechanism.
>> >
>> > I expressed my concerns as well, I'm definitely against merging this
>> > series.
>>
>> Don't worry, that was never the intention, at least not as-is :-)
>
> I wasn't worried, just wanted to make my position clearer ;).
>
>> I think we have established that the performance hit is not the
>> problem but the correctness is.
>
> I haven't looked at the performance figures but has anyone assessed the
> hit caused by doing cache maintenance in Qemu vs cacheable guest
> accesses (and no maintenance)?
>

No, I don't think so. The performance hit I am referring to is the
performance hit caused by leaving the trapping of VM system register
writes enabled all the time, so that writes to MAIR_EL1 are always
caught. This is why patch #1 implements some of the sysreg write
handling in EL2

>> I do have a remaining question, though: my original [non-working]
>> approach was to replace uncached mappings with write-through
>> read-allocate write-allocate,
>
> Does it make sense to have write-through and write-allocate at the same
> time? The write-allocate hint would probably be ignored as write-through
> writes do not generate linefills.
>

OK, that answers my question then. The availability of a
write-allocate setting on write-through attributes suggested to me
that writes would go to both the cache and main memory, so that the
write-back cached attribute the host is using for the same memory
would not result in it reading stale data.

>> which I expected would keep the caches
>> in sync with main memory, but apparently I am misunderstanding
>> something here. (This is the reason for s/0xbb/0xff/ in patch #2 to
>> get it to work: it replaces WT/RA/WA with WB/RA/WA)
>>
>> Is there no way to use write-through caching here?
>
> Write-through is considered non-cacheable from a write perspective when
> it does not hit in the cache. AFAIK, it should still be able to hit
> existing cache lines and evict. The ARM ARM states that cache cleaning
> to _PoU_ is not required for coherency when the writes are to
> write-through memory but I have to dig further into the PoC because
> that's what we care about here.
>
> What platform did you test it on? I can't tell what the behaviour of
> system caches is. I know they intercept explicit cache maintenance by VA
> but not sure what happens to write-through writes when they hit in the
> system cache (are they evicted to RAM or not?). If such write-through
> writes are only evicted to the point-of-unification, they won't work
> since non-cacheable accesses go all the way to PoC.
>

This was tested on APM, by Drew and Laszlo (thanks guys)

I have recently received a Seattle myself, but I haven't had time yet
to test these patches myself.

> I need to do more reading through the ARM ARM,

If you say so :-)

> it should be hidden
> somewhere ;).
>

-- 
Ard.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-04 12:43                               ` Ard Biesheuvel
  0 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-03-04 12:43 UTC (permalink / raw)
  To: linux-arm-kernel

On 4 March 2015 at 13:29, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Wed, Mar 04, 2015 at 12:50:57PM +0100, Ard Biesheuvel wrote:
>> On 4 March 2015 at 12:35, Catalin Marinas <catalin.marinas@arm.com> wrote:
>> > On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote:
>> >> On 03/02/2015 08:31 AM, Christoffer Dall wrote:
>> >> > However, my concern with these patches are on two points:
>> >> >
>> >> > 1. It's not a fix-all.  We still have the case where the guest expects
>> >> > the behavior of device memory (for strong ordering for example) on a RAM
>> >> > region, which we now break.  Similiarly this doesn't support the
>> >> > non-coherent DMA to RAM region case.
>> >> >
>> >> > 2. While the code is probably as nice as this kind of stuff gets, it
>> >> > is non-trivial and extremely difficult to debug.  The counter-point here
>> >> > is that we may end up handling other stuff at EL2 for performanc reasons
>> >> > in the future.
>> >> >
>> >> > Mainly because of point 1 above, I am leaning to thinking userspace
>> >> > should do the invalidation when it knows it needs to, either through KVM
>> >> > via a memslot flag or through some other syscall mechanism.
>> >
>> > I expressed my concerns as well, I'm definitely against merging this
>> > series.
>>
>> Don't worry, that was never the intention, at least not as-is :-)
>
> I wasn't worried, just wanted to make my position clearer ;).
>
>> I think we have established that the performance hit is not the
>> problem but the correctness is.
>
> I haven't looked at the performance figures but has anyone assessed the
> hit caused by doing cache maintenance in Qemu vs cacheable guest
> accesses (and no maintenance)?
>

No, I don't think so. The performance hit I am referring to is the
performance hit caused by leaving the trapping of VM system register
writes enabled all the time, so that writes to MAIR_EL1 are always
caught. This is why patch #1 implements some of the sysreg write
handling in EL2

>> I do have a remaining question, though: my original [non-working]
>> approach was to replace uncached mappings with write-through
>> read-allocate write-allocate,
>
> Does it make sense to have write-through and write-allocate at the same
> time? The write-allocate hint would probably be ignored as write-through
> writes do not generate linefills.
>

OK, that answers my question then. The availability of a
write-allocate setting on write-through attributes suggested to me
that writes would go to both the cache and main memory, so that the
write-back cached attribute the host is using for the same memory
would not result in it reading stale data.

>> which I expected would keep the caches
>> in sync with main memory, but apparently I am misunderstanding
>> something here. (This is the reason for s/0xbb/0xff/ in patch #2 to
>> get it to work: it replaces WT/RA/WA with WB/RA/WA)
>>
>> Is there no way to use write-through caching here?
>
> Write-through is considered non-cacheable from a write perspective when
> it does not hit in the cache. AFAIK, it should still be able to hit
> existing cache lines and evict. The ARM ARM states that cache cleaning
> to _PoU_ is not required for coherency when the writes are to
> write-through memory but I have to dig further into the PoC because
> that's what we care about here.
>
> What platform did you test it on? I can't tell what the behaviour of
> system caches is. I know they intercept explicit cache maintenance by VA
> but not sure what happens to write-through writes when they hit in the
> system cache (are they evicted to RAM or not?). If such write-through
> writes are only evicted to the point-of-unification, they won't work
> since non-cacheable accesses go all the way to PoC.
>

This was tested on APM, by Drew and Laszlo (thanks guys)

I have recently received a Seattle myself, but I haven't had time yet
to test these patches myself.

> I need to do more reading through the ARM ARM,

If you say so :-)

> it should be hidden
> somewhere ;).
>

-- 
Ard.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-04 12:43                               ` Ard Biesheuvel
@ 2015-03-04 14:12                                 ` Andrew Jones
  -1 siblings, 0 replies; 110+ messages in thread
From: Andrew Jones @ 2015-03-04 14:12 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: KVM devel mailing list, Marc Zyngier, Catalin Marinas,
	Paolo Bonzini, Laszlo Ersek, kvmarm, linux-arm-kernel

On Wed, Mar 04, 2015 at 01:43:02PM +0100, Ard Biesheuvel wrote:
> On 4 March 2015 at 13:29, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Wed, Mar 04, 2015 at 12:50:57PM +0100, Ard Biesheuvel wrote:
> >> On 4 March 2015 at 12:35, Catalin Marinas <catalin.marinas@arm.com> wrote:
> >> > On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote:
> >> >> On 03/02/2015 08:31 AM, Christoffer Dall wrote:
> >> >> > However, my concern with these patches are on two points:
> >> >> >
> >> >> > 1. It's not a fix-all.  We still have the case where the guest expects
> >> >> > the behavior of device memory (for strong ordering for example) on a RAM
> >> >> > region, which we now break.  Similiarly this doesn't support the
> >> >> > non-coherent DMA to RAM region case.
> >> >> >
> >> >> > 2. While the code is probably as nice as this kind of stuff gets, it
> >> >> > is non-trivial and extremely difficult to debug.  The counter-point here
> >> >> > is that we may end up handling other stuff at EL2 for performanc reasons
> >> >> > in the future.
> >> >> >
> >> >> > Mainly because of point 1 above, I am leaning to thinking userspace
> >> >> > should do the invalidation when it knows it needs to, either through KVM
> >> >> > via a memslot flag or through some other syscall mechanism.
> >> >
> >> > I expressed my concerns as well, I'm definitely against merging this
> >> > series.
> >>
> >> Don't worry, that was never the intention, at least not as-is :-)
> >
> > I wasn't worried, just wanted to make my position clearer ;).
> >
> >> I think we have established that the performance hit is not the
> >> problem but the correctness is.
> >
> > I haven't looked at the performance figures but has anyone assessed the
> > hit caused by doing cache maintenance in Qemu vs cacheable guest
> > accesses (and no maintenance)?
> >

I'm working on a PoC of a QEMU/KVM cache maintenance approach now.
Hopefully I'll send it out this evening. Tomorrow at the latest.
Getting numbers of that approach vs. a guest's use of cached memory
for devices would take a decent amount of additional work, so won't
be part of that post. I'm actually not sure we should care about
the numbers for a guest using normal mem attributes for device
memory - other than out of curiosity. For correctness this issue
really needs to be solved 100% host-side. We can't rely on a
guest to do different/weird things, just because it's a guest.
Ideally guests don't even know that they're guests. (Even if we
describe the memory as cache-able to the guest, I don't think we
can rely on the guest believing us.)

drew

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-04 14:12                                 ` Andrew Jones
  0 siblings, 0 replies; 110+ messages in thread
From: Andrew Jones @ 2015-03-04 14:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 04, 2015 at 01:43:02PM +0100, Ard Biesheuvel wrote:
> On 4 March 2015 at 13:29, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Wed, Mar 04, 2015 at 12:50:57PM +0100, Ard Biesheuvel wrote:
> >> On 4 March 2015 at 12:35, Catalin Marinas <catalin.marinas@arm.com> wrote:
> >> > On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote:
> >> >> On 03/02/2015 08:31 AM, Christoffer Dall wrote:
> >> >> > However, my concern with these patches are on two points:
> >> >> >
> >> >> > 1. It's not a fix-all.  We still have the case where the guest expects
> >> >> > the behavior of device memory (for strong ordering for example) on a RAM
> >> >> > region, which we now break.  Similiarly this doesn't support the
> >> >> > non-coherent DMA to RAM region case.
> >> >> >
> >> >> > 2. While the code is probably as nice as this kind of stuff gets, it
> >> >> > is non-trivial and extremely difficult to debug.  The counter-point here
> >> >> > is that we may end up handling other stuff at EL2 for performanc reasons
> >> >> > in the future.
> >> >> >
> >> >> > Mainly because of point 1 above, I am leaning to thinking userspace
> >> >> > should do the invalidation when it knows it needs to, either through KVM
> >> >> > via a memslot flag or through some other syscall mechanism.
> >> >
> >> > I expressed my concerns as well, I'm definitely against merging this
> >> > series.
> >>
> >> Don't worry, that was never the intention, at least not as-is :-)
> >
> > I wasn't worried, just wanted to make my position clearer ;).
> >
> >> I think we have established that the performance hit is not the
> >> problem but the correctness is.
> >
> > I haven't looked at the performance figures but has anyone assessed the
> > hit caused by doing cache maintenance in Qemu vs cacheable guest
> > accesses (and no maintenance)?
> >

I'm working on a PoC of a QEMU/KVM cache maintenance approach now.
Hopefully I'll send it out this evening. Tomorrow at the latest.
Getting numbers of that approach vs. a guest's use of cached memory
for devices would take a decent amount of additional work, so won't
be part of that post. I'm actually not sure we should care about
the numbers for a guest using normal mem attributes for device
memory - other than out of curiosity. For correctness this issue
really needs to be solved 100% host-side. We can't rely on a
guest to do different/weird things, just because it's a guest.
Ideally guests don't even know that they're guests. (Even if we
describe the memory as cache-able to the guest, I don't think we
can rely on the guest believing us.)

drew

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-04 14:12                                 ` Andrew Jones
@ 2015-03-04 14:29                                   ` Catalin Marinas
  -1 siblings, 0 replies; 110+ messages in thread
From: Catalin Marinas @ 2015-03-04 14:29 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Ard Biesheuvel, KVM devel mailing list, Marc Zyngier,
	Paolo Bonzini, Laszlo Ersek, kvmarm, linux-arm-kernel

On Wed, Mar 04, 2015 at 03:12:12PM +0100, Andrew Jones wrote:
> On Wed, Mar 04, 2015 at 01:43:02PM +0100, Ard Biesheuvel wrote:
> > On 4 March 2015 at 13:29, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > On Wed, Mar 04, 2015 at 12:50:57PM +0100, Ard Biesheuvel wrote:
> > >> I think we have established that the performance hit is not the
> > >> problem but the correctness is.
> > >
> > > I haven't looked at the performance figures but has anyone assessed the
> > > hit caused by doing cache maintenance in Qemu vs cacheable guest
> > > accesses (and no maintenance)?
> 
> I'm working on a PoC of a QEMU/KVM cache maintenance approach now.
> Hopefully I'll send it out this evening. Tomorrow at the latest.
> Getting numbers of that approach vs. a guest's use of cached memory
> for devices would take a decent amount of additional work, so won't
> be part of that post.

OK.

> I'm actually not sure we should care about
> the numbers for a guest using normal mem attributes for device
> memory - other than out of curiosity. For correctness this issue
> really needs to be solved 100% host-side. We can't rely on a
> guest to do different/weird things, just because it's a guest.
> Ideally guests don't even know that they're guests. (Even if we
> describe the memory as cache-able to the guest, I don't think we
> can rely on the guest believing us.)

I disagree it is 100% a host-side issue. It is a host-side issue _if_
the host tells the guest that the (virtual) device is non-coherent (or,
more precisely, it does not explicitly tell the guest that the device is
coherent). If the guest thinks the (virtual) device is non-coherent
because of information passed by the host, I fully agree that the host
needs to manage the cache coherency.

However, the host could also pass a "dma-coherent" property in the DT
given to the guest and avoid any form of cache maintenance. If the guest
does not honour such coherency property, it's a guest problem and it
needs fixing in the guest. This isn't any different from a real physical
device behaviour.

(there are counter arguments for the latter as well like emulating old
platforms that never had coherency but from a performance/production
perspective, I strongly recommend that guests are passed the
"dma-coherent" property for such virtual devices)

-- 
Catalin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-04 14:29                                   ` Catalin Marinas
  0 siblings, 0 replies; 110+ messages in thread
From: Catalin Marinas @ 2015-03-04 14:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 04, 2015 at 03:12:12PM +0100, Andrew Jones wrote:
> On Wed, Mar 04, 2015 at 01:43:02PM +0100, Ard Biesheuvel wrote:
> > On 4 March 2015 at 13:29, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > On Wed, Mar 04, 2015 at 12:50:57PM +0100, Ard Biesheuvel wrote:
> > >> I think we have established that the performance hit is not the
> > >> problem but the correctness is.
> > >
> > > I haven't looked at the performance figures but has anyone assessed the
> > > hit caused by doing cache maintenance in Qemu vs cacheable guest
> > > accesses (and no maintenance)?
> 
> I'm working on a PoC of a QEMU/KVM cache maintenance approach now.
> Hopefully I'll send it out this evening. Tomorrow at the latest.
> Getting numbers of that approach vs. a guest's use of cached memory
> for devices would take a decent amount of additional work, so won't
> be part of that post.

OK.

> I'm actually not sure we should care about
> the numbers for a guest using normal mem attributes for device
> memory - other than out of curiosity. For correctness this issue
> really needs to be solved 100% host-side. We can't rely on a
> guest to do different/weird things, just because it's a guest.
> Ideally guests don't even know that they're guests. (Even if we
> describe the memory as cache-able to the guest, I don't think we
> can rely on the guest believing us.)

I disagree it is 100% a host-side issue. It is a host-side issue _if_
the host tells the guest that the (virtual) device is non-coherent (or,
more precisely, it does not explicitly tell the guest that the device is
coherent). If the guest thinks the (virtual) device is non-coherent
because of information passed by the host, I fully agree that the host
needs to manage the cache coherency.

However, the host could also pass a "dma-coherent" property in the DT
given to the guest and avoid any form of cache maintenance. If the guest
does not honour such coherency property, it's a guest problem and it
needs fixing in the guest. This isn't any different from a real physical
device behaviour.

(there are counter arguments for the latter as well like emulating old
platforms that never had coherency but from a performance/production
perspective, I strongly recommend that guests are passed the
"dma-coherent" property for such virtual devices)

-- 
Catalin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-04 14:29                                   ` Catalin Marinas
@ 2015-03-04 14:34                                     ` Peter Maydell
  -1 siblings, 0 replies; 110+ messages in thread
From: Peter Maydell @ 2015-03-04 14:34 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Andrew Jones, KVM devel mailing list, Ard Biesheuvel,
	Marc Zyngier, Paolo Bonzini, Laszlo Ersek, kvmarm,
	linux-arm-kernel

On 4 March 2015 at 23:29, Catalin Marinas <catalin.marinas@arm.com> wrote:
> I disagree it is 100% a host-side issue. It is a host-side issue _if_
> the host tells the guest that the (virtual) device is non-coherent (or,
> more precisely, it does not explicitly tell the guest that the device is
> coherent). If the guest thinks the (virtual) device is non-coherent
> because of information passed by the host, I fully agree that the host
> needs to manage the cache coherency.
>
> However, the host could also pass a "dma-coherent" property in the DT
> given to the guest and avoid any form of cache maintenance. If the guest
> does not honour such coherency property, it's a guest problem and it
> needs fixing in the guest. This isn't any different from a real physical
> device behaviour.

Right, and we should do that for things like virtio, because we want
the performance. But we also have devices (like vga framebuffers)
which shouldn't be handled as cacheable, so we need to be able to
deal with both situations.

-- PMM

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-04 14:34                                     ` Peter Maydell
  0 siblings, 0 replies; 110+ messages in thread
From: Peter Maydell @ 2015-03-04 14:34 UTC (permalink / raw)
  To: linux-arm-kernel

On 4 March 2015 at 23:29, Catalin Marinas <catalin.marinas@arm.com> wrote:
> I disagree it is 100% a host-side issue. It is a host-side issue _if_
> the host tells the guest that the (virtual) device is non-coherent (or,
> more precisely, it does not explicitly tell the guest that the device is
> coherent). If the guest thinks the (virtual) device is non-coherent
> because of information passed by the host, I fully agree that the host
> needs to manage the cache coherency.
>
> However, the host could also pass a "dma-coherent" property in the DT
> given to the guest and avoid any form of cache maintenance. If the guest
> does not honour such coherency property, it's a guest problem and it
> needs fixing in the guest. This isn't any different from a real physical
> device behaviour.

Right, and we should do that for things like virtio, because we want
the performance. But we also have devices (like vga framebuffers)
which shouldn't be handled as cacheable, so we need to be able to
deal with both situations.

-- PMM

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-04 14:29                                   ` Catalin Marinas
@ 2015-03-04 17:03                                     ` Paolo Bonzini
  -1 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-03-04 17:03 UTC (permalink / raw)
  To: Catalin Marinas, Andrew Jones
  Cc: KVM devel mailing list, Ard Biesheuvel, Marc Zyngier,
	Laszlo Ersek, kvmarm, linux-arm-kernel



On 04/03/2015 15:29, Catalin Marinas wrote:
> I disagree it is 100% a host-side issue. It is a host-side issue _if_
> the host tells the guest that the (virtual) device is non-coherent (or,
> more precisely, it does not explicitly tell the guest that the device is
> coherent). If the guest thinks the (virtual) device is non-coherent
> because of information passed by the host, I fully agree that the host
> needs to manage the cache coherency.
> 
> However, the host could also pass a "dma-coherent" property in the DT
> given to the guest and avoid any form of cache maintenance. If the guest
> does not honour such coherency property, it's a guest problem and it
> needs fixing in the guest. This isn't any different from a real physical
> device behaviour.

Can you add that property to the device tree for PCI devices too?

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-04 17:03                                     ` Paolo Bonzini
  0 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-03-04 17:03 UTC (permalink / raw)
  To: linux-arm-kernel



On 04/03/2015 15:29, Catalin Marinas wrote:
> I disagree it is 100% a host-side issue. It is a host-side issue _if_
> the host tells the guest that the (virtual) device is non-coherent (or,
> more precisely, it does not explicitly tell the guest that the device is
> coherent). If the guest thinks the (virtual) device is non-coherent
> because of information passed by the host, I fully agree that the host
> needs to manage the cache coherency.
> 
> However, the host could also pass a "dma-coherent" property in the DT
> given to the guest and avoid any form of cache maintenance. If the guest
> does not honour such coherency property, it's a guest problem and it
> needs fixing in the guest. This isn't any different from a real physical
> device behaviour.

Can you add that property to the device tree for PCI devices too?

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-04 17:03                                     ` Paolo Bonzini
@ 2015-03-04 17:28                                       ` Catalin Marinas
  -1 siblings, 0 replies; 110+ messages in thread
From: Catalin Marinas @ 2015-03-04 17:28 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Andrew Jones, KVM devel mailing list, Ard Biesheuvel,
	Marc Zyngier, Laszlo Ersek, kvmarm, linux-arm-kernel

On Wed, Mar 04, 2015 at 06:03:11PM +0100, Paolo Bonzini wrote:
> 
> 
> On 04/03/2015 15:29, Catalin Marinas wrote:
> > I disagree it is 100% a host-side issue. It is a host-side issue _if_
> > the host tells the guest that the (virtual) device is non-coherent (or,
> > more precisely, it does not explicitly tell the guest that the device is
> > coherent). If the guest thinks the (virtual) device is non-coherent
> > because of information passed by the host, I fully agree that the host
> > needs to manage the cache coherency.
> > 
> > However, the host could also pass a "dma-coherent" property in the DT
> > given to the guest and avoid any form of cache maintenance. If the guest
> > does not honour such coherency property, it's a guest problem and it
> > needs fixing in the guest. This isn't any different from a real physical
> > device behaviour.
> 
> Can you add that property to the device tree for PCI devices too?

Yes but not with mainline yet:

http://thread.gmane.org/gmane.linux.kernel.iommu/8935

We can add the property at the PCI host bridge level (with the drawback
that it covers all the PCI devices), like here:

Documentation/devicetree/bindings/pci/xgene-pci.txt

-- 
Catalin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-04 17:28                                       ` Catalin Marinas
  0 siblings, 0 replies; 110+ messages in thread
From: Catalin Marinas @ 2015-03-04 17:28 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Mar 04, 2015 at 06:03:11PM +0100, Paolo Bonzini wrote:
> 
> 
> On 04/03/2015 15:29, Catalin Marinas wrote:
> > I disagree it is 100% a host-side issue. It is a host-side issue _if_
> > the host tells the guest that the (virtual) device is non-coherent (or,
> > more precisely, it does not explicitly tell the guest that the device is
> > coherent). If the guest thinks the (virtual) device is non-coherent
> > because of information passed by the host, I fully agree that the host
> > needs to manage the cache coherency.
> > 
> > However, the host could also pass a "dma-coherent" property in the DT
> > given to the guest and avoid any form of cache maintenance. If the guest
> > does not honour such coherency property, it's a guest problem and it
> > needs fixing in the guest. This isn't any different from a real physical
> > device behaviour.
> 
> Can you add that property to the device tree for PCI devices too?

Yes but not with mainline yet:

http://thread.gmane.org/gmane.linux.kernel.iommu/8935

We can add the property at the PCI host bridge level (with the drawback
that it covers all the PCI devices), like here:

Documentation/devicetree/bindings/pci/xgene-pci.txt

-- 
Catalin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-04 17:28                                       ` Catalin Marinas
@ 2015-03-05 10:12                                         ` Paolo Bonzini
  -1 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-03-05 10:12 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: KVM devel mailing list, Ard Biesheuvel, Marc Zyngier,
	Laszlo Ersek, kvmarm, linux-arm-kernel



On 04/03/2015 18:28, Catalin Marinas wrote:
>> > 
>> > Can you add that property to the device tree for PCI devices too?
> Yes but not with mainline yet:
> 
> http://thread.gmane.org/gmane.linux.kernel.iommu/8935
> 
> We can add the property at the PCI host bridge level (with the drawback
> that it covers all the PCI devices), like here:

Even covering all PCI devices is not enough if we want to support device
assignment of PCI host devices.  I'd rather not spend effort on a
solution that we know will only half-work a few months down the road.

Paolo

> Documentation/devicetree/bindings/pci/xgene-pci.txt

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-05 10:12                                         ` Paolo Bonzini
  0 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-03-05 10:12 UTC (permalink / raw)
  To: linux-arm-kernel



On 04/03/2015 18:28, Catalin Marinas wrote:
>> > 
>> > Can you add that property to the device tree for PCI devices too?
> Yes but not with mainline yet:
> 
> http://thread.gmane.org/gmane.linux.kernel.iommu/8935
> 
> We can add the property at the PCI host bridge level (with the drawback
> that it covers all the PCI devices), like here:

Even covering all PCI devices is not enough if we want to support device
assignment of PCI host devices.  I'd rather not spend effort on a
solution that we know will only half-work a few months down the road.

Paolo

> Documentation/devicetree/bindings/pci/xgene-pci.txt

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-05 10:12                                         ` Paolo Bonzini
@ 2015-03-05 11:04                                           ` Catalin Marinas
  -1 siblings, 0 replies; 110+ messages in thread
From: Catalin Marinas @ 2015-03-05 11:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: KVM devel mailing list, Ard Biesheuvel, Marc Zyngier,
	Laszlo Ersek, kvmarm, linux-arm-kernel

On Thu, Mar 05, 2015 at 11:12:22AM +0100, Paolo Bonzini wrote:
> On 04/03/2015 18:28, Catalin Marinas wrote:
> >> > Can you add that property to the device tree for PCI devices too?
> >
> > Yes but not with mainline yet:
> > 
> > http://thread.gmane.org/gmane.linux.kernel.iommu/8935
> > 
> > We can add the property at the PCI host bridge level (with the drawback
> > that it covers all the PCI devices), like here:
> 
> Even covering all PCI devices is not enough if we want to support device
> assignment of PCI host devices. 

Can we not have another PCI bridge node in the DT for the host device
assignments?

> I'd rather not spend effort on a solution that we know will only
> half-work a few months down the road.

Which of the solutions are you referring to? On the Applied Micro
boards, for example, the PCIe is coherent. If you do device assignment,
the guest must be aware of the coherency property of the physical device
and behave accordingly, there isn't much the host can do about it.

-- 
Catalin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-05 11:04                                           ` Catalin Marinas
  0 siblings, 0 replies; 110+ messages in thread
From: Catalin Marinas @ 2015-03-05 11:04 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Mar 05, 2015 at 11:12:22AM +0100, Paolo Bonzini wrote:
> On 04/03/2015 18:28, Catalin Marinas wrote:
> >> > Can you add that property to the device tree for PCI devices too?
> >
> > Yes but not with mainline yet:
> > 
> > http://thread.gmane.org/gmane.linux.kernel.iommu/8935
> > 
> > We can add the property at the PCI host bridge level (with the drawback
> > that it covers all the PCI devices), like here:
> 
> Even covering all PCI devices is not enough if we want to support device
> assignment of PCI host devices. 

Can we not have another PCI bridge node in the DT for the host device
assignments?

> I'd rather not spend effort on a solution that we know will only
> half-work a few months down the road.

Which of the solutions are you referring to? On the Applied Micro
boards, for example, the PCIe is coherent. If you do device assignment,
the guest must be aware of the coherency property of the physical device
and behave accordingly, there isn't much the host can do about it.

-- 
Catalin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-05 11:04                                           ` Catalin Marinas
@ 2015-03-05 11:52                                             ` Peter Maydell
  -1 siblings, 0 replies; 110+ messages in thread
From: Peter Maydell @ 2015-03-05 11:52 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Paolo Bonzini, KVM devel mailing list, Ard Biesheuvel,
	Marc Zyngier, Laszlo Ersek, kvmarm, linux-arm-kernel

On 5 March 2015 at 20:04, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Thu, Mar 05, 2015 at 11:12:22AM +0100, Paolo Bonzini wrote:
>> On 04/03/2015 18:28, Catalin Marinas wrote:
>> >> > Can you add that property to the device tree for PCI devices too?
>> >
>> > Yes but not with mainline yet:
>> >
>> > http://thread.gmane.org/gmane.linux.kernel.iommu/8935
>> >
>> > We can add the property at the PCI host bridge level (with the drawback
>> > that it covers all the PCI devices), like here:
>>
>> Even covering all PCI devices is not enough if we want to support device
>> assignment of PCI host devices.
>
> Can we not have another PCI bridge node in the DT for the host device
> assignments?

I'd hate to have to do that. PCI should be entirely probeable
given that we tell the guest where the host bridge is, that's
one of its advantages.

-- PMM

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-05 11:52                                             ` Peter Maydell
  0 siblings, 0 replies; 110+ messages in thread
From: Peter Maydell @ 2015-03-05 11:52 UTC (permalink / raw)
  To: linux-arm-kernel

On 5 March 2015 at 20:04, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Thu, Mar 05, 2015 at 11:12:22AM +0100, Paolo Bonzini wrote:
>> On 04/03/2015 18:28, Catalin Marinas wrote:
>> >> > Can you add that property to the device tree for PCI devices too?
>> >
>> > Yes but not with mainline yet:
>> >
>> > http://thread.gmane.org/gmane.linux.kernel.iommu/8935
>> >
>> > We can add the property at the PCI host bridge level (with the drawback
>> > that it covers all the PCI devices), like here:
>>
>> Even covering all PCI devices is not enough if we want to support device
>> assignment of PCI host devices.
>
> Can we not have another PCI bridge node in the DT for the host device
> assignments?

I'd hate to have to do that. PCI should be entirely probeable
given that we tell the guest where the host bridge is, that's
one of its advantages.

-- PMM

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-05 11:52                                             ` Peter Maydell
@ 2015-03-05 12:03                                               ` Catalin Marinas
  -1 siblings, 0 replies; 110+ messages in thread
From: Catalin Marinas @ 2015-03-05 12:03 UTC (permalink / raw)
  To: Peter Maydell
  Cc: KVM devel mailing list, Ard Biesheuvel, Marc Zyngier,
	Paolo Bonzini, Laszlo Ersek, kvmarm, linux-arm-kernel

On Thu, Mar 05, 2015 at 08:52:49PM +0900, Peter Maydell wrote:
> On 5 March 2015 at 20:04, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Thu, Mar 05, 2015 at 11:12:22AM +0100, Paolo Bonzini wrote:
> >> On 04/03/2015 18:28, Catalin Marinas wrote:
> >> >> > Can you add that property to the device tree for PCI devices too?
> >> >
> >> > Yes but not with mainline yet:
> >> >
> >> > http://thread.gmane.org/gmane.linux.kernel.iommu/8935
> >> >
> >> > We can add the property at the PCI host bridge level (with the drawback
> >> > that it covers all the PCI devices), like here:
> >>
> >> Even covering all PCI devices is not enough if we want to support device
> >> assignment of PCI host devices.
> >
> > Can we not have another PCI bridge node in the DT for the host device
> > assignments?
> 
> I'd hate to have to do that. PCI should be entirely probeable
> given that we tell the guest where the host bridge is, that's
> one of its advantages.

I didn't say a DT node per device, the DT doesn't know what PCI devices
are available (otherwise it defeats the idea of probing). But we need to
tell the OS where the host bridge is via DT.

So the guest would be told about two host bridges: one for real devices
and another for virtual devices. These can have different coherency
properties.

-- 
Catalin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-05 12:03                                               ` Catalin Marinas
  0 siblings, 0 replies; 110+ messages in thread
From: Catalin Marinas @ 2015-03-05 12:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Mar 05, 2015 at 08:52:49PM +0900, Peter Maydell wrote:
> On 5 March 2015 at 20:04, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Thu, Mar 05, 2015 at 11:12:22AM +0100, Paolo Bonzini wrote:
> >> On 04/03/2015 18:28, Catalin Marinas wrote:
> >> >> > Can you add that property to the device tree for PCI devices too?
> >> >
> >> > Yes but not with mainline yet:
> >> >
> >> > http://thread.gmane.org/gmane.linux.kernel.iommu/8935
> >> >
> >> > We can add the property at the PCI host bridge level (with the drawback
> >> > that it covers all the PCI devices), like here:
> >>
> >> Even covering all PCI devices is not enough if we want to support device
> >> assignment of PCI host devices.
> >
> > Can we not have another PCI bridge node in the DT for the host device
> > assignments?
> 
> I'd hate to have to do that. PCI should be entirely probeable
> given that we tell the guest where the host bridge is, that's
> one of its advantages.

I didn't say a DT node per device, the DT doesn't know what PCI devices
are available (otherwise it defeats the idea of probing). But we need to
tell the OS where the host bridge is via DT.

So the guest would be told about two host bridges: one for real devices
and another for virtual devices. These can have different coherency
properties.

-- 
Catalin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-05 12:03                                               ` Catalin Marinas
@ 2015-03-05 12:26                                                 ` Paolo Bonzini
  -1 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-03-05 12:26 UTC (permalink / raw)
  To: Catalin Marinas, Peter Maydell
  Cc: KVM devel mailing list, Ard Biesheuvel, Marc Zyngier,
	Laszlo Ersek, kvmarm, linux-arm-kernel

On 05/03/2015 13:03, Catalin Marinas wrote:
>> > 
>> > I'd hate to have to do that. PCI should be entirely probeable
>> > given that we tell the guest where the host bridge is, that's
>> > one of its advantages.
> I didn't say a DT node per device, the DT doesn't know what PCI devices
> are available (otherwise it defeats the idea of probing). But we need to
> tell the OS where the host bridge is via DT.
> 
> So the guest would be told about two host bridges: one for real devices
> and another for virtual devices. These can have different coherency
> properties.

Yeah, and it would suck that the user needs to know the difference
between the coherency proprties of the host bridges.

It would especially suck if the user has a cluster with different
machines, some of them coherent and others non-coherent, and then has to
debug why the same configuration works on some machines and not on others.

To avoid replying in two different places, which of the solutions look
to me like something that half-works?  Pretty much all of them, because
in the end it is just a processor misfeature.  For example, Intel
virtualization extensions let the hypervisor override stage1 translation
_if necessary_.  AMD doesn't, but has some other quirky things that let
you achieve the same effect..

In particular, I am not even sure that this is about bus coherency,
because this problem does not happen when the device is doing bus master
DMA.  Working around coherency for bus master DMA would be easy.

The problem arises with MMIO areas that the guest can reasonably expect
to be uncacheable, but that are optimized by the host so that they end
up backed by cacheable RAM.  It's perfectly reasonable that the same
device needs cacheable mapping with one userspace, and works with
uncacheable mapping with another userspace that doesn't optimize the
MMIO area to RAM.

Currently the VGA framebuffer is the main case where this happen, and I
don't expect many more.  Because this is not bus master DMA, it's hard
to find a QEMU API that can be hooked to invalidate the cache.  QEMU is
just reading from an array of chars.

In practice, the VGA framebuffer has an optimization that uses dirty
page tracking, so we could piggyback on the ioctls that return which
pages are dirty.  It turns out that piggybacking on those ioctls also
should fix the case of migrating a guest while the MMU is disabled.

But it's a hack, and it may not work for other devices.

We could use _DSD to export the device tree property separately for each
device, but that wouldn't work for hotplugged devices.

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-05 12:26                                                 ` Paolo Bonzini
  0 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-03-05 12:26 UTC (permalink / raw)
  To: linux-arm-kernel

On 05/03/2015 13:03, Catalin Marinas wrote:
>> > 
>> > I'd hate to have to do that. PCI should be entirely probeable
>> > given that we tell the guest where the host bridge is, that's
>> > one of its advantages.
> I didn't say a DT node per device, the DT doesn't know what PCI devices
> are available (otherwise it defeats the idea of probing). But we need to
> tell the OS where the host bridge is via DT.
> 
> So the guest would be told about two host bridges: one for real devices
> and another for virtual devices. These can have different coherency
> properties.

Yeah, and it would suck that the user needs to know the difference
between the coherency proprties of the host bridges.

It would especially suck if the user has a cluster with different
machines, some of them coherent and others non-coherent, and then has to
debug why the same configuration works on some machines and not on others.

To avoid replying in two different places, which of the solutions look
to me like something that half-works?  Pretty much all of them, because
in the end it is just a processor misfeature.  For example, Intel
virtualization extensions let the hypervisor override stage1 translation
_if necessary_.  AMD doesn't, but has some other quirky things that let
you achieve the same effect..

In particular, I am not even sure that this is about bus coherency,
because this problem does not happen when the device is doing bus master
DMA.  Working around coherency for bus master DMA would be easy.

The problem arises with MMIO areas that the guest can reasonably expect
to be uncacheable, but that are optimized by the host so that they end
up backed by cacheable RAM.  It's perfectly reasonable that the same
device needs cacheable mapping with one userspace, and works with
uncacheable mapping with another userspace that doesn't optimize the
MMIO area to RAM.

Currently the VGA framebuffer is the main case where this happen, and I
don't expect many more.  Because this is not bus master DMA, it's hard
to find a QEMU API that can be hooked to invalidate the cache.  QEMU is
just reading from an array of chars.

In practice, the VGA framebuffer has an optimization that uses dirty
page tracking, so we could piggyback on the ioctls that return which
pages are dirty.  It turns out that piggybacking on those ioctls also
should fix the case of migrating a guest while the MMU is disabled.

But it's a hack, and it may not work for other devices.

We could use _DSD to export the device tree property separately for each
device, but that wouldn't work for hotplugged devices.

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-05 12:26                                                 ` Paolo Bonzini
@ 2015-03-05 14:58                                                   ` Catalin Marinas
  -1 siblings, 0 replies; 110+ messages in thread
From: Catalin Marinas @ 2015-03-05 14:58 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Maydell, KVM devel mailing list, Ard Biesheuvel,
	Marc Zyngier, Laszlo Ersek, kvmarm, linux-arm-kernel

On Thu, Mar 05, 2015 at 01:26:39PM +0100, Paolo Bonzini wrote:
> On 05/03/2015 13:03, Catalin Marinas wrote:
> >> > I'd hate to have to do that. PCI should be entirely probeable
> >> > given that we tell the guest where the host bridge is, that's
> >> > one of its advantages.
> > I didn't say a DT node per device, the DT doesn't know what PCI devices
> > are available (otherwise it defeats the idea of probing). But we need to
> > tell the OS where the host bridge is via DT.
> > 
> > So the guest would be told about two host bridges: one for real devices
> > and another for virtual devices. These can have different coherency
> > properties.
> 
> Yeah, and it would suck that the user needs to know the difference
> between the coherency proprties of the host bridges.

The host needs to know about this, unless we assume full coherency on
all the platforms. Arguably, Qemu needs to know as well if it is the one
generating the DT for guest (or at least passing some snippets from the
host DT).

> It would especially suck if the user has a cluster with different
> machines, some of them coherent and others non-coherent, and then has to
> debug why the same configuration works on some machines and not on others.

That's a problem indeed, especially with guest migration. But I don't
think we have any sane solution here for the bus master DMA.

> To avoid replying in two different places, which of the solutions look
> to me like something that half-works?  Pretty much all of them, because
> in the end it is just a processor misfeature.  For example, Intel
> virtualization extensions let the hypervisor override stage1 translation
> _if necessary_.  AMD doesn't, but has some other quirky things that let
> you achieve the same effect..

ARM can override them as well but only making them stricter. Otherwise,
on a weakly ordered architecture, it's not always safe (let's say the
guest thinks it accesses Strongly Ordered memory and avoids barriers for
flag updates but the host "upgrades" it to Cacheable which breaks the
memory order).

If we want the host to enforce guest memory mapping attributes via stage
2, we could do it the other way around: get the guests to always assume
full cache coherency, generating Normal Cacheable mappings, but use the
stage 2 attributes restriction in the host to make such mappings
non-cacheable when needed (it works this way on ARM but not in the other
direction to relax the attributes).

> In particular, I am not even sure that this is about bus coherency,
> because this problem does not happen when the device is doing bus master
> DMA.  Working around coherency for bus master DMA would be easy.

My previous emails on the "dma-coherent" property were only about bus
master DMA (which would cause the correct selection of the DMA API ops
in the guest).

But even for bus master DMA, guest OS still needs to be aware of the
(virtual) device DMA capabilities (cache coherent or not). You may be
able to work around it in the host (stage 2, explicit cache flushing or
SMMU attributes) if the guests assumes non-coherency but it's not really
efficient (nor nice to implement).

> The problem arises with MMIO areas that the guest can reasonably expect
> to be uncacheable, but that are optimized by the host so that they end
> up backed by cacheable RAM.  It's perfectly reasonable that the same
> device needs cacheable mapping with one userspace, and works with
> uncacheable mapping with another userspace that doesn't optimize the
> MMIO area to RAM.

Unless the guest allocates the framebuffer itself (e.g.
dma_alloc_coherent), we can't control the cacheability via
"dma-coherent" properties as it refers to bus master DMA.

So for MMIO with the buffer allocated by the host (Qemu), the only
solution I see on ARM is for the host to ensure coherency, either via
explicit cache maintenance (new KVM API) or by changing the memory
attributes used by Qemu to access such virtual MMIO.

Basically Qemu is acting as a bus master when reading the framebuffer it
allocated but the guest considers it a slave access and we don't have a
way to tell the guest that such accesses should be cacheable, nor can we
upgrade them via architecture features.

> Currently the VGA framebuffer is the main case where this happen, and I
> don't expect many more.  Because this is not bus master DMA, it's hard
> to find a QEMU API that can be hooked to invalidate the cache.  QEMU is
> just reading from an array of chars.

I now understand the problem better. I was under the impression that the
guest allocates the framebuffer itself and tells Qemu where it is (like
in amba-clcd.c for example).

> In practice, the VGA framebuffer has an optimization that uses dirty
> page tracking, so we could piggyback on the ioctls that return which
> pages are dirty.  It turns out that piggybacking on those ioctls also
> should fix the case of migrating a guest while the MMU is disabled.

Yes, Qemu would need to invalidate the cache before reading a dirty
framebuffer page.

As I said above, an API that allows non-cacheable mappings for the VGA
framebuffer in Qemu would also solve the problem. I'm not sure what KVM
provides here (or whether we can add such API).

> We could use _DSD to export the device tree property separately for each
> device, but that wouldn't work for hotplugged devices.

This would only work for bus master DMA, so it doesn't solve the VGA
framebuffer issue.

-- 
Catalin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-05 14:58                                                   ` Catalin Marinas
  0 siblings, 0 replies; 110+ messages in thread
From: Catalin Marinas @ 2015-03-05 14:58 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Mar 05, 2015 at 01:26:39PM +0100, Paolo Bonzini wrote:
> On 05/03/2015 13:03, Catalin Marinas wrote:
> >> > I'd hate to have to do that. PCI should be entirely probeable
> >> > given that we tell the guest where the host bridge is, that's
> >> > one of its advantages.
> > I didn't say a DT node per device, the DT doesn't know what PCI devices
> > are available (otherwise it defeats the idea of probing). But we need to
> > tell the OS where the host bridge is via DT.
> > 
> > So the guest would be told about two host bridges: one for real devices
> > and another for virtual devices. These can have different coherency
> > properties.
> 
> Yeah, and it would suck that the user needs to know the difference
> between the coherency proprties of the host bridges.

The host needs to know about this, unless we assume full coherency on
all the platforms. Arguably, Qemu needs to know as well if it is the one
generating the DT for guest (or at least passing some snippets from the
host DT).

> It would especially suck if the user has a cluster with different
> machines, some of them coherent and others non-coherent, and then has to
> debug why the same configuration works on some machines and not on others.

That's a problem indeed, especially with guest migration. But I don't
think we have any sane solution here for the bus master DMA.

> To avoid replying in two different places, which of the solutions look
> to me like something that half-works?  Pretty much all of them, because
> in the end it is just a processor misfeature.  For example, Intel
> virtualization extensions let the hypervisor override stage1 translation
> _if necessary_.  AMD doesn't, but has some other quirky things that let
> you achieve the same effect..

ARM can override them as well but only making them stricter. Otherwise,
on a weakly ordered architecture, it's not always safe (let's say the
guest thinks it accesses Strongly Ordered memory and avoids barriers for
flag updates but the host "upgrades" it to Cacheable which breaks the
memory order).

If we want the host to enforce guest memory mapping attributes via stage
2, we could do it the other way around: get the guests to always assume
full cache coherency, generating Normal Cacheable mappings, but use the
stage 2 attributes restriction in the host to make such mappings
non-cacheable when needed (it works this way on ARM but not in the other
direction to relax the attributes).

> In particular, I am not even sure that this is about bus coherency,
> because this problem does not happen when the device is doing bus master
> DMA.  Working around coherency for bus master DMA would be easy.

My previous emails on the "dma-coherent" property were only about bus
master DMA (which would cause the correct selection of the DMA API ops
in the guest).

But even for bus master DMA, guest OS still needs to be aware of the
(virtual) device DMA capabilities (cache coherent or not). You may be
able to work around it in the host (stage 2, explicit cache flushing or
SMMU attributes) if the guests assumes non-coherency but it's not really
efficient (nor nice to implement).

> The problem arises with MMIO areas that the guest can reasonably expect
> to be uncacheable, but that are optimized by the host so that they end
> up backed by cacheable RAM.  It's perfectly reasonable that the same
> device needs cacheable mapping with one userspace, and works with
> uncacheable mapping with another userspace that doesn't optimize the
> MMIO area to RAM.

Unless the guest allocates the framebuffer itself (e.g.
dma_alloc_coherent), we can't control the cacheability via
"dma-coherent" properties as it refers to bus master DMA.

So for MMIO with the buffer allocated by the host (Qemu), the only
solution I see on ARM is for the host to ensure coherency, either via
explicit cache maintenance (new KVM API) or by changing the memory
attributes used by Qemu to access such virtual MMIO.

Basically Qemu is acting as a bus master when reading the framebuffer it
allocated but the guest considers it a slave access and we don't have a
way to tell the guest that such accesses should be cacheable, nor can we
upgrade them via architecture features.

> Currently the VGA framebuffer is the main case where this happen, and I
> don't expect many more.  Because this is not bus master DMA, it's hard
> to find a QEMU API that can be hooked to invalidate the cache.  QEMU is
> just reading from an array of chars.

I now understand the problem better. I was under the impression that the
guest allocates the framebuffer itself and tells Qemu where it is (like
in amba-clcd.c for example).

> In practice, the VGA framebuffer has an optimization that uses dirty
> page tracking, so we could piggyback on the ioctls that return which
> pages are dirty.  It turns out that piggybacking on those ioctls also
> should fix the case of migrating a guest while the MMU is disabled.

Yes, Qemu would need to invalidate the cache before reading a dirty
framebuffer page.

As I said above, an API that allows non-cacheable mappings for the VGA
framebuffer in Qemu would also solve the problem. I'm not sure what KVM
provides here (or whether we can add such API).

> We could use _DSD to export the device tree property separately for each
> device, but that wouldn't work for hotplugged devices.

This would only work for bus master DMA, so it doesn't solve the VGA
framebuffer issue.

-- 
Catalin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-05 14:58                                                   ` Catalin Marinas
@ 2015-03-05 17:43                                                     ` Paolo Bonzini
  -1 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-03-05 17:43 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: KVM devel mailing list, Ard Biesheuvel, Marc Zyngier,
	Laszlo Ersek, kvmarm, linux-arm-kernel



On 05/03/2015 15:58, Catalin Marinas wrote:
>> It would especially suck if the user has a cluster with different
>> machines, some of them coherent and others non-coherent, and then has to
>> debug why the same configuration works on some machines and not on others.
> 
> That's a problem indeed, especially with guest migration. But I don't
> think we have any sane solution here for the bus master DMA.

I do not oppose doing cache management in QEMU for bus master DMA
(though if the solution you outlined below works it would be great).

> ARM can override them as well but only making them stricter. Otherwise,
> on a weakly ordered architecture, it's not always safe (let's say the
> guest thinks it accesses Strongly Ordered memory and avoids barriers for
> flag updates but the host "upgrades" it to Cacheable which breaks the
> memory order).

The same can happen on x86 though, even if it's rarer.  You still need a
barrier between stores and loads.

> If we want the host to enforce guest memory mapping attributes via stage
> 2, we could do it the other way around: get the guests to always assume
> full cache coherency, generating Normal Cacheable mappings, but use the
> stage 2 attributes restriction in the host to make such mappings
> non-cacheable when needed (it works this way on ARM but not in the other
> direction to relax the attributes).

That sounds like a plan for device assignment.  But it still would not
solve the problem of the MMIO framebuffer, right?

>> The problem arises with MMIO areas that the guest can reasonably expect
>> to be uncacheable, but that are optimized by the host so that they end
>> up backed by cacheable RAM.  It's perfectly reasonable that the same
>> device needs cacheable mapping with one userspace, and works with
>> uncacheable mapping with another userspace that doesn't optimize the
>> MMIO area to RAM.
> 
> Unless the guest allocates the framebuffer itself (e.g.
> dma_alloc_coherent), we can't control the cacheability via
> "dma-coherent" properties as it refers to bus master DMA.

Okay, it's good to rule that out.  One less thing to think about. :)
Same for _DSD.

> So for MMIO with the buffer allocated by the host (Qemu), the only
> solution I see on ARM is for the host to ensure coherency, either via
> explicit cache maintenance (new KVM API) or by changing the memory
> attributes used by Qemu to access such virtual MMIO.
> 
> Basically Qemu is acting as a bus master when reading the framebuffer it
> allocated but the guest considers it a slave access and we don't have a
> way to tell the guest that such accesses should be cacheable, nor can we
> upgrade them via architecture features.

Yes, that's a way to put it.

>> In practice, the VGA framebuffer has an optimization that uses dirty
>> page tracking, so we could piggyback on the ioctls that return which
>> pages are dirty.  It turns out that piggybacking on those ioctls also
>> should fix the case of migrating a guest while the MMU is disabled.
> 
> Yes, Qemu would need to invalidate the cache before reading a dirty
> framebuffer page.
> 
> As I said above, an API that allows non-cacheable mappings for the VGA
> framebuffer in Qemu would also solve the problem. I'm not sure what KVM
> provides here (or whether we can add such API).

Nothing for now; other architectures simply do not have the issue.

As long as it's just VGA, we can quirk it.  There's just a couple
vendor/device IDs to catch, and the guest can then use a cacheable mapping.

For a more generic solution, the API would be madvise(MADV_DONTCACHE).
It would be easy for QEMU to use it, but I am not too optimistic about
convincing the mm folks about it.  We can try.

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-05 17:43                                                     ` Paolo Bonzini
  0 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-03-05 17:43 UTC (permalink / raw)
  To: linux-arm-kernel



On 05/03/2015 15:58, Catalin Marinas wrote:
>> It would especially suck if the user has a cluster with different
>> machines, some of them coherent and others non-coherent, and then has to
>> debug why the same configuration works on some machines and not on others.
> 
> That's a problem indeed, especially with guest migration. But I don't
> think we have any sane solution here for the bus master DMA.

I do not oppose doing cache management in QEMU for bus master DMA
(though if the solution you outlined below works it would be great).

> ARM can override them as well but only making them stricter. Otherwise,
> on a weakly ordered architecture, it's not always safe (let's say the
> guest thinks it accesses Strongly Ordered memory and avoids barriers for
> flag updates but the host "upgrades" it to Cacheable which breaks the
> memory order).

The same can happen on x86 though, even if it's rarer.  You still need a
barrier between stores and loads.

> If we want the host to enforce guest memory mapping attributes via stage
> 2, we could do it the other way around: get the guests to always assume
> full cache coherency, generating Normal Cacheable mappings, but use the
> stage 2 attributes restriction in the host to make such mappings
> non-cacheable when needed (it works this way on ARM but not in the other
> direction to relax the attributes).

That sounds like a plan for device assignment.  But it still would not
solve the problem of the MMIO framebuffer, right?

>> The problem arises with MMIO areas that the guest can reasonably expect
>> to be uncacheable, but that are optimized by the host so that they end
>> up backed by cacheable RAM.  It's perfectly reasonable that the same
>> device needs cacheable mapping with one userspace, and works with
>> uncacheable mapping with another userspace that doesn't optimize the
>> MMIO area to RAM.
> 
> Unless the guest allocates the framebuffer itself (e.g.
> dma_alloc_coherent), we can't control the cacheability via
> "dma-coherent" properties as it refers to bus master DMA.

Okay, it's good to rule that out.  One less thing to think about. :)
Same for _DSD.

> So for MMIO with the buffer allocated by the host (Qemu), the only
> solution I see on ARM is for the host to ensure coherency, either via
> explicit cache maintenance (new KVM API) or by changing the memory
> attributes used by Qemu to access such virtual MMIO.
> 
> Basically Qemu is acting as a bus master when reading the framebuffer it
> allocated but the guest considers it a slave access and we don't have a
> way to tell the guest that such accesses should be cacheable, nor can we
> upgrade them via architecture features.

Yes, that's a way to put it.

>> In practice, the VGA framebuffer has an optimization that uses dirty
>> page tracking, so we could piggyback on the ioctls that return which
>> pages are dirty.  It turns out that piggybacking on those ioctls also
>> should fix the case of migrating a guest while the MMU is disabled.
> 
> Yes, Qemu would need to invalidate the cache before reading a dirty
> framebuffer page.
> 
> As I said above, an API that allows non-cacheable mappings for the VGA
> framebuffer in Qemu would also solve the problem. I'm not sure what KVM
> provides here (or whether we can add such API).

Nothing for now; other architectures simply do not have the issue.

As long as it's just VGA, we can quirk it.  There's just a couple
vendor/device IDs to catch, and the guest can then use a cacheable mapping.

For a more generic solution, the API would be madvise(MADV_DONTCACHE).
It would be easy for QEMU to use it, but I am not too optimistic about
convincing the mm folks about it.  We can try.

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-05 14:58                                                   ` Catalin Marinas
@ 2015-03-05 19:13                                                     ` Ard Biesheuvel
  -1 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-03-05 19:13 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: KVM devel mailing list, Marc Zyngier, Paolo Bonzini,
	Laszlo Ersek, kvmarm, linux-arm-kernel

On 5 March 2015 at 15:58, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Thu, Mar 05, 2015 at 01:26:39PM +0100, Paolo Bonzini wrote:
>> On 05/03/2015 13:03, Catalin Marinas wrote:
>> >> > I'd hate to have to do that. PCI should be entirely probeable
>> >> > given that we tell the guest where the host bridge is, that's
>> >> > one of its advantages.
>> > I didn't say a DT node per device, the DT doesn't know what PCI devices
>> > are available (otherwise it defeats the idea of probing). But we need to
>> > tell the OS where the host bridge is via DT.
>> >
>> > So the guest would be told about two host bridges: one for real devices
>> > and another for virtual devices. These can have different coherency
>> > properties.
>>
>> Yeah, and it would suck that the user needs to know the difference
>> between the coherency proprties of the host bridges.
>
> The host needs to know about this, unless we assume full coherency on
> all the platforms. Arguably, Qemu needs to know as well if it is the one
> generating the DT for guest (or at least passing some snippets from the
> host DT).
>
>> It would especially suck if the user has a cluster with different
>> machines, some of them coherent and others non-coherent, and then has to
>> debug why the same configuration works on some machines and not on others.
>
> That's a problem indeed, especially with guest migration. But I don't
> think we have any sane solution here for the bus master DMA.
>
>> To avoid replying in two different places, which of the solutions look
>> to me like something that half-works?  Pretty much all of them, because
>> in the end it is just a processor misfeature.  For example, Intel
>> virtualization extensions let the hypervisor override stage1 translation
>> _if necessary_.  AMD doesn't, but has some other quirky things that let
>> you achieve the same effect..
>
> ARM can override them as well but only making them stricter. Otherwise,
> on a weakly ordered architecture, it's not always safe (let's say the
> guest thinks it accesses Strongly Ordered memory and avoids barriers for
> flag updates but the host "upgrades" it to Cacheable which breaks the
> memory order).
>
> If we want the host to enforce guest memory mapping attributes via stage
> 2, we could do it the other way around: get the guests to always assume
> full cache coherency, generating Normal Cacheable mappings, but use the
> stage 2 attributes restriction in the host to make such mappings
> non-cacheable when needed (it works this way on ARM but not in the other
> direction to relax the attributes).
>

This was precisely the idea of the MAIR mangling patch: promote all
stage1 mappings to cacheable, and put the host in control by allowing
it to supersede them with device mappings in stage 2.
But it appears there are too many corner cases where this doesn't
quite work out.


>> In particular, I am not even sure that this is about bus coherency,
>> because this problem does not happen when the device is doing bus master
>> DMA.  Working around coherency for bus master DMA would be easy.
>
> My previous emails on the "dma-coherent" property were only about bus
> master DMA (which would cause the correct selection of the DMA API ops
> in the guest).
>
> But even for bus master DMA, guest OS still needs to be aware of the
> (virtual) device DMA capabilities (cache coherent or not). You may be
> able to work around it in the host (stage 2, explicit cache flushing or
> SMMU attributes) if the guests assumes non-coherency but it's not really
> efficient (nor nice to implement).
>
>> The problem arises with MMIO areas that the guest can reasonably expect
>> to be uncacheable, but that are optimized by the host so that they end
>> up backed by cacheable RAM.  It's perfectly reasonable that the same
>> device needs cacheable mapping with one userspace, and works with
>> uncacheable mapping with another userspace that doesn't optimize the
>> MMIO area to RAM.
>
> Unless the guest allocates the framebuffer itself (e.g.
> dma_alloc_coherent), we can't control the cacheability via
> "dma-coherent" properties as it refers to bus master DMA.
>
> So for MMIO with the buffer allocated by the host (Qemu), the only
> solution I see on ARM is for the host to ensure coherency, either via
> explicit cache maintenance (new KVM API) or by changing the memory
> attributes used by Qemu to access such virtual MMIO.
>
> Basically Qemu is acting as a bus master when reading the framebuffer it
> allocated but the guest considers it a slave access and we don't have a
> way to tell the guest that such accesses should be cacheable, nor can we
> upgrade them via architecture features.
>
>> Currently the VGA framebuffer is the main case where this happen, and I
>> don't expect many more.  Because this is not bus master DMA, it's hard
>> to find a QEMU API that can be hooked to invalidate the cache.  QEMU is
>> just reading from an array of chars.
>
> I now understand the problem better. I was under the impression that the
> guest allocates the framebuffer itself and tells Qemu where it is (like
> in amba-clcd.c for example).
>
>> In practice, the VGA framebuffer has an optimization that uses dirty
>> page tracking, so we could piggyback on the ioctls that return which
>> pages are dirty.  It turns out that piggybacking on those ioctls also
>> should fix the case of migrating a guest while the MMU is disabled.
>
> Yes, Qemu would need to invalidate the cache before reading a dirty
> framebuffer page.
>
> As I said above, an API that allows non-cacheable mappings for the VGA
> framebuffer in Qemu would also solve the problem. I'm not sure what KVM
> provides here (or whether we can add such API).
>
>> We could use _DSD to export the device tree property separately for each
>> device, but that wouldn't work for hotplugged devices.
>
> This would only work for bus master DMA, so it doesn't solve the VGA
> framebuffer issue.
>
> --
> Catalin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-05 19:13                                                     ` Ard Biesheuvel
  0 siblings, 0 replies; 110+ messages in thread
From: Ard Biesheuvel @ 2015-03-05 19:13 UTC (permalink / raw)
  To: linux-arm-kernel

On 5 March 2015 at 15:58, Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Thu, Mar 05, 2015 at 01:26:39PM +0100, Paolo Bonzini wrote:
>> On 05/03/2015 13:03, Catalin Marinas wrote:
>> >> > I'd hate to have to do that. PCI should be entirely probeable
>> >> > given that we tell the guest where the host bridge is, that's
>> >> > one of its advantages.
>> > I didn't say a DT node per device, the DT doesn't know what PCI devices
>> > are available (otherwise it defeats the idea of probing). But we need to
>> > tell the OS where the host bridge is via DT.
>> >
>> > So the guest would be told about two host bridges: one for real devices
>> > and another for virtual devices. These can have different coherency
>> > properties.
>>
>> Yeah, and it would suck that the user needs to know the difference
>> between the coherency proprties of the host bridges.
>
> The host needs to know about this, unless we assume full coherency on
> all the platforms. Arguably, Qemu needs to know as well if it is the one
> generating the DT for guest (or at least passing some snippets from the
> host DT).
>
>> It would especially suck if the user has a cluster with different
>> machines, some of them coherent and others non-coherent, and then has to
>> debug why the same configuration works on some machines and not on others.
>
> That's a problem indeed, especially with guest migration. But I don't
> think we have any sane solution here for the bus master DMA.
>
>> To avoid replying in two different places, which of the solutions look
>> to me like something that half-works?  Pretty much all of them, because
>> in the end it is just a processor misfeature.  For example, Intel
>> virtualization extensions let the hypervisor override stage1 translation
>> _if necessary_.  AMD doesn't, but has some other quirky things that let
>> you achieve the same effect..
>
> ARM can override them as well but only making them stricter. Otherwise,
> on a weakly ordered architecture, it's not always safe (let's say the
> guest thinks it accesses Strongly Ordered memory and avoids barriers for
> flag updates but the host "upgrades" it to Cacheable which breaks the
> memory order).
>
> If we want the host to enforce guest memory mapping attributes via stage
> 2, we could do it the other way around: get the guests to always assume
> full cache coherency, generating Normal Cacheable mappings, but use the
> stage 2 attributes restriction in the host to make such mappings
> non-cacheable when needed (it works this way on ARM but not in the other
> direction to relax the attributes).
>

This was precisely the idea of the MAIR mangling patch: promote all
stage1 mappings to cacheable, and put the host in control by allowing
it to supersede them with device mappings in stage 2.
But it appears there are too many corner cases where this doesn't
quite work out.


>> In particular, I am not even sure that this is about bus coherency,
>> because this problem does not happen when the device is doing bus master
>> DMA.  Working around coherency for bus master DMA would be easy.
>
> My previous emails on the "dma-coherent" property were only about bus
> master DMA (which would cause the correct selection of the DMA API ops
> in the guest).
>
> But even for bus master DMA, guest OS still needs to be aware of the
> (virtual) device DMA capabilities (cache coherent or not). You may be
> able to work around it in the host (stage 2, explicit cache flushing or
> SMMU attributes) if the guests assumes non-coherency but it's not really
> efficient (nor nice to implement).
>
>> The problem arises with MMIO areas that the guest can reasonably expect
>> to be uncacheable, but that are optimized by the host so that they end
>> up backed by cacheable RAM.  It's perfectly reasonable that the same
>> device needs cacheable mapping with one userspace, and works with
>> uncacheable mapping with another userspace that doesn't optimize the
>> MMIO area to RAM.
>
> Unless the guest allocates the framebuffer itself (e.g.
> dma_alloc_coherent), we can't control the cacheability via
> "dma-coherent" properties as it refers to bus master DMA.
>
> So for MMIO with the buffer allocated by the host (Qemu), the only
> solution I see on ARM is for the host to ensure coherency, either via
> explicit cache maintenance (new KVM API) or by changing the memory
> attributes used by Qemu to access such virtual MMIO.
>
> Basically Qemu is acting as a bus master when reading the framebuffer it
> allocated but the guest considers it a slave access and we don't have a
> way to tell the guest that such accesses should be cacheable, nor can we
> upgrade them via architecture features.
>
>> Currently the VGA framebuffer is the main case where this happen, and I
>> don't expect many more.  Because this is not bus master DMA, it's hard
>> to find a QEMU API that can be hooked to invalidate the cache.  QEMU is
>> just reading from an array of chars.
>
> I now understand the problem better. I was under the impression that the
> guest allocates the framebuffer itself and tells Qemu where it is (like
> in amba-clcd.c for example).
>
>> In practice, the VGA framebuffer has an optimization that uses dirty
>> page tracking, so we could piggyback on the ioctls that return which
>> pages are dirty.  It turns out that piggybacking on those ioctls also
>> should fix the case of migrating a guest while the MMU is disabled.
>
> Yes, Qemu would need to invalidate the cache before reading a dirty
> framebuffer page.
>
> As I said above, an API that allows non-cacheable mappings for the VGA
> framebuffer in Qemu would also solve the problem. I'm not sure what KVM
> provides here (or whether we can add such API).
>
>> We could use _DSD to export the device tree property separately for each
>> device, but that wouldn't work for hotplugged devices.
>
> This would only work for bus master DMA, so it doesn't solve the VGA
> framebuffer issue.
>
> --
> Catalin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-04 11:35                         ` Catalin Marinas
@ 2015-03-06 20:33                           ` Mario Smarduch
  -1 siblings, 0 replies; 110+ messages in thread
From: Mario Smarduch @ 2015-03-06 20:33 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Christoffer Dall, KVM devel mailing list, Ard Biesheuvel,
	Marc Zyngier, Paolo Bonzini, Laszlo Ersek, kvmarm,
	linux-arm-kernel

On 03/04/2015 03:35 AM, Catalin Marinas wrote:
> (please try to avoid top-posting)
> 
> On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote:
>> On 03/02/2015 08:31 AM, Christoffer Dall wrote:
>>> However, my concern with these patches are on two points:
>>>
>>> 1. It's not a fix-all.  We still have the case where the guest expects
>>> the behavior of device memory (for strong ordering for example) on a RAM
>>> region, which we now break.  Similiarly this doesn't support the
>>> non-coherent DMA to RAM region case.
>>>
>>> 2. While the code is probably as nice as this kind of stuff gets, it
>>> is non-trivial and extremely difficult to debug.  The counter-point here
>>> is that we may end up handling other stuff at EL2 for performanc reasons
>>> in the future.
>>>
>>> Mainly because of point 1 above, I am leaning to thinking userspace
>>> should do the invalidation when it knows it needs to, either through KVM
>>> via a memslot flag or through some other syscall mechanism.
> 
> I expressed my concerns as well, I'm definitely against merging this
> series.
> 
>> I don't understand how can the CPU handle different cache attributes
>> used by QEMU and Guest won't you run into B2.9 checklist? Wouldn't
>> cache evictions or cleans wipe out guest updates to same cache
>> line(s)?
> 
> "Clean+invalidate" is a safe operation even if the guest accesses the
> memory in a cacheable way. But if the guest can update the cache lines,
> Qemu should avoid cache maintenance from a performance perspective.
> 
> The guest is either told that the DMA is coherent (via DT properties) or
> Qemu deals with (non-)coherency itself. The latter is fully in line with
> the B2.9 chapter in the ARM ARM, more precisely point 5:
> 
>   If the mismatched attributes for a memory location all assign the same
>   shareability attribute to the location, any loss of uniprocessor
>   semantics or coherency within a shareability domain can be avoided by
>   use of software cache management.
> 
> ... it continues with what kind of cache maintenance is required,
> together with:
> 
>   A clean and invalidate instruction can be used instead of a clean
>   instruction, or instead of an invalidate instruction.
> 
Hi Catalin,
  sorry for the top posting. I'm struggling with QEMU cache
maintenance for devices that don't have registers cache line aligned
and may be multi-function, for lack of a better
one I thought of sp804 that supports two devices with registers
covered by one cache line. Wouldn't QEMU cache maintenance
of one device have potential to corrupt the second device?
These could be used by two guest threads in parallel.

I get bullet 2,3 still working on 1st one it will take a while.

Thanks,
- Mario

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-06 20:33                           ` Mario Smarduch
  0 siblings, 0 replies; 110+ messages in thread
From: Mario Smarduch @ 2015-03-06 20:33 UTC (permalink / raw)
  To: linux-arm-kernel

On 03/04/2015 03:35 AM, Catalin Marinas wrote:
> (please try to avoid top-posting)
> 
> On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote:
>> On 03/02/2015 08:31 AM, Christoffer Dall wrote:
>>> However, my concern with these patches are on two points:
>>>
>>> 1. It's not a fix-all.  We still have the case where the guest expects
>>> the behavior of device memory (for strong ordering for example) on a RAM
>>> region, which we now break.  Similiarly this doesn't support the
>>> non-coherent DMA to RAM region case.
>>>
>>> 2. While the code is probably as nice as this kind of stuff gets, it
>>> is non-trivial and extremely difficult to debug.  The counter-point here
>>> is that we may end up handling other stuff at EL2 for performanc reasons
>>> in the future.
>>>
>>> Mainly because of point 1 above, I am leaning to thinking userspace
>>> should do the invalidation when it knows it needs to, either through KVM
>>> via a memslot flag or through some other syscall mechanism.
> 
> I expressed my concerns as well, I'm definitely against merging this
> series.
> 
>> I don't understand how can the CPU handle different cache attributes
>> used by QEMU and Guest won't you run into B2.9 checklist? Wouldn't
>> cache evictions or cleans wipe out guest updates to same cache
>> line(s)?
> 
> "Clean+invalidate" is a safe operation even if the guest accesses the
> memory in a cacheable way. But if the guest can update the cache lines,
> Qemu should avoid cache maintenance from a performance perspective.
> 
> The guest is either told that the DMA is coherent (via DT properties) or
> Qemu deals with (non-)coherency itself. The latter is fully in line with
> the B2.9 chapter in the ARM ARM, more precisely point 5:
> 
>   If the mismatched attributes for a memory location all assign the same
>   shareability attribute to the location, any loss of uniprocessor
>   semantics or coherency within a shareability domain can be avoided by
>   use of software cache management.
> 
> ... it continues with what kind of cache maintenance is required,
> together with:
> 
>   A clean and invalidate instruction can be used instead of a clean
>   instruction, or instead of an invalidate instruction.
> 
Hi Catalin,
  sorry for the top posting. I'm struggling with QEMU cache
maintenance for devices that don't have registers cache line aligned
and may be multi-function, for lack of a better
one I thought of sp804 that supports two devices with registers
covered by one cache line. Wouldn't QEMU cache maintenance
of one device have potential to corrupt the second device?
These could be used by two guest threads in parallel.

I get bullet 2,3 still working on 1st one it will take a while.

Thanks,
- Mario

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-05 17:43                                                     ` Paolo Bonzini
@ 2015-03-06 21:08                                                       ` Mario Smarduch
  -1 siblings, 0 replies; 110+ messages in thread
From: Mario Smarduch @ 2015-03-06 21:08 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Catalin Marinas, KVM devel mailing list, Ard Biesheuvel,
	Marc Zyngier, Laszlo Ersek, kvmarm, linux-arm-kernel

On 03/05/2015 09:43 AM, Paolo Bonzini wrote:
> 
> 
> On 05/03/2015 15:58, Catalin Marinas wrote:
>>> It would especially suck if the user has a cluster with different
>>> machines, some of them coherent and others non-coherent, and then has to
>>> debug why the same configuration works on some machines and not on others.
>>
>> That's a problem indeed, especially with guest migration. But I don't
>> think we have any sane solution here for the bus master DMA.
> 
> I do not oppose doing cache management in QEMU for bus master DMA
> (though if the solution you outlined below works it would be great).
> 
>> ARM can override them as well but only making them stricter. Otherwise,
>> on a weakly ordered architecture, it's not always safe (let's say the
>> guest thinks it accesses Strongly Ordered memory and avoids barriers for
>> flag updates but the host "upgrades" it to Cacheable which breaks the
>> memory order).
> 
> The same can happen on x86 though, even if it's rarer.  You still need a
> barrier between stores and loads.
> 
>> If we want the host to enforce guest memory mapping attributes via stage
>> 2, we could do it the other way around: get the guests to always assume
>> full cache coherency, generating Normal Cacheable mappings, but use the
>> stage 2 attributes restriction in the host to make such mappings
>> non-cacheable when needed (it works this way on ARM but not in the other
>> direction to relax the attributes).
> 
> That sounds like a plan for device assignment.  But it still would not
> solve the problem of the MMIO framebuffer, right?
> 
>>> The problem arises with MMIO areas that the guest can reasonably expect
>>> to be uncacheable, but that are optimized by the host so that they end
>>> up backed by cacheable RAM.  It's perfectly reasonable that the same
>>> device needs cacheable mapping with one userspace, and works with
>>> uncacheable mapping with another userspace that doesn't optimize the
>>> MMIO area to RAM.
>>
>> Unless the guest allocates the framebuffer itself (e.g.
>> dma_alloc_coherent), we can't control the cacheability via
>> "dma-coherent" properties as it refers to bus master DMA.
> 
> Okay, it's good to rule that out.  One less thing to think about. :)
> Same for _DSD.
> 
>> So for MMIO with the buffer allocated by the host (Qemu), the only
>> solution I see on ARM is for the host to ensure coherency, either via
>> explicit cache maintenance (new KVM API) or by changing the memory
>> attributes used by Qemu to access such virtual MMIO.
>>
>> Basically Qemu is acting as a bus master when reading the framebuffer it
>> allocated but the guest considers it a slave access and we don't have a
>> way to tell the guest that such accesses should be cacheable, nor can we
>> upgrade them via architecture features.
> 
> Yes, that's a way to put it.
> 
>>> In practice, the VGA framebuffer has an optimization that uses dirty
>>> page tracking, so we could piggyback on the ioctls that return which
>>> pages are dirty.  It turns out that piggybacking on those ioctls also
>>> should fix the case of migrating a guest while the MMU is disabled.
>>
>> Yes, Qemu would need to invalidate the cache before reading a dirty
>> framebuffer page.
>>
>> As I said above, an API that allows non-cacheable mappings for the VGA
>> framebuffer in Qemu would also solve the problem. I'm not sure what KVM
>> provides here (or whether we can add such API).
> 
> Nothing for now; other architectures simply do not have the issue.
> 
> As long as it's just VGA, we can quirk it.  There's just a couple
> vendor/device IDs to catch, and the guest can then use a cacheable mapping.
> 
> For a more generic solution, the API would be madvise(MADV_DONTCACHE).
> It would be easy for QEMU to use it, but I am not too optimistic about
> convincing the mm folks about it.  We can try.

Interested to see the outcome.

I was thinking of a very basic memory driver that can provide
an uncached memslot to QEMU - in mmap() file operation
apply pgprot_uncached to allocated pages, lock them, flush TLB
call remap_pfn_range().

Mario

> 
> Paolo
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-06 21:08                                                       ` Mario Smarduch
  0 siblings, 0 replies; 110+ messages in thread
From: Mario Smarduch @ 2015-03-06 21:08 UTC (permalink / raw)
  To: linux-arm-kernel

On 03/05/2015 09:43 AM, Paolo Bonzini wrote:
> 
> 
> On 05/03/2015 15:58, Catalin Marinas wrote:
>>> It would especially suck if the user has a cluster with different
>>> machines, some of them coherent and others non-coherent, and then has to
>>> debug why the same configuration works on some machines and not on others.
>>
>> That's a problem indeed, especially with guest migration. But I don't
>> think we have any sane solution here for the bus master DMA.
> 
> I do not oppose doing cache management in QEMU for bus master DMA
> (though if the solution you outlined below works it would be great).
> 
>> ARM can override them as well but only making them stricter. Otherwise,
>> on a weakly ordered architecture, it's not always safe (let's say the
>> guest thinks it accesses Strongly Ordered memory and avoids barriers for
>> flag updates but the host "upgrades" it to Cacheable which breaks the
>> memory order).
> 
> The same can happen on x86 though, even if it's rarer.  You still need a
> barrier between stores and loads.
> 
>> If we want the host to enforce guest memory mapping attributes via stage
>> 2, we could do it the other way around: get the guests to always assume
>> full cache coherency, generating Normal Cacheable mappings, but use the
>> stage 2 attributes restriction in the host to make such mappings
>> non-cacheable when needed (it works this way on ARM but not in the other
>> direction to relax the attributes).
> 
> That sounds like a plan for device assignment.  But it still would not
> solve the problem of the MMIO framebuffer, right?
> 
>>> The problem arises with MMIO areas that the guest can reasonably expect
>>> to be uncacheable, but that are optimized by the host so that they end
>>> up backed by cacheable RAM.  It's perfectly reasonable that the same
>>> device needs cacheable mapping with one userspace, and works with
>>> uncacheable mapping with another userspace that doesn't optimize the
>>> MMIO area to RAM.
>>
>> Unless the guest allocates the framebuffer itself (e.g.
>> dma_alloc_coherent), we can't control the cacheability via
>> "dma-coherent" properties as it refers to bus master DMA.
> 
> Okay, it's good to rule that out.  One less thing to think about. :)
> Same for _DSD.
> 
>> So for MMIO with the buffer allocated by the host (Qemu), the only
>> solution I see on ARM is for the host to ensure coherency, either via
>> explicit cache maintenance (new KVM API) or by changing the memory
>> attributes used by Qemu to access such virtual MMIO.
>>
>> Basically Qemu is acting as a bus master when reading the framebuffer it
>> allocated but the guest considers it a slave access and we don't have a
>> way to tell the guest that such accesses should be cacheable, nor can we
>> upgrade them via architecture features.
> 
> Yes, that's a way to put it.
> 
>>> In practice, the VGA framebuffer has an optimization that uses dirty
>>> page tracking, so we could piggyback on the ioctls that return which
>>> pages are dirty.  It turns out that piggybacking on those ioctls also
>>> should fix the case of migrating a guest while the MMU is disabled.
>>
>> Yes, Qemu would need to invalidate the cache before reading a dirty
>> framebuffer page.
>>
>> As I said above, an API that allows non-cacheable mappings for the VGA
>> framebuffer in Qemu would also solve the problem. I'm not sure what KVM
>> provides here (or whether we can add such API).
> 
> Nothing for now; other architectures simply do not have the issue.
> 
> As long as it's just VGA, we can quirk it.  There's just a couple
> vendor/device IDs to catch, and the guest can then use a cacheable mapping.
> 
> For a more generic solution, the API would be madvise(MADV_DONTCACHE).
> It would be easy for QEMU to use it, but I am not too optimistic about
> convincing the mm folks about it.  We can try.

Interested to see the outcome.

I was thinking of a very basic memory driver that can provide
an uncached memslot to QEMU - in mmap() file operation
apply pgprot_uncached to allocated pages, lock them, flush TLB
call remap_pfn_range().

Mario

> 
> Paolo
> _______________________________________________
> kvmarm mailing list
> kvmarm at lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-06 21:08                                                       ` Mario Smarduch
@ 2015-03-09 14:26                                                         ` Andrew Jones
  -1 siblings, 0 replies; 110+ messages in thread
From: Andrew Jones @ 2015-03-09 14:26 UTC (permalink / raw)
  To: Mario Smarduch
  Cc: KVM devel mailing list, Ard Biesheuvel, Marc Zyngier,
	Catalin Marinas, Paolo Bonzini, Laszlo Ersek, kvmarm,
	linux-arm-kernel

On Fri, Mar 06, 2015 at 01:08:29PM -0800, Mario Smarduch wrote:
> On 03/05/2015 09:43 AM, Paolo Bonzini wrote:
> > 
> > 
> > On 05/03/2015 15:58, Catalin Marinas wrote:
> >>> It would especially suck if the user has a cluster with different
> >>> machines, some of them coherent and others non-coherent, and then has to
> >>> debug why the same configuration works on some machines and not on others.
> >>
> >> That's a problem indeed, especially with guest migration. But I don't
> >> think we have any sane solution here for the bus master DMA.
> > 
> > I do not oppose doing cache management in QEMU for bus master DMA
> > (though if the solution you outlined below works it would be great).
> > 
> >> ARM can override them as well but only making them stricter. Otherwise,
> >> on a weakly ordered architecture, it's not always safe (let's say the
> >> guest thinks it accesses Strongly Ordered memory and avoids barriers for
> >> flag updates but the host "upgrades" it to Cacheable which breaks the
> >> memory order).
> > 
> > The same can happen on x86 though, even if it's rarer.  You still need a
> > barrier between stores and loads.
> > 
> >> If we want the host to enforce guest memory mapping attributes via stage
> >> 2, we could do it the other way around: get the guests to always assume
> >> full cache coherency, generating Normal Cacheable mappings, but use the
> >> stage 2 attributes restriction in the host to make such mappings
> >> non-cacheable when needed (it works this way on ARM but not in the other
> >> direction to relax the attributes).
> > 
> > That sounds like a plan for device assignment.  But it still would not
> > solve the problem of the MMIO framebuffer, right?
> > 
> >>> The problem arises with MMIO areas that the guest can reasonably expect
> >>> to be uncacheable, but that are optimized by the host so that they end
> >>> up backed by cacheable RAM.  It's perfectly reasonable that the same
> >>> device needs cacheable mapping with one userspace, and works with
> >>> uncacheable mapping with another userspace that doesn't optimize the
> >>> MMIO area to RAM.
> >>
> >> Unless the guest allocates the framebuffer itself (e.g.
> >> dma_alloc_coherent), we can't control the cacheability via
> >> "dma-coherent" properties as it refers to bus master DMA.
> > 
> > Okay, it's good to rule that out.  One less thing to think about. :)
> > Same for _DSD.
> > 
> >> So for MMIO with the buffer allocated by the host (Qemu), the only
> >> solution I see on ARM is for the host to ensure coherency, either via
> >> explicit cache maintenance (new KVM API) or by changing the memory
> >> attributes used by Qemu to access such virtual MMIO.
> >>
> >> Basically Qemu is acting as a bus master when reading the framebuffer it
> >> allocated but the guest considers it a slave access and we don't have a
> >> way to tell the guest that such accesses should be cacheable, nor can we
> >> upgrade them via architecture features.
> > 
> > Yes, that's a way to put it.
> > 
> >>> In practice, the VGA framebuffer has an optimization that uses dirty
> >>> page tracking, so we could piggyback on the ioctls that return which
> >>> pages are dirty.  It turns out that piggybacking on those ioctls also
> >>> should fix the case of migrating a guest while the MMU is disabled.
> >>
> >> Yes, Qemu would need to invalidate the cache before reading a dirty
> >> framebuffer page.
> >>
> >> As I said above, an API that allows non-cacheable mappings for the VGA
> >> framebuffer in Qemu would also solve the problem. I'm not sure what KVM
> >> provides here (or whether we can add such API).
> > 
> > Nothing for now; other architectures simply do not have the issue.
> > 
> > As long as it's just VGA, we can quirk it.  There's just a couple
> > vendor/device IDs to catch, and the guest can then use a cacheable mapping.
> > 
> > For a more generic solution, the API would be madvise(MADV_DONTCACHE).
> > It would be easy for QEMU to use it, but I am not too optimistic about
> > convincing the mm folks about it.  We can try.

I forgot to list this one in my summary of approaches[*]. This is a
nice, clean approach. Avoids getting cache maintenance into everything.
However, besides the difficulty to get it past mm people, it reduces
performance for any userspace-userspace uses/sharing of the memory.
userspace-guest requires cache maintenance, but nothing else. Maybe
that's not an important concern for the few emulated devices that need
it though.

> 
> Interested to see the outcome.
> 
> I was thinking of a very basic memory driver that can provide
> an uncached memslot to QEMU - in mmap() file operation
> apply pgprot_uncached to allocated pages, lock them, flush TLB
> call remap_pfn_range().

I guess this is the same as the madvise approach, but with a driver.
KVM could take this approach itself when memslots are added/updated
with the INCOHERENT flag. Maybe worth some experimental patches to
find out?

I'm still thinking about experimenting with the ARM private syscalls
next though.

drew

[*] http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg01254.html

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-09 14:26                                                         ` Andrew Jones
  0 siblings, 0 replies; 110+ messages in thread
From: Andrew Jones @ 2015-03-09 14:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Mar 06, 2015 at 01:08:29PM -0800, Mario Smarduch wrote:
> On 03/05/2015 09:43 AM, Paolo Bonzini wrote:
> > 
> > 
> > On 05/03/2015 15:58, Catalin Marinas wrote:
> >>> It would especially suck if the user has a cluster with different
> >>> machines, some of them coherent and others non-coherent, and then has to
> >>> debug why the same configuration works on some machines and not on others.
> >>
> >> That's a problem indeed, especially with guest migration. But I don't
> >> think we have any sane solution here for the bus master DMA.
> > 
> > I do not oppose doing cache management in QEMU for bus master DMA
> > (though if the solution you outlined below works it would be great).
> > 
> >> ARM can override them as well but only making them stricter. Otherwise,
> >> on a weakly ordered architecture, it's not always safe (let's say the
> >> guest thinks it accesses Strongly Ordered memory and avoids barriers for
> >> flag updates but the host "upgrades" it to Cacheable which breaks the
> >> memory order).
> > 
> > The same can happen on x86 though, even if it's rarer.  You still need a
> > barrier between stores and loads.
> > 
> >> If we want the host to enforce guest memory mapping attributes via stage
> >> 2, we could do it the other way around: get the guests to always assume
> >> full cache coherency, generating Normal Cacheable mappings, but use the
> >> stage 2 attributes restriction in the host to make such mappings
> >> non-cacheable when needed (it works this way on ARM but not in the other
> >> direction to relax the attributes).
> > 
> > That sounds like a plan for device assignment.  But it still would not
> > solve the problem of the MMIO framebuffer, right?
> > 
> >>> The problem arises with MMIO areas that the guest can reasonably expect
> >>> to be uncacheable, but that are optimized by the host so that they end
> >>> up backed by cacheable RAM.  It's perfectly reasonable that the same
> >>> device needs cacheable mapping with one userspace, and works with
> >>> uncacheable mapping with another userspace that doesn't optimize the
> >>> MMIO area to RAM.
> >>
> >> Unless the guest allocates the framebuffer itself (e.g.
> >> dma_alloc_coherent), we can't control the cacheability via
> >> "dma-coherent" properties as it refers to bus master DMA.
> > 
> > Okay, it's good to rule that out.  One less thing to think about. :)
> > Same for _DSD.
> > 
> >> So for MMIO with the buffer allocated by the host (Qemu), the only
> >> solution I see on ARM is for the host to ensure coherency, either via
> >> explicit cache maintenance (new KVM API) or by changing the memory
> >> attributes used by Qemu to access such virtual MMIO.
> >>
> >> Basically Qemu is acting as a bus master when reading the framebuffer it
> >> allocated but the guest considers it a slave access and we don't have a
> >> way to tell the guest that such accesses should be cacheable, nor can we
> >> upgrade them via architecture features.
> > 
> > Yes, that's a way to put it.
> > 
> >>> In practice, the VGA framebuffer has an optimization that uses dirty
> >>> page tracking, so we could piggyback on the ioctls that return which
> >>> pages are dirty.  It turns out that piggybacking on those ioctls also
> >>> should fix the case of migrating a guest while the MMU is disabled.
> >>
> >> Yes, Qemu would need to invalidate the cache before reading a dirty
> >> framebuffer page.
> >>
> >> As I said above, an API that allows non-cacheable mappings for the VGA
> >> framebuffer in Qemu would also solve the problem. I'm not sure what KVM
> >> provides here (or whether we can add such API).
> > 
> > Nothing for now; other architectures simply do not have the issue.
> > 
> > As long as it's just VGA, we can quirk it.  There's just a couple
> > vendor/device IDs to catch, and the guest can then use a cacheable mapping.
> > 
> > For a more generic solution, the API would be madvise(MADV_DONTCACHE).
> > It would be easy for QEMU to use it, but I am not too optimistic about
> > convincing the mm folks about it.  We can try.

I forgot to list this one in my summary of approaches[*]. This is a
nice, clean approach. Avoids getting cache maintenance into everything.
However, besides the difficulty to get it past mm people, it reduces
performance for any userspace-userspace uses/sharing of the memory.
userspace-guest requires cache maintenance, but nothing else. Maybe
that's not an important concern for the few emulated devices that need
it though.

> 
> Interested to see the outcome.
> 
> I was thinking of a very basic memory driver that can provide
> an uncached memslot to QEMU - in mmap() file operation
> apply pgprot_uncached to allocated pages, lock them, flush TLB
> call remap_pfn_range().

I guess this is the same as the madvise approach, but with a driver.
KVM could take this approach itself when memslots are added/updated
with the INCOHERENT flag. Maybe worth some experimental patches to
find out?

I'm still thinking about experimenting with the ARM private syscalls
next though.

drew

[*] http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg01254.html

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
  2015-03-09 14:26                                                         ` Andrew Jones
@ 2015-03-09 15:33                                                           ` Mario Smarduch
  -1 siblings, 0 replies; 110+ messages in thread
From: Mario Smarduch @ 2015-03-09 15:33 UTC (permalink / raw)
  To: Andrew Jones
  Cc: KVM devel mailing list, Ard Biesheuvel, Marc Zyngier,
	Catalin Marinas, Paolo Bonzini, Laszlo Ersek, kvmarm,
	linux-arm-kernel

On 03/09/2015 07:26 AM, Andrew Jones wrote:
> On Fri, Mar 06, 2015 at 01:08:29PM -0800, Mario Smarduch wrote:
>> On 03/05/2015 09:43 AM, Paolo Bonzini wrote:
>>>
>>>
>>> On 05/03/2015 15:58, Catalin Marinas wrote:
>>>>> It would especially suck if the user has a cluster with different
>>>>> machines, some of them coherent and others non-coherent, and then has to
>>>>> debug why the same configuration works on some machines and not on others.
>>>>
>>>> That's a problem indeed, especially with guest migration. But I don't
>>>> think we have any sane solution here for the bus master DMA.
>>>
>>> I do not oppose doing cache management in QEMU for bus master DMA
>>> (though if the solution you outlined below works it would be great).
>>>
>>>> ARM can override them as well but only making them stricter. Otherwise,
>>>> on a weakly ordered architecture, it's not always safe (let's say the
>>>> guest thinks it accesses Strongly Ordered memory and avoids barriers for
>>>> flag updates but the host "upgrades" it to Cacheable which breaks the
>>>> memory order).
>>>
>>> The same can happen on x86 though, even if it's rarer.  You still need a
>>> barrier between stores and loads.
>>>
>>>> If we want the host to enforce guest memory mapping attributes via stage
>>>> 2, we could do it the other way around: get the guests to always assume
>>>> full cache coherency, generating Normal Cacheable mappings, but use the
>>>> stage 2 attributes restriction in the host to make such mappings
>>>> non-cacheable when needed (it works this way on ARM but not in the other
>>>> direction to relax the attributes).
>>>
>>> That sounds like a plan for device assignment.  But it still would not
>>> solve the problem of the MMIO framebuffer, right?
>>>
>>>>> The problem arises with MMIO areas that the guest can reasonably expect
>>>>> to be uncacheable, but that are optimized by the host so that they end
>>>>> up backed by cacheable RAM.  It's perfectly reasonable that the same
>>>>> device needs cacheable mapping with one userspace, and works with
>>>>> uncacheable mapping with another userspace that doesn't optimize the
>>>>> MMIO area to RAM.
>>>>
>>>> Unless the guest allocates the framebuffer itself (e.g.
>>>> dma_alloc_coherent), we can't control the cacheability via
>>>> "dma-coherent" properties as it refers to bus master DMA.
>>>
>>> Okay, it's good to rule that out.  One less thing to think about. :)
>>> Same for _DSD.
>>>
>>>> So for MMIO with the buffer allocated by the host (Qemu), the only
>>>> solution I see on ARM is for the host to ensure coherency, either via
>>>> explicit cache maintenance (new KVM API) or by changing the memory
>>>> attributes used by Qemu to access such virtual MMIO.
>>>>
>>>> Basically Qemu is acting as a bus master when reading the framebuffer it
>>>> allocated but the guest considers it a slave access and we don't have a
>>>> way to tell the guest that such accesses should be cacheable, nor can we
>>>> upgrade them via architecture features.
>>>
>>> Yes, that's a way to put it.
>>>
>>>>> In practice, the VGA framebuffer has an optimization that uses dirty
>>>>> page tracking, so we could piggyback on the ioctls that return which
>>>>> pages are dirty.  It turns out that piggybacking on those ioctls also
>>>>> should fix the case of migrating a guest while the MMU is disabled.
>>>>
>>>> Yes, Qemu would need to invalidate the cache before reading a dirty
>>>> framebuffer page.
>>>>
>>>> As I said above, an API that allows non-cacheable mappings for the VGA
>>>> framebuffer in Qemu would also solve the problem. I'm not sure what KVM
>>>> provides here (or whether we can add such API).
>>>
>>> Nothing for now; other architectures simply do not have the issue.
>>>
>>> As long as it's just VGA, we can quirk it.  There's just a couple
>>> vendor/device IDs to catch, and the guest can then use a cacheable mapping.
>>>
>>> For a more generic solution, the API would be madvise(MADV_DONTCACHE).
>>> It would be easy for QEMU to use it, but I am not too optimistic about
>>> convincing the mm folks about it.  We can try.
> 
> I forgot to list this one in my summary of approaches[*]. This is a
> nice, clean approach. Avoids getting cache maintenance into everything.
> However, besides the difficulty to get it past mm people, it reduces
> performance for any userspace-userspace uses/sharing of the memory.
> userspace-guest requires cache maintenance, but nothing else. Maybe
> that's not an important concern for the few emulated devices that need
> it though.
> 
>>
>> Interested to see the outcome.
>>
>> I was thinking of a very basic memory driver that can provide
>> an uncached memslot to QEMU - in mmap() file operation
>> apply pgprot_uncached to allocated pages, lock them, flush TLB
>> call remap_pfn_range().
> 
> I guess this is the same as the madvise approach, but with a driver.
> KVM could take this approach itself when memslots are added/updated
> with the INCOHERENT flag. Maybe worth some experimental patches to
> find out?

I would work on this but I'm tied up for next 3 weeks.
If anyone is interested I can provide base code, I used
it for memory passthrough although testing may be time consuming.
I think the hurdle here is the kernel doesn't map these
for any reason like page migration, locking pages should
tell kernel don't touch. madvise() is the  desired solution
but I suspect it might take a while to get in.
> 
> I'm still thinking about experimenting with the ARM private syscalls
> next though.

Hope it succeeds.
> 
> drew
> 
> [*] http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg01254.html
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
@ 2015-03-09 15:33                                                           ` Mario Smarduch
  0 siblings, 0 replies; 110+ messages in thread
From: Mario Smarduch @ 2015-03-09 15:33 UTC (permalink / raw)
  To: linux-arm-kernel

On 03/09/2015 07:26 AM, Andrew Jones wrote:
> On Fri, Mar 06, 2015 at 01:08:29PM -0800, Mario Smarduch wrote:
>> On 03/05/2015 09:43 AM, Paolo Bonzini wrote:
>>>
>>>
>>> On 05/03/2015 15:58, Catalin Marinas wrote:
>>>>> It would especially suck if the user has a cluster with different
>>>>> machines, some of them coherent and others non-coherent, and then has to
>>>>> debug why the same configuration works on some machines and not on others.
>>>>
>>>> That's a problem indeed, especially with guest migration. But I don't
>>>> think we have any sane solution here for the bus master DMA.
>>>
>>> I do not oppose doing cache management in QEMU for bus master DMA
>>> (though if the solution you outlined below works it would be great).
>>>
>>>> ARM can override them as well but only making them stricter. Otherwise,
>>>> on a weakly ordered architecture, it's not always safe (let's say the
>>>> guest thinks it accesses Strongly Ordered memory and avoids barriers for
>>>> flag updates but the host "upgrades" it to Cacheable which breaks the
>>>> memory order).
>>>
>>> The same can happen on x86 though, even if it's rarer.  You still need a
>>> barrier between stores and loads.
>>>
>>>> If we want the host to enforce guest memory mapping attributes via stage
>>>> 2, we could do it the other way around: get the guests to always assume
>>>> full cache coherency, generating Normal Cacheable mappings, but use the
>>>> stage 2 attributes restriction in the host to make such mappings
>>>> non-cacheable when needed (it works this way on ARM but not in the other
>>>> direction to relax the attributes).
>>>
>>> That sounds like a plan for device assignment.  But it still would not
>>> solve the problem of the MMIO framebuffer, right?
>>>
>>>>> The problem arises with MMIO areas that the guest can reasonably expect
>>>>> to be uncacheable, but that are optimized by the host so that they end
>>>>> up backed by cacheable RAM.  It's perfectly reasonable that the same
>>>>> device needs cacheable mapping with one userspace, and works with
>>>>> uncacheable mapping with another userspace that doesn't optimize the
>>>>> MMIO area to RAM.
>>>>
>>>> Unless the guest allocates the framebuffer itself (e.g.
>>>> dma_alloc_coherent), we can't control the cacheability via
>>>> "dma-coherent" properties as it refers to bus master DMA.
>>>
>>> Okay, it's good to rule that out.  One less thing to think about. :)
>>> Same for _DSD.
>>>
>>>> So for MMIO with the buffer allocated by the host (Qemu), the only
>>>> solution I see on ARM is for the host to ensure coherency, either via
>>>> explicit cache maintenance (new KVM API) or by changing the memory
>>>> attributes used by Qemu to access such virtual MMIO.
>>>>
>>>> Basically Qemu is acting as a bus master when reading the framebuffer it
>>>> allocated but the guest considers it a slave access and we don't have a
>>>> way to tell the guest that such accesses should be cacheable, nor can we
>>>> upgrade them via architecture features.
>>>
>>> Yes, that's a way to put it.
>>>
>>>>> In practice, the VGA framebuffer has an optimization that uses dirty
>>>>> page tracking, so we could piggyback on the ioctls that return which
>>>>> pages are dirty.  It turns out that piggybacking on those ioctls also
>>>>> should fix the case of migrating a guest while the MMU is disabled.
>>>>
>>>> Yes, Qemu would need to invalidate the cache before reading a dirty
>>>> framebuffer page.
>>>>
>>>> As I said above, an API that allows non-cacheable mappings for the VGA
>>>> framebuffer in Qemu would also solve the problem. I'm not sure what KVM
>>>> provides here (or whether we can add such API).
>>>
>>> Nothing for now; other architectures simply do not have the issue.
>>>
>>> As long as it's just VGA, we can quirk it.  There's just a couple
>>> vendor/device IDs to catch, and the guest can then use a cacheable mapping.
>>>
>>> For a more generic solution, the API would be madvise(MADV_DONTCACHE).
>>> It would be easy for QEMU to use it, but I am not too optimistic about
>>> convincing the mm folks about it.  We can try.
> 
> I forgot to list this one in my summary of approaches[*]. This is a
> nice, clean approach. Avoids getting cache maintenance into everything.
> However, besides the difficulty to get it past mm people, it reduces
> performance for any userspace-userspace uses/sharing of the memory.
> userspace-guest requires cache maintenance, but nothing else. Maybe
> that's not an important concern for the few emulated devices that need
> it though.
> 
>>
>> Interested to see the outcome.
>>
>> I was thinking of a very basic memory driver that can provide
>> an uncached memslot to QEMU - in mmap() file operation
>> apply pgprot_uncached to allocated pages, lock them, flush TLB
>> call remap_pfn_range().
> 
> I guess this is the same as the madvise approach, but with a driver.
> KVM could take this approach itself when memslots are added/updated
> with the INCOHERENT flag. Maybe worth some experimental patches to
> find out?

I would work on this but I'm tied up for next 3 weeks.
If anyone is interested I can provide base code, I used
it for memory passthrough although testing may be time consuming.
I think the hurdle here is the kernel doesn't map these
for any reason like page migration, locking pages should
tell kernel don't touch. madvise() is the  desired solution
but I suspect it might take a while to get in.
> 
> I'm still thinking about experimenting with the ARM private syscalls
> next though.

Hope it succeeds.
> 
> drew
> 
> [*] http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg01254.html
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

end of thread, other threads:[~2015-03-09 15:33 UTC | newest]

Thread overview: 110+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-19 10:54 [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings Ard Biesheuvel
2015-02-19 10:54 ` Ard Biesheuvel
2015-02-19 10:54 ` [RFC/RFT PATCH 1/3] arm64: KVM: handle some sysreg writes in EL2 Ard Biesheuvel
2015-02-19 10:54   ` Ard Biesheuvel
2015-03-03 17:59   ` Mario Smarduch
2015-03-03 17:59     ` Mario Smarduch
2015-02-19 10:54 ` [RFC/RFT PATCH 2/3] arm64: KVM: mangle MAIR register to prevent uncached guest mappings Ard Biesheuvel
2015-02-19 10:54   ` Ard Biesheuvel
2015-02-19 10:54 ` [RFC/RFT PATCH 3/3] arm64: KVM: keep trapping of VM sysreg writes enabled Ard Biesheuvel
2015-02-19 10:54   ` Ard Biesheuvel
2015-02-19 13:40   ` Marc Zyngier
2015-02-19 13:40     ` Marc Zyngier
2015-02-19 13:44     ` Ard Biesheuvel
2015-02-19 13:44       ` Ard Biesheuvel
2015-02-19 15:19       ` Marc Zyngier
2015-02-19 15:19         ` Marc Zyngier
2015-02-19 15:22         ` Ard Biesheuvel
2015-02-19 15:22           ` Ard Biesheuvel
2015-02-19 14:50 ` [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings Alexander Graf
2015-02-19 14:50   ` Alexander Graf
2015-02-19 14:56   ` Ard Biesheuvel
2015-02-19 14:56     ` Ard Biesheuvel
2015-02-19 15:27     ` Alexander Graf
2015-02-19 15:27       ` Alexander Graf
2015-02-19 15:31       ` Ard Biesheuvel
2015-02-19 15:31         ` Ard Biesheuvel
2015-02-19 16:57 ` Andrew Jones
2015-02-19 16:57   ` Andrew Jones
2015-02-19 17:19   ` Ard Biesheuvel
2015-02-19 17:19     ` Ard Biesheuvel
2015-02-19 17:55     ` Andrew Jones
2015-02-19 17:55       ` Andrew Jones
2015-02-19 17:57       ` Paolo Bonzini
2015-02-19 17:57         ` Paolo Bonzini
2015-02-20 14:29         ` Andrew Jones
2015-02-20 14:29           ` Andrew Jones
2015-02-20 14:37           ` Ard Biesheuvel
2015-02-20 14:37             ` Ard Biesheuvel
2015-02-20 15:36             ` Andrew Jones
2015-02-20 15:36               ` Andrew Jones
2015-02-24 14:55               ` Andrew Jones
2015-02-24 14:55                 ` Andrew Jones
2015-02-24 17:47                 ` Ard Biesheuvel
2015-02-24 17:47                   ` Ard Biesheuvel
2015-02-24 19:12                   ` Andrew Jones
2015-02-24 19:12                     ` Andrew Jones
2015-03-02 16:31                   ` Christoffer Dall
2015-03-02 16:31                     ` Christoffer Dall
2015-03-02 16:47                     ` Paolo Bonzini
2015-03-02 16:47                       ` Paolo Bonzini
2015-03-02 16:55                       ` Laszlo Ersek
2015-03-02 16:55                         ` Laszlo Ersek
2015-03-02 17:05                         ` Andrew Jones
2015-03-02 17:05                           ` Andrew Jones
2015-03-02 16:48                     ` Andrew Jones
2015-03-02 16:48                       ` Andrew Jones
2015-03-03  2:20                     ` Mario Smarduch
2015-03-03  2:20                       ` Mario Smarduch
2015-03-04 11:35                       ` Catalin Marinas
2015-03-04 11:35                         ` Catalin Marinas
2015-03-04 11:50                         ` Ard Biesheuvel
2015-03-04 11:50                           ` Ard Biesheuvel
2015-03-04 12:29                           ` Catalin Marinas
2015-03-04 12:29                             ` Catalin Marinas
2015-03-04 12:43                             ` Ard Biesheuvel
2015-03-04 12:43                               ` Ard Biesheuvel
2015-03-04 14:12                               ` Andrew Jones
2015-03-04 14:12                                 ` Andrew Jones
2015-03-04 14:29                                 ` Catalin Marinas
2015-03-04 14:29                                   ` Catalin Marinas
2015-03-04 14:34                                   ` Peter Maydell
2015-03-04 14:34                                     ` Peter Maydell
2015-03-04 17:03                                   ` Paolo Bonzini
2015-03-04 17:03                                     ` Paolo Bonzini
2015-03-04 17:28                                     ` Catalin Marinas
2015-03-04 17:28                                       ` Catalin Marinas
2015-03-05 10:12                                       ` Paolo Bonzini
2015-03-05 10:12                                         ` Paolo Bonzini
2015-03-05 11:04                                         ` Catalin Marinas
2015-03-05 11:04                                           ` Catalin Marinas
2015-03-05 11:52                                           ` Peter Maydell
2015-03-05 11:52                                             ` Peter Maydell
2015-03-05 12:03                                             ` Catalin Marinas
2015-03-05 12:03                                               ` Catalin Marinas
2015-03-05 12:26                                               ` Paolo Bonzini
2015-03-05 12:26                                                 ` Paolo Bonzini
2015-03-05 14:58                                                 ` Catalin Marinas
2015-03-05 14:58                                                   ` Catalin Marinas
2015-03-05 17:43                                                   ` Paolo Bonzini
2015-03-05 17:43                                                     ` Paolo Bonzini
2015-03-06 21:08                                                     ` Mario Smarduch
2015-03-06 21:08                                                       ` Mario Smarduch
2015-03-09 14:26                                                       ` Andrew Jones
2015-03-09 14:26                                                         ` Andrew Jones
2015-03-09 15:33                                                         ` Mario Smarduch
2015-03-09 15:33                                                           ` Mario Smarduch
2015-03-05 19:13                                                   ` Ard Biesheuvel
2015-03-05 19:13                                                     ` Ard Biesheuvel
2015-03-06 20:33                         ` Mario Smarduch
2015-03-06 20:33                           ` Mario Smarduch
2015-02-19 18:44       ` Ard Biesheuvel
2015-02-19 18:44         ` Ard Biesheuvel
2015-03-03 17:34 ` Alexander Graf
2015-03-03 17:34   ` Alexander Graf
2015-03-03 18:13   ` Laszlo Ersek
2015-03-03 18:13     ` Laszlo Ersek
2015-03-03 20:58     ` Andrew Jones
2015-03-03 20:58       ` Andrew Jones
2015-03-03 18:32 ` Catalin Marinas
2015-03-03 18:32   ` Catalin Marinas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.