All of lore.kernel.org
 help / color / mirror / Atom feed
* [PULL 00/14] ppc patch queue 2011-10-31
@ 2011-10-31  7:53 ` Alexander Graf
  0 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti

Hi Avi / Marcelo,

This is my current patch queue for ppc. Please pull.

Alex


The following changes since commit b796a09c5d808f4013f27ad45953db604dac18fd:
  Marcelo Tosatti (1):
        Merge remote-tracking branch 'upstream/master' into kvm-devel

are available in the git repository at:

  git://github.com/agraf/linux-2.6.git kvm-ppc-next

Alexander Graf (7):
      KVM: PPC: Fix build failure with HV KVM and CBE
      Revert "KVM: PPC: Add support for explicit HIOR setting"
      KVM: PPC: Add generic single register ioctls
      KVM: PPC: Add support for explicit HIOR setting
      KVM: PPC: Whitespace fix for kvm.h
      KVM: Fix whitespace in kvm_para.h
      KVM: PPC: E500: Support hugetlbfs

Bharat Bhushan (1):
      PPC: Fix race in mtmsr paravirt implementation

Scott Wood (6):
      KVM: PPC: e500: don't translate gfn to pfn with preemption disabled
      KVM: PPC: e500: Eliminate preempt_disable in local_sid_destroy_all
      KVM: PPC: e500: clear up confusion between host and guest entries
      KVM: PPC: e500: MMU API
      KVM: PPC: e500: tlbsx: fix tlb0 esel
      KVM: PPC: e500: Don't hardcode PIR=0

 Documentation/virtual/kvm/api.txt     |  122 ++++++
 arch/powerpc/include/asm/kvm.h        |   49 ++-
 arch/powerpc/include/asm/kvm_book3s.h |    2 +-
 arch/powerpc/include/asm/kvm_e500.h   |   46 ++-
 arch/powerpc/include/asm/kvm_ppc.h    |    5 +
 arch/powerpc/include/asm/mmu-book3e.h |    1 +
 arch/powerpc/kernel/exceptions-64s.S  |    6 +-
 arch/powerpc/kernel/kvm_emul.S        |   10 +-
 arch/powerpc/kvm/book3s_pr.c          |   12 +-
 arch/powerpc/kvm/booke.c              |    4 +-
 arch/powerpc/kvm/e500.c               |    8 +-
 arch/powerpc/kvm/e500_emulate.c       |   12 +-
 arch/powerpc/kvm/e500_tlb.c           |  674 +++++++++++++++++++++++----------
 arch/powerpc/kvm/e500_tlb.h           |   55 +--
 arch/powerpc/kvm/powerpc.c            |   92 +++++
 include/linux/kvm.h                   |   50 +++
 include/linux/kvm_para.h              |    1 -
 17 files changed, 865 insertions(+), 284 deletions(-)

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PULL 00/14] ppc patch queue 2011-10-31
@ 2011-10-31  7:53 ` Alexander Graf
  0 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti

Hi Avi / Marcelo,

This is my current patch queue for ppc. Please pull.

Alex


The following changes since commit b796a09c5d808f4013f27ad45953db604dac18fd:
  Marcelo Tosatti (1):
        Merge remote-tracking branch 'upstream/master' into kvm-devel

are available in the git repository at:

  git://github.com/agraf/linux-2.6.git kvm-ppc-next

Alexander Graf (7):
      KVM: PPC: Fix build failure with HV KVM and CBE
      Revert "KVM: PPC: Add support for explicit HIOR setting"
      KVM: PPC: Add generic single register ioctls
      KVM: PPC: Add support for explicit HIOR setting
      KVM: PPC: Whitespace fix for kvm.h
      KVM: Fix whitespace in kvm_para.h
      KVM: PPC: E500: Support hugetlbfs

Bharat Bhushan (1):
      PPC: Fix race in mtmsr paravirt implementation

Scott Wood (6):
      KVM: PPC: e500: don't translate gfn to pfn with preemption disabled
      KVM: PPC: e500: Eliminate preempt_disable in local_sid_destroy_all
      KVM: PPC: e500: clear up confusion between host and guest entries
      KVM: PPC: e500: MMU API
      KVM: PPC: e500: tlbsx: fix tlb0 esel
      KVM: PPC: e500: Don't hardcode PIR=0

 Documentation/virtual/kvm/api.txt     |  122 ++++++
 arch/powerpc/include/asm/kvm.h        |   49 ++-
 arch/powerpc/include/asm/kvm_book3s.h |    2 +-
 arch/powerpc/include/asm/kvm_e500.h   |   46 ++-
 arch/powerpc/include/asm/kvm_ppc.h    |    5 +
 arch/powerpc/include/asm/mmu-book3e.h |    1 +
 arch/powerpc/kernel/exceptions-64s.S  |    6 +-
 arch/powerpc/kernel/kvm_emul.S        |   10 +-
 arch/powerpc/kvm/book3s_pr.c          |   12 +-
 arch/powerpc/kvm/booke.c              |    4 +-
 arch/powerpc/kvm/e500.c               |    8 +-
 arch/powerpc/kvm/e500_emulate.c       |   12 +-
 arch/powerpc/kvm/e500_tlb.c           |  674 +++++++++++++++++++++++----------
 arch/powerpc/kvm/e500_tlb.h           |   55 +--
 arch/powerpc/kvm/powerpc.c            |   92 +++++
 include/linux/kvm.h                   |   50 +++
 include/linux/kvm_para.h              |    1 -
 17 files changed, 865 insertions(+), 284 deletions(-)

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH 01/14] KVM: PPC: e500: don't translate gfn to pfn with preemption disabled
  2011-10-31  7:53 ` Alexander Graf
@ 2011-10-31  7:53   ` Alexander Graf
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti, Scott Wood

From: Scott Wood <scottwood@freescale.com>

Delay allocation of the shadow pid until we're ready to disable
preemption and write the entry.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
---
 arch/powerpc/kvm/e500_tlb.c |   36 +++++++++++++++++++++++-------------
 1 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index 13c432e..22624a7 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -507,21 +507,16 @@ static inline void kvmppc_e500_deliver_tlb_miss(struct kvm_vcpu *vcpu,
 	vcpu_e500->mas7 = 0;
 }
 
+/* TID must be supplied by the caller */
 static inline void kvmppc_e500_setup_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
 					   struct tlbe *gtlbe, int tsize,
 					   struct tlbe_priv *priv,
 					   u64 gvaddr, struct tlbe *stlbe)
 {
 	pfn_t pfn = priv->pfn;
-	unsigned int stid;
-
-	stid = kvmppc_e500_get_sid(vcpu_e500, get_tlb_ts(gtlbe),
-				   get_tlb_tid(gtlbe),
-				   get_cur_pr(&vcpu_e500->vcpu), 0);
 
 	/* Force TS=1 IPROT=0 for all guest mappings. */
-	stlbe->mas1 = MAS1_TSIZE(tsize)
-		| MAS1_TID(stid) | MAS1_TS | MAS1_VALID;
+	stlbe->mas1 = MAS1_TSIZE(tsize) | MAS1_TS | MAS1_VALID;
 	stlbe->mas2 = (gvaddr & MAS2_EPN)
 		| e500_shadow_mas2_attrib(gtlbe->mas2,
 				vcpu_e500->vcpu.arch.shared->msr & MSR_PR);
@@ -816,6 +811,24 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
 	return EMULATE_DONE;
 }
 
+/* sesel is index into the set, not the whole array */
+static void write_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
+			struct tlbe *gtlbe,
+			struct tlbe *stlbe,
+			int stlbsel, int sesel)
+{
+	int stid;
+
+	preempt_disable();
+	stid = kvmppc_e500_get_sid(vcpu_e500, get_tlb_ts(gtlbe),
+				   get_tlb_tid(gtlbe),
+				   get_cur_pr(&vcpu_e500->vcpu), 0);
+
+	stlbe->mas1 |= MAS1_TID(stid);
+	write_host_tlbe(vcpu_e500, stlbsel, sesel, stlbe);
+	preempt_enable();
+}
+
 int kvmppc_e500_emul_tlbwe(struct kvm_vcpu *vcpu)
 {
 	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
@@ -845,7 +858,6 @@ int kvmppc_e500_emul_tlbwe(struct kvm_vcpu *vcpu)
 		u64 eaddr;
 		u64 raddr;
 
-		preempt_disable();
 		switch (tlbsel) {
 		case 0:
 			/* TLB0 */
@@ -874,8 +886,8 @@ int kvmppc_e500_emul_tlbwe(struct kvm_vcpu *vcpu)
 		default:
 			BUG();
 		}
-		write_host_tlbe(vcpu_e500, stlbsel, sesel, &stlbe);
-		preempt_enable();
+
+		write_stlbe(vcpu_e500, gtlbe, &stlbe, stlbsel, sesel);
 	}
 
 	kvmppc_set_exit_type(vcpu, EMULATED_TLBWE_EXITS);
@@ -937,7 +949,6 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 eaddr, gpa_t gpaddr,
 
 	gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
 
-	preempt_disable();
 	switch (tlbsel) {
 	case 0:
 		stlbsel = 0;
@@ -962,8 +973,7 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 eaddr, gpa_t gpaddr,
 		break;
 	}
 
-	write_host_tlbe(vcpu_e500, stlbsel, sesel, &stlbe);
-	preempt_enable();
+	write_stlbe(vcpu_e500, gtlbe, &stlbe, stlbsel, sesel);
 }
 
 int kvmppc_e500_tlb_search(struct kvm_vcpu *vcpu,
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 01/14] KVM: PPC: e500: don't translate gfn to pfn with preemption disabled
@ 2011-10-31  7:53   ` Alexander Graf
  0 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti, Scott Wood

From: Scott Wood <scottwood@freescale.com>

Delay allocation of the shadow pid until we're ready to disable
preemption and write the entry.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
---
 arch/powerpc/kvm/e500_tlb.c |   36 +++++++++++++++++++++++-------------
 1 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index 13c432e..22624a7 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -507,21 +507,16 @@ static inline void kvmppc_e500_deliver_tlb_miss(struct kvm_vcpu *vcpu,
 	vcpu_e500->mas7 = 0;
 }
 
+/* TID must be supplied by the caller */
 static inline void kvmppc_e500_setup_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
 					   struct tlbe *gtlbe, int tsize,
 					   struct tlbe_priv *priv,
 					   u64 gvaddr, struct tlbe *stlbe)
 {
 	pfn_t pfn = priv->pfn;
-	unsigned int stid;
-
-	stid = kvmppc_e500_get_sid(vcpu_e500, get_tlb_ts(gtlbe),
-				   get_tlb_tid(gtlbe),
-				   get_cur_pr(&vcpu_e500->vcpu), 0);
 
 	/* Force TS=1 IPROT=0 for all guest mappings. */
-	stlbe->mas1 = MAS1_TSIZE(tsize)
-		| MAS1_TID(stid) | MAS1_TS | MAS1_VALID;
+	stlbe->mas1 = MAS1_TSIZE(tsize) | MAS1_TS | MAS1_VALID;
 	stlbe->mas2 = (gvaddr & MAS2_EPN)
 		| e500_shadow_mas2_attrib(gtlbe->mas2,
 				vcpu_e500->vcpu.arch.shared->msr & MSR_PR);
@@ -816,6 +811,24 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
 	return EMULATE_DONE;
 }
 
+/* sesel is index into the set, not the whole array */
+static void write_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
+			struct tlbe *gtlbe,
+			struct tlbe *stlbe,
+			int stlbsel, int sesel)
+{
+	int stid;
+
+	preempt_disable();
+	stid = kvmppc_e500_get_sid(vcpu_e500, get_tlb_ts(gtlbe),
+				   get_tlb_tid(gtlbe),
+				   get_cur_pr(&vcpu_e500->vcpu), 0);
+
+	stlbe->mas1 |= MAS1_TID(stid);
+	write_host_tlbe(vcpu_e500, stlbsel, sesel, stlbe);
+	preempt_enable();
+}
+
 int kvmppc_e500_emul_tlbwe(struct kvm_vcpu *vcpu)
 {
 	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
@@ -845,7 +858,6 @@ int kvmppc_e500_emul_tlbwe(struct kvm_vcpu *vcpu)
 		u64 eaddr;
 		u64 raddr;
 
-		preempt_disable();
 		switch (tlbsel) {
 		case 0:
 			/* TLB0 */
@@ -874,8 +886,8 @@ int kvmppc_e500_emul_tlbwe(struct kvm_vcpu *vcpu)
 		default:
 			BUG();
 		}
-		write_host_tlbe(vcpu_e500, stlbsel, sesel, &stlbe);
-		preempt_enable();
+
+		write_stlbe(vcpu_e500, gtlbe, &stlbe, stlbsel, sesel);
 	}
 
 	kvmppc_set_exit_type(vcpu, EMULATED_TLBWE_EXITS);
@@ -937,7 +949,6 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 eaddr, gpa_t gpaddr,
 
 	gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
 
-	preempt_disable();
 	switch (tlbsel) {
 	case 0:
 		stlbsel = 0;
@@ -962,8 +973,7 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 eaddr, gpa_t gpaddr,
 		break;
 	}
 
-	write_host_tlbe(vcpu_e500, stlbsel, sesel, &stlbe);
-	preempt_enable();
+	write_stlbe(vcpu_e500, gtlbe, &stlbe, stlbsel, sesel);
 }
 
 int kvmppc_e500_tlb_search(struct kvm_vcpu *vcpu,
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 02/14] KVM: PPC: e500: Eliminate preempt_disable in local_sid_destroy_all
  2011-10-31  7:53 ` Alexander Graf
@ 2011-10-31  7:53   ` Alexander Graf
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti, Scott Wood

From: Scott Wood <scottwood@freescale.com>

The only place it makes sense to call this function already needs
to have preemption disabled.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
---
 arch/powerpc/kvm/e500_tlb.c |    4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index 22624a7..b976d80 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -116,13 +116,11 @@ static inline int local_sid_lookup(struct id *entry)
 	return -1;
 }
 
-/* Invalidate all id mappings on local core */
+/* Invalidate all id mappings on local core -- call with preempt disabled */
 static inline void local_sid_destroy_all(void)
 {
-	preempt_disable();
 	__get_cpu_var(pcpu_last_used_sid) = 0;
 	memset(&__get_cpu_var(pcpu_sids), 0, sizeof(__get_cpu_var(pcpu_sids)));
-	preempt_enable();
 }
 
 static void *kvmppc_e500_id_table_alloc(struct kvmppc_vcpu_e500 *vcpu_e500)
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 02/14] KVM: PPC: e500: Eliminate preempt_disable in local_sid_destroy_all
@ 2011-10-31  7:53   ` Alexander Graf
  0 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti, Scott Wood

From: Scott Wood <scottwood@freescale.com>

The only place it makes sense to call this function already needs
to have preemption disabled.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
---
 arch/powerpc/kvm/e500_tlb.c |    4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index 22624a7..b976d80 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -116,13 +116,11 @@ static inline int local_sid_lookup(struct id *entry)
 	return -1;
 }
 
-/* Invalidate all id mappings on local core */
+/* Invalidate all id mappings on local core -- call with preempt disabled */
 static inline void local_sid_destroy_all(void)
 {
-	preempt_disable();
 	__get_cpu_var(pcpu_last_used_sid) = 0;
 	memset(&__get_cpu_var(pcpu_sids), 0, sizeof(__get_cpu_var(pcpu_sids)));
-	preempt_enable();
 }
 
 static void *kvmppc_e500_id_table_alloc(struct kvmppc_vcpu_e500 *vcpu_e500)
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 03/14] KVM: PPC: e500: clear up confusion between host and guest entries
  2011-10-31  7:53 ` Alexander Graf
@ 2011-10-31  7:53   ` Alexander Graf
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti, Scott Wood

From: Scott Wood <scottwood@freescale.com>

Split out the portions of tlbe_priv that should be associated with host
entries into tlbe_ref.  Base victim selection on the number of hardware
entries, not guest entries.

For TLB1, where one guest entry can be mapped by multiple host entries,
we use the host tlbe_ref for tracking page references.  For the guest
TLB0 entries, we still track it with gtlb_priv, to avoid having to
retranslate if the entry is evicted from the host TLB but not the
guest TLB.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
---
 arch/powerpc/include/asm/kvm_e500.h   |   24 +++-
 arch/powerpc/include/asm/mmu-book3e.h |    1 +
 arch/powerpc/kvm/e500_tlb.c           |  267 +++++++++++++++++++++++----------
 arch/powerpc/kvm/e500_tlb.h           |   17 --
 4 files changed, 213 insertions(+), 96 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_e500.h b/arch/powerpc/include/asm/kvm_e500.h
index adbfca9..a5197d8 100644
--- a/arch/powerpc/include/asm/kvm_e500.h
+++ b/arch/powerpc/include/asm/kvm_e500.h
@@ -32,13 +32,21 @@ struct tlbe{
 #define E500_TLB_VALID 1
 #define E500_TLB_DIRTY 2
 
-struct tlbe_priv {
+struct tlbe_ref {
 	pfn_t pfn;
 	unsigned int flags; /* E500_TLB_* */
 };
 
+struct tlbe_priv {
+	struct tlbe_ref ref; /* TLB0 only -- TLB1 uses tlb_refs */
+};
+
 struct vcpu_id_table;
 
+struct kvmppc_e500_tlb_params {
+	int entries, ways, sets;
+};
+
 struct kvmppc_vcpu_e500 {
 	/* Unmodified copy of the guest's TLB. */
 	struct tlbe *gtlb_arch[E500_TLB_NUM];
@@ -49,6 +57,20 @@ struct kvmppc_vcpu_e500 {
 	unsigned int gtlb_size[E500_TLB_NUM];
 	unsigned int gtlb_nv[E500_TLB_NUM];
 
+	/*
+	 * information associated with each host TLB entry --
+	 * TLB1 only for now.  If/when guest TLB1 entries can be
+	 * mapped with host TLB0, this will be used for that too.
+	 *
+	 * We don't want to use this for guest TLB0 because then we'd
+	 * have the overhead of doing the translation again even if
+	 * the entry is still in the guest TLB (e.g. we swapped out
+	 * and back, and our host TLB entries got evicted).
+	 */
+	struct tlbe_ref *tlb_refs[E500_TLB_NUM];
+
+	unsigned int host_tlb1_nv;
+
 	u32 host_pid[E500_PID_NUM];
 	u32 pid[E500_PID_NUM];
 	u32 svr;
diff --git a/arch/powerpc/include/asm/mmu-book3e.h b/arch/powerpc/include/asm/mmu-book3e.h
index 3ea0f9a..4c30de3 100644
--- a/arch/powerpc/include/asm/mmu-book3e.h
+++ b/arch/powerpc/include/asm/mmu-book3e.h
@@ -165,6 +165,7 @@
 #define TLBnCFG_MAXSIZE		0x000f0000	/* Maximum Page Size (v1.0) */
 #define TLBnCFG_MAXSIZE_SHIFT	16
 #define TLBnCFG_ASSOC		0xff000000	/* Associativity */
+#define TLBnCFG_ASSOC_SHIFT	24
 
 /* TLBnPS encoding */
 #define TLBnPS_4K		0x00000004
diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index b976d80..59221bb 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -12,6 +12,7 @@
  * published by the Free Software Foundation.
  */
 
+#include <linux/kernel.h>
 #include <linux/types.h>
 #include <linux/slab.h>
 #include <linux/string.h>
@@ -26,7 +27,7 @@
 #include "trace.h"
 #include "timing.h"
 
-#define to_htlb1_esel(esel) (tlb1_entry_num - (esel) - 1)
+#define to_htlb1_esel(esel) (host_tlb_params[1].entries - (esel) - 1)
 
 struct id {
 	unsigned long val;
@@ -63,7 +64,7 @@ static DEFINE_PER_CPU(struct pcpu_id_table, pcpu_sids);
  * The valid range of shadow ID is [1..255] */
 static DEFINE_PER_CPU(unsigned long, pcpu_last_used_sid);
 
-static unsigned int tlb1_entry_num;
+static struct kvmppc_e500_tlb_params host_tlb_params[E500_TLB_NUM];
 
 /*
  * Allocate a free shadow id and setup a valid sid mapping in given entry.
@@ -237,7 +238,7 @@ void kvmppc_dump_tlbs(struct kvm_vcpu *vcpu)
 	}
 }
 
-static inline unsigned int tlb0_get_next_victim(
+static inline unsigned int gtlb0_get_next_victim(
 		struct kvmppc_vcpu_e500 *vcpu_e500)
 {
 	unsigned int victim;
@@ -252,7 +253,7 @@ static inline unsigned int tlb0_get_next_victim(
 static inline unsigned int tlb1_max_shadow_size(void)
 {
 	/* reserve one entry for magic page */
-	return tlb1_entry_num - tlbcam_index - 1;
+	return host_tlb_params[1].entries - tlbcam_index - 1;
 }
 
 static inline int tlbe_is_writable(struct tlbe *tlbe)
@@ -302,13 +303,12 @@ static inline void __write_host_tlbe(struct tlbe *stlbe, uint32_t mas0)
 	local_irq_restore(flags);
 }
 
+/* esel is index into set, not whole array */
 static inline void write_host_tlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
 		int tlbsel, int esel, struct tlbe *stlbe)
 {
 	if (tlbsel == 0) {
-		__write_host_tlbe(stlbe,
-				  MAS0_TLBSEL(0) |
-				  MAS0_ESEL(esel & (KVM_E500_TLB0_WAY_NUM - 1)));
+		__write_host_tlbe(stlbe, MAS0_TLBSEL(0) | MAS0_ESEL(esel));
 	} else {
 		__write_host_tlbe(stlbe,
 				  MAS0_TLBSEL(1) |
@@ -355,8 +355,8 @@ void kvmppc_e500_tlb_put(struct kvm_vcpu *vcpu)
 {
 }
 
-static void kvmppc_e500_stlbe_invalidate(struct kvmppc_vcpu_e500 *vcpu_e500,
-					 int tlbsel, int esel)
+static void inval_gtlbe_on_host(struct kvmppc_vcpu_e500 *vcpu_e500,
+				int tlbsel, int esel)
 {
 	struct tlbe *gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
 	struct vcpu_id_table *idt = vcpu_e500->idt;
@@ -412,18 +412,53 @@ static void kvmppc_e500_stlbe_invalidate(struct kvmppc_vcpu_e500 *vcpu_e500,
 	preempt_enable();
 }
 
+static int tlb0_set_base(gva_t addr, int sets, int ways)
+{
+	int set_base;
+
+	set_base = (addr >> PAGE_SHIFT) & (sets - 1);
+	set_base *= ways;
+
+	return set_base;
+}
+
+static int gtlb0_set_base(struct kvmppc_vcpu_e500 *vcpu_e500, gva_t addr)
+{
+	int sets = KVM_E500_TLB0_SIZE / KVM_E500_TLB0_WAY_NUM;
+
+	return tlb0_set_base(addr, sets, KVM_E500_TLB0_WAY_NUM);
+}
+
+static int htlb0_set_base(gva_t addr)
+{
+	return tlb0_set_base(addr, host_tlb_params[0].sets,
+			     host_tlb_params[0].ways);
+}
+
+static unsigned int get_tlb_esel(struct kvmppc_vcpu_e500 *vcpu_e500, int tlbsel)
+{
+	unsigned int esel = get_tlb_esel_bit(vcpu_e500);
+
+	if (tlbsel == 0) {
+		esel &= KVM_E500_TLB0_WAY_NUM_MASK;
+		esel += gtlb0_set_base(vcpu_e500, vcpu_e500->mas2);
+	} else {
+		esel &= vcpu_e500->gtlb_size[tlbsel] - 1;
+	}
+
+	return esel;
+}
+
 /* Search the guest TLB for a matching entry. */
 static int kvmppc_e500_tlb_index(struct kvmppc_vcpu_e500 *vcpu_e500,
 		gva_t eaddr, int tlbsel, unsigned int pid, int as)
 {
 	int size = vcpu_e500->gtlb_size[tlbsel];
-	int set_base;
+	unsigned int set_base;
 	int i;
 
 	if (tlbsel == 0) {
-		int mask = size / KVM_E500_TLB0_WAY_NUM - 1;
-		set_base = (eaddr >> PAGE_SHIFT) & mask;
-		set_base *= KVM_E500_TLB0_WAY_NUM;
+		set_base = gtlb0_set_base(vcpu_e500, eaddr);
 		size = KVM_E500_TLB0_WAY_NUM;
 	} else {
 		set_base = 0;
@@ -455,29 +490,55 @@ static int kvmppc_e500_tlb_index(struct kvmppc_vcpu_e500 *vcpu_e500,
 	return -1;
 }
 
-static inline void kvmppc_e500_priv_setup(struct tlbe_priv *priv,
-					  struct tlbe *gtlbe,
-					  pfn_t pfn)
+static inline void kvmppc_e500_ref_setup(struct tlbe_ref *ref,
+					 struct tlbe *gtlbe,
+					 pfn_t pfn)
 {
-	priv->pfn = pfn;
-	priv->flags = E500_TLB_VALID;
+	ref->pfn = pfn;
+	ref->flags = E500_TLB_VALID;
 
 	if (tlbe_is_writable(gtlbe))
-		priv->flags |= E500_TLB_DIRTY;
+		ref->flags |= E500_TLB_DIRTY;
 }
 
-static inline void kvmppc_e500_priv_release(struct tlbe_priv *priv)
+static inline void kvmppc_e500_ref_release(struct tlbe_ref *ref)
 {
-	if (priv->flags & E500_TLB_VALID) {
-		if (priv->flags & E500_TLB_DIRTY)
-			kvm_release_pfn_dirty(priv->pfn);
+	if (ref->flags & E500_TLB_VALID) {
+		if (ref->flags & E500_TLB_DIRTY)
+			kvm_release_pfn_dirty(ref->pfn);
 		else
-			kvm_release_pfn_clean(priv->pfn);
+			kvm_release_pfn_clean(ref->pfn);
+
+		ref->flags = 0;
+	}
+}
+
+static void clear_tlb_privs(struct kvmppc_vcpu_e500 *vcpu_e500)
+{
+	int tlbsel = 0;
+	int i;
 
-		priv->flags = 0;
+	for (i = 0; i < vcpu_e500->gtlb_size[tlbsel]; i++) {
+		struct tlbe_ref *ref =
+			&vcpu_e500->gtlb_priv[tlbsel][i].ref;
+		kvmppc_e500_ref_release(ref);
 	}
 }
 
+static void clear_tlb_refs(struct kvmppc_vcpu_e500 *vcpu_e500)
+{
+	int stlbsel = 1;
+	int i;
+
+	for (i = 0; i < host_tlb_params[stlbsel].entries; i++) {
+		struct tlbe_ref *ref =
+			&vcpu_e500->tlb_refs[stlbsel][i];
+		kvmppc_e500_ref_release(ref);
+	}
+
+	clear_tlb_privs(vcpu_e500);
+}
+
 static inline void kvmppc_e500_deliver_tlb_miss(struct kvm_vcpu *vcpu,
 		unsigned int eaddr, int as)
 {
@@ -487,7 +548,7 @@ static inline void kvmppc_e500_deliver_tlb_miss(struct kvm_vcpu *vcpu,
 
 	/* since we only have two TLBs, only lower bit is used. */
 	tlbsel = (vcpu_e500->mas4 >> 28) & 0x1;
-	victim = (tlbsel == 0) ? tlb0_get_next_victim(vcpu_e500) : 0;
+	victim = (tlbsel == 0) ? gtlb0_get_next_victim(vcpu_e500) : 0;
 	pidsel = (vcpu_e500->mas4 >> 16) & 0xf;
 	tsized = (vcpu_e500->mas4 >> 7) & 0x1f;
 
@@ -508,10 +569,12 @@ static inline void kvmppc_e500_deliver_tlb_miss(struct kvm_vcpu *vcpu,
 /* TID must be supplied by the caller */
 static inline void kvmppc_e500_setup_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
 					   struct tlbe *gtlbe, int tsize,
-					   struct tlbe_priv *priv,
+					   struct tlbe_ref *ref,
 					   u64 gvaddr, struct tlbe *stlbe)
 {
-	pfn_t pfn = priv->pfn;
+	pfn_t pfn = ref->pfn;
+
+	BUG_ON(!(ref->flags & E500_TLB_VALID));
 
 	/* Force TS=1 IPROT=0 for all guest mappings. */
 	stlbe->mas1 = MAS1_TSIZE(tsize) | MAS1_TS | MAS1_VALID;
@@ -524,16 +587,15 @@ static inline void kvmppc_e500_setup_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
 	stlbe->mas7 = (pfn >> (32 - PAGE_SHIFT)) & MAS7_RPN;
 }
 
-
+/* sesel is an index into the entire array, not just the set */
 static inline void kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
-	u64 gvaddr, gfn_t gfn, struct tlbe *gtlbe, int tlbsel, int esel,
-	struct tlbe *stlbe)
+	u64 gvaddr, gfn_t gfn, struct tlbe *gtlbe, int tlbsel, int sesel,
+	struct tlbe *stlbe, struct tlbe_ref *ref)
 {
 	struct kvm_memory_slot *slot;
 	unsigned long pfn, hva;
 	int pfnmap = 0;
 	int tsize = BOOK3E_PAGESZ_4K;
-	struct tlbe_priv *priv;
 
 	/*
 	 * Translate guest physical to true physical, acquiring
@@ -629,12 +691,11 @@ static inline void kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 		}
 	}
 
-	/* Drop old priv and setup new one. */
-	priv = &vcpu_e500->gtlb_priv[tlbsel][esel];
-	kvmppc_e500_priv_release(priv);
-	kvmppc_e500_priv_setup(priv, gtlbe, pfn);
+	/* Drop old ref and setup new one. */
+	kvmppc_e500_ref_release(ref);
+	kvmppc_e500_ref_setup(ref, gtlbe, pfn);
 
-	kvmppc_e500_setup_stlbe(vcpu_e500, gtlbe, tsize, priv, gvaddr, stlbe);
+	kvmppc_e500_setup_stlbe(vcpu_e500, gtlbe, tsize, ref, gvaddr, stlbe);
 }
 
 /* XXX only map the one-one case, for now use TLB0 */
@@ -642,14 +703,22 @@ static int kvmppc_e500_tlb0_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 				int esel, struct tlbe *stlbe)
 {
 	struct tlbe *gtlbe;
+	struct tlbe_ref *ref;
+	int sesel = esel & (host_tlb_params[0].ways - 1);
+	int sesel_base;
+	gva_t ea;
 
 	gtlbe = &vcpu_e500->gtlb_arch[0][esel];
+	ref = &vcpu_e500->gtlb_priv[0][esel].ref;
+
+	ea = get_tlb_eaddr(gtlbe);
+	sesel_base = htlb0_set_base(ea);
 
 	kvmppc_e500_shadow_map(vcpu_e500, get_tlb_eaddr(gtlbe),
 			get_tlb_raddr(gtlbe) >> PAGE_SHIFT,
-			gtlbe, 0, esel, stlbe);
+			gtlbe, 0, sesel_base + sesel, stlbe, ref);
 
-	return esel;
+	return sesel;
 }
 
 /* Caller must ensure that the specified guest TLB entry is safe to insert into
@@ -658,14 +727,17 @@ static int kvmppc_e500_tlb0_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 static int kvmppc_e500_tlb1_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 		u64 gvaddr, gfn_t gfn, struct tlbe *gtlbe, struct tlbe *stlbe)
 {
+	struct tlbe_ref *ref;
 	unsigned int victim;
 
-	victim = vcpu_e500->gtlb_nv[1]++;
+	victim = vcpu_e500->host_tlb1_nv++;
 
-	if (unlikely(vcpu_e500->gtlb_nv[1] >= tlb1_max_shadow_size()))
-		vcpu_e500->gtlb_nv[1] = 0;
+	if (unlikely(vcpu_e500->host_tlb1_nv >= tlb1_max_shadow_size()))
+		vcpu_e500->host_tlb1_nv = 0;
 
-	kvmppc_e500_shadow_map(vcpu_e500, gvaddr, gfn, gtlbe, 1, victim, stlbe);
+	ref = &vcpu_e500->tlb_refs[1][victim];
+	kvmppc_e500_shadow_map(vcpu_e500, gvaddr, gfn, gtlbe, 1,
+			       victim, stlbe, ref);
 
 	return victim;
 }
@@ -792,7 +864,7 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
 
 		/* since we only have two TLBs, only lower bit is used. */
 		tlbsel = vcpu_e500->mas4 >> 28 & 0x1;
-		victim = (tlbsel == 0) ? tlb0_get_next_victim(vcpu_e500) : 0;
+		victim = (tlbsel == 0) ? gtlb0_get_next_victim(vcpu_e500) : 0;
 
 		vcpu_e500->mas0 = MAS0_TLBSEL(tlbsel) | MAS0_ESEL(victim)
 			| MAS0_NV(vcpu_e500->gtlb_nv[tlbsel]);
@@ -839,7 +911,7 @@ int kvmppc_e500_emul_tlbwe(struct kvm_vcpu *vcpu)
 	gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
 
 	if (get_tlb_v(gtlbe))
-		kvmppc_e500_stlbe_invalidate(vcpu_e500, tlbsel, esel);
+		inval_gtlbe_on_host(vcpu_e500, tlbsel, esel);
 
 	gtlbe->mas1 = vcpu_e500->mas1;
 	gtlbe->mas2 = vcpu_e500->mas2;
@@ -950,11 +1022,11 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 eaddr, gpa_t gpaddr,
 	switch (tlbsel) {
 	case 0:
 		stlbsel = 0;
-		sesel = esel;
-		priv = &vcpu_e500->gtlb_priv[stlbsel][sesel];
+		sesel = esel & (host_tlb_params[0].ways - 1);
+		priv = &vcpu_e500->gtlb_priv[tlbsel][esel];
 
 		kvmppc_e500_setup_stlbe(vcpu_e500, gtlbe, BOOK3E_PAGESZ_4K,
-					priv, eaddr, &stlbe);
+					&priv->ref, eaddr, &stlbe);
 		break;
 
 	case 1: {
@@ -1020,32 +1092,76 @@ void kvmppc_e500_tlb_setup(struct kvmppc_vcpu_e500 *vcpu_e500)
 
 int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500)
 {
-	tlb1_entry_num = mfspr(SPRN_TLB1CFG) & 0xFFF;
+	host_tlb_params[0].entries = mfspr(SPRN_TLB0CFG) & TLBnCFG_N_ENTRY;
+	host_tlb_params[1].entries = mfspr(SPRN_TLB1CFG) & TLBnCFG_N_ENTRY;
+
+	/*
+	 * This should never happen on real e500 hardware, but is
+	 * architecturally possible -- e.g. in some weird nested
+	 * virtualization case.
+	 */
+	if (host_tlb_params[0].entries == 0 ||
+	    host_tlb_params[1].entries == 0) {
+		pr_err("%s: need to know host tlb size\n", __func__);
+		return -ENODEV;
+	}
+
+	host_tlb_params[0].ways = (mfspr(SPRN_TLB0CFG) & TLBnCFG_ASSOC) >>
+				  TLBnCFG_ASSOC_SHIFT;
+	host_tlb_params[1].ways = host_tlb_params[1].entries;
+
+	if (!is_power_of_2(host_tlb_params[0].entries) ||
+	    !is_power_of_2(host_tlb_params[0].ways) ||
+	    host_tlb_params[0].entries < host_tlb_params[0].ways ||
+	    host_tlb_params[0].ways == 0) {
+		pr_err("%s: bad tlb0 host config: %u entries %u ways\n",
+		       __func__, host_tlb_params[0].entries,
+		       host_tlb_params[0].ways);
+		return -ENODEV;
+	}
+
+	host_tlb_params[0].sets =
+		host_tlb_params[0].entries / host_tlb_params[0].ways;
+	host_tlb_params[1].sets = 1;
 
 	vcpu_e500->gtlb_size[0] = KVM_E500_TLB0_SIZE;
 	vcpu_e500->gtlb_arch[0] =
 		kzalloc(sizeof(struct tlbe) * KVM_E500_TLB0_SIZE, GFP_KERNEL);
 	if (vcpu_e500->gtlb_arch[0] == NULL)
-		goto err_out;
+		goto err;
 
 	vcpu_e500->gtlb_size[1] = KVM_E500_TLB1_SIZE;
 	vcpu_e500->gtlb_arch[1] =
 		kzalloc(sizeof(struct tlbe) * KVM_E500_TLB1_SIZE, GFP_KERNEL);
 	if (vcpu_e500->gtlb_arch[1] == NULL)
-		goto err_out_guest0;
-
-	vcpu_e500->gtlb_priv[0] = (struct tlbe_priv *)
-		kzalloc(sizeof(struct tlbe_priv) * KVM_E500_TLB0_SIZE, GFP_KERNEL);
-	if (vcpu_e500->gtlb_priv[0] == NULL)
-		goto err_out_guest1;
-	vcpu_e500->gtlb_priv[1] = (struct tlbe_priv *)
-		kzalloc(sizeof(struct tlbe_priv) * KVM_E500_TLB1_SIZE, GFP_KERNEL);
-
-	if (vcpu_e500->gtlb_priv[1] == NULL)
-		goto err_out_priv0;
+		goto err;
+
+	vcpu_e500->tlb_refs[0] =
+		kzalloc(sizeof(struct tlbe_ref) * host_tlb_params[0].entries,
+			GFP_KERNEL);
+	if (!vcpu_e500->tlb_refs[0])
+		goto err;
+
+	vcpu_e500->tlb_refs[1] =
+		kzalloc(sizeof(struct tlbe_ref) * host_tlb_params[1].entries,
+			GFP_KERNEL);
+	if (!vcpu_e500->tlb_refs[1])
+		goto err;
+
+	vcpu_e500->gtlb_priv[0] =
+		kzalloc(sizeof(struct tlbe_ref) * vcpu_e500->gtlb_size[0],
+			GFP_KERNEL);
+	if (!vcpu_e500->gtlb_priv[0])
+		goto err;
+
+	vcpu_e500->gtlb_priv[1] =
+		kzalloc(sizeof(struct tlbe_ref) * vcpu_e500->gtlb_size[1],
+			GFP_KERNEL);
+	if (!vcpu_e500->gtlb_priv[1])
+		goto err;
 
 	if (kvmppc_e500_id_table_alloc(vcpu_e500) == NULL)
-		goto err_out_priv1;
+		goto err;
 
 	/* Init TLB configuration register */
 	vcpu_e500->tlb0cfg = mfspr(SPRN_TLB0CFG) & ~0xfffUL;
@@ -1055,31 +1171,26 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500)
 
 	return 0;
 
-err_out_priv1:
-	kfree(vcpu_e500->gtlb_priv[1]);
-err_out_priv0:
+err:
+	kfree(vcpu_e500->tlb_refs[0]);
+	kfree(vcpu_e500->tlb_refs[1]);
 	kfree(vcpu_e500->gtlb_priv[0]);
-err_out_guest1:
-	kfree(vcpu_e500->gtlb_arch[1]);
-err_out_guest0:
+	kfree(vcpu_e500->gtlb_priv[1]);
 	kfree(vcpu_e500->gtlb_arch[0]);
-err_out:
+	kfree(vcpu_e500->gtlb_arch[1]);
 	return -1;
 }
 
 void kvmppc_e500_tlb_uninit(struct kvmppc_vcpu_e500 *vcpu_e500)
 {
-	int stlbsel, i;
-
-	/* release all privs */
-	for (stlbsel = 0; stlbsel < 2; stlbsel++)
-		for (i = 0; i < vcpu_e500->gtlb_size[stlbsel]; i++) {
-			struct tlbe_priv *priv =
-				&vcpu_e500->gtlb_priv[stlbsel][i];
-			kvmppc_e500_priv_release(priv);
-		}
+	clear_tlb_refs(vcpu_e500);
 
 	kvmppc_e500_id_table_free(vcpu_e500);
+
+	kfree(vcpu_e500->tlb_refs[0]);
+	kfree(vcpu_e500->tlb_refs[1]);
+	kfree(vcpu_e500->gtlb_priv[0]);
+	kfree(vcpu_e500->gtlb_priv[1]);
 	kfree(vcpu_e500->gtlb_arch[1]);
 	kfree(vcpu_e500->gtlb_arch[0]);
 }
diff --git a/arch/powerpc/kvm/e500_tlb.h b/arch/powerpc/kvm/e500_tlb.h
index 59b88e9..b587f69 100644
--- a/arch/powerpc/kvm/e500_tlb.h
+++ b/arch/powerpc/kvm/e500_tlb.h
@@ -155,23 +155,6 @@ static inline unsigned int get_tlb_esel_bit(
 	return (vcpu_e500->mas0 >> 16) & 0xfff;
 }
 
-static inline unsigned int get_tlb_esel(
-		const struct kvmppc_vcpu_e500 *vcpu_e500,
-		int tlbsel)
-{
-	unsigned int esel = get_tlb_esel_bit(vcpu_e500);
-
-	if (tlbsel == 0) {
-		esel &= KVM_E500_TLB0_WAY_NUM_MASK;
-		esel |= ((vcpu_e500->mas2 >> 12) & KVM_E500_TLB0_WAY_SIZE_MASK)
-				<< KVM_E500_TLB0_WAY_NUM_BIT;
-	} else {
-		esel &= KVM_E500_TLB1_SIZE - 1;
-	}
-
-	return esel;
-}
-
 static inline int tlbe_is_host_safe(const struct kvm_vcpu *vcpu,
 			const struct tlbe *tlbe)
 {
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 03/14] KVM: PPC: e500: clear up confusion between host and guest entries
@ 2011-10-31  7:53   ` Alexander Graf
  0 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti, Scott Wood

From: Scott Wood <scottwood@freescale.com>

Split out the portions of tlbe_priv that should be associated with host
entries into tlbe_ref.  Base victim selection on the number of hardware
entries, not guest entries.

For TLB1, where one guest entry can be mapped by multiple host entries,
we use the host tlbe_ref for tracking page references.  For the guest
TLB0 entries, we still track it with gtlb_priv, to avoid having to
retranslate if the entry is evicted from the host TLB but not the
guest TLB.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
---
 arch/powerpc/include/asm/kvm_e500.h   |   24 +++-
 arch/powerpc/include/asm/mmu-book3e.h |    1 +
 arch/powerpc/kvm/e500_tlb.c           |  267 +++++++++++++++++++++++----------
 arch/powerpc/kvm/e500_tlb.h           |   17 --
 4 files changed, 213 insertions(+), 96 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_e500.h b/arch/powerpc/include/asm/kvm_e500.h
index adbfca9..a5197d8 100644
--- a/arch/powerpc/include/asm/kvm_e500.h
+++ b/arch/powerpc/include/asm/kvm_e500.h
@@ -32,13 +32,21 @@ struct tlbe{
 #define E500_TLB_VALID 1
 #define E500_TLB_DIRTY 2
 
-struct tlbe_priv {
+struct tlbe_ref {
 	pfn_t pfn;
 	unsigned int flags; /* E500_TLB_* */
 };
 
+struct tlbe_priv {
+	struct tlbe_ref ref; /* TLB0 only -- TLB1 uses tlb_refs */
+};
+
 struct vcpu_id_table;
 
+struct kvmppc_e500_tlb_params {
+	int entries, ways, sets;
+};
+
 struct kvmppc_vcpu_e500 {
 	/* Unmodified copy of the guest's TLB. */
 	struct tlbe *gtlb_arch[E500_TLB_NUM];
@@ -49,6 +57,20 @@ struct kvmppc_vcpu_e500 {
 	unsigned int gtlb_size[E500_TLB_NUM];
 	unsigned int gtlb_nv[E500_TLB_NUM];
 
+	/*
+	 * information associated with each host TLB entry --
+	 * TLB1 only for now.  If/when guest TLB1 entries can be
+	 * mapped with host TLB0, this will be used for that too.
+	 *
+	 * We don't want to use this for guest TLB0 because then we'd
+	 * have the overhead of doing the translation again even if
+	 * the entry is still in the guest TLB (e.g. we swapped out
+	 * and back, and our host TLB entries got evicted).
+	 */
+	struct tlbe_ref *tlb_refs[E500_TLB_NUM];
+
+	unsigned int host_tlb1_nv;
+
 	u32 host_pid[E500_PID_NUM];
 	u32 pid[E500_PID_NUM];
 	u32 svr;
diff --git a/arch/powerpc/include/asm/mmu-book3e.h b/arch/powerpc/include/asm/mmu-book3e.h
index 3ea0f9a..4c30de3 100644
--- a/arch/powerpc/include/asm/mmu-book3e.h
+++ b/arch/powerpc/include/asm/mmu-book3e.h
@@ -165,6 +165,7 @@
 #define TLBnCFG_MAXSIZE		0x000f0000	/* Maximum Page Size (v1.0) */
 #define TLBnCFG_MAXSIZE_SHIFT	16
 #define TLBnCFG_ASSOC		0xff000000	/* Associativity */
+#define TLBnCFG_ASSOC_SHIFT	24
 
 /* TLBnPS encoding */
 #define TLBnPS_4K		0x00000004
diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index b976d80..59221bb 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -12,6 +12,7 @@
  * published by the Free Software Foundation.
  */
 
+#include <linux/kernel.h>
 #include <linux/types.h>
 #include <linux/slab.h>
 #include <linux/string.h>
@@ -26,7 +27,7 @@
 #include "trace.h"
 #include "timing.h"
 
-#define to_htlb1_esel(esel) (tlb1_entry_num - (esel) - 1)
+#define to_htlb1_esel(esel) (host_tlb_params[1].entries - (esel) - 1)
 
 struct id {
 	unsigned long val;
@@ -63,7 +64,7 @@ static DEFINE_PER_CPU(struct pcpu_id_table, pcpu_sids);
  * The valid range of shadow ID is [1..255] */
 static DEFINE_PER_CPU(unsigned long, pcpu_last_used_sid);
 
-static unsigned int tlb1_entry_num;
+static struct kvmppc_e500_tlb_params host_tlb_params[E500_TLB_NUM];
 
 /*
  * Allocate a free shadow id and setup a valid sid mapping in given entry.
@@ -237,7 +238,7 @@ void kvmppc_dump_tlbs(struct kvm_vcpu *vcpu)
 	}
 }
 
-static inline unsigned int tlb0_get_next_victim(
+static inline unsigned int gtlb0_get_next_victim(
 		struct kvmppc_vcpu_e500 *vcpu_e500)
 {
 	unsigned int victim;
@@ -252,7 +253,7 @@ static inline unsigned int tlb0_get_next_victim(
 static inline unsigned int tlb1_max_shadow_size(void)
 {
 	/* reserve one entry for magic page */
-	return tlb1_entry_num - tlbcam_index - 1;
+	return host_tlb_params[1].entries - tlbcam_index - 1;
 }
 
 static inline int tlbe_is_writable(struct tlbe *tlbe)
@@ -302,13 +303,12 @@ static inline void __write_host_tlbe(struct tlbe *stlbe, uint32_t mas0)
 	local_irq_restore(flags);
 }
 
+/* esel is index into set, not whole array */
 static inline void write_host_tlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
 		int tlbsel, int esel, struct tlbe *stlbe)
 {
 	if (tlbsel = 0) {
-		__write_host_tlbe(stlbe,
-				  MAS0_TLBSEL(0) |
-				  MAS0_ESEL(esel & (KVM_E500_TLB0_WAY_NUM - 1)));
+		__write_host_tlbe(stlbe, MAS0_TLBSEL(0) | MAS0_ESEL(esel));
 	} else {
 		__write_host_tlbe(stlbe,
 				  MAS0_TLBSEL(1) |
@@ -355,8 +355,8 @@ void kvmppc_e500_tlb_put(struct kvm_vcpu *vcpu)
 {
 }
 
-static void kvmppc_e500_stlbe_invalidate(struct kvmppc_vcpu_e500 *vcpu_e500,
-					 int tlbsel, int esel)
+static void inval_gtlbe_on_host(struct kvmppc_vcpu_e500 *vcpu_e500,
+				int tlbsel, int esel)
 {
 	struct tlbe *gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
 	struct vcpu_id_table *idt = vcpu_e500->idt;
@@ -412,18 +412,53 @@ static void kvmppc_e500_stlbe_invalidate(struct kvmppc_vcpu_e500 *vcpu_e500,
 	preempt_enable();
 }
 
+static int tlb0_set_base(gva_t addr, int sets, int ways)
+{
+	int set_base;
+
+	set_base = (addr >> PAGE_SHIFT) & (sets - 1);
+	set_base *= ways;
+
+	return set_base;
+}
+
+static int gtlb0_set_base(struct kvmppc_vcpu_e500 *vcpu_e500, gva_t addr)
+{
+	int sets = KVM_E500_TLB0_SIZE / KVM_E500_TLB0_WAY_NUM;
+
+	return tlb0_set_base(addr, sets, KVM_E500_TLB0_WAY_NUM);
+}
+
+static int htlb0_set_base(gva_t addr)
+{
+	return tlb0_set_base(addr, host_tlb_params[0].sets,
+			     host_tlb_params[0].ways);
+}
+
+static unsigned int get_tlb_esel(struct kvmppc_vcpu_e500 *vcpu_e500, int tlbsel)
+{
+	unsigned int esel = get_tlb_esel_bit(vcpu_e500);
+
+	if (tlbsel = 0) {
+		esel &= KVM_E500_TLB0_WAY_NUM_MASK;
+		esel += gtlb0_set_base(vcpu_e500, vcpu_e500->mas2);
+	} else {
+		esel &= vcpu_e500->gtlb_size[tlbsel] - 1;
+	}
+
+	return esel;
+}
+
 /* Search the guest TLB for a matching entry. */
 static int kvmppc_e500_tlb_index(struct kvmppc_vcpu_e500 *vcpu_e500,
 		gva_t eaddr, int tlbsel, unsigned int pid, int as)
 {
 	int size = vcpu_e500->gtlb_size[tlbsel];
-	int set_base;
+	unsigned int set_base;
 	int i;
 
 	if (tlbsel = 0) {
-		int mask = size / KVM_E500_TLB0_WAY_NUM - 1;
-		set_base = (eaddr >> PAGE_SHIFT) & mask;
-		set_base *= KVM_E500_TLB0_WAY_NUM;
+		set_base = gtlb0_set_base(vcpu_e500, eaddr);
 		size = KVM_E500_TLB0_WAY_NUM;
 	} else {
 		set_base = 0;
@@ -455,29 +490,55 @@ static int kvmppc_e500_tlb_index(struct kvmppc_vcpu_e500 *vcpu_e500,
 	return -1;
 }
 
-static inline void kvmppc_e500_priv_setup(struct tlbe_priv *priv,
-					  struct tlbe *gtlbe,
-					  pfn_t pfn)
+static inline void kvmppc_e500_ref_setup(struct tlbe_ref *ref,
+					 struct tlbe *gtlbe,
+					 pfn_t pfn)
 {
-	priv->pfn = pfn;
-	priv->flags = E500_TLB_VALID;
+	ref->pfn = pfn;
+	ref->flags = E500_TLB_VALID;
 
 	if (tlbe_is_writable(gtlbe))
-		priv->flags |= E500_TLB_DIRTY;
+		ref->flags |= E500_TLB_DIRTY;
 }
 
-static inline void kvmppc_e500_priv_release(struct tlbe_priv *priv)
+static inline void kvmppc_e500_ref_release(struct tlbe_ref *ref)
 {
-	if (priv->flags & E500_TLB_VALID) {
-		if (priv->flags & E500_TLB_DIRTY)
-			kvm_release_pfn_dirty(priv->pfn);
+	if (ref->flags & E500_TLB_VALID) {
+		if (ref->flags & E500_TLB_DIRTY)
+			kvm_release_pfn_dirty(ref->pfn);
 		else
-			kvm_release_pfn_clean(priv->pfn);
+			kvm_release_pfn_clean(ref->pfn);
+
+		ref->flags = 0;
+	}
+}
+
+static void clear_tlb_privs(struct kvmppc_vcpu_e500 *vcpu_e500)
+{
+	int tlbsel = 0;
+	int i;
 
-		priv->flags = 0;
+	for (i = 0; i < vcpu_e500->gtlb_size[tlbsel]; i++) {
+		struct tlbe_ref *ref +			&vcpu_e500->gtlb_priv[tlbsel][i].ref;
+		kvmppc_e500_ref_release(ref);
 	}
 }
 
+static void clear_tlb_refs(struct kvmppc_vcpu_e500 *vcpu_e500)
+{
+	int stlbsel = 1;
+	int i;
+
+	for (i = 0; i < host_tlb_params[stlbsel].entries; i++) {
+		struct tlbe_ref *ref +			&vcpu_e500->tlb_refs[stlbsel][i];
+		kvmppc_e500_ref_release(ref);
+	}
+
+	clear_tlb_privs(vcpu_e500);
+}
+
 static inline void kvmppc_e500_deliver_tlb_miss(struct kvm_vcpu *vcpu,
 		unsigned int eaddr, int as)
 {
@@ -487,7 +548,7 @@ static inline void kvmppc_e500_deliver_tlb_miss(struct kvm_vcpu *vcpu,
 
 	/* since we only have two TLBs, only lower bit is used. */
 	tlbsel = (vcpu_e500->mas4 >> 28) & 0x1;
-	victim = (tlbsel = 0) ? tlb0_get_next_victim(vcpu_e500) : 0;
+	victim = (tlbsel = 0) ? gtlb0_get_next_victim(vcpu_e500) : 0;
 	pidsel = (vcpu_e500->mas4 >> 16) & 0xf;
 	tsized = (vcpu_e500->mas4 >> 7) & 0x1f;
 
@@ -508,10 +569,12 @@ static inline void kvmppc_e500_deliver_tlb_miss(struct kvm_vcpu *vcpu,
 /* TID must be supplied by the caller */
 static inline void kvmppc_e500_setup_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
 					   struct tlbe *gtlbe, int tsize,
-					   struct tlbe_priv *priv,
+					   struct tlbe_ref *ref,
 					   u64 gvaddr, struct tlbe *stlbe)
 {
-	pfn_t pfn = priv->pfn;
+	pfn_t pfn = ref->pfn;
+
+	BUG_ON(!(ref->flags & E500_TLB_VALID));
 
 	/* Force TS=1 IPROT=0 for all guest mappings. */
 	stlbe->mas1 = MAS1_TSIZE(tsize) | MAS1_TS | MAS1_VALID;
@@ -524,16 +587,15 @@ static inline void kvmppc_e500_setup_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
 	stlbe->mas7 = (pfn >> (32 - PAGE_SHIFT)) & MAS7_RPN;
 }
 
-
+/* sesel is an index into the entire array, not just the set */
 static inline void kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
-	u64 gvaddr, gfn_t gfn, struct tlbe *gtlbe, int tlbsel, int esel,
-	struct tlbe *stlbe)
+	u64 gvaddr, gfn_t gfn, struct tlbe *gtlbe, int tlbsel, int sesel,
+	struct tlbe *stlbe, struct tlbe_ref *ref)
 {
 	struct kvm_memory_slot *slot;
 	unsigned long pfn, hva;
 	int pfnmap = 0;
 	int tsize = BOOK3E_PAGESZ_4K;
-	struct tlbe_priv *priv;
 
 	/*
 	 * Translate guest physical to true physical, acquiring
@@ -629,12 +691,11 @@ static inline void kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 		}
 	}
 
-	/* Drop old priv and setup new one. */
-	priv = &vcpu_e500->gtlb_priv[tlbsel][esel];
-	kvmppc_e500_priv_release(priv);
-	kvmppc_e500_priv_setup(priv, gtlbe, pfn);
+	/* Drop old ref and setup new one. */
+	kvmppc_e500_ref_release(ref);
+	kvmppc_e500_ref_setup(ref, gtlbe, pfn);
 
-	kvmppc_e500_setup_stlbe(vcpu_e500, gtlbe, tsize, priv, gvaddr, stlbe);
+	kvmppc_e500_setup_stlbe(vcpu_e500, gtlbe, tsize, ref, gvaddr, stlbe);
 }
 
 /* XXX only map the one-one case, for now use TLB0 */
@@ -642,14 +703,22 @@ static int kvmppc_e500_tlb0_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 				int esel, struct tlbe *stlbe)
 {
 	struct tlbe *gtlbe;
+	struct tlbe_ref *ref;
+	int sesel = esel & (host_tlb_params[0].ways - 1);
+	int sesel_base;
+	gva_t ea;
 
 	gtlbe = &vcpu_e500->gtlb_arch[0][esel];
+	ref = &vcpu_e500->gtlb_priv[0][esel].ref;
+
+	ea = get_tlb_eaddr(gtlbe);
+	sesel_base = htlb0_set_base(ea);
 
 	kvmppc_e500_shadow_map(vcpu_e500, get_tlb_eaddr(gtlbe),
 			get_tlb_raddr(gtlbe) >> PAGE_SHIFT,
-			gtlbe, 0, esel, stlbe);
+			gtlbe, 0, sesel_base + sesel, stlbe, ref);
 
-	return esel;
+	return sesel;
 }
 
 /* Caller must ensure that the specified guest TLB entry is safe to insert into
@@ -658,14 +727,17 @@ static int kvmppc_e500_tlb0_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 static int kvmppc_e500_tlb1_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 		u64 gvaddr, gfn_t gfn, struct tlbe *gtlbe, struct tlbe *stlbe)
 {
+	struct tlbe_ref *ref;
 	unsigned int victim;
 
-	victim = vcpu_e500->gtlb_nv[1]++;
+	victim = vcpu_e500->host_tlb1_nv++;
 
-	if (unlikely(vcpu_e500->gtlb_nv[1] >= tlb1_max_shadow_size()))
-		vcpu_e500->gtlb_nv[1] = 0;
+	if (unlikely(vcpu_e500->host_tlb1_nv >= tlb1_max_shadow_size()))
+		vcpu_e500->host_tlb1_nv = 0;
 
-	kvmppc_e500_shadow_map(vcpu_e500, gvaddr, gfn, gtlbe, 1, victim, stlbe);
+	ref = &vcpu_e500->tlb_refs[1][victim];
+	kvmppc_e500_shadow_map(vcpu_e500, gvaddr, gfn, gtlbe, 1,
+			       victim, stlbe, ref);
 
 	return victim;
 }
@@ -792,7 +864,7 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
 
 		/* since we only have two TLBs, only lower bit is used. */
 		tlbsel = vcpu_e500->mas4 >> 28 & 0x1;
-		victim = (tlbsel = 0) ? tlb0_get_next_victim(vcpu_e500) : 0;
+		victim = (tlbsel = 0) ? gtlb0_get_next_victim(vcpu_e500) : 0;
 
 		vcpu_e500->mas0 = MAS0_TLBSEL(tlbsel) | MAS0_ESEL(victim)
 			| MAS0_NV(vcpu_e500->gtlb_nv[tlbsel]);
@@ -839,7 +911,7 @@ int kvmppc_e500_emul_tlbwe(struct kvm_vcpu *vcpu)
 	gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
 
 	if (get_tlb_v(gtlbe))
-		kvmppc_e500_stlbe_invalidate(vcpu_e500, tlbsel, esel);
+		inval_gtlbe_on_host(vcpu_e500, tlbsel, esel);
 
 	gtlbe->mas1 = vcpu_e500->mas1;
 	gtlbe->mas2 = vcpu_e500->mas2;
@@ -950,11 +1022,11 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 eaddr, gpa_t gpaddr,
 	switch (tlbsel) {
 	case 0:
 		stlbsel = 0;
-		sesel = esel;
-		priv = &vcpu_e500->gtlb_priv[stlbsel][sesel];
+		sesel = esel & (host_tlb_params[0].ways - 1);
+		priv = &vcpu_e500->gtlb_priv[tlbsel][esel];
 
 		kvmppc_e500_setup_stlbe(vcpu_e500, gtlbe, BOOK3E_PAGESZ_4K,
-					priv, eaddr, &stlbe);
+					&priv->ref, eaddr, &stlbe);
 		break;
 
 	case 1: {
@@ -1020,32 +1092,76 @@ void kvmppc_e500_tlb_setup(struct kvmppc_vcpu_e500 *vcpu_e500)
 
 int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500)
 {
-	tlb1_entry_num = mfspr(SPRN_TLB1CFG) & 0xFFF;
+	host_tlb_params[0].entries = mfspr(SPRN_TLB0CFG) & TLBnCFG_N_ENTRY;
+	host_tlb_params[1].entries = mfspr(SPRN_TLB1CFG) & TLBnCFG_N_ENTRY;
+
+	/*
+	 * This should never happen on real e500 hardware, but is
+	 * architecturally possible -- e.g. in some weird nested
+	 * virtualization case.
+	 */
+	if (host_tlb_params[0].entries = 0 ||
+	    host_tlb_params[1].entries = 0) {
+		pr_err("%s: need to know host tlb size\n", __func__);
+		return -ENODEV;
+	}
+
+	host_tlb_params[0].ways = (mfspr(SPRN_TLB0CFG) & TLBnCFG_ASSOC) >>
+				  TLBnCFG_ASSOC_SHIFT;
+	host_tlb_params[1].ways = host_tlb_params[1].entries;
+
+	if (!is_power_of_2(host_tlb_params[0].entries) ||
+	    !is_power_of_2(host_tlb_params[0].ways) ||
+	    host_tlb_params[0].entries < host_tlb_params[0].ways ||
+	    host_tlb_params[0].ways = 0) {
+		pr_err("%s: bad tlb0 host config: %u entries %u ways\n",
+		       __func__, host_tlb_params[0].entries,
+		       host_tlb_params[0].ways);
+		return -ENODEV;
+	}
+
+	host_tlb_params[0].sets +		host_tlb_params[0].entries / host_tlb_params[0].ways;
+	host_tlb_params[1].sets = 1;
 
 	vcpu_e500->gtlb_size[0] = KVM_E500_TLB0_SIZE;
 	vcpu_e500->gtlb_arch[0]  		kzalloc(sizeof(struct tlbe) * KVM_E500_TLB0_SIZE, GFP_KERNEL);
 	if (vcpu_e500->gtlb_arch[0] = NULL)
-		goto err_out;
+		goto err;
 
 	vcpu_e500->gtlb_size[1] = KVM_E500_TLB1_SIZE;
 	vcpu_e500->gtlb_arch[1]  		kzalloc(sizeof(struct tlbe) * KVM_E500_TLB1_SIZE, GFP_KERNEL);
 	if (vcpu_e500->gtlb_arch[1] = NULL)
-		goto err_out_guest0;
-
-	vcpu_e500->gtlb_priv[0] = (struct tlbe_priv *)
-		kzalloc(sizeof(struct tlbe_priv) * KVM_E500_TLB0_SIZE, GFP_KERNEL);
-	if (vcpu_e500->gtlb_priv[0] = NULL)
-		goto err_out_guest1;
-	vcpu_e500->gtlb_priv[1] = (struct tlbe_priv *)
-		kzalloc(sizeof(struct tlbe_priv) * KVM_E500_TLB1_SIZE, GFP_KERNEL);
-
-	if (vcpu_e500->gtlb_priv[1] = NULL)
-		goto err_out_priv0;
+		goto err;
+
+	vcpu_e500->tlb_refs[0] +		kzalloc(sizeof(struct tlbe_ref) * host_tlb_params[0].entries,
+			GFP_KERNEL);
+	if (!vcpu_e500->tlb_refs[0])
+		goto err;
+
+	vcpu_e500->tlb_refs[1] +		kzalloc(sizeof(struct tlbe_ref) * host_tlb_params[1].entries,
+			GFP_KERNEL);
+	if (!vcpu_e500->tlb_refs[1])
+		goto err;
+
+	vcpu_e500->gtlb_priv[0] +		kzalloc(sizeof(struct tlbe_ref) * vcpu_e500->gtlb_size[0],
+			GFP_KERNEL);
+	if (!vcpu_e500->gtlb_priv[0])
+		goto err;
+
+	vcpu_e500->gtlb_priv[1] +		kzalloc(sizeof(struct tlbe_ref) * vcpu_e500->gtlb_size[1],
+			GFP_KERNEL);
+	if (!vcpu_e500->gtlb_priv[1])
+		goto err;
 
 	if (kvmppc_e500_id_table_alloc(vcpu_e500) = NULL)
-		goto err_out_priv1;
+		goto err;
 
 	/* Init TLB configuration register */
 	vcpu_e500->tlb0cfg = mfspr(SPRN_TLB0CFG) & ~0xfffUL;
@@ -1055,31 +1171,26 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500)
 
 	return 0;
 
-err_out_priv1:
-	kfree(vcpu_e500->gtlb_priv[1]);
-err_out_priv0:
+err:
+	kfree(vcpu_e500->tlb_refs[0]);
+	kfree(vcpu_e500->tlb_refs[1]);
 	kfree(vcpu_e500->gtlb_priv[0]);
-err_out_guest1:
-	kfree(vcpu_e500->gtlb_arch[1]);
-err_out_guest0:
+	kfree(vcpu_e500->gtlb_priv[1]);
 	kfree(vcpu_e500->gtlb_arch[0]);
-err_out:
+	kfree(vcpu_e500->gtlb_arch[1]);
 	return -1;
 }
 
 void kvmppc_e500_tlb_uninit(struct kvmppc_vcpu_e500 *vcpu_e500)
 {
-	int stlbsel, i;
-
-	/* release all privs */
-	for (stlbsel = 0; stlbsel < 2; stlbsel++)
-		for (i = 0; i < vcpu_e500->gtlb_size[stlbsel]; i++) {
-			struct tlbe_priv *priv -				&vcpu_e500->gtlb_priv[stlbsel][i];
-			kvmppc_e500_priv_release(priv);
-		}
+	clear_tlb_refs(vcpu_e500);
 
 	kvmppc_e500_id_table_free(vcpu_e500);
+
+	kfree(vcpu_e500->tlb_refs[0]);
+	kfree(vcpu_e500->tlb_refs[1]);
+	kfree(vcpu_e500->gtlb_priv[0]);
+	kfree(vcpu_e500->gtlb_priv[1]);
 	kfree(vcpu_e500->gtlb_arch[1]);
 	kfree(vcpu_e500->gtlb_arch[0]);
 }
diff --git a/arch/powerpc/kvm/e500_tlb.h b/arch/powerpc/kvm/e500_tlb.h
index 59b88e9..b587f69 100644
--- a/arch/powerpc/kvm/e500_tlb.h
+++ b/arch/powerpc/kvm/e500_tlb.h
@@ -155,23 +155,6 @@ static inline unsigned int get_tlb_esel_bit(
 	return (vcpu_e500->mas0 >> 16) & 0xfff;
 }
 
-static inline unsigned int get_tlb_esel(
-		const struct kvmppc_vcpu_e500 *vcpu_e500,
-		int tlbsel)
-{
-	unsigned int esel = get_tlb_esel_bit(vcpu_e500);
-
-	if (tlbsel = 0) {
-		esel &= KVM_E500_TLB0_WAY_NUM_MASK;
-		esel |= ((vcpu_e500->mas2 >> 12) & KVM_E500_TLB0_WAY_SIZE_MASK)
-				<< KVM_E500_TLB0_WAY_NUM_BIT;
-	} else {
-		esel &= KVM_E500_TLB1_SIZE - 1;
-	}
-
-	return esel;
-}
-
 static inline int tlbe_is_host_safe(const struct kvm_vcpu *vcpu,
 			const struct tlbe *tlbe)
 {
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 04/14] KVM: PPC: e500: MMU API
  2011-10-31  7:53 ` Alexander Graf
@ 2011-10-31  7:53   ` Alexander Graf
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti, Scott Wood

From: Scott Wood <scottwood@freescale.com>

This implements a shared-memory API for giving host userspace access to
the guest's TLB.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
---
 Documentation/virtual/kvm/api.txt   |   74 +++++++
 arch/powerpc/include/asm/kvm.h      |   35 +++
 arch/powerpc/include/asm/kvm_e500.h |   24 +--
 arch/powerpc/include/asm/kvm_ppc.h  |    5 +
 arch/powerpc/kvm/e500.c             |    5 +-
 arch/powerpc/kvm/e500_emulate.c     |   12 +-
 arch/powerpc/kvm/e500_tlb.c         |  393 ++++++++++++++++++++++++-----------
 arch/powerpc/kvm/e500_tlb.h         |   38 ++--
 arch/powerpc/kvm/powerpc.c          |   28 +++
 include/linux/kvm.h                 |   18 ++
 10 files changed, 469 insertions(+), 163 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 7945b0b..ab1136f 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1383,6 +1383,38 @@ The following flags are defined:
 If datamatch flag is set, the event will be signaled only if the written value
 to the registered address is equal to datamatch in struct kvm_ioeventfd.
 
+4.59 KVM_DIRTY_TLB
+
+Capability: KVM_CAP_SW_TLB
+Architectures: ppc
+Type: vcpu ioctl
+Parameters: struct kvm_dirty_tlb (in)
+Returns: 0 on success, -1 on error
+
+struct kvm_dirty_tlb {
+	__u64 bitmap;
+	__u32 num_dirty;
+};
+
+This must be called whenever userspace has changed an entry in the shared
+TLB, prior to calling KVM_RUN on the associated vcpu.
+
+The "bitmap" field is the userspace address of an array.  This array
+consists of a number of bits, equal to the total number of TLB entries as
+determined by the last successful call to KVM_CONFIG_TLB, rounded up to the
+nearest multiple of 64.
+
+Each bit corresponds to one TLB entry, ordered the same as in the shared TLB
+array.
+
+The array is little-endian: the bit 0 is the least significant bit of the
+first byte, bit 8 is the least significant bit of the second byte, etc.
+This avoids any complications with differing word sizes.
+
+The "num_dirty" field is a performance hint for KVM to determine whether it
+should skip processing the bitmap and just invalidate everything.  It must
+be set to the number of set bits in the bitmap.
+
 4.62 KVM_CREATE_SPAPR_TCE
 
 Capability: KVM_CAP_SPAPR_TCE
@@ -1700,3 +1732,45 @@ HTAB address part of SDR1 contains an HVA instead of a GPA, as PAPR keeps the
 HTAB invisible to the guest.
 
 When this capability is enabled, KVM_EXIT_PAPR_HCALL can occur.
+
+6.3 KVM_CAP_SW_TLB
+
+Architectures: ppc
+Parameters: args[0] is the address of a struct kvm_config_tlb
+Returns: 0 on success; -1 on error
+
+struct kvm_config_tlb {
+	__u64 params;
+	__u64 array;
+	__u32 mmu_type;
+	__u32 array_len;
+};
+
+Configures the virtual CPU's TLB array, establishing a shared memory area
+between userspace and KVM.  The "params" and "array" fields are userspace
+addresses of mmu-type-specific data structures.  The "array_len" field is an
+safety mechanism, and should be set to the size in bytes of the memory that
+userspace has reserved for the array.  It must be at least the size dictated
+by "mmu_type" and "params".
+
+While KVM_RUN is active, the shared region is under control of KVM.  Its
+contents are undefined, and any modification by userspace results in
+boundedly undefined behavior.
+
+On return from KVM_RUN, the shared region will reflect the current state of
+the guest's TLB.  If userspace makes any changes, it must call KVM_DIRTY_TLB
+to tell KVM which entries have been changed, prior to calling KVM_RUN again
+on this vcpu.
+
+For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
+ - The "params" field is of type "struct kvm_book3e_206_tlb_params".
+ - The "array" field points to an array of type "struct
+   kvm_book3e_206_tlb_entry".
+ - The array consists of all entries in the first TLB, followed by all
+   entries in the second TLB.
+ - Within a TLB, entries are ordered first by increasing set number.  Within a
+   set, entries are ordered by way (increasing ESEL).
+ - The hash for determining set number in TLB0 is: (MAS2 >> 12) & (num_sets - 1)
+   where "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
+ - The tsize field of mas1 shall be set to 4K on TLB0, even though the
+   hardware ignores this value for TLB0.
diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h
index 08fe69e..71684b9 100644
--- a/arch/powerpc/include/asm/kvm.h
+++ b/arch/powerpc/include/asm/kvm.h
@@ -300,4 +300,39 @@ struct kvm_allocate_rma {
 	__u64 rma_size;
 };
 
+struct kvm_book3e_206_tlb_entry {
+	__u32 mas8;
+	__u32 mas1;
+	__u64 mas2;
+	__u64 mas7_3;
+};
+
+struct kvm_book3e_206_tlb_params {
+	/*
+	 * For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
+	 *
+	 * - The number of ways of TLB0 must be a power of two between 2 and
+	 *   16.
+	 * - TLB1 must be fully associative.
+	 * - The size of TLB0 must be a multiple of the number of ways, and
+	 *   the number of sets must be a power of two.
+	 * - The size of TLB1 may not exceed 64 entries.
+	 * - TLB0 supports 4 KiB pages.
+	 * - The page sizes supported by TLB1 are as indicated by
+	 *   TLB1CFG (if MMUCFG[MAVN] = 0) or TLB1PS (if MMUCFG[MAVN] = 1)
+	 *   as returned by KVM_GET_SREGS.
+	 * - TLB2 and TLB3 are reserved, and their entries in tlb_sizes[]
+	 *   and tlb_ways[] must be zero.
+	 *
+	 * tlb_ways[n] = tlb_sizes[n] means the array is fully associative.
+	 *
+	 * KVM will adjust TLBnCFG based on the sizes configured here,
+	 * though arrays greater than 2048 entries will have TLBnCFG[NENTRY]
+	 * set to zero.
+	 */
+	__u32 tlb_sizes[4];
+	__u32 tlb_ways[4];
+	__u32 reserved[8];
+};
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/include/asm/kvm_e500.h b/arch/powerpc/include/asm/kvm_e500.h
index a5197d8..bc17441 100644
--- a/arch/powerpc/include/asm/kvm_e500.h
+++ b/arch/powerpc/include/asm/kvm_e500.h
@@ -22,13 +22,6 @@
 #define E500_PID_NUM   3
 #define E500_TLB_NUM   2
 
-struct tlbe{
-	u32 mas1;
-	u32 mas2;
-	u32 mas3;
-	u32 mas7;
-};
-
 #define E500_TLB_VALID 1
 #define E500_TLB_DIRTY 2
 
@@ -48,13 +41,17 @@ struct kvmppc_e500_tlb_params {
 };
 
 struct kvmppc_vcpu_e500 {
-	/* Unmodified copy of the guest's TLB. */
-	struct tlbe *gtlb_arch[E500_TLB_NUM];
+	/* Unmodified copy of the guest's TLB -- shared with host userspace. */
+	struct kvm_book3e_206_tlb_entry *gtlb_arch;
+
+	/* Starting entry number in gtlb_arch[] */
+	int gtlb_offset[E500_TLB_NUM];
 
 	/* KVM internal information associated with each guest TLB entry */
 	struct tlbe_priv *gtlb_priv[E500_TLB_NUM];
 
-	unsigned int gtlb_size[E500_TLB_NUM];
+	struct kvmppc_e500_tlb_params gtlb_params[E500_TLB_NUM];
+
 	unsigned int gtlb_nv[E500_TLB_NUM];
 
 	/*
@@ -68,7 +65,6 @@ struct kvmppc_vcpu_e500 {
 	 * and back, and our host TLB entries got evicted).
 	 */
 	struct tlbe_ref *tlb_refs[E500_TLB_NUM];
-
 	unsigned int host_tlb1_nv;
 
 	u32 host_pid[E500_PID_NUM];
@@ -78,11 +74,10 @@ struct kvmppc_vcpu_e500 {
 	u32 mas0;
 	u32 mas1;
 	u32 mas2;
-	u32 mas3;
+	u64 mas7_3;
 	u32 mas4;
 	u32 mas5;
 	u32 mas6;
-	u32 mas7;
 
 	/* vcpu id table */
 	struct vcpu_id_table *idt;
@@ -95,6 +90,9 @@ struct kvmppc_vcpu_e500 {
 	u32 tlb1cfg;
 	u64 mcar;
 
+	struct page **shared_tlb_pages;
+	int num_shared_tlb_pages;
+
 	struct kvm_vcpu vcpu;
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 46efd1a..a284f20 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -193,4 +193,9 @@ static inline void kvm_rma_init(void)
 {}
 #endif
 
+int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
+			      struct kvm_config_tlb *cfg);
+int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
+			     struct kvm_dirty_tlb *cfg);
+
 #endif /* __POWERPC_KVM_PPC_H__ */
diff --git a/arch/powerpc/kvm/e500.c b/arch/powerpc/kvm/e500.c
index 26d2090..14d6e6e 100644
--- a/arch/powerpc/kvm/e500.c
+++ b/arch/powerpc/kvm/e500.c
@@ -120,7 +120,7 @@ void kvmppc_core_get_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 	sregs->u.e.mas0 = vcpu_e500->mas0;
 	sregs->u.e.mas1 = vcpu_e500->mas1;
 	sregs->u.e.mas2 = vcpu_e500->mas2;
-	sregs->u.e.mas7_3 = ((u64)vcpu_e500->mas7 << 32) | vcpu_e500->mas3;
+	sregs->u.e.mas7_3 = vcpu_e500->mas7_3;
 	sregs->u.e.mas4 = vcpu_e500->mas4;
 	sregs->u.e.mas6 = vcpu_e500->mas6;
 
@@ -153,8 +153,7 @@ int kvmppc_core_set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 		vcpu_e500->mas0 = sregs->u.e.mas0;
 		vcpu_e500->mas1 = sregs->u.e.mas1;
 		vcpu_e500->mas2 = sregs->u.e.mas2;
-		vcpu_e500->mas7 = sregs->u.e.mas7_3 >> 32;
-		vcpu_e500->mas3 = (u32)sregs->u.e.mas7_3;
+		vcpu_e500->mas7_3 = sregs->u.e.mas7_3;
 		vcpu_e500->mas4 = sregs->u.e.mas4;
 		vcpu_e500->mas6 = sregs->u.e.mas6;
 	}
diff --git a/arch/powerpc/kvm/e500_emulate.c b/arch/powerpc/kvm/e500_emulate.c
index d48ae39..e0d3609 100644
--- a/arch/powerpc/kvm/e500_emulate.c
+++ b/arch/powerpc/kvm/e500_emulate.c
@@ -95,13 +95,17 @@ int kvmppc_core_emulate_mtspr(struct kvm_vcpu *vcpu, int sprn, int rs)
 	case SPRN_MAS2:
 		vcpu_e500->mas2 = spr_val; break;
 	case SPRN_MAS3:
-		vcpu_e500->mas3 = spr_val; break;
+		vcpu_e500->mas7_3 &= ~(u64)0xffffffff;
+		vcpu_e500->mas7_3 |= spr_val;
+		break;
 	case SPRN_MAS4:
 		vcpu_e500->mas4 = spr_val; break;
 	case SPRN_MAS6:
 		vcpu_e500->mas6 = spr_val; break;
 	case SPRN_MAS7:
-		vcpu_e500->mas7 = spr_val; break;
+		vcpu_e500->mas7_3 &= (u64)0xffffffff;
+		vcpu_e500->mas7_3 |= (u64)spr_val << 32;
+		break;
 	case SPRN_L1CSR0:
 		vcpu_e500->l1csr0 = spr_val;
 		vcpu_e500->l1csr0 &= ~(L1CSR0_DCFI | L1CSR0_CLFC);
@@ -158,13 +162,13 @@ int kvmppc_core_emulate_mfspr(struct kvm_vcpu *vcpu, int sprn, int rt)
 	case SPRN_MAS2:
 		kvmppc_set_gpr(vcpu, rt, vcpu_e500->mas2); break;
 	case SPRN_MAS3:
-		kvmppc_set_gpr(vcpu, rt, vcpu_e500->mas3); break;
+		kvmppc_set_gpr(vcpu, rt, (u32)vcpu_e500->mas7_3); break;
 	case SPRN_MAS4:
 		kvmppc_set_gpr(vcpu, rt, vcpu_e500->mas4); break;
 	case SPRN_MAS6:
 		kvmppc_set_gpr(vcpu, rt, vcpu_e500->mas6); break;
 	case SPRN_MAS7:
-		kvmppc_set_gpr(vcpu, rt, vcpu_e500->mas7); break;
+		kvmppc_set_gpr(vcpu, rt, vcpu_e500->mas7_3 >> 32); break;
 
 	case SPRN_TLB0CFG:
 		kvmppc_set_gpr(vcpu, rt, vcpu_e500->tlb0cfg); break;
diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index 59221bb..f19ae2f 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -19,6 +19,11 @@
 #include <linux/kvm.h>
 #include <linux/kvm_host.h>
 #include <linux/highmem.h>
+#include <linux/log2.h>
+#include <linux/uaccess.h>
+#include <linux/sched.h>
+#include <linux/rwsem.h>
+#include <linux/vmalloc.h>
 #include <asm/kvm_ppc.h>
 #include <asm/kvm_e500.h>
 
@@ -66,6 +71,13 @@ static DEFINE_PER_CPU(unsigned long, pcpu_last_used_sid);
 
 static struct kvmppc_e500_tlb_params host_tlb_params[E500_TLB_NUM];
 
+static struct kvm_book3e_206_tlb_entry *get_entry(
+	struct kvmppc_vcpu_e500 *vcpu_e500, int tlbsel, int entry)
+{
+	int offset = vcpu_e500->gtlb_offset[tlbsel];
+	return &vcpu_e500->gtlb_arch[offset + entry];
+}
+
 /*
  * Allocate a free shadow id and setup a valid sid mapping in given entry.
  * A mapping is only valid when vcpu_id_table and pcpu_id_table are match.
@@ -217,34 +229,13 @@ void kvmppc_e500_recalc_shadow_pid(struct kvmppc_vcpu_e500 *vcpu_e500)
 	preempt_enable();
 }
 
-void kvmppc_dump_tlbs(struct kvm_vcpu *vcpu)
-{
-	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
-	struct tlbe *tlbe;
-	int i, tlbsel;
-
-	printk("| %8s | %8s | %8s | %8s | %8s |\n",
-			"nr", "mas1", "mas2", "mas3", "mas7");
-
-	for (tlbsel = 0; tlbsel < 2; tlbsel++) {
-		printk("Guest TLB%d:\n", tlbsel);
-		for (i = 0; i < vcpu_e500->gtlb_size[tlbsel]; i++) {
-			tlbe = &vcpu_e500->gtlb_arch[tlbsel][i];
-			if (tlbe->mas1 & MAS1_VALID)
-				printk(" G[%d][%3d] |  %08X | %08X | %08X | %08X |\n",
-					tlbsel, i, tlbe->mas1, tlbe->mas2,
-					tlbe->mas3, tlbe->mas7);
-		}
-	}
-}
-
 static inline unsigned int gtlb0_get_next_victim(
 		struct kvmppc_vcpu_e500 *vcpu_e500)
 {
 	unsigned int victim;
 
 	victim = vcpu_e500->gtlb_nv[0]++;
-	if (unlikely(vcpu_e500->gtlb_nv[0] >= KVM_E500_TLB0_WAY_NUM))
+	if (unlikely(vcpu_e500->gtlb_nv[0] >= vcpu_e500->gtlb_params[0].ways))
 		vcpu_e500->gtlb_nv[0] = 0;
 
 	return victim;
@@ -256,9 +247,9 @@ static inline unsigned int tlb1_max_shadow_size(void)
 	return host_tlb_params[1].entries - tlbcam_index - 1;
 }
 
-static inline int tlbe_is_writable(struct tlbe *tlbe)
+static inline int tlbe_is_writable(struct kvm_book3e_206_tlb_entry *tlbe)
 {
-	return tlbe->mas3 & (MAS3_SW|MAS3_UW);
+	return tlbe->mas7_3 & (MAS3_SW|MAS3_UW);
 }
 
 static inline u32 e500_shadow_mas3_attrib(u32 mas3, int usermode)
@@ -289,39 +280,41 @@ static inline u32 e500_shadow_mas2_attrib(u32 mas2, int usermode)
 /*
  * writing shadow tlb entry to host TLB
  */
-static inline void __write_host_tlbe(struct tlbe *stlbe, uint32_t mas0)
+static inline void __write_host_tlbe(struct kvm_book3e_206_tlb_entry *stlbe,
+				     uint32_t mas0)
 {
 	unsigned long flags;
 
 	local_irq_save(flags);
 	mtspr(SPRN_MAS0, mas0);
 	mtspr(SPRN_MAS1, stlbe->mas1);
-	mtspr(SPRN_MAS2, stlbe->mas2);
-	mtspr(SPRN_MAS3, stlbe->mas3);
-	mtspr(SPRN_MAS7, stlbe->mas7);
+	mtspr(SPRN_MAS2, (unsigned long)stlbe->mas2);
+	mtspr(SPRN_MAS3, (u32)stlbe->mas7_3);
+	mtspr(SPRN_MAS7, (u32)(stlbe->mas7_3 >> 32));
 	asm volatile("isync; tlbwe" : : : "memory");
 	local_irq_restore(flags);
 }
 
 /* esel is index into set, not whole array */
 static inline void write_host_tlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
-		int tlbsel, int esel, struct tlbe *stlbe)
+		int tlbsel, int esel, struct kvm_book3e_206_tlb_entry *stlbe)
 {
 	if (tlbsel == 0) {
-		__write_host_tlbe(stlbe, MAS0_TLBSEL(0) | MAS0_ESEL(esel));
+		int way = esel & (vcpu_e500->gtlb_params[0].ways - 1);
+		__write_host_tlbe(stlbe, MAS0_TLBSEL(0) | MAS0_ESEL(way));
 	} else {
 		__write_host_tlbe(stlbe,
 				  MAS0_TLBSEL(1) |
 				  MAS0_ESEL(to_htlb1_esel(esel)));
 	}
 	trace_kvm_stlb_write(index_of(tlbsel, esel), stlbe->mas1, stlbe->mas2,
-			     stlbe->mas3, stlbe->mas7);
+			     (u32)stlbe->mas7_3, (u32)(stlbe->mas7_3 >> 32));
 }
 
 void kvmppc_map_magic(struct kvm_vcpu *vcpu)
 {
 	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
-	struct tlbe magic;
+	struct kvm_book3e_206_tlb_entry magic;
 	ulong shared_page = ((ulong)vcpu->arch.shared) & PAGE_MASK;
 	unsigned int stid;
 	pfn_t pfn;
@@ -335,9 +328,8 @@ void kvmppc_map_magic(struct kvm_vcpu *vcpu)
 	magic.mas1 = MAS1_VALID | MAS1_TS | MAS1_TID(stid) |
 		     MAS1_TSIZE(BOOK3E_PAGESZ_4K);
 	magic.mas2 = vcpu->arch.magic_page_ea | MAS2_M;
-	magic.mas3 = (pfn << PAGE_SHIFT) |
-		     MAS3_SW | MAS3_SR | MAS3_UW | MAS3_UR;
-	magic.mas7 = pfn >> (32 - PAGE_SHIFT);
+	magic.mas7_3 = ((u64)pfn << PAGE_SHIFT) |
+		       MAS3_SW | MAS3_SR | MAS3_UW | MAS3_UR;
 
 	__write_host_tlbe(&magic, MAS0_TLBSEL(1) | MAS0_ESEL(tlbcam_index));
 	preempt_enable();
@@ -358,7 +350,8 @@ void kvmppc_e500_tlb_put(struct kvm_vcpu *vcpu)
 static void inval_gtlbe_on_host(struct kvmppc_vcpu_e500 *vcpu_e500,
 				int tlbsel, int esel)
 {
-	struct tlbe *gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
+	struct kvm_book3e_206_tlb_entry *gtlbe =
+		get_entry(vcpu_e500, tlbsel, esel);
 	struct vcpu_id_table *idt = vcpu_e500->idt;
 	unsigned int pr, tid, ts, pid;
 	u32 val, eaddr;
@@ -424,9 +417,8 @@ static int tlb0_set_base(gva_t addr, int sets, int ways)
 
 static int gtlb0_set_base(struct kvmppc_vcpu_e500 *vcpu_e500, gva_t addr)
 {
-	int sets = KVM_E500_TLB0_SIZE / KVM_E500_TLB0_WAY_NUM;
-
-	return tlb0_set_base(addr, sets, KVM_E500_TLB0_WAY_NUM);
+	return tlb0_set_base(addr, vcpu_e500->gtlb_params[0].sets,
+			     vcpu_e500->gtlb_params[0].ways);
 }
 
 static int htlb0_set_base(gva_t addr)
@@ -440,10 +432,10 @@ static unsigned int get_tlb_esel(struct kvmppc_vcpu_e500 *vcpu_e500, int tlbsel)
 	unsigned int esel = get_tlb_esel_bit(vcpu_e500);
 
 	if (tlbsel == 0) {
-		esel &= KVM_E500_TLB0_WAY_NUM_MASK;
+		esel &= vcpu_e500->gtlb_params[0].ways - 1;
 		esel += gtlb0_set_base(vcpu_e500, vcpu_e500->mas2);
 	} else {
-		esel &= vcpu_e500->gtlb_size[tlbsel] - 1;
+		esel &= vcpu_e500->gtlb_params[tlbsel].entries - 1;
 	}
 
 	return esel;
@@ -453,19 +445,22 @@ static unsigned int get_tlb_esel(struct kvmppc_vcpu_e500 *vcpu_e500, int tlbsel)
 static int kvmppc_e500_tlb_index(struct kvmppc_vcpu_e500 *vcpu_e500,
 		gva_t eaddr, int tlbsel, unsigned int pid, int as)
 {
-	int size = vcpu_e500->gtlb_size[tlbsel];
-	unsigned int set_base;
+	int size = vcpu_e500->gtlb_params[tlbsel].entries;
+	unsigned int set_base, offset;
 	int i;
 
 	if (tlbsel == 0) {
 		set_base = gtlb0_set_base(vcpu_e500, eaddr);
-		size = KVM_E500_TLB0_WAY_NUM;
+		size = vcpu_e500->gtlb_params[0].ways;
 	} else {
 		set_base = 0;
 	}
 
+	offset = vcpu_e500->gtlb_offset[tlbsel];
+
 	for (i = 0; i < size; i++) {
-		struct tlbe *tlbe = &vcpu_e500->gtlb_arch[tlbsel][set_base + i];
+		struct kvm_book3e_206_tlb_entry *tlbe =
+			&vcpu_e500->gtlb_arch[offset + set_base + i];
 		unsigned int tid;
 
 		if (eaddr < get_tlb_eaddr(tlbe))
@@ -491,7 +486,7 @@ static int kvmppc_e500_tlb_index(struct kvmppc_vcpu_e500 *vcpu_e500,
 }
 
 static inline void kvmppc_e500_ref_setup(struct tlbe_ref *ref,
-					 struct tlbe *gtlbe,
+					 struct kvm_book3e_206_tlb_entry *gtlbe,
 					 pfn_t pfn)
 {
 	ref->pfn = pfn;
@@ -518,7 +513,7 @@ static void clear_tlb_privs(struct kvmppc_vcpu_e500 *vcpu_e500)
 	int tlbsel = 0;
 	int i;
 
-	for (i = 0; i < vcpu_e500->gtlb_size[tlbsel]; i++) {
+	for (i = 0; i < vcpu_e500->gtlb_params[tlbsel].entries; i++) {
 		struct tlbe_ref *ref =
 			&vcpu_e500->gtlb_priv[tlbsel][i].ref;
 		kvmppc_e500_ref_release(ref);
@@ -530,6 +525,8 @@ static void clear_tlb_refs(struct kvmppc_vcpu_e500 *vcpu_e500)
 	int stlbsel = 1;
 	int i;
 
+	kvmppc_e500_id_table_reset_all(vcpu_e500);
+
 	for (i = 0; i < host_tlb_params[stlbsel].entries; i++) {
 		struct tlbe_ref *ref =
 			&vcpu_e500->tlb_refs[stlbsel][i];
@@ -559,18 +556,18 @@ static inline void kvmppc_e500_deliver_tlb_miss(struct kvm_vcpu *vcpu,
 		| MAS1_TSIZE(tsized);
 	vcpu_e500->mas2 = (eaddr & MAS2_EPN)
 		| (vcpu_e500->mas4 & MAS2_ATTRIB_MASK);
-	vcpu_e500->mas3 &= MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3;
+	vcpu_e500->mas7_3 &= MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3;
 	vcpu_e500->mas6 = (vcpu_e500->mas6 & MAS6_SPID1)
 		| (get_cur_pid(vcpu) << 16)
 		| (as ? MAS6_SAS : 0);
-	vcpu_e500->mas7 = 0;
 }
 
 /* TID must be supplied by the caller */
-static inline void kvmppc_e500_setup_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
-					   struct tlbe *gtlbe, int tsize,
-					   struct tlbe_ref *ref,
-					   u64 gvaddr, struct tlbe *stlbe)
+static inline void kvmppc_e500_setup_stlbe(
+	struct kvmppc_vcpu_e500 *vcpu_e500,
+	struct kvm_book3e_206_tlb_entry *gtlbe,
+	int tsize, struct tlbe_ref *ref, u64 gvaddr,
+	struct kvm_book3e_206_tlb_entry *stlbe)
 {
 	pfn_t pfn = ref->pfn;
 
@@ -581,16 +578,16 @@ static inline void kvmppc_e500_setup_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
 	stlbe->mas2 = (gvaddr & MAS2_EPN)
 		| e500_shadow_mas2_attrib(gtlbe->mas2,
 				vcpu_e500->vcpu.arch.shared->msr & MSR_PR);
-	stlbe->mas3 = ((pfn << PAGE_SHIFT) & MAS3_RPN)
-		| e500_shadow_mas3_attrib(gtlbe->mas3,
+	stlbe->mas7_3 = ((u64)pfn << PAGE_SHIFT)
+		| e500_shadow_mas3_attrib(gtlbe->mas7_3,
 				vcpu_e500->vcpu.arch.shared->msr & MSR_PR);
-	stlbe->mas7 = (pfn >> (32 - PAGE_SHIFT)) & MAS7_RPN;
 }
 
 /* sesel is an index into the entire array, not just the set */
 static inline void kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
-	u64 gvaddr, gfn_t gfn, struct tlbe *gtlbe, int tlbsel, int sesel,
-	struct tlbe *stlbe, struct tlbe_ref *ref)
+	u64 gvaddr, gfn_t gfn, struct kvm_book3e_206_tlb_entry *gtlbe,
+	int tlbsel, int sesel, struct kvm_book3e_206_tlb_entry *stlbe,
+	struct tlbe_ref *ref)
 {
 	struct kvm_memory_slot *slot;
 	unsigned long pfn, hva;
@@ -700,15 +697,16 @@ static inline void kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 
 /* XXX only map the one-one case, for now use TLB0 */
 static int kvmppc_e500_tlb0_map(struct kvmppc_vcpu_e500 *vcpu_e500,
-				int esel, struct tlbe *stlbe)
+				int esel,
+				struct kvm_book3e_206_tlb_entry *stlbe)
 {
-	struct tlbe *gtlbe;
+	struct kvm_book3e_206_tlb_entry *gtlbe;
 	struct tlbe_ref *ref;
 	int sesel = esel & (host_tlb_params[0].ways - 1);
 	int sesel_base;
 	gva_t ea;
 
-	gtlbe = &vcpu_e500->gtlb_arch[0][esel];
+	gtlbe = get_entry(vcpu_e500, 0, esel);
 	ref = &vcpu_e500->gtlb_priv[0][esel].ref;
 
 	ea = get_tlb_eaddr(gtlbe);
@@ -725,7 +723,8 @@ static int kvmppc_e500_tlb0_map(struct kvmppc_vcpu_e500 *vcpu_e500,
  * the shadow TLB. */
 /* XXX for both one-one and one-to-many , for now use TLB1 */
 static int kvmppc_e500_tlb1_map(struct kvmppc_vcpu_e500 *vcpu_e500,
-		u64 gvaddr, gfn_t gfn, struct tlbe *gtlbe, struct tlbe *stlbe)
+		u64 gvaddr, gfn_t gfn, struct kvm_book3e_206_tlb_entry *gtlbe,
+		struct kvm_book3e_206_tlb_entry *stlbe)
 {
 	struct tlbe_ref *ref;
 	unsigned int victim;
@@ -754,7 +753,8 @@ static inline int kvmppc_e500_gtlbe_invalidate(
 				struct kvmppc_vcpu_e500 *vcpu_e500,
 				int tlbsel, int esel)
 {
-	struct tlbe *gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
+	struct kvm_book3e_206_tlb_entry *gtlbe =
+		get_entry(vcpu_e500, tlbsel, esel);
 
 	if (unlikely(get_tlb_iprot(gtlbe)))
 		return -1;
@@ -769,10 +769,10 @@ int kvmppc_e500_emul_mt_mmucsr0(struct kvmppc_vcpu_e500 *vcpu_e500, ulong value)
 	int esel;
 
 	if (value & MMUCSR0_TLB0FI)
-		for (esel = 0; esel < vcpu_e500->gtlb_size[0]; esel++)
+		for (esel = 0; esel < vcpu_e500->gtlb_params[0].entries; esel++)
 			kvmppc_e500_gtlbe_invalidate(vcpu_e500, 0, esel);
 	if (value & MMUCSR0_TLB1FI)
-		for (esel = 0; esel < vcpu_e500->gtlb_size[1]; esel++)
+		for (esel = 0; esel < vcpu_e500->gtlb_params[1].entries; esel++)
 			kvmppc_e500_gtlbe_invalidate(vcpu_e500, 1, esel);
 
 	/* Invalidate all vcpu id mappings */
@@ -797,7 +797,8 @@ int kvmppc_e500_emul_tlbivax(struct kvm_vcpu *vcpu, int ra, int rb)
 
 	if (ia) {
 		/* invalidate all entries */
-		for (esel = 0; esel < vcpu_e500->gtlb_size[tlbsel]; esel++)
+		for (esel = 0; esel < vcpu_e500->gtlb_params[tlbsel].entries;
+		     esel++)
 			kvmppc_e500_gtlbe_invalidate(vcpu_e500, tlbsel, esel);
 	} else {
 		ea &= 0xfffff000;
@@ -817,18 +818,17 @@ int kvmppc_e500_emul_tlbre(struct kvm_vcpu *vcpu)
 {
 	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
 	int tlbsel, esel;
-	struct tlbe *gtlbe;
+	struct kvm_book3e_206_tlb_entry *gtlbe;
 
 	tlbsel = get_tlb_tlbsel(vcpu_e500);
 	esel = get_tlb_esel(vcpu_e500, tlbsel);
 
-	gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
+	gtlbe = get_entry(vcpu_e500, tlbsel, esel);
 	vcpu_e500->mas0 &= ~MAS0_NV(~0);
 	vcpu_e500->mas0 |= MAS0_NV(vcpu_e500->gtlb_nv[tlbsel]);
 	vcpu_e500->mas1 = gtlbe->mas1;
 	vcpu_e500->mas2 = gtlbe->mas2;
-	vcpu_e500->mas3 = gtlbe->mas3;
-	vcpu_e500->mas7 = gtlbe->mas7;
+	vcpu_e500->mas7_3 = gtlbe->mas7_3;
 
 	return EMULATE_DONE;
 }
@@ -839,7 +839,7 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
 	int as = !!get_cur_sas(vcpu_e500);
 	unsigned int pid = get_cur_spid(vcpu_e500);
 	int esel, tlbsel;
-	struct tlbe *gtlbe = NULL;
+	struct kvm_book3e_206_tlb_entry *gtlbe = NULL;
 	gva_t ea;
 
 	ea = kvmppc_get_gpr(vcpu, rb);
@@ -847,7 +847,7 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
 	for (tlbsel = 0; tlbsel < 2; tlbsel++) {
 		esel = kvmppc_e500_tlb_index(vcpu_e500, ea, tlbsel, pid, as);
 		if (esel >= 0) {
-			gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
+			gtlbe = get_entry(vcpu_e500, tlbsel, esel);
 			break;
 		}
 	}
@@ -857,8 +857,7 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
 			| MAS0_NV(vcpu_e500->gtlb_nv[tlbsel]);
 		vcpu_e500->mas1 = gtlbe->mas1;
 		vcpu_e500->mas2 = gtlbe->mas2;
-		vcpu_e500->mas3 = gtlbe->mas3;
-		vcpu_e500->mas7 = gtlbe->mas7;
+		vcpu_e500->mas7_3 = gtlbe->mas7_3;
 	} else {
 		int victim;
 
@@ -873,8 +872,7 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
 			| (vcpu_e500->mas4 & MAS4_TSIZED(~0));
 		vcpu_e500->mas2 &= MAS2_EPN;
 		vcpu_e500->mas2 |= vcpu_e500->mas4 & MAS2_ATTRIB_MASK;
-		vcpu_e500->mas3 &= MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3;
-		vcpu_e500->mas7 = 0;
+		vcpu_e500->mas7_3 &= MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3;
 	}
 
 	kvmppc_set_exit_type(vcpu, EMULATED_TLBSX_EXITS);
@@ -883,8 +881,8 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
 
 /* sesel is index into the set, not the whole array */
 static void write_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
-			struct tlbe *gtlbe,
-			struct tlbe *stlbe,
+			struct kvm_book3e_206_tlb_entry *gtlbe,
+			struct kvm_book3e_206_tlb_entry *stlbe,
 			int stlbsel, int sesel)
 {
 	int stid;
@@ -902,28 +900,27 @@ static void write_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
 int kvmppc_e500_emul_tlbwe(struct kvm_vcpu *vcpu)
 {
 	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
-	struct tlbe *gtlbe;
+	struct kvm_book3e_206_tlb_entry *gtlbe;
 	int tlbsel, esel;
 
 	tlbsel = get_tlb_tlbsel(vcpu_e500);
 	esel = get_tlb_esel(vcpu_e500, tlbsel);
 
-	gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
+	gtlbe = get_entry(vcpu_e500, tlbsel, esel);
 
 	if (get_tlb_v(gtlbe))
 		inval_gtlbe_on_host(vcpu_e500, tlbsel, esel);
 
 	gtlbe->mas1 = vcpu_e500->mas1;
 	gtlbe->mas2 = vcpu_e500->mas2;
-	gtlbe->mas3 = vcpu_e500->mas3;
-	gtlbe->mas7 = vcpu_e500->mas7;
+	gtlbe->mas7_3 = vcpu_e500->mas7_3;
 
 	trace_kvm_gtlb_write(vcpu_e500->mas0, gtlbe->mas1, gtlbe->mas2,
-			     gtlbe->mas3, gtlbe->mas7);
+			     (u32)gtlbe->mas7_3, (u32)(gtlbe->mas7_3 >> 32));
 
 	/* Invalidate shadow mappings for the about-to-be-clobbered TLBE. */
 	if (tlbe_is_host_safe(vcpu, gtlbe)) {
-		struct tlbe stlbe;
+		struct kvm_book3e_206_tlb_entry stlbe;
 		int stlbsel, sesel;
 		u64 eaddr;
 		u64 raddr;
@@ -996,9 +993,11 @@ gpa_t kvmppc_mmu_xlate(struct kvm_vcpu *vcpu, unsigned int index,
 			gva_t eaddr)
 {
 	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
-	struct tlbe *gtlbe =
-		&vcpu_e500->gtlb_arch[tlbsel_of(index)][esel_of(index)];
-	u64 pgmask = get_tlb_bytes(gtlbe) - 1;
+	struct kvm_book3e_206_tlb_entry *gtlbe;
+	u64 pgmask;
+
+	gtlbe = get_entry(vcpu_e500, tlbsel_of(index), esel_of(index));
+	pgmask = get_tlb_bytes(gtlbe) - 1;
 
 	return get_tlb_raddr(gtlbe) | (eaddr & pgmask);
 }
@@ -1012,12 +1011,12 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 eaddr, gpa_t gpaddr,
 {
 	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
 	struct tlbe_priv *priv;
-	struct tlbe *gtlbe, stlbe;
+	struct kvm_book3e_206_tlb_entry *gtlbe, stlbe;
 	int tlbsel = tlbsel_of(index);
 	int esel = esel_of(index);
 	int stlbsel, sesel;
 
-	gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
+	gtlbe = get_entry(vcpu_e500, tlbsel, esel);
 
 	switch (tlbsel) {
 	case 0:
@@ -1073,25 +1072,174 @@ void kvmppc_set_pid(struct kvm_vcpu *vcpu, u32 pid)
 
 void kvmppc_e500_tlb_setup(struct kvmppc_vcpu_e500 *vcpu_e500)
 {
-	struct tlbe *tlbe;
+	struct kvm_book3e_206_tlb_entry *tlbe;
 
 	/* Insert large initial mapping for guest. */
-	tlbe = &vcpu_e500->gtlb_arch[1][0];
+	tlbe = get_entry(vcpu_e500, 1, 0);
 	tlbe->mas1 = MAS1_VALID | MAS1_TSIZE(BOOK3E_PAGESZ_256M);
 	tlbe->mas2 = 0;
-	tlbe->mas3 = E500_TLB_SUPER_PERM_MASK;
-	tlbe->mas7 = 0;
+	tlbe->mas7_3 = E500_TLB_SUPER_PERM_MASK;
 
 	/* 4K map for serial output. Used by kernel wrapper. */
-	tlbe = &vcpu_e500->gtlb_arch[1][1];
+	tlbe = get_entry(vcpu_e500, 1, 1);
 	tlbe->mas1 = MAS1_VALID | MAS1_TSIZE(BOOK3E_PAGESZ_4K);
 	tlbe->mas2 = (0xe0004500 & 0xFFFFF000) | MAS2_I | MAS2_G;
-	tlbe->mas3 = (0xe0004500 & 0xFFFFF000) | E500_TLB_SUPER_PERM_MASK;
-	tlbe->mas7 = 0;
+	tlbe->mas7_3 = (0xe0004500 & 0xFFFFF000) | E500_TLB_SUPER_PERM_MASK;
+}
+
+static void free_gtlb(struct kvmppc_vcpu_e500 *vcpu_e500)
+{
+	int i;
+
+	clear_tlb_refs(vcpu_e500);
+	kfree(vcpu_e500->gtlb_priv[0]);
+	kfree(vcpu_e500->gtlb_priv[1]);
+
+	if (vcpu_e500->shared_tlb_pages) {
+		vfree((void *)(round_down((uintptr_t)vcpu_e500->gtlb_arch,
+					  PAGE_SIZE)));
+
+		for (i = 0; i < vcpu_e500->num_shared_tlb_pages; i++) {
+			set_page_dirty_lock(vcpu_e500->shared_tlb_pages[i]);
+			put_page(vcpu_e500->shared_tlb_pages[i]);
+		}
+
+		vcpu_e500->num_shared_tlb_pages = 0;
+		vcpu_e500->shared_tlb_pages = NULL;
+	} else {
+		kfree(vcpu_e500->gtlb_arch);
+	}
+
+	vcpu_e500->gtlb_arch = NULL;
+}
+
+int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
+			      struct kvm_config_tlb *cfg)
+{
+	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
+	struct kvm_book3e_206_tlb_params params;
+	char *virt;
+	struct page **pages;
+	struct tlbe_priv *privs[2] = {};
+	size_t array_len;
+	u32 sets;
+	int num_pages, ret, i;
+
+	if (cfg->mmu_type != KVM_MMU_FSL_BOOKE_NOHV)
+		return -EINVAL;
+
+	if (copy_from_user(&params, (void __user *)(uintptr_t)cfg->params,
+			   sizeof(params)))
+		return -EFAULT;
+
+	if (params.tlb_sizes[1] > 64)
+		return -EINVAL;
+	if (params.tlb_ways[1] != params.tlb_sizes[1])
+		return -EINVAL;
+	if (params.tlb_sizes[2] != 0 || params.tlb_sizes[3] != 0)
+		return -EINVAL;
+	if (params.tlb_ways[2] != 0 || params.tlb_ways[3] != 0)
+		return -EINVAL;
+
+	if (!is_power_of_2(params.tlb_ways[0]))
+		return -EINVAL;
+
+	sets = params.tlb_sizes[0] >> ilog2(params.tlb_ways[0]);
+	if (!is_power_of_2(sets))
+		return -EINVAL;
+
+	array_len = params.tlb_sizes[0] + params.tlb_sizes[1];
+	array_len *= sizeof(struct kvm_book3e_206_tlb_entry);
+
+	if (cfg->array_len < array_len)
+		return -EINVAL;
+
+	num_pages = DIV_ROUND_UP(cfg->array + array_len - 1, PAGE_SIZE) -
+		    cfg->array / PAGE_SIZE;
+	pages = kmalloc(sizeof(struct page *) * num_pages, GFP_KERNEL);
+	if (!pages)
+		return -ENOMEM;
+
+	ret = get_user_pages_fast(cfg->array, num_pages, 1, pages);
+	if (ret < 0)
+		goto err_pages;
+
+	if (ret != num_pages) {
+		num_pages = ret;
+		ret = -EFAULT;
+		goto err_put_page;
+	}
+
+	virt = vmap(pages, num_pages, VM_MAP, PAGE_KERNEL);
+	if (!virt)
+		goto err_put_page;
+
+	privs[0] = kzalloc(sizeof(struct tlbe_priv) * params.tlb_sizes[0],
+			   GFP_KERNEL);
+	privs[1] = kzalloc(sizeof(struct tlbe_priv) * params.tlb_sizes[1],
+			   GFP_KERNEL);
+
+	if (!privs[0] || !privs[1])
+		goto err_put_page;
+
+	free_gtlb(vcpu_e500);
+
+	vcpu_e500->gtlb_priv[0] = privs[0];
+	vcpu_e500->gtlb_priv[1] = privs[1];
+
+	vcpu_e500->gtlb_arch = (struct kvm_book3e_206_tlb_entry *)
+		(virt + (cfg->array & (PAGE_SIZE - 1)));
+
+	vcpu_e500->gtlb_params[0].entries = params.tlb_sizes[0];
+	vcpu_e500->gtlb_params[1].entries = params.tlb_sizes[1];
+
+	vcpu_e500->gtlb_offset[0] = 0;
+	vcpu_e500->gtlb_offset[1] = params.tlb_sizes[0];
+
+	vcpu_e500->tlb0cfg = mfspr(SPRN_TLB0CFG) & ~0xfffUL;
+	if (params.tlb_sizes[0] <= 2048)
+		vcpu_e500->tlb0cfg |= params.tlb_sizes[0];
+
+	vcpu_e500->tlb1cfg = mfspr(SPRN_TLB1CFG) & ~0xfffUL;
+	vcpu_e500->tlb1cfg |= params.tlb_sizes[1];
+
+	vcpu_e500->shared_tlb_pages = pages;
+	vcpu_e500->num_shared_tlb_pages = num_pages;
+
+	vcpu_e500->gtlb_params[0].ways = params.tlb_ways[0];
+	vcpu_e500->gtlb_params[0].sets = sets;
+
+	vcpu_e500->gtlb_params[1].ways = params.tlb_sizes[1];
+	vcpu_e500->gtlb_params[1].sets = 1;
+
+	return 0;
+
+err_put_page:
+	kfree(privs[0]);
+	kfree(privs[1]);
+
+	for (i = 0; i < num_pages; i++)
+		put_page(pages[i]);
+
+err_pages:
+	kfree(pages);
+	return ret;
+}
+
+int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
+			     struct kvm_dirty_tlb *dirty)
+{
+	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
+
+	clear_tlb_refs(vcpu_e500);
+	return 0;
 }
 
 int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500)
 {
+	int entry_size = sizeof(struct kvm_book3e_206_tlb_entry);
+	int entries = KVM_E500_TLB0_SIZE + KVM_E500_TLB1_SIZE;
+
 	host_tlb_params[0].entries = mfspr(SPRN_TLB0CFG) & TLBnCFG_N_ENTRY;
 	host_tlb_params[1].entries = mfspr(SPRN_TLB1CFG) & TLBnCFG_N_ENTRY;
 
@@ -1124,17 +1272,22 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500)
 		host_tlb_params[0].entries / host_tlb_params[0].ways;
 	host_tlb_params[1].sets = 1;
 
-	vcpu_e500->gtlb_size[0] = KVM_E500_TLB0_SIZE;
-	vcpu_e500->gtlb_arch[0] =
-		kzalloc(sizeof(struct tlbe) * KVM_E500_TLB0_SIZE, GFP_KERNEL);
-	if (vcpu_e500->gtlb_arch[0] == NULL)
-		goto err;
+	vcpu_e500->gtlb_params[0].entries = KVM_E500_TLB0_SIZE;
+	vcpu_e500->gtlb_params[1].entries = KVM_E500_TLB1_SIZE;
 
-	vcpu_e500->gtlb_size[1] = KVM_E500_TLB1_SIZE;
-	vcpu_e500->gtlb_arch[1] =
-		kzalloc(sizeof(struct tlbe) * KVM_E500_TLB1_SIZE, GFP_KERNEL);
-	if (vcpu_e500->gtlb_arch[1] == NULL)
-		goto err;
+	vcpu_e500->gtlb_params[0].ways = KVM_E500_TLB0_WAY_NUM;
+	vcpu_e500->gtlb_params[0].sets =
+		KVM_E500_TLB0_SIZE / KVM_E500_TLB0_WAY_NUM;
+
+	vcpu_e500->gtlb_params[1].ways = KVM_E500_TLB1_SIZE;
+	vcpu_e500->gtlb_params[1].sets = 1;
+
+	vcpu_e500->gtlb_arch = kmalloc(entries * entry_size, GFP_KERNEL);
+	if (!vcpu_e500->gtlb_arch)
+		return -ENOMEM;
+
+	vcpu_e500->gtlb_offset[0] = 0;
+	vcpu_e500->gtlb_offset[1] = KVM_E500_TLB0_SIZE;
 
 	vcpu_e500->tlb_refs[0] =
 		kzalloc(sizeof(struct tlbe_ref) * host_tlb_params[0].entries,
@@ -1148,15 +1301,15 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500)
 	if (!vcpu_e500->tlb_refs[1])
 		goto err;
 
-	vcpu_e500->gtlb_priv[0] =
-		kzalloc(sizeof(struct tlbe_ref) * vcpu_e500->gtlb_size[0],
-			GFP_KERNEL);
+	vcpu_e500->gtlb_priv[0] = kzalloc(sizeof(struct tlbe_ref) *
+					  vcpu_e500->gtlb_params[0].entries,
+					  GFP_KERNEL);
 	if (!vcpu_e500->gtlb_priv[0])
 		goto err;
 
-	vcpu_e500->gtlb_priv[1] =
-		kzalloc(sizeof(struct tlbe_ref) * vcpu_e500->gtlb_size[1],
-			GFP_KERNEL);
+	vcpu_e500->gtlb_priv[1] = kzalloc(sizeof(struct tlbe_ref) *
+					  vcpu_e500->gtlb_params[1].entries,
+					  GFP_KERNEL);
 	if (!vcpu_e500->gtlb_priv[1])
 		goto err;
 
@@ -1165,32 +1318,24 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500)
 
 	/* Init TLB configuration register */
 	vcpu_e500->tlb0cfg = mfspr(SPRN_TLB0CFG) & ~0xfffUL;
-	vcpu_e500->tlb0cfg |= vcpu_e500->gtlb_size[0];
+	vcpu_e500->tlb0cfg |= vcpu_e500->gtlb_params[0].entries;
 	vcpu_e500->tlb1cfg = mfspr(SPRN_TLB1CFG) & ~0xfffUL;
-	vcpu_e500->tlb1cfg |= vcpu_e500->gtlb_size[1];
+	vcpu_e500->tlb1cfg |= vcpu_e500->gtlb_params[1].entries;
 
 	return 0;
 
 err:
+	free_gtlb(vcpu_e500);
 	kfree(vcpu_e500->tlb_refs[0]);
 	kfree(vcpu_e500->tlb_refs[1]);
-	kfree(vcpu_e500->gtlb_priv[0]);
-	kfree(vcpu_e500->gtlb_priv[1]);
-	kfree(vcpu_e500->gtlb_arch[0]);
-	kfree(vcpu_e500->gtlb_arch[1]);
 	return -1;
 }
 
 void kvmppc_e500_tlb_uninit(struct kvmppc_vcpu_e500 *vcpu_e500)
 {
-	clear_tlb_refs(vcpu_e500);
-
+	free_gtlb(vcpu_e500);
 	kvmppc_e500_id_table_free(vcpu_e500);
 
 	kfree(vcpu_e500->tlb_refs[0]);
 	kfree(vcpu_e500->tlb_refs[1]);
-	kfree(vcpu_e500->gtlb_priv[0]);
-	kfree(vcpu_e500->gtlb_priv[1]);
-	kfree(vcpu_e500->gtlb_arch[1]);
-	kfree(vcpu_e500->gtlb_arch[0]);
 }
diff --git a/arch/powerpc/kvm/e500_tlb.h b/arch/powerpc/kvm/e500_tlb.h
index b587f69..2c29640 100644
--- a/arch/powerpc/kvm/e500_tlb.h
+++ b/arch/powerpc/kvm/e500_tlb.h
@@ -20,13 +20,9 @@
 #include <asm/tlb.h>
 #include <asm/kvm_e500.h>
 
-#define KVM_E500_TLB0_WAY_SIZE_BIT	7	/* Fixed */
-#define KVM_E500_TLB0_WAY_SIZE		(1UL << KVM_E500_TLB0_WAY_SIZE_BIT)
-#define KVM_E500_TLB0_WAY_SIZE_MASK	(KVM_E500_TLB0_WAY_SIZE - 1)
-
-#define KVM_E500_TLB0_WAY_NUM_BIT	1	/* No greater than 7 */
-#define KVM_E500_TLB0_WAY_NUM		(1UL << KVM_E500_TLB0_WAY_NUM_BIT)
-#define KVM_E500_TLB0_WAY_NUM_MASK	(KVM_E500_TLB0_WAY_NUM - 1)
+/* This geometry is the legacy default -- can be overridden by userspace */
+#define KVM_E500_TLB0_WAY_SIZE		128
+#define KVM_E500_TLB0_WAY_NUM		2
 
 #define KVM_E500_TLB0_SIZE  (KVM_E500_TLB0_WAY_SIZE * KVM_E500_TLB0_WAY_NUM)
 #define KVM_E500_TLB1_SIZE  16
@@ -58,50 +54,54 @@ extern void kvmppc_e500_tlb_setup(struct kvmppc_vcpu_e500 *);
 extern void kvmppc_e500_recalc_shadow_pid(struct kvmppc_vcpu_e500 *);
 
 /* TLB helper functions */
-static inline unsigned int get_tlb_size(const struct tlbe *tlbe)
+static inline unsigned int
+get_tlb_size(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	return (tlbe->mas1 >> 7) & 0x1f;
 }
 
-static inline gva_t get_tlb_eaddr(const struct tlbe *tlbe)
+static inline gva_t get_tlb_eaddr(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	return tlbe->mas2 & 0xfffff000;
 }
 
-static inline u64 get_tlb_bytes(const struct tlbe *tlbe)
+static inline u64 get_tlb_bytes(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	unsigned int pgsize = get_tlb_size(tlbe);
 	return 1ULL << 10 << pgsize;
 }
 
-static inline gva_t get_tlb_end(const struct tlbe *tlbe)
+static inline gva_t get_tlb_end(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	u64 bytes = get_tlb_bytes(tlbe);
 	return get_tlb_eaddr(tlbe) + bytes - 1;
 }
 
-static inline u64 get_tlb_raddr(const struct tlbe *tlbe)
+static inline u64 get_tlb_raddr(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
-	u64 rpn = tlbe->mas7;
-	return (rpn << 32) | (tlbe->mas3 & 0xfffff000);
+	return tlbe->mas7_3 & ~0xfffULL;
 }
 
-static inline unsigned int get_tlb_tid(const struct tlbe *tlbe)
+static inline unsigned int
+get_tlb_tid(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	return (tlbe->mas1 >> 16) & 0xff;
 }
 
-static inline unsigned int get_tlb_ts(const struct tlbe *tlbe)
+static inline unsigned int
+get_tlb_ts(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	return (tlbe->mas1 >> 12) & 0x1;
 }
 
-static inline unsigned int get_tlb_v(const struct tlbe *tlbe)
+static inline unsigned int
+get_tlb_v(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	return (tlbe->mas1 >> 31) & 0x1;
 }
 
-static inline unsigned int get_tlb_iprot(const struct tlbe *tlbe)
+static inline unsigned int
+get_tlb_iprot(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	return (tlbe->mas1 >> 30) & 0x1;
 }
@@ -156,7 +156,7 @@ static inline unsigned int get_tlb_esel_bit(
 }
 
 static inline int tlbe_is_host_safe(const struct kvm_vcpu *vcpu,
-			const struct tlbe *tlbe)
+			const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	gpa_t gpa;
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 0d843c6..55b4233 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -221,6 +221,9 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_PPC_PAIRED_SINGLES:
 	case KVM_CAP_PPC_OSI:
 	case KVM_CAP_PPC_GET_PVINFO:
+#ifdef CONFIG_KVM_E500
+	case KVM_CAP_SW_TLB:
+#endif
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -601,6 +604,19 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 		r = 0;
 		vcpu->arch.papr_enabled = true;
 		break;
+#ifdef CONFIG_KVM_E500
+	case KVM_CAP_SW_TLB: {
+		struct kvm_config_tlb cfg;
+		void __user *user_ptr = (void __user *)(uintptr_t)cap->args[0];
+
+		r = -EFAULT;
+		if (copy_from_user(&cfg, user_ptr, sizeof(cfg)))
+			break;
+
+		r = kvm_vcpu_ioctl_config_tlb(vcpu, &cfg);
+		break;
+	}
+#endif
 	default:
 		r = -EINVAL;
 		break;
@@ -650,6 +666,18 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 		r = kvm_vcpu_ioctl_enable_cap(vcpu, &cap);
 		break;
 	}
+
+#ifdef CONFIG_KVM_E500
+	case KVM_DIRTY_TLB: {
+		struct kvm_dirty_tlb dirty;
+		r = -EFAULT;
+		if (copy_from_user(&dirty, argp, sizeof(dirty)))
+			goto out;
+		r = kvm_vcpu_ioctl_dirty_tlb(vcpu, &dirty);
+		break;
+	}
+#endif
+
 	default:
 		r = -EINVAL;
 	}
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index f47fcd3..76ef719 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -557,6 +557,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_MAX_VCPUS 66       /* returns max vcpus per vm */
 #define KVM_CAP_PPC_HIOR 67
 #define KVM_CAP_PPC_PAPR 68
+#define KVM_CAP_SW_TLB 69
 #define KVM_CAP_S390_GMAP 71
 
 #ifdef KVM_CAP_IRQ_ROUTING
@@ -637,6 +638,21 @@ struct kvm_clock_data {
 	__u32 pad[9];
 };
 
+#define KVM_MMU_FSL_BOOKE_NOHV		0
+#define KVM_MMU_FSL_BOOKE_HV		1
+
+struct kvm_config_tlb {
+	__u64 params;
+	__u64 array;
+	__u32 mmu_type;
+	__u32 array_len;
+};
+
+struct kvm_dirty_tlb {
+	__u64 bitmap;
+	__u32 num_dirty;
+};
+
 /*
  * ioctls for VM fds
  */
@@ -763,6 +779,8 @@ struct kvm_clock_data {
 #define KVM_CREATE_SPAPR_TCE	  _IOW(KVMIO,  0xa8, struct kvm_create_spapr_tce)
 /* Available with KVM_CAP_RMA */
 #define KVM_ALLOCATE_RMA	  _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
+/* Available with KVM_CAP_SW_TLB */
+#define KVM_DIRTY_TLB		  _IOW(KVMIO,  0xaa, struct kvm_dirty_tlb)
 
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 04/14] KVM: PPC: e500: MMU API
@ 2011-10-31  7:53   ` Alexander Graf
  0 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti, Scott Wood

From: Scott Wood <scottwood@freescale.com>

This implements a shared-memory API for giving host userspace access to
the guest's TLB.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
---
 Documentation/virtual/kvm/api.txt   |   74 +++++++
 arch/powerpc/include/asm/kvm.h      |   35 +++
 arch/powerpc/include/asm/kvm_e500.h |   24 +--
 arch/powerpc/include/asm/kvm_ppc.h  |    5 +
 arch/powerpc/kvm/e500.c             |    5 +-
 arch/powerpc/kvm/e500_emulate.c     |   12 +-
 arch/powerpc/kvm/e500_tlb.c         |  393 ++++++++++++++++++++++++-----------
 arch/powerpc/kvm/e500_tlb.h         |   38 ++--
 arch/powerpc/kvm/powerpc.c          |   28 +++
 include/linux/kvm.h                 |   18 ++
 10 files changed, 469 insertions(+), 163 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 7945b0b..ab1136f 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1383,6 +1383,38 @@ The following flags are defined:
 If datamatch flag is set, the event will be signaled only if the written value
 to the registered address is equal to datamatch in struct kvm_ioeventfd.
 
+4.59 KVM_DIRTY_TLB
+
+Capability: KVM_CAP_SW_TLB
+Architectures: ppc
+Type: vcpu ioctl
+Parameters: struct kvm_dirty_tlb (in)
+Returns: 0 on success, -1 on error
+
+struct kvm_dirty_tlb {
+	__u64 bitmap;
+	__u32 num_dirty;
+};
+
+This must be called whenever userspace has changed an entry in the shared
+TLB, prior to calling KVM_RUN on the associated vcpu.
+
+The "bitmap" field is the userspace address of an array.  This array
+consists of a number of bits, equal to the total number of TLB entries as
+determined by the last successful call to KVM_CONFIG_TLB, rounded up to the
+nearest multiple of 64.
+
+Each bit corresponds to one TLB entry, ordered the same as in the shared TLB
+array.
+
+The array is little-endian: the bit 0 is the least significant bit of the
+first byte, bit 8 is the least significant bit of the second byte, etc.
+This avoids any complications with differing word sizes.
+
+The "num_dirty" field is a performance hint for KVM to determine whether it
+should skip processing the bitmap and just invalidate everything.  It must
+be set to the number of set bits in the bitmap.
+
 4.62 KVM_CREATE_SPAPR_TCE
 
 Capability: KVM_CAP_SPAPR_TCE
@@ -1700,3 +1732,45 @@ HTAB address part of SDR1 contains an HVA instead of a GPA, as PAPR keeps the
 HTAB invisible to the guest.
 
 When this capability is enabled, KVM_EXIT_PAPR_HCALL can occur.
+
+6.3 KVM_CAP_SW_TLB
+
+Architectures: ppc
+Parameters: args[0] is the address of a struct kvm_config_tlb
+Returns: 0 on success; -1 on error
+
+struct kvm_config_tlb {
+	__u64 params;
+	__u64 array;
+	__u32 mmu_type;
+	__u32 array_len;
+};
+
+Configures the virtual CPU's TLB array, establishing a shared memory area
+between userspace and KVM.  The "params" and "array" fields are userspace
+addresses of mmu-type-specific data structures.  The "array_len" field is an
+safety mechanism, and should be set to the size in bytes of the memory that
+userspace has reserved for the array.  It must be at least the size dictated
+by "mmu_type" and "params".
+
+While KVM_RUN is active, the shared region is under control of KVM.  Its
+contents are undefined, and any modification by userspace results in
+boundedly undefined behavior.
+
+On return from KVM_RUN, the shared region will reflect the current state of
+the guest's TLB.  If userspace makes any changes, it must call KVM_DIRTY_TLB
+to tell KVM which entries have been changed, prior to calling KVM_RUN again
+on this vcpu.
+
+For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
+ - The "params" field is of type "struct kvm_book3e_206_tlb_params".
+ - The "array" field points to an array of type "struct
+   kvm_book3e_206_tlb_entry".
+ - The array consists of all entries in the first TLB, followed by all
+   entries in the second TLB.
+ - Within a TLB, entries are ordered first by increasing set number.  Within a
+   set, entries are ordered by way (increasing ESEL).
+ - The hash for determining set number in TLB0 is: (MAS2 >> 12) & (num_sets - 1)
+   where "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
+ - The tsize field of mas1 shall be set to 4K on TLB0, even though the
+   hardware ignores this value for TLB0.
diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h
index 08fe69e..71684b9 100644
--- a/arch/powerpc/include/asm/kvm.h
+++ b/arch/powerpc/include/asm/kvm.h
@@ -300,4 +300,39 @@ struct kvm_allocate_rma {
 	__u64 rma_size;
 };
 
+struct kvm_book3e_206_tlb_entry {
+	__u32 mas8;
+	__u32 mas1;
+	__u64 mas2;
+	__u64 mas7_3;
+};
+
+struct kvm_book3e_206_tlb_params {
+	/*
+	 * For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
+	 *
+	 * - The number of ways of TLB0 must be a power of two between 2 and
+	 *   16.
+	 * - TLB1 must be fully associative.
+	 * - The size of TLB0 must be a multiple of the number of ways, and
+	 *   the number of sets must be a power of two.
+	 * - The size of TLB1 may not exceed 64 entries.
+	 * - TLB0 supports 4 KiB pages.
+	 * - The page sizes supported by TLB1 are as indicated by
+	 *   TLB1CFG (if MMUCFG[MAVN] = 0) or TLB1PS (if MMUCFG[MAVN] = 1)
+	 *   as returned by KVM_GET_SREGS.
+	 * - TLB2 and TLB3 are reserved, and their entries in tlb_sizes[]
+	 *   and tlb_ways[] must be zero.
+	 *
+	 * tlb_ways[n] = tlb_sizes[n] means the array is fully associative.
+	 *
+	 * KVM will adjust TLBnCFG based on the sizes configured here,
+	 * though arrays greater than 2048 entries will have TLBnCFG[NENTRY]
+	 * set to zero.
+	 */
+	__u32 tlb_sizes[4];
+	__u32 tlb_ways[4];
+	__u32 reserved[8];
+};
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/include/asm/kvm_e500.h b/arch/powerpc/include/asm/kvm_e500.h
index a5197d8..bc17441 100644
--- a/arch/powerpc/include/asm/kvm_e500.h
+++ b/arch/powerpc/include/asm/kvm_e500.h
@@ -22,13 +22,6 @@
 #define E500_PID_NUM   3
 #define E500_TLB_NUM   2
 
-struct tlbe{
-	u32 mas1;
-	u32 mas2;
-	u32 mas3;
-	u32 mas7;
-};
-
 #define E500_TLB_VALID 1
 #define E500_TLB_DIRTY 2
 
@@ -48,13 +41,17 @@ struct kvmppc_e500_tlb_params {
 };
 
 struct kvmppc_vcpu_e500 {
-	/* Unmodified copy of the guest's TLB. */
-	struct tlbe *gtlb_arch[E500_TLB_NUM];
+	/* Unmodified copy of the guest's TLB -- shared with host userspace. */
+	struct kvm_book3e_206_tlb_entry *gtlb_arch;
+
+	/* Starting entry number in gtlb_arch[] */
+	int gtlb_offset[E500_TLB_NUM];
 
 	/* KVM internal information associated with each guest TLB entry */
 	struct tlbe_priv *gtlb_priv[E500_TLB_NUM];
 
-	unsigned int gtlb_size[E500_TLB_NUM];
+	struct kvmppc_e500_tlb_params gtlb_params[E500_TLB_NUM];
+
 	unsigned int gtlb_nv[E500_TLB_NUM];
 
 	/*
@@ -68,7 +65,6 @@ struct kvmppc_vcpu_e500 {
 	 * and back, and our host TLB entries got evicted).
 	 */
 	struct tlbe_ref *tlb_refs[E500_TLB_NUM];
-
 	unsigned int host_tlb1_nv;
 
 	u32 host_pid[E500_PID_NUM];
@@ -78,11 +74,10 @@ struct kvmppc_vcpu_e500 {
 	u32 mas0;
 	u32 mas1;
 	u32 mas2;
-	u32 mas3;
+	u64 mas7_3;
 	u32 mas4;
 	u32 mas5;
 	u32 mas6;
-	u32 mas7;
 
 	/* vcpu id table */
 	struct vcpu_id_table *idt;
@@ -95,6 +90,9 @@ struct kvmppc_vcpu_e500 {
 	u32 tlb1cfg;
 	u64 mcar;
 
+	struct page **shared_tlb_pages;
+	int num_shared_tlb_pages;
+
 	struct kvm_vcpu vcpu;
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 46efd1a..a284f20 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -193,4 +193,9 @@ static inline void kvm_rma_init(void)
 {}
 #endif
 
+int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
+			      struct kvm_config_tlb *cfg);
+int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
+			     struct kvm_dirty_tlb *cfg);
+
 #endif /* __POWERPC_KVM_PPC_H__ */
diff --git a/arch/powerpc/kvm/e500.c b/arch/powerpc/kvm/e500.c
index 26d2090..14d6e6e 100644
--- a/arch/powerpc/kvm/e500.c
+++ b/arch/powerpc/kvm/e500.c
@@ -120,7 +120,7 @@ void kvmppc_core_get_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 	sregs->u.e.mas0 = vcpu_e500->mas0;
 	sregs->u.e.mas1 = vcpu_e500->mas1;
 	sregs->u.e.mas2 = vcpu_e500->mas2;
-	sregs->u.e.mas7_3 = ((u64)vcpu_e500->mas7 << 32) | vcpu_e500->mas3;
+	sregs->u.e.mas7_3 = vcpu_e500->mas7_3;
 	sregs->u.e.mas4 = vcpu_e500->mas4;
 	sregs->u.e.mas6 = vcpu_e500->mas6;
 
@@ -153,8 +153,7 @@ int kvmppc_core_set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 		vcpu_e500->mas0 = sregs->u.e.mas0;
 		vcpu_e500->mas1 = sregs->u.e.mas1;
 		vcpu_e500->mas2 = sregs->u.e.mas2;
-		vcpu_e500->mas7 = sregs->u.e.mas7_3 >> 32;
-		vcpu_e500->mas3 = (u32)sregs->u.e.mas7_3;
+		vcpu_e500->mas7_3 = sregs->u.e.mas7_3;
 		vcpu_e500->mas4 = sregs->u.e.mas4;
 		vcpu_e500->mas6 = sregs->u.e.mas6;
 	}
diff --git a/arch/powerpc/kvm/e500_emulate.c b/arch/powerpc/kvm/e500_emulate.c
index d48ae39..e0d3609 100644
--- a/arch/powerpc/kvm/e500_emulate.c
+++ b/arch/powerpc/kvm/e500_emulate.c
@@ -95,13 +95,17 @@ int kvmppc_core_emulate_mtspr(struct kvm_vcpu *vcpu, int sprn, int rs)
 	case SPRN_MAS2:
 		vcpu_e500->mas2 = spr_val; break;
 	case SPRN_MAS3:
-		vcpu_e500->mas3 = spr_val; break;
+		vcpu_e500->mas7_3 &= ~(u64)0xffffffff;
+		vcpu_e500->mas7_3 |= spr_val;
+		break;
 	case SPRN_MAS4:
 		vcpu_e500->mas4 = spr_val; break;
 	case SPRN_MAS6:
 		vcpu_e500->mas6 = spr_val; break;
 	case SPRN_MAS7:
-		vcpu_e500->mas7 = spr_val; break;
+		vcpu_e500->mas7_3 &= (u64)0xffffffff;
+		vcpu_e500->mas7_3 |= (u64)spr_val << 32;
+		break;
 	case SPRN_L1CSR0:
 		vcpu_e500->l1csr0 = spr_val;
 		vcpu_e500->l1csr0 &= ~(L1CSR0_DCFI | L1CSR0_CLFC);
@@ -158,13 +162,13 @@ int kvmppc_core_emulate_mfspr(struct kvm_vcpu *vcpu, int sprn, int rt)
 	case SPRN_MAS2:
 		kvmppc_set_gpr(vcpu, rt, vcpu_e500->mas2); break;
 	case SPRN_MAS3:
-		kvmppc_set_gpr(vcpu, rt, vcpu_e500->mas3); break;
+		kvmppc_set_gpr(vcpu, rt, (u32)vcpu_e500->mas7_3); break;
 	case SPRN_MAS4:
 		kvmppc_set_gpr(vcpu, rt, vcpu_e500->mas4); break;
 	case SPRN_MAS6:
 		kvmppc_set_gpr(vcpu, rt, vcpu_e500->mas6); break;
 	case SPRN_MAS7:
-		kvmppc_set_gpr(vcpu, rt, vcpu_e500->mas7); break;
+		kvmppc_set_gpr(vcpu, rt, vcpu_e500->mas7_3 >> 32); break;
 
 	case SPRN_TLB0CFG:
 		kvmppc_set_gpr(vcpu, rt, vcpu_e500->tlb0cfg); break;
diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index 59221bb..f19ae2f 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -19,6 +19,11 @@
 #include <linux/kvm.h>
 #include <linux/kvm_host.h>
 #include <linux/highmem.h>
+#include <linux/log2.h>
+#include <linux/uaccess.h>
+#include <linux/sched.h>
+#include <linux/rwsem.h>
+#include <linux/vmalloc.h>
 #include <asm/kvm_ppc.h>
 #include <asm/kvm_e500.h>
 
@@ -66,6 +71,13 @@ static DEFINE_PER_CPU(unsigned long, pcpu_last_used_sid);
 
 static struct kvmppc_e500_tlb_params host_tlb_params[E500_TLB_NUM];
 
+static struct kvm_book3e_206_tlb_entry *get_entry(
+	struct kvmppc_vcpu_e500 *vcpu_e500, int tlbsel, int entry)
+{
+	int offset = vcpu_e500->gtlb_offset[tlbsel];
+	return &vcpu_e500->gtlb_arch[offset + entry];
+}
+
 /*
  * Allocate a free shadow id and setup a valid sid mapping in given entry.
  * A mapping is only valid when vcpu_id_table and pcpu_id_table are match.
@@ -217,34 +229,13 @@ void kvmppc_e500_recalc_shadow_pid(struct kvmppc_vcpu_e500 *vcpu_e500)
 	preempt_enable();
 }
 
-void kvmppc_dump_tlbs(struct kvm_vcpu *vcpu)
-{
-	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
-	struct tlbe *tlbe;
-	int i, tlbsel;
-
-	printk("| %8s | %8s | %8s | %8s | %8s |\n",
-			"nr", "mas1", "mas2", "mas3", "mas7");
-
-	for (tlbsel = 0; tlbsel < 2; tlbsel++) {
-		printk("Guest TLB%d:\n", tlbsel);
-		for (i = 0; i < vcpu_e500->gtlb_size[tlbsel]; i++) {
-			tlbe = &vcpu_e500->gtlb_arch[tlbsel][i];
-			if (tlbe->mas1 & MAS1_VALID)
-				printk(" G[%d][%3d] |  %08X | %08X | %08X | %08X |\n",
-					tlbsel, i, tlbe->mas1, tlbe->mas2,
-					tlbe->mas3, tlbe->mas7);
-		}
-	}
-}
-
 static inline unsigned int gtlb0_get_next_victim(
 		struct kvmppc_vcpu_e500 *vcpu_e500)
 {
 	unsigned int victim;
 
 	victim = vcpu_e500->gtlb_nv[0]++;
-	if (unlikely(vcpu_e500->gtlb_nv[0] >= KVM_E500_TLB0_WAY_NUM))
+	if (unlikely(vcpu_e500->gtlb_nv[0] >= vcpu_e500->gtlb_params[0].ways))
 		vcpu_e500->gtlb_nv[0] = 0;
 
 	return victim;
@@ -256,9 +247,9 @@ static inline unsigned int tlb1_max_shadow_size(void)
 	return host_tlb_params[1].entries - tlbcam_index - 1;
 }
 
-static inline int tlbe_is_writable(struct tlbe *tlbe)
+static inline int tlbe_is_writable(struct kvm_book3e_206_tlb_entry *tlbe)
 {
-	return tlbe->mas3 & (MAS3_SW|MAS3_UW);
+	return tlbe->mas7_3 & (MAS3_SW|MAS3_UW);
 }
 
 static inline u32 e500_shadow_mas3_attrib(u32 mas3, int usermode)
@@ -289,39 +280,41 @@ static inline u32 e500_shadow_mas2_attrib(u32 mas2, int usermode)
 /*
  * writing shadow tlb entry to host TLB
  */
-static inline void __write_host_tlbe(struct tlbe *stlbe, uint32_t mas0)
+static inline void __write_host_tlbe(struct kvm_book3e_206_tlb_entry *stlbe,
+				     uint32_t mas0)
 {
 	unsigned long flags;
 
 	local_irq_save(flags);
 	mtspr(SPRN_MAS0, mas0);
 	mtspr(SPRN_MAS1, stlbe->mas1);
-	mtspr(SPRN_MAS2, stlbe->mas2);
-	mtspr(SPRN_MAS3, stlbe->mas3);
-	mtspr(SPRN_MAS7, stlbe->mas7);
+	mtspr(SPRN_MAS2, (unsigned long)stlbe->mas2);
+	mtspr(SPRN_MAS3, (u32)stlbe->mas7_3);
+	mtspr(SPRN_MAS7, (u32)(stlbe->mas7_3 >> 32));
 	asm volatile("isync; tlbwe" : : : "memory");
 	local_irq_restore(flags);
 }
 
 /* esel is index into set, not whole array */
 static inline void write_host_tlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
-		int tlbsel, int esel, struct tlbe *stlbe)
+		int tlbsel, int esel, struct kvm_book3e_206_tlb_entry *stlbe)
 {
 	if (tlbsel = 0) {
-		__write_host_tlbe(stlbe, MAS0_TLBSEL(0) | MAS0_ESEL(esel));
+		int way = esel & (vcpu_e500->gtlb_params[0].ways - 1);
+		__write_host_tlbe(stlbe, MAS0_TLBSEL(0) | MAS0_ESEL(way));
 	} else {
 		__write_host_tlbe(stlbe,
 				  MAS0_TLBSEL(1) |
 				  MAS0_ESEL(to_htlb1_esel(esel)));
 	}
 	trace_kvm_stlb_write(index_of(tlbsel, esel), stlbe->mas1, stlbe->mas2,
-			     stlbe->mas3, stlbe->mas7);
+			     (u32)stlbe->mas7_3, (u32)(stlbe->mas7_3 >> 32));
 }
 
 void kvmppc_map_magic(struct kvm_vcpu *vcpu)
 {
 	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
-	struct tlbe magic;
+	struct kvm_book3e_206_tlb_entry magic;
 	ulong shared_page = ((ulong)vcpu->arch.shared) & PAGE_MASK;
 	unsigned int stid;
 	pfn_t pfn;
@@ -335,9 +328,8 @@ void kvmppc_map_magic(struct kvm_vcpu *vcpu)
 	magic.mas1 = MAS1_VALID | MAS1_TS | MAS1_TID(stid) |
 		     MAS1_TSIZE(BOOK3E_PAGESZ_4K);
 	magic.mas2 = vcpu->arch.magic_page_ea | MAS2_M;
-	magic.mas3 = (pfn << PAGE_SHIFT) |
-		     MAS3_SW | MAS3_SR | MAS3_UW | MAS3_UR;
-	magic.mas7 = pfn >> (32 - PAGE_SHIFT);
+	magic.mas7_3 = ((u64)pfn << PAGE_SHIFT) |
+		       MAS3_SW | MAS3_SR | MAS3_UW | MAS3_UR;
 
 	__write_host_tlbe(&magic, MAS0_TLBSEL(1) | MAS0_ESEL(tlbcam_index));
 	preempt_enable();
@@ -358,7 +350,8 @@ void kvmppc_e500_tlb_put(struct kvm_vcpu *vcpu)
 static void inval_gtlbe_on_host(struct kvmppc_vcpu_e500 *vcpu_e500,
 				int tlbsel, int esel)
 {
-	struct tlbe *gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
+	struct kvm_book3e_206_tlb_entry *gtlbe +		get_entry(vcpu_e500, tlbsel, esel);
 	struct vcpu_id_table *idt = vcpu_e500->idt;
 	unsigned int pr, tid, ts, pid;
 	u32 val, eaddr;
@@ -424,9 +417,8 @@ static int tlb0_set_base(gva_t addr, int sets, int ways)
 
 static int gtlb0_set_base(struct kvmppc_vcpu_e500 *vcpu_e500, gva_t addr)
 {
-	int sets = KVM_E500_TLB0_SIZE / KVM_E500_TLB0_WAY_NUM;
-
-	return tlb0_set_base(addr, sets, KVM_E500_TLB0_WAY_NUM);
+	return tlb0_set_base(addr, vcpu_e500->gtlb_params[0].sets,
+			     vcpu_e500->gtlb_params[0].ways);
 }
 
 static int htlb0_set_base(gva_t addr)
@@ -440,10 +432,10 @@ static unsigned int get_tlb_esel(struct kvmppc_vcpu_e500 *vcpu_e500, int tlbsel)
 	unsigned int esel = get_tlb_esel_bit(vcpu_e500);
 
 	if (tlbsel = 0) {
-		esel &= KVM_E500_TLB0_WAY_NUM_MASK;
+		esel &= vcpu_e500->gtlb_params[0].ways - 1;
 		esel += gtlb0_set_base(vcpu_e500, vcpu_e500->mas2);
 	} else {
-		esel &= vcpu_e500->gtlb_size[tlbsel] - 1;
+		esel &= vcpu_e500->gtlb_params[tlbsel].entries - 1;
 	}
 
 	return esel;
@@ -453,19 +445,22 @@ static unsigned int get_tlb_esel(struct kvmppc_vcpu_e500 *vcpu_e500, int tlbsel)
 static int kvmppc_e500_tlb_index(struct kvmppc_vcpu_e500 *vcpu_e500,
 		gva_t eaddr, int tlbsel, unsigned int pid, int as)
 {
-	int size = vcpu_e500->gtlb_size[tlbsel];
-	unsigned int set_base;
+	int size = vcpu_e500->gtlb_params[tlbsel].entries;
+	unsigned int set_base, offset;
 	int i;
 
 	if (tlbsel = 0) {
 		set_base = gtlb0_set_base(vcpu_e500, eaddr);
-		size = KVM_E500_TLB0_WAY_NUM;
+		size = vcpu_e500->gtlb_params[0].ways;
 	} else {
 		set_base = 0;
 	}
 
+	offset = vcpu_e500->gtlb_offset[tlbsel];
+
 	for (i = 0; i < size; i++) {
-		struct tlbe *tlbe = &vcpu_e500->gtlb_arch[tlbsel][set_base + i];
+		struct kvm_book3e_206_tlb_entry *tlbe +			&vcpu_e500->gtlb_arch[offset + set_base + i];
 		unsigned int tid;
 
 		if (eaddr < get_tlb_eaddr(tlbe))
@@ -491,7 +486,7 @@ static int kvmppc_e500_tlb_index(struct kvmppc_vcpu_e500 *vcpu_e500,
 }
 
 static inline void kvmppc_e500_ref_setup(struct tlbe_ref *ref,
-					 struct tlbe *gtlbe,
+					 struct kvm_book3e_206_tlb_entry *gtlbe,
 					 pfn_t pfn)
 {
 	ref->pfn = pfn;
@@ -518,7 +513,7 @@ static void clear_tlb_privs(struct kvmppc_vcpu_e500 *vcpu_e500)
 	int tlbsel = 0;
 	int i;
 
-	for (i = 0; i < vcpu_e500->gtlb_size[tlbsel]; i++) {
+	for (i = 0; i < vcpu_e500->gtlb_params[tlbsel].entries; i++) {
 		struct tlbe_ref *ref  			&vcpu_e500->gtlb_priv[tlbsel][i].ref;
 		kvmppc_e500_ref_release(ref);
@@ -530,6 +525,8 @@ static void clear_tlb_refs(struct kvmppc_vcpu_e500 *vcpu_e500)
 	int stlbsel = 1;
 	int i;
 
+	kvmppc_e500_id_table_reset_all(vcpu_e500);
+
 	for (i = 0; i < host_tlb_params[stlbsel].entries; i++) {
 		struct tlbe_ref *ref  			&vcpu_e500->tlb_refs[stlbsel][i];
@@ -559,18 +556,18 @@ static inline void kvmppc_e500_deliver_tlb_miss(struct kvm_vcpu *vcpu,
 		| MAS1_TSIZE(tsized);
 	vcpu_e500->mas2 = (eaddr & MAS2_EPN)
 		| (vcpu_e500->mas4 & MAS2_ATTRIB_MASK);
-	vcpu_e500->mas3 &= MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3;
+	vcpu_e500->mas7_3 &= MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3;
 	vcpu_e500->mas6 = (vcpu_e500->mas6 & MAS6_SPID1)
 		| (get_cur_pid(vcpu) << 16)
 		| (as ? MAS6_SAS : 0);
-	vcpu_e500->mas7 = 0;
 }
 
 /* TID must be supplied by the caller */
-static inline void kvmppc_e500_setup_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
-					   struct tlbe *gtlbe, int tsize,
-					   struct tlbe_ref *ref,
-					   u64 gvaddr, struct tlbe *stlbe)
+static inline void kvmppc_e500_setup_stlbe(
+	struct kvmppc_vcpu_e500 *vcpu_e500,
+	struct kvm_book3e_206_tlb_entry *gtlbe,
+	int tsize, struct tlbe_ref *ref, u64 gvaddr,
+	struct kvm_book3e_206_tlb_entry *stlbe)
 {
 	pfn_t pfn = ref->pfn;
 
@@ -581,16 +578,16 @@ static inline void kvmppc_e500_setup_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
 	stlbe->mas2 = (gvaddr & MAS2_EPN)
 		| e500_shadow_mas2_attrib(gtlbe->mas2,
 				vcpu_e500->vcpu.arch.shared->msr & MSR_PR);
-	stlbe->mas3 = ((pfn << PAGE_SHIFT) & MAS3_RPN)
-		| e500_shadow_mas3_attrib(gtlbe->mas3,
+	stlbe->mas7_3 = ((u64)pfn << PAGE_SHIFT)
+		| e500_shadow_mas3_attrib(gtlbe->mas7_3,
 				vcpu_e500->vcpu.arch.shared->msr & MSR_PR);
-	stlbe->mas7 = (pfn >> (32 - PAGE_SHIFT)) & MAS7_RPN;
 }
 
 /* sesel is an index into the entire array, not just the set */
 static inline void kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
-	u64 gvaddr, gfn_t gfn, struct tlbe *gtlbe, int tlbsel, int sesel,
-	struct tlbe *stlbe, struct tlbe_ref *ref)
+	u64 gvaddr, gfn_t gfn, struct kvm_book3e_206_tlb_entry *gtlbe,
+	int tlbsel, int sesel, struct kvm_book3e_206_tlb_entry *stlbe,
+	struct tlbe_ref *ref)
 {
 	struct kvm_memory_slot *slot;
 	unsigned long pfn, hva;
@@ -700,15 +697,16 @@ static inline void kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 
 /* XXX only map the one-one case, for now use TLB0 */
 static int kvmppc_e500_tlb0_map(struct kvmppc_vcpu_e500 *vcpu_e500,
-				int esel, struct tlbe *stlbe)
+				int esel,
+				struct kvm_book3e_206_tlb_entry *stlbe)
 {
-	struct tlbe *gtlbe;
+	struct kvm_book3e_206_tlb_entry *gtlbe;
 	struct tlbe_ref *ref;
 	int sesel = esel & (host_tlb_params[0].ways - 1);
 	int sesel_base;
 	gva_t ea;
 
-	gtlbe = &vcpu_e500->gtlb_arch[0][esel];
+	gtlbe = get_entry(vcpu_e500, 0, esel);
 	ref = &vcpu_e500->gtlb_priv[0][esel].ref;
 
 	ea = get_tlb_eaddr(gtlbe);
@@ -725,7 +723,8 @@ static int kvmppc_e500_tlb0_map(struct kvmppc_vcpu_e500 *vcpu_e500,
  * the shadow TLB. */
 /* XXX for both one-one and one-to-many , for now use TLB1 */
 static int kvmppc_e500_tlb1_map(struct kvmppc_vcpu_e500 *vcpu_e500,
-		u64 gvaddr, gfn_t gfn, struct tlbe *gtlbe, struct tlbe *stlbe)
+		u64 gvaddr, gfn_t gfn, struct kvm_book3e_206_tlb_entry *gtlbe,
+		struct kvm_book3e_206_tlb_entry *stlbe)
 {
 	struct tlbe_ref *ref;
 	unsigned int victim;
@@ -754,7 +753,8 @@ static inline int kvmppc_e500_gtlbe_invalidate(
 				struct kvmppc_vcpu_e500 *vcpu_e500,
 				int tlbsel, int esel)
 {
-	struct tlbe *gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
+	struct kvm_book3e_206_tlb_entry *gtlbe +		get_entry(vcpu_e500, tlbsel, esel);
 
 	if (unlikely(get_tlb_iprot(gtlbe)))
 		return -1;
@@ -769,10 +769,10 @@ int kvmppc_e500_emul_mt_mmucsr0(struct kvmppc_vcpu_e500 *vcpu_e500, ulong value)
 	int esel;
 
 	if (value & MMUCSR0_TLB0FI)
-		for (esel = 0; esel < vcpu_e500->gtlb_size[0]; esel++)
+		for (esel = 0; esel < vcpu_e500->gtlb_params[0].entries; esel++)
 			kvmppc_e500_gtlbe_invalidate(vcpu_e500, 0, esel);
 	if (value & MMUCSR0_TLB1FI)
-		for (esel = 0; esel < vcpu_e500->gtlb_size[1]; esel++)
+		for (esel = 0; esel < vcpu_e500->gtlb_params[1].entries; esel++)
 			kvmppc_e500_gtlbe_invalidate(vcpu_e500, 1, esel);
 
 	/* Invalidate all vcpu id mappings */
@@ -797,7 +797,8 @@ int kvmppc_e500_emul_tlbivax(struct kvm_vcpu *vcpu, int ra, int rb)
 
 	if (ia) {
 		/* invalidate all entries */
-		for (esel = 0; esel < vcpu_e500->gtlb_size[tlbsel]; esel++)
+		for (esel = 0; esel < vcpu_e500->gtlb_params[tlbsel].entries;
+		     esel++)
 			kvmppc_e500_gtlbe_invalidate(vcpu_e500, tlbsel, esel);
 	} else {
 		ea &= 0xfffff000;
@@ -817,18 +818,17 @@ int kvmppc_e500_emul_tlbre(struct kvm_vcpu *vcpu)
 {
 	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
 	int tlbsel, esel;
-	struct tlbe *gtlbe;
+	struct kvm_book3e_206_tlb_entry *gtlbe;
 
 	tlbsel = get_tlb_tlbsel(vcpu_e500);
 	esel = get_tlb_esel(vcpu_e500, tlbsel);
 
-	gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
+	gtlbe = get_entry(vcpu_e500, tlbsel, esel);
 	vcpu_e500->mas0 &= ~MAS0_NV(~0);
 	vcpu_e500->mas0 |= MAS0_NV(vcpu_e500->gtlb_nv[tlbsel]);
 	vcpu_e500->mas1 = gtlbe->mas1;
 	vcpu_e500->mas2 = gtlbe->mas2;
-	vcpu_e500->mas3 = gtlbe->mas3;
-	vcpu_e500->mas7 = gtlbe->mas7;
+	vcpu_e500->mas7_3 = gtlbe->mas7_3;
 
 	return EMULATE_DONE;
 }
@@ -839,7 +839,7 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
 	int as = !!get_cur_sas(vcpu_e500);
 	unsigned int pid = get_cur_spid(vcpu_e500);
 	int esel, tlbsel;
-	struct tlbe *gtlbe = NULL;
+	struct kvm_book3e_206_tlb_entry *gtlbe = NULL;
 	gva_t ea;
 
 	ea = kvmppc_get_gpr(vcpu, rb);
@@ -847,7 +847,7 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
 	for (tlbsel = 0; tlbsel < 2; tlbsel++) {
 		esel = kvmppc_e500_tlb_index(vcpu_e500, ea, tlbsel, pid, as);
 		if (esel >= 0) {
-			gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
+			gtlbe = get_entry(vcpu_e500, tlbsel, esel);
 			break;
 		}
 	}
@@ -857,8 +857,7 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
 			| MAS0_NV(vcpu_e500->gtlb_nv[tlbsel]);
 		vcpu_e500->mas1 = gtlbe->mas1;
 		vcpu_e500->mas2 = gtlbe->mas2;
-		vcpu_e500->mas3 = gtlbe->mas3;
-		vcpu_e500->mas7 = gtlbe->mas7;
+		vcpu_e500->mas7_3 = gtlbe->mas7_3;
 	} else {
 		int victim;
 
@@ -873,8 +872,7 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
 			| (vcpu_e500->mas4 & MAS4_TSIZED(~0));
 		vcpu_e500->mas2 &= MAS2_EPN;
 		vcpu_e500->mas2 |= vcpu_e500->mas4 & MAS2_ATTRIB_MASK;
-		vcpu_e500->mas3 &= MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3;
-		vcpu_e500->mas7 = 0;
+		vcpu_e500->mas7_3 &= MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3;
 	}
 
 	kvmppc_set_exit_type(vcpu, EMULATED_TLBSX_EXITS);
@@ -883,8 +881,8 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
 
 /* sesel is index into the set, not the whole array */
 static void write_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
-			struct tlbe *gtlbe,
-			struct tlbe *stlbe,
+			struct kvm_book3e_206_tlb_entry *gtlbe,
+			struct kvm_book3e_206_tlb_entry *stlbe,
 			int stlbsel, int sesel)
 {
 	int stid;
@@ -902,28 +900,27 @@ static void write_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
 int kvmppc_e500_emul_tlbwe(struct kvm_vcpu *vcpu)
 {
 	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
-	struct tlbe *gtlbe;
+	struct kvm_book3e_206_tlb_entry *gtlbe;
 	int tlbsel, esel;
 
 	tlbsel = get_tlb_tlbsel(vcpu_e500);
 	esel = get_tlb_esel(vcpu_e500, tlbsel);
 
-	gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
+	gtlbe = get_entry(vcpu_e500, tlbsel, esel);
 
 	if (get_tlb_v(gtlbe))
 		inval_gtlbe_on_host(vcpu_e500, tlbsel, esel);
 
 	gtlbe->mas1 = vcpu_e500->mas1;
 	gtlbe->mas2 = vcpu_e500->mas2;
-	gtlbe->mas3 = vcpu_e500->mas3;
-	gtlbe->mas7 = vcpu_e500->mas7;
+	gtlbe->mas7_3 = vcpu_e500->mas7_3;
 
 	trace_kvm_gtlb_write(vcpu_e500->mas0, gtlbe->mas1, gtlbe->mas2,
-			     gtlbe->mas3, gtlbe->mas7);
+			     (u32)gtlbe->mas7_3, (u32)(gtlbe->mas7_3 >> 32));
 
 	/* Invalidate shadow mappings for the about-to-be-clobbered TLBE. */
 	if (tlbe_is_host_safe(vcpu, gtlbe)) {
-		struct tlbe stlbe;
+		struct kvm_book3e_206_tlb_entry stlbe;
 		int stlbsel, sesel;
 		u64 eaddr;
 		u64 raddr;
@@ -996,9 +993,11 @@ gpa_t kvmppc_mmu_xlate(struct kvm_vcpu *vcpu, unsigned int index,
 			gva_t eaddr)
 {
 	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
-	struct tlbe *gtlbe -		&vcpu_e500->gtlb_arch[tlbsel_of(index)][esel_of(index)];
-	u64 pgmask = get_tlb_bytes(gtlbe) - 1;
+	struct kvm_book3e_206_tlb_entry *gtlbe;
+	u64 pgmask;
+
+	gtlbe = get_entry(vcpu_e500, tlbsel_of(index), esel_of(index));
+	pgmask = get_tlb_bytes(gtlbe) - 1;
 
 	return get_tlb_raddr(gtlbe) | (eaddr & pgmask);
 }
@@ -1012,12 +1011,12 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 eaddr, gpa_t gpaddr,
 {
 	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
 	struct tlbe_priv *priv;
-	struct tlbe *gtlbe, stlbe;
+	struct kvm_book3e_206_tlb_entry *gtlbe, stlbe;
 	int tlbsel = tlbsel_of(index);
 	int esel = esel_of(index);
 	int stlbsel, sesel;
 
-	gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
+	gtlbe = get_entry(vcpu_e500, tlbsel, esel);
 
 	switch (tlbsel) {
 	case 0:
@@ -1073,25 +1072,174 @@ void kvmppc_set_pid(struct kvm_vcpu *vcpu, u32 pid)
 
 void kvmppc_e500_tlb_setup(struct kvmppc_vcpu_e500 *vcpu_e500)
 {
-	struct tlbe *tlbe;
+	struct kvm_book3e_206_tlb_entry *tlbe;
 
 	/* Insert large initial mapping for guest. */
-	tlbe = &vcpu_e500->gtlb_arch[1][0];
+	tlbe = get_entry(vcpu_e500, 1, 0);
 	tlbe->mas1 = MAS1_VALID | MAS1_TSIZE(BOOK3E_PAGESZ_256M);
 	tlbe->mas2 = 0;
-	tlbe->mas3 = E500_TLB_SUPER_PERM_MASK;
-	tlbe->mas7 = 0;
+	tlbe->mas7_3 = E500_TLB_SUPER_PERM_MASK;
 
 	/* 4K map for serial output. Used by kernel wrapper. */
-	tlbe = &vcpu_e500->gtlb_arch[1][1];
+	tlbe = get_entry(vcpu_e500, 1, 1);
 	tlbe->mas1 = MAS1_VALID | MAS1_TSIZE(BOOK3E_PAGESZ_4K);
 	tlbe->mas2 = (0xe0004500 & 0xFFFFF000) | MAS2_I | MAS2_G;
-	tlbe->mas3 = (0xe0004500 & 0xFFFFF000) | E500_TLB_SUPER_PERM_MASK;
-	tlbe->mas7 = 0;
+	tlbe->mas7_3 = (0xe0004500 & 0xFFFFF000) | E500_TLB_SUPER_PERM_MASK;
+}
+
+static void free_gtlb(struct kvmppc_vcpu_e500 *vcpu_e500)
+{
+	int i;
+
+	clear_tlb_refs(vcpu_e500);
+	kfree(vcpu_e500->gtlb_priv[0]);
+	kfree(vcpu_e500->gtlb_priv[1]);
+
+	if (vcpu_e500->shared_tlb_pages) {
+		vfree((void *)(round_down((uintptr_t)vcpu_e500->gtlb_arch,
+					  PAGE_SIZE)));
+
+		for (i = 0; i < vcpu_e500->num_shared_tlb_pages; i++) {
+			set_page_dirty_lock(vcpu_e500->shared_tlb_pages[i]);
+			put_page(vcpu_e500->shared_tlb_pages[i]);
+		}
+
+		vcpu_e500->num_shared_tlb_pages = 0;
+		vcpu_e500->shared_tlb_pages = NULL;
+	} else {
+		kfree(vcpu_e500->gtlb_arch);
+	}
+
+	vcpu_e500->gtlb_arch = NULL;
+}
+
+int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
+			      struct kvm_config_tlb *cfg)
+{
+	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
+	struct kvm_book3e_206_tlb_params params;
+	char *virt;
+	struct page **pages;
+	struct tlbe_priv *privs[2] = {};
+	size_t array_len;
+	u32 sets;
+	int num_pages, ret, i;
+
+	if (cfg->mmu_type != KVM_MMU_FSL_BOOKE_NOHV)
+		return -EINVAL;
+
+	if (copy_from_user(&params, (void __user *)(uintptr_t)cfg->params,
+			   sizeof(params)))
+		return -EFAULT;
+
+	if (params.tlb_sizes[1] > 64)
+		return -EINVAL;
+	if (params.tlb_ways[1] != params.tlb_sizes[1])
+		return -EINVAL;
+	if (params.tlb_sizes[2] != 0 || params.tlb_sizes[3] != 0)
+		return -EINVAL;
+	if (params.tlb_ways[2] != 0 || params.tlb_ways[3] != 0)
+		return -EINVAL;
+
+	if (!is_power_of_2(params.tlb_ways[0]))
+		return -EINVAL;
+
+	sets = params.tlb_sizes[0] >> ilog2(params.tlb_ways[0]);
+	if (!is_power_of_2(sets))
+		return -EINVAL;
+
+	array_len = params.tlb_sizes[0] + params.tlb_sizes[1];
+	array_len *= sizeof(struct kvm_book3e_206_tlb_entry);
+
+	if (cfg->array_len < array_len)
+		return -EINVAL;
+
+	num_pages = DIV_ROUND_UP(cfg->array + array_len - 1, PAGE_SIZE) -
+		    cfg->array / PAGE_SIZE;
+	pages = kmalloc(sizeof(struct page *) * num_pages, GFP_KERNEL);
+	if (!pages)
+		return -ENOMEM;
+
+	ret = get_user_pages_fast(cfg->array, num_pages, 1, pages);
+	if (ret < 0)
+		goto err_pages;
+
+	if (ret != num_pages) {
+		num_pages = ret;
+		ret = -EFAULT;
+		goto err_put_page;
+	}
+
+	virt = vmap(pages, num_pages, VM_MAP, PAGE_KERNEL);
+	if (!virt)
+		goto err_put_page;
+
+	privs[0] = kzalloc(sizeof(struct tlbe_priv) * params.tlb_sizes[0],
+			   GFP_KERNEL);
+	privs[1] = kzalloc(sizeof(struct tlbe_priv) * params.tlb_sizes[1],
+			   GFP_KERNEL);
+
+	if (!privs[0] || !privs[1])
+		goto err_put_page;
+
+	free_gtlb(vcpu_e500);
+
+	vcpu_e500->gtlb_priv[0] = privs[0];
+	vcpu_e500->gtlb_priv[1] = privs[1];
+
+	vcpu_e500->gtlb_arch = (struct kvm_book3e_206_tlb_entry *)
+		(virt + (cfg->array & (PAGE_SIZE - 1)));
+
+	vcpu_e500->gtlb_params[0].entries = params.tlb_sizes[0];
+	vcpu_e500->gtlb_params[1].entries = params.tlb_sizes[1];
+
+	vcpu_e500->gtlb_offset[0] = 0;
+	vcpu_e500->gtlb_offset[1] = params.tlb_sizes[0];
+
+	vcpu_e500->tlb0cfg = mfspr(SPRN_TLB0CFG) & ~0xfffUL;
+	if (params.tlb_sizes[0] <= 2048)
+		vcpu_e500->tlb0cfg |= params.tlb_sizes[0];
+
+	vcpu_e500->tlb1cfg = mfspr(SPRN_TLB1CFG) & ~0xfffUL;
+	vcpu_e500->tlb1cfg |= params.tlb_sizes[1];
+
+	vcpu_e500->shared_tlb_pages = pages;
+	vcpu_e500->num_shared_tlb_pages = num_pages;
+
+	vcpu_e500->gtlb_params[0].ways = params.tlb_ways[0];
+	vcpu_e500->gtlb_params[0].sets = sets;
+
+	vcpu_e500->gtlb_params[1].ways = params.tlb_sizes[1];
+	vcpu_e500->gtlb_params[1].sets = 1;
+
+	return 0;
+
+err_put_page:
+	kfree(privs[0]);
+	kfree(privs[1]);
+
+	for (i = 0; i < num_pages; i++)
+		put_page(pages[i]);
+
+err_pages:
+	kfree(pages);
+	return ret;
+}
+
+int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
+			     struct kvm_dirty_tlb *dirty)
+{
+	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
+
+	clear_tlb_refs(vcpu_e500);
+	return 0;
 }
 
 int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500)
 {
+	int entry_size = sizeof(struct kvm_book3e_206_tlb_entry);
+	int entries = KVM_E500_TLB0_SIZE + KVM_E500_TLB1_SIZE;
+
 	host_tlb_params[0].entries = mfspr(SPRN_TLB0CFG) & TLBnCFG_N_ENTRY;
 	host_tlb_params[1].entries = mfspr(SPRN_TLB1CFG) & TLBnCFG_N_ENTRY;
 
@@ -1124,17 +1272,22 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500)
 		host_tlb_params[0].entries / host_tlb_params[0].ways;
 	host_tlb_params[1].sets = 1;
 
-	vcpu_e500->gtlb_size[0] = KVM_E500_TLB0_SIZE;
-	vcpu_e500->gtlb_arch[0] -		kzalloc(sizeof(struct tlbe) * KVM_E500_TLB0_SIZE, GFP_KERNEL);
-	if (vcpu_e500->gtlb_arch[0] = NULL)
-		goto err;
+	vcpu_e500->gtlb_params[0].entries = KVM_E500_TLB0_SIZE;
+	vcpu_e500->gtlb_params[1].entries = KVM_E500_TLB1_SIZE;
 
-	vcpu_e500->gtlb_size[1] = KVM_E500_TLB1_SIZE;
-	vcpu_e500->gtlb_arch[1] -		kzalloc(sizeof(struct tlbe) * KVM_E500_TLB1_SIZE, GFP_KERNEL);
-	if (vcpu_e500->gtlb_arch[1] = NULL)
-		goto err;
+	vcpu_e500->gtlb_params[0].ways = KVM_E500_TLB0_WAY_NUM;
+	vcpu_e500->gtlb_params[0].sets +		KVM_E500_TLB0_SIZE / KVM_E500_TLB0_WAY_NUM;
+
+	vcpu_e500->gtlb_params[1].ways = KVM_E500_TLB1_SIZE;
+	vcpu_e500->gtlb_params[1].sets = 1;
+
+	vcpu_e500->gtlb_arch = kmalloc(entries * entry_size, GFP_KERNEL);
+	if (!vcpu_e500->gtlb_arch)
+		return -ENOMEM;
+
+	vcpu_e500->gtlb_offset[0] = 0;
+	vcpu_e500->gtlb_offset[1] = KVM_E500_TLB0_SIZE;
 
 	vcpu_e500->tlb_refs[0]  		kzalloc(sizeof(struct tlbe_ref) * host_tlb_params[0].entries,
@@ -1148,15 +1301,15 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500)
 	if (!vcpu_e500->tlb_refs[1])
 		goto err;
 
-	vcpu_e500->gtlb_priv[0] -		kzalloc(sizeof(struct tlbe_ref) * vcpu_e500->gtlb_size[0],
-			GFP_KERNEL);
+	vcpu_e500->gtlb_priv[0] = kzalloc(sizeof(struct tlbe_ref) *
+					  vcpu_e500->gtlb_params[0].entries,
+					  GFP_KERNEL);
 	if (!vcpu_e500->gtlb_priv[0])
 		goto err;
 
-	vcpu_e500->gtlb_priv[1] -		kzalloc(sizeof(struct tlbe_ref) * vcpu_e500->gtlb_size[1],
-			GFP_KERNEL);
+	vcpu_e500->gtlb_priv[1] = kzalloc(sizeof(struct tlbe_ref) *
+					  vcpu_e500->gtlb_params[1].entries,
+					  GFP_KERNEL);
 	if (!vcpu_e500->gtlb_priv[1])
 		goto err;
 
@@ -1165,32 +1318,24 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500)
 
 	/* Init TLB configuration register */
 	vcpu_e500->tlb0cfg = mfspr(SPRN_TLB0CFG) & ~0xfffUL;
-	vcpu_e500->tlb0cfg |= vcpu_e500->gtlb_size[0];
+	vcpu_e500->tlb0cfg |= vcpu_e500->gtlb_params[0].entries;
 	vcpu_e500->tlb1cfg = mfspr(SPRN_TLB1CFG) & ~0xfffUL;
-	vcpu_e500->tlb1cfg |= vcpu_e500->gtlb_size[1];
+	vcpu_e500->tlb1cfg |= vcpu_e500->gtlb_params[1].entries;
 
 	return 0;
 
 err:
+	free_gtlb(vcpu_e500);
 	kfree(vcpu_e500->tlb_refs[0]);
 	kfree(vcpu_e500->tlb_refs[1]);
-	kfree(vcpu_e500->gtlb_priv[0]);
-	kfree(vcpu_e500->gtlb_priv[1]);
-	kfree(vcpu_e500->gtlb_arch[0]);
-	kfree(vcpu_e500->gtlb_arch[1]);
 	return -1;
 }
 
 void kvmppc_e500_tlb_uninit(struct kvmppc_vcpu_e500 *vcpu_e500)
 {
-	clear_tlb_refs(vcpu_e500);
-
+	free_gtlb(vcpu_e500);
 	kvmppc_e500_id_table_free(vcpu_e500);
 
 	kfree(vcpu_e500->tlb_refs[0]);
 	kfree(vcpu_e500->tlb_refs[1]);
-	kfree(vcpu_e500->gtlb_priv[0]);
-	kfree(vcpu_e500->gtlb_priv[1]);
-	kfree(vcpu_e500->gtlb_arch[1]);
-	kfree(vcpu_e500->gtlb_arch[0]);
 }
diff --git a/arch/powerpc/kvm/e500_tlb.h b/arch/powerpc/kvm/e500_tlb.h
index b587f69..2c29640 100644
--- a/arch/powerpc/kvm/e500_tlb.h
+++ b/arch/powerpc/kvm/e500_tlb.h
@@ -20,13 +20,9 @@
 #include <asm/tlb.h>
 #include <asm/kvm_e500.h>
 
-#define KVM_E500_TLB0_WAY_SIZE_BIT	7	/* Fixed */
-#define KVM_E500_TLB0_WAY_SIZE		(1UL << KVM_E500_TLB0_WAY_SIZE_BIT)
-#define KVM_E500_TLB0_WAY_SIZE_MASK	(KVM_E500_TLB0_WAY_SIZE - 1)
-
-#define KVM_E500_TLB0_WAY_NUM_BIT	1	/* No greater than 7 */
-#define KVM_E500_TLB0_WAY_NUM		(1UL << KVM_E500_TLB0_WAY_NUM_BIT)
-#define KVM_E500_TLB0_WAY_NUM_MASK	(KVM_E500_TLB0_WAY_NUM - 1)
+/* This geometry is the legacy default -- can be overridden by userspace */
+#define KVM_E500_TLB0_WAY_SIZE		128
+#define KVM_E500_TLB0_WAY_NUM		2
 
 #define KVM_E500_TLB0_SIZE  (KVM_E500_TLB0_WAY_SIZE * KVM_E500_TLB0_WAY_NUM)
 #define KVM_E500_TLB1_SIZE  16
@@ -58,50 +54,54 @@ extern void kvmppc_e500_tlb_setup(struct kvmppc_vcpu_e500 *);
 extern void kvmppc_e500_recalc_shadow_pid(struct kvmppc_vcpu_e500 *);
 
 /* TLB helper functions */
-static inline unsigned int get_tlb_size(const struct tlbe *tlbe)
+static inline unsigned int
+get_tlb_size(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	return (tlbe->mas1 >> 7) & 0x1f;
 }
 
-static inline gva_t get_tlb_eaddr(const struct tlbe *tlbe)
+static inline gva_t get_tlb_eaddr(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	return tlbe->mas2 & 0xfffff000;
 }
 
-static inline u64 get_tlb_bytes(const struct tlbe *tlbe)
+static inline u64 get_tlb_bytes(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	unsigned int pgsize = get_tlb_size(tlbe);
 	return 1ULL << 10 << pgsize;
 }
 
-static inline gva_t get_tlb_end(const struct tlbe *tlbe)
+static inline gva_t get_tlb_end(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	u64 bytes = get_tlb_bytes(tlbe);
 	return get_tlb_eaddr(tlbe) + bytes - 1;
 }
 
-static inline u64 get_tlb_raddr(const struct tlbe *tlbe)
+static inline u64 get_tlb_raddr(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
-	u64 rpn = tlbe->mas7;
-	return (rpn << 32) | (tlbe->mas3 & 0xfffff000);
+	return tlbe->mas7_3 & ~0xfffULL;
 }
 
-static inline unsigned int get_tlb_tid(const struct tlbe *tlbe)
+static inline unsigned int
+get_tlb_tid(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	return (tlbe->mas1 >> 16) & 0xff;
 }
 
-static inline unsigned int get_tlb_ts(const struct tlbe *tlbe)
+static inline unsigned int
+get_tlb_ts(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	return (tlbe->mas1 >> 12) & 0x1;
 }
 
-static inline unsigned int get_tlb_v(const struct tlbe *tlbe)
+static inline unsigned int
+get_tlb_v(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	return (tlbe->mas1 >> 31) & 0x1;
 }
 
-static inline unsigned int get_tlb_iprot(const struct tlbe *tlbe)
+static inline unsigned int
+get_tlb_iprot(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	return (tlbe->mas1 >> 30) & 0x1;
 }
@@ -156,7 +156,7 @@ static inline unsigned int get_tlb_esel_bit(
 }
 
 static inline int tlbe_is_host_safe(const struct kvm_vcpu *vcpu,
-			const struct tlbe *tlbe)
+			const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	gpa_t gpa;
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 0d843c6..55b4233 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -221,6 +221,9 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_PPC_PAIRED_SINGLES:
 	case KVM_CAP_PPC_OSI:
 	case KVM_CAP_PPC_GET_PVINFO:
+#ifdef CONFIG_KVM_E500
+	case KVM_CAP_SW_TLB:
+#endif
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -601,6 +604,19 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 		r = 0;
 		vcpu->arch.papr_enabled = true;
 		break;
+#ifdef CONFIG_KVM_E500
+	case KVM_CAP_SW_TLB: {
+		struct kvm_config_tlb cfg;
+		void __user *user_ptr = (void __user *)(uintptr_t)cap->args[0];
+
+		r = -EFAULT;
+		if (copy_from_user(&cfg, user_ptr, sizeof(cfg)))
+			break;
+
+		r = kvm_vcpu_ioctl_config_tlb(vcpu, &cfg);
+		break;
+	}
+#endif
 	default:
 		r = -EINVAL;
 		break;
@@ -650,6 +666,18 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 		r = kvm_vcpu_ioctl_enable_cap(vcpu, &cap);
 		break;
 	}
+
+#ifdef CONFIG_KVM_E500
+	case KVM_DIRTY_TLB: {
+		struct kvm_dirty_tlb dirty;
+		r = -EFAULT;
+		if (copy_from_user(&dirty, argp, sizeof(dirty)))
+			goto out;
+		r = kvm_vcpu_ioctl_dirty_tlb(vcpu, &dirty);
+		break;
+	}
+#endif
+
 	default:
 		r = -EINVAL;
 	}
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index f47fcd3..76ef719 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -557,6 +557,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_MAX_VCPUS 66       /* returns max vcpus per vm */
 #define KVM_CAP_PPC_HIOR 67
 #define KVM_CAP_PPC_PAPR 68
+#define KVM_CAP_SW_TLB 69
 #define KVM_CAP_S390_GMAP 71
 
 #ifdef KVM_CAP_IRQ_ROUTING
@@ -637,6 +638,21 @@ struct kvm_clock_data {
 	__u32 pad[9];
 };
 
+#define KVM_MMU_FSL_BOOKE_NOHV		0
+#define KVM_MMU_FSL_BOOKE_HV		1
+
+struct kvm_config_tlb {
+	__u64 params;
+	__u64 array;
+	__u32 mmu_type;
+	__u32 array_len;
+};
+
+struct kvm_dirty_tlb {
+	__u64 bitmap;
+	__u32 num_dirty;
+};
+
 /*
  * ioctls for VM fds
  */
@@ -763,6 +779,8 @@ struct kvm_clock_data {
 #define KVM_CREATE_SPAPR_TCE	  _IOW(KVMIO,  0xa8, struct kvm_create_spapr_tce)
 /* Available with KVM_CAP_RMA */
 #define KVM_ALLOCATE_RMA	  _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
+/* Available with KVM_CAP_SW_TLB */
+#define KVM_DIRTY_TLB		  _IOW(KVMIO,  0xaa, struct kvm_dirty_tlb)
 
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 05/14] KVM: PPC: e500: tlbsx: fix tlb0 esel
  2011-10-31  7:53 ` Alexander Graf
@ 2011-10-31  7:53   ` Alexander Graf
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti, Scott Wood

From: Scott Wood <scottwood@freescale.com>

It should contain the way, not the absolute TLB0 index.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
---
 arch/powerpc/kvm/e500_tlb.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index f19ae2f..ec17148 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -853,6 +853,8 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
 	}
 
 	if (gtlbe) {
+		esel &= vcpu_e500->gtlb_params[tlbsel].ways - 1;
+
 		vcpu_e500->mas0 = MAS0_TLBSEL(tlbsel) | MAS0_ESEL(esel)
 			| MAS0_NV(vcpu_e500->gtlb_nv[tlbsel]);
 		vcpu_e500->mas1 = gtlbe->mas1;
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 05/14] KVM: PPC: e500: tlbsx: fix tlb0 esel
@ 2011-10-31  7:53   ` Alexander Graf
  0 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti, Scott Wood

From: Scott Wood <scottwood@freescale.com>

It should contain the way, not the absolute TLB0 index.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
---
 arch/powerpc/kvm/e500_tlb.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index f19ae2f..ec17148 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -853,6 +853,8 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
 	}
 
 	if (gtlbe) {
+		esel &= vcpu_e500->gtlb_params[tlbsel].ways - 1;
+
 		vcpu_e500->mas0 = MAS0_TLBSEL(tlbsel) | MAS0_ESEL(esel)
 			| MAS0_NV(vcpu_e500->gtlb_nv[tlbsel]);
 		vcpu_e500->mas1 = gtlbe->mas1;
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 06/14] KVM: PPC: e500: Don't hardcode PIR=0
  2011-10-31  7:53 ` Alexander Graf
@ 2011-10-31  7:53   ` Alexander Graf
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti, Scott Wood

From: Scott Wood <scottwood@freescale.com>

The hardcoded behavior prevents proper SMP support.

QEMU shall specify the vcpu's PIR as the vcpu id.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
---
 arch/powerpc/kvm/booke.c |    4 ++--
 arch/powerpc/kvm/e500.c  |    3 ---
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index bb6c988..b642200 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -761,7 +761,7 @@ static void get_sregs_arch206(struct kvm_vcpu *vcpu,
 {
 	sregs->u.e.features |= KVM_SREGS_E_ARCH206;
 
-	sregs->u.e.pir = 0;
+	sregs->u.e.pir = vcpu->vcpu_id;
 	sregs->u.e.mcsrr0 = vcpu->arch.mcsrr0;
 	sregs->u.e.mcsrr1 = vcpu->arch.mcsrr1;
 	sregs->u.e.decar = vcpu->arch.decar;
@@ -774,7 +774,7 @@ static int set_sregs_arch206(struct kvm_vcpu *vcpu,
 	if (!(sregs->u.e.features & KVM_SREGS_E_ARCH206))
 		return 0;
 
-	if (sregs->u.e.pir != 0)
+	if (sregs->u.e.pir != vcpu->vcpu_id)
 		return -EINVAL;
 
 	vcpu->arch.mcsrr0 = sregs->u.e.mcsrr0;
diff --git a/arch/powerpc/kvm/e500.c b/arch/powerpc/kvm/e500.c
index 14d6e6e..cbbcc9e 100644
--- a/arch/powerpc/kvm/e500.c
+++ b/arch/powerpc/kvm/e500.c
@@ -70,9 +70,6 @@ int kvmppc_core_vcpu_setup(struct kvm_vcpu *vcpu)
 	vcpu->arch.pvr = mfspr(SPRN_PVR);
 	vcpu_e500->svr = mfspr(SPRN_SVR);
 
-	/* Since booke kvm only support one core, update all vcpus' PIR to 0 */
-	vcpu->vcpu_id = 0;
-
 	vcpu->arch.cpu_type = KVM_CPU_E500V2;
 
 	return 0;
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 06/14] KVM: PPC: e500: Don't hardcode PIR=0
@ 2011-10-31  7:53   ` Alexander Graf
  0 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti, Scott Wood

From: Scott Wood <scottwood@freescale.com>

The hardcoded behavior prevents proper SMP support.

QEMU shall specify the vcpu's PIR as the vcpu id.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
---
 arch/powerpc/kvm/booke.c |    4 ++--
 arch/powerpc/kvm/e500.c  |    3 ---
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index bb6c988..b642200 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -761,7 +761,7 @@ static void get_sregs_arch206(struct kvm_vcpu *vcpu,
 {
 	sregs->u.e.features |= KVM_SREGS_E_ARCH206;
 
-	sregs->u.e.pir = 0;
+	sregs->u.e.pir = vcpu->vcpu_id;
 	sregs->u.e.mcsrr0 = vcpu->arch.mcsrr0;
 	sregs->u.e.mcsrr1 = vcpu->arch.mcsrr1;
 	sregs->u.e.decar = vcpu->arch.decar;
@@ -774,7 +774,7 @@ static int set_sregs_arch206(struct kvm_vcpu *vcpu,
 	if (!(sregs->u.e.features & KVM_SREGS_E_ARCH206))
 		return 0;
 
-	if (sregs->u.e.pir != 0)
+	if (sregs->u.e.pir != vcpu->vcpu_id)
 		return -EINVAL;
 
 	vcpu->arch.mcsrr0 = sregs->u.e.mcsrr0;
diff --git a/arch/powerpc/kvm/e500.c b/arch/powerpc/kvm/e500.c
index 14d6e6e..cbbcc9e 100644
--- a/arch/powerpc/kvm/e500.c
+++ b/arch/powerpc/kvm/e500.c
@@ -70,9 +70,6 @@ int kvmppc_core_vcpu_setup(struct kvm_vcpu *vcpu)
 	vcpu->arch.pvr = mfspr(SPRN_PVR);
 	vcpu_e500->svr = mfspr(SPRN_SVR);
 
-	/* Since booke kvm only support one core, update all vcpus' PIR to 0 */
-	vcpu->vcpu_id = 0;
-
 	vcpu->arch.cpu_type = KVM_CPU_E500V2;
 
 	return 0;
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 07/14] KVM: PPC: Fix build failure with HV KVM and CBE
  2011-10-31  7:53 ` Alexander Graf
@ 2011-10-31  7:53   ` Alexander Graf
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti

When running with HV KVM and CBE config options enabled, I get
build failures like the following:

  arch/powerpc/kernel/head_64.o: In function `cbe_system_error_hv':
  (.text+0x1228): undefined reference to `do_kvm_0x1202'
  arch/powerpc/kernel/head_64.o: In function `cbe_maintenance_hv':
  (.text+0x1628): undefined reference to `do_kvm_0x1602'
  arch/powerpc/kernel/head_64.o: In function `cbe_thermal_hv':
  (.text+0x1828): undefined reference to `do_kvm_0x1802'

This is because we jump to a KVM handler when HV is enabled, but we
only generate the handler with PR KVM mode.

Signed-off-by: Alexander Graf <agraf@suse.de>
---
 arch/powerpc/kernel/exceptions-64s.S |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 29ddd8b..396d080 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -267,7 +267,7 @@ vsx_unavailable_pSeries_1:
 
 #ifdef CONFIG_CBE_RAS
 	STD_EXCEPTION_HV(0x1200, 0x1202, cbe_system_error)
-	KVM_HANDLER_PR_SKIP(PACA_EXGEN, EXC_HV, 0x1202)
+	KVM_HANDLER_SKIP(PACA_EXGEN, EXC_HV, 0x1202)
 #endif /* CONFIG_CBE_RAS */
 
 	STD_EXCEPTION_PSERIES(0x1300, 0x1300, instruction_breakpoint)
@@ -275,7 +275,7 @@ vsx_unavailable_pSeries_1:
 
 #ifdef CONFIG_CBE_RAS
 	STD_EXCEPTION_HV(0x1600, 0x1602, cbe_maintenance)
-	KVM_HANDLER_PR_SKIP(PACA_EXGEN, EXC_HV, 0x1602)
+	KVM_HANDLER_SKIP(PACA_EXGEN, EXC_HV, 0x1602)
 #endif /* CONFIG_CBE_RAS */
 
 	STD_EXCEPTION_PSERIES(0x1700, 0x1700, altivec_assist)
@@ -283,7 +283,7 @@ vsx_unavailable_pSeries_1:
 
 #ifdef CONFIG_CBE_RAS
 	STD_EXCEPTION_HV(0x1800, 0x1802, cbe_thermal)
-	KVM_HANDLER_PR_SKIP(PACA_EXGEN, EXC_HV, 0x1802)
+	KVM_HANDLER_SKIP(PACA_EXGEN, EXC_HV, 0x1802)
 #endif /* CONFIG_CBE_RAS */
 
 	. = 0x3000
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 07/14] KVM: PPC: Fix build failure with HV KVM and CBE
@ 2011-10-31  7:53   ` Alexander Graf
  0 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti

When running with HV KVM and CBE config options enabled, I get
build failures like the following:

  arch/powerpc/kernel/head_64.o: In function `cbe_system_error_hv':
  (.text+0x1228): undefined reference to `do_kvm_0x1202'
  arch/powerpc/kernel/head_64.o: In function `cbe_maintenance_hv':
  (.text+0x1628): undefined reference to `do_kvm_0x1602'
  arch/powerpc/kernel/head_64.o: In function `cbe_thermal_hv':
  (.text+0x1828): undefined reference to `do_kvm_0x1802'

This is because we jump to a KVM handler when HV is enabled, but we
only generate the handler with PR KVM mode.

Signed-off-by: Alexander Graf <agraf@suse.de>
---
 arch/powerpc/kernel/exceptions-64s.S |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 29ddd8b..396d080 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -267,7 +267,7 @@ vsx_unavailable_pSeries_1:
 
 #ifdef CONFIG_CBE_RAS
 	STD_EXCEPTION_HV(0x1200, 0x1202, cbe_system_error)
-	KVM_HANDLER_PR_SKIP(PACA_EXGEN, EXC_HV, 0x1202)
+	KVM_HANDLER_SKIP(PACA_EXGEN, EXC_HV, 0x1202)
 #endif /* CONFIG_CBE_RAS */
 
 	STD_EXCEPTION_PSERIES(0x1300, 0x1300, instruction_breakpoint)
@@ -275,7 +275,7 @@ vsx_unavailable_pSeries_1:
 
 #ifdef CONFIG_CBE_RAS
 	STD_EXCEPTION_HV(0x1600, 0x1602, cbe_maintenance)
-	KVM_HANDLER_PR_SKIP(PACA_EXGEN, EXC_HV, 0x1602)
+	KVM_HANDLER_SKIP(PACA_EXGEN, EXC_HV, 0x1602)
 #endif /* CONFIG_CBE_RAS */
 
 	STD_EXCEPTION_PSERIES(0x1700, 0x1700, altivec_assist)
@@ -283,7 +283,7 @@ vsx_unavailable_pSeries_1:
 
 #ifdef CONFIG_CBE_RAS
 	STD_EXCEPTION_HV(0x1800, 0x1802, cbe_thermal)
-	KVM_HANDLER_PR_SKIP(PACA_EXGEN, EXC_HV, 0x1802)
+	KVM_HANDLER_SKIP(PACA_EXGEN, EXC_HV, 0x1802)
 #endif /* CONFIG_CBE_RAS */
 
 	. = 0x3000
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 08/14] Revert "KVM: PPC: Add support for explicit HIOR setting"
  2011-10-31  7:53 ` Alexander Graf
@ 2011-10-31  7:53   ` Alexander Graf
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti

This reverts commit 11d7596e18a712dc3bc29d45662ec111fd65946b. It exceeded
the padding on the SREGS struct, rendering the ABI backwards-incompatible.

Signed-off-by: Alexander Graf <agraf@suse.de>
---
 arch/powerpc/include/asm/kvm.h        |    8 --------
 arch/powerpc/include/asm/kvm_book3s.h |    2 --
 arch/powerpc/kvm/book3s_pr.c          |   14 ++------------
 arch/powerpc/kvm/powerpc.c            |    1 -
 include/linux/kvm.h                   |    1 -
 5 files changed, 2 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h
index 71684b9..a635e22 100644
--- a/arch/powerpc/include/asm/kvm.h
+++ b/arch/powerpc/include/asm/kvm.h
@@ -149,12 +149,6 @@ struct kvm_regs {
 #define KVM_SREGS_E_UPDATE_DBSR		(1 << 3)
 
 /*
- * Book3S special bits to indicate contents in the struct by maintaining
- * backwards compatibility with older structs. If adding a new field,
- * please make sure to add a flag for that new field */
-#define KVM_SREGS_S_HIOR		(1 << 0)
-
-/*
  * In KVM_SET_SREGS, reserved/pad fields must be left untouched from a
  * previous KVM_GET_REGS.
  *
@@ -179,8 +173,6 @@ struct kvm_sregs {
 				__u64 ibat[8]; 
 				__u64 dbat[8]; 
 			} ppc32;
-			__u64 flags; /* KVM_SREGS_S_ */
-			__u64 hior;
 		} s;
 		struct {
 			union {
diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
index a384ffd..d4df013 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -90,8 +90,6 @@ struct kvmppc_vcpu_book3s {
 #endif
 	int context_id[SID_CONTEXTS];
 
-	bool hior_sregs;		/* HIOR is set by SREGS, not PVR */
-
 	struct hlist_head hpte_hash_pte[HPTEG_HASH_NUM_PTE];
 	struct hlist_head hpte_hash_pte_long[HPTEG_HASH_NUM_PTE_LONG];
 	struct hlist_head hpte_hash_vpte[HPTEG_HASH_NUM_VPTE];
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index d417511..84505a2 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -150,16 +150,14 @@ void kvmppc_set_pvr(struct kvm_vcpu *vcpu, u32 pvr)
 #ifdef CONFIG_PPC_BOOK3S_64
 	if ((pvr >= 0x330000) && (pvr < 0x70330000)) {
 		kvmppc_mmu_book3s_64_init(vcpu);
-		if (!to_book3s(vcpu)->hior_sregs)
-			to_book3s(vcpu)->hior = 0xfff00000;
+		to_book3s(vcpu)->hior = 0xfff00000;
 		to_book3s(vcpu)->msr_mask = 0xffffffffffffffffULL;
 		vcpu->arch.cpu_type = KVM_CPU_3S_64;
 	} else
 #endif
 	{
 		kvmppc_mmu_book3s_32_init(vcpu);
-		if (!to_book3s(vcpu)->hior_sregs)
-			to_book3s(vcpu)->hior = 0;
+		to_book3s(vcpu)->hior = 0;
 		to_book3s(vcpu)->msr_mask = 0xffffffffULL;
 		vcpu->arch.cpu_type = KVM_CPU_3S_32;
 	}
@@ -796,9 +794,6 @@ int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
 		}
 	}
 
-	if (sregs->u.s.flags & KVM_SREGS_S_HIOR)
-		sregs->u.s.hior = to_book3s(vcpu)->hior;
-
 	return 0;
 }
 
@@ -835,11 +830,6 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
 	/* Flush the MMU after messing with the segments */
 	kvmppc_mmu_pte_flush(vcpu, 0, 0);
 
-	if (sregs->u.s.flags & KVM_SREGS_S_HIOR) {
-		to_book3s(vcpu)->hior_sregs = true;
-		to_book3s(vcpu)->hior = sregs->u.s.hior;
-	}
-
 	return 0;
 }
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 55b4233..e75c5ac 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -209,7 +209,6 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_PPC_BOOKE_SREGS:
 #else
 	case KVM_CAP_PPC_SEGSTATE:
-	case KVM_CAP_PPC_HIOR:
 	case KVM_CAP_PPC_PAPR:
 #endif
 	case KVM_CAP_PPC_UNSET_IRQ:
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 76ef719..a6b1295 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -555,7 +555,6 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_PPC_SMT 64
 #define KVM_CAP_PPC_RMA	65
 #define KVM_CAP_MAX_VCPUS 66       /* returns max vcpus per vm */
-#define KVM_CAP_PPC_HIOR 67
 #define KVM_CAP_PPC_PAPR 68
 #define KVM_CAP_SW_TLB 69
 #define KVM_CAP_S390_GMAP 71
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 08/14] Revert "KVM: PPC: Add support for explicit HIOR setting"
@ 2011-10-31  7:53   ` Alexander Graf
  0 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti

This reverts commit 11d7596e18a712dc3bc29d45662ec111fd65946b. It exceeded
the padding on the SREGS struct, rendering the ABI backwards-incompatible.

Signed-off-by: Alexander Graf <agraf@suse.de>
---
 arch/powerpc/include/asm/kvm.h        |    8 --------
 arch/powerpc/include/asm/kvm_book3s.h |    2 --
 arch/powerpc/kvm/book3s_pr.c          |   14 ++------------
 arch/powerpc/kvm/powerpc.c            |    1 -
 include/linux/kvm.h                   |    1 -
 5 files changed, 2 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h
index 71684b9..a635e22 100644
--- a/arch/powerpc/include/asm/kvm.h
+++ b/arch/powerpc/include/asm/kvm.h
@@ -149,12 +149,6 @@ struct kvm_regs {
 #define KVM_SREGS_E_UPDATE_DBSR		(1 << 3)
 
 /*
- * Book3S special bits to indicate contents in the struct by maintaining
- * backwards compatibility with older structs. If adding a new field,
- * please make sure to add a flag for that new field */
-#define KVM_SREGS_S_HIOR		(1 << 0)
-
-/*
  * In KVM_SET_SREGS, reserved/pad fields must be left untouched from a
  * previous KVM_GET_REGS.
  *
@@ -179,8 +173,6 @@ struct kvm_sregs {
 				__u64 ibat[8]; 
 				__u64 dbat[8]; 
 			} ppc32;
-			__u64 flags; /* KVM_SREGS_S_ */
-			__u64 hior;
 		} s;
 		struct {
 			union {
diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
index a384ffd..d4df013 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -90,8 +90,6 @@ struct kvmppc_vcpu_book3s {
 #endif
 	int context_id[SID_CONTEXTS];
 
-	bool hior_sregs;		/* HIOR is set by SREGS, not PVR */
-
 	struct hlist_head hpte_hash_pte[HPTEG_HASH_NUM_PTE];
 	struct hlist_head hpte_hash_pte_long[HPTEG_HASH_NUM_PTE_LONG];
 	struct hlist_head hpte_hash_vpte[HPTEG_HASH_NUM_VPTE];
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index d417511..84505a2 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -150,16 +150,14 @@ void kvmppc_set_pvr(struct kvm_vcpu *vcpu, u32 pvr)
 #ifdef CONFIG_PPC_BOOK3S_64
 	if ((pvr >= 0x330000) && (pvr < 0x70330000)) {
 		kvmppc_mmu_book3s_64_init(vcpu);
-		if (!to_book3s(vcpu)->hior_sregs)
-			to_book3s(vcpu)->hior = 0xfff00000;
+		to_book3s(vcpu)->hior = 0xfff00000;
 		to_book3s(vcpu)->msr_mask = 0xffffffffffffffffULL;
 		vcpu->arch.cpu_type = KVM_CPU_3S_64;
 	} else
 #endif
 	{
 		kvmppc_mmu_book3s_32_init(vcpu);
-		if (!to_book3s(vcpu)->hior_sregs)
-			to_book3s(vcpu)->hior = 0;
+		to_book3s(vcpu)->hior = 0;
 		to_book3s(vcpu)->msr_mask = 0xffffffffULL;
 		vcpu->arch.cpu_type = KVM_CPU_3S_32;
 	}
@@ -796,9 +794,6 @@ int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
 		}
 	}
 
-	if (sregs->u.s.flags & KVM_SREGS_S_HIOR)
-		sregs->u.s.hior = to_book3s(vcpu)->hior;
-
 	return 0;
 }
 
@@ -835,11 +830,6 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
 	/* Flush the MMU after messing with the segments */
 	kvmppc_mmu_pte_flush(vcpu, 0, 0);
 
-	if (sregs->u.s.flags & KVM_SREGS_S_HIOR) {
-		to_book3s(vcpu)->hior_sregs = true;
-		to_book3s(vcpu)->hior = sregs->u.s.hior;
-	}
-
 	return 0;
 }
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 55b4233..e75c5ac 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -209,7 +209,6 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_PPC_BOOKE_SREGS:
 #else
 	case KVM_CAP_PPC_SEGSTATE:
-	case KVM_CAP_PPC_HIOR:
 	case KVM_CAP_PPC_PAPR:
 #endif
 	case KVM_CAP_PPC_UNSET_IRQ:
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 76ef719..a6b1295 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -555,7 +555,6 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_PPC_SMT 64
 #define KVM_CAP_PPC_RMA	65
 #define KVM_CAP_MAX_VCPUS 66       /* returns max vcpus per vm */
-#define KVM_CAP_PPC_HIOR 67
 #define KVM_CAP_PPC_PAPR 68
 #define KVM_CAP_SW_TLB 69
 #define KVM_CAP_S390_GMAP 71
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 09/14] KVM: PPC: Add generic single register ioctls
  2011-10-31  7:53 ` Alexander Graf
@ 2011-10-31  7:53   ` Alexander Graf
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti

Right now we transfer a static struct every time we want to get or set
registers. Unfortunately, over time we realize that there are more of
these than we thought of before and the extensibility and flexibility of
transferring a full struct every time is limited.

So this is a new approach to the problem. With these new ioctls, we can
get and set a single register that is identified by an ID. This allows for
very precise and limited transmittal of data. When we later realize that
it's a better idea to shove over multiple registers at once, we can reuse
most of the infrastructure and simply implement a GET_MANY_REGS / SET_MANY_REGS
interface.

The only downpoint I see to this one is that it needs to pad to 1024 bits
(hardware is already on 512 bit registers, so I wanted to leave some room)
which is slightly too much for transmitting only 64 bits. But if that's all
the tradeoff we have to do for getting an extensible interface, I'd say go
for it nevertheless.

Signed-off-by: Alexander Graf <agraf@suse.de>
---
 Documentation/virtual/kvm/api.txt |   47 ++++++++++++++++++++++++++++++++++
 arch/powerpc/kvm/powerpc.c        |   51 +++++++++++++++++++++++++++++++++++++
 include/linux/kvm.h               |   32 +++++++++++++++++++++++
 3 files changed, 130 insertions(+), 0 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index ab1136f..a23fe62 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1482,6 +1482,53 @@ is supported; 2 if the processor requires all virtual machines to have
 an RMA, or 1 if the processor can use an RMA but doesn't require it,
 because it supports the Virtual RMA (VRMA) facility.
 
+4.64 KVM_SET_ONE_REG
+
+Capability: KVM_CAP_ONE_REG
+Architectures: all
+Type: vcpu ioctl
+Parameters: struct kvm_one_reg (in)
+Returns: 0 on success, negative value on failure
+
+struct kvm_one_reg {
+       __u64 id;
+       union {
+               __u8 reg8;
+               __u16 reg16;
+               __u32 reg32;
+               __u64 reg64;
+               __u8 reg128[16];
+               __u8 reg256[32];
+               __u8 reg512[64];
+               __u8 reg1024[128];
+       } u;
+};
+
+Using this ioctl, a single vcpu register can be set to a specific value
+defined by user space with the passed in struct kvm_one_reg. There can
+be architecture agnostic and architecture specific registers. Each have
+their own range of operation and their own constants and width. To keep
+track of the implemented registers, find a list below:
+
+  Arch  |       Register        | Width (bits)
+        |                       |
+
+4.65 KVM_GET_ONE_REG
+
+Capability: KVM_CAP_ONE_REG
+Architectures: all
+Type: vcpu ioctl
+Parameters: struct kvm_one_reg (in and out)
+Returns: 0 on success, negative value on failure
+
+This ioctl allows to receive the value of a single register implemented
+in a vcpu. The register to read is indicated by the "id" field of the
+kvm_one_reg struct passed in. On success, the register value can be found
+in the respective width field of the struct after this call.
+
+The list of registers accessible using this interface is identical to the
+list in 4.64.
+
 5. The kvm_run structure
 
 Application code obtains a pointer to the kvm_run structure by
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index e75c5ac..39cdb3f 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -214,6 +214,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_PPC_UNSET_IRQ:
 	case KVM_CAP_PPC_IRQ_LEVEL:
 	case KVM_CAP_ENABLE_CAP:
+	case KVM_CAP_ONE_REG:
 		r = 1;
 		break;
 #ifndef CONFIG_KVM_BOOK3S_64_HV
@@ -627,6 +628,32 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 	return r;
 }
 
+static int kvm_vcpu_ioctl_get_one_reg(struct kvm_vcpu *vcpu,
+				      struct kvm_one_reg *reg)
+{
+	int r = -EINVAL;
+
+	switch (reg->id) {
+	default:
+		break;
+	}
+
+	return r;
+}
+
+static int kvm_vcpu_ioctl_set_one_reg(struct kvm_vcpu *vcpu,
+				      struct kvm_one_reg *reg)
+{
+	int r = -EINVAL;
+
+	switch (reg->id) {
+	default:
+		break;
+	}
+
+	return r;
+}
+
 int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu,
                                     struct kvm_mp_state *mp_state)
 {
@@ -666,6 +693,30 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 		break;
 	}
 
+	case KVM_GET_ONE_REG:
+	{
+		struct kvm_one_reg reg;
+		r = -EFAULT;
+		if (copy_from_user(&reg, argp, sizeof(reg)))
+			goto out;
+		r = kvm_vcpu_ioctl_get_one_reg(vcpu, &reg);
+		if (copy_to_user(argp, &reg, sizeof(reg))) {
+			r = -EFAULT;
+			goto out;
+		}
+		break;
+	}
+
+	case KVM_SET_ONE_REG:
+	{
+		struct kvm_one_reg reg;
+		r = -EFAULT;
+		if (copy_from_user(&reg, argp, sizeof(reg)))
+			goto out;
+		r = kvm_vcpu_ioctl_set_one_reg(vcpu, &reg);
+		break;
+	}
+
 #ifdef CONFIG_KVM_E500
 	case KVM_DIRTY_TLB: {
 		struct kvm_dirty_tlb dirty;
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index a6b1295..e652a7b 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -557,6 +557,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_MAX_VCPUS 66       /* returns max vcpus per vm */
 #define KVM_CAP_PPC_PAPR 68
 #define KVM_CAP_SW_TLB 69
+#define KVM_CAP_ONE_REG 70
 #define KVM_CAP_S390_GMAP 71
 
 #ifdef KVM_CAP_IRQ_ROUTING
@@ -652,6 +653,34 @@ struct kvm_dirty_tlb {
 	__u32 num_dirty;
 };
 
+/* Available with KVM_CAP_ONE_REG */
+
+#define KVM_ONE_REG_GENERIC		0x0000000000000000ULL
+
+/*
+ * Architecture specific registers are to be defined in arch headers and
+ * ORed with the arch identifier.
+ */
+#define KVM_ONE_REG_PPC			0x1000000000000000ULL
+#define KVM_ONE_REG_X86			0x2000000000000000ULL
+#define KVM_ONE_REG_IA64		0x3000000000000000ULL
+#define KVM_ONE_REG_ARM			0x4000000000000000ULL
+#define KVM_ONE_REG_S390		0x5000000000000000ULL
+
+struct kvm_one_reg {
+	__u64 id;
+	union {
+		__u8 reg8;
+		__u16 reg16;
+		__u32 reg32;
+		__u64 reg64;
+		__u8 reg128[16];
+		__u8 reg256[32];
+		__u8 reg512[64];
+		__u8 reg1024[128];
+	} u;
+};
+
 /*
  * ioctls for VM fds
  */
@@ -780,6 +809,9 @@ struct kvm_dirty_tlb {
 #define KVM_ALLOCATE_RMA	  _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
 /* Available with KVM_CAP_SW_TLB */
 #define KVM_DIRTY_TLB		  _IOW(KVMIO,  0xaa, struct kvm_dirty_tlb)
+/* Available with KVM_CAP_ONE_REG */
+#define KVM_GET_ONE_REG		  _IOWR(KVMIO, 0xab, struct kvm_one_reg)
+#define KVM_SET_ONE_REG		  _IOW(KVMIO,  0xac, struct kvm_one_reg)
 
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 09/14] KVM: PPC: Add generic single register ioctls
@ 2011-10-31  7:53   ` Alexander Graf
  0 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti

Right now we transfer a static struct every time we want to get or set
registers. Unfortunately, over time we realize that there are more of
these than we thought of before and the extensibility and flexibility of
transferring a full struct every time is limited.

So this is a new approach to the problem. With these new ioctls, we can
get and set a single register that is identified by an ID. This allows for
very precise and limited transmittal of data. When we later realize that
it's a better idea to shove over multiple registers at once, we can reuse
most of the infrastructure and simply implement a GET_MANY_REGS / SET_MANY_REGS
interface.

The only downpoint I see to this one is that it needs to pad to 1024 bits
(hardware is already on 512 bit registers, so I wanted to leave some room)
which is slightly too much for transmitting only 64 bits. But if that's all
the tradeoff we have to do for getting an extensible interface, I'd say go
for it nevertheless.

Signed-off-by: Alexander Graf <agraf@suse.de>
---
 Documentation/virtual/kvm/api.txt |   47 ++++++++++++++++++++++++++++++++++
 arch/powerpc/kvm/powerpc.c        |   51 +++++++++++++++++++++++++++++++++++++
 include/linux/kvm.h               |   32 +++++++++++++++++++++++
 3 files changed, 130 insertions(+), 0 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index ab1136f..a23fe62 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1482,6 +1482,53 @@ is supported; 2 if the processor requires all virtual machines to have
 an RMA, or 1 if the processor can use an RMA but doesn't require it,
 because it supports the Virtual RMA (VRMA) facility.
 
+4.64 KVM_SET_ONE_REG
+
+Capability: KVM_CAP_ONE_REG
+Architectures: all
+Type: vcpu ioctl
+Parameters: struct kvm_one_reg (in)
+Returns: 0 on success, negative value on failure
+
+struct kvm_one_reg {
+       __u64 id;
+       union {
+               __u8 reg8;
+               __u16 reg16;
+               __u32 reg32;
+               __u64 reg64;
+               __u8 reg128[16];
+               __u8 reg256[32];
+               __u8 reg512[64];
+               __u8 reg1024[128];
+       } u;
+};
+
+Using this ioctl, a single vcpu register can be set to a specific value
+defined by user space with the passed in struct kvm_one_reg. There can
+be architecture agnostic and architecture specific registers. Each have
+their own range of operation and their own constants and width. To keep
+track of the implemented registers, find a list below:
+
+  Arch  |       Register        | Width (bits)
+        |                       |
+
+4.65 KVM_GET_ONE_REG
+
+Capability: KVM_CAP_ONE_REG
+Architectures: all
+Type: vcpu ioctl
+Parameters: struct kvm_one_reg (in and out)
+Returns: 0 on success, negative value on failure
+
+This ioctl allows to receive the value of a single register implemented
+in a vcpu. The register to read is indicated by the "id" field of the
+kvm_one_reg struct passed in. On success, the register value can be found
+in the respective width field of the struct after this call.
+
+The list of registers accessible using this interface is identical to the
+list in 4.64.
+
 5. The kvm_run structure
 
 Application code obtains a pointer to the kvm_run structure by
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index e75c5ac..39cdb3f 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -214,6 +214,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_PPC_UNSET_IRQ:
 	case KVM_CAP_PPC_IRQ_LEVEL:
 	case KVM_CAP_ENABLE_CAP:
+	case KVM_CAP_ONE_REG:
 		r = 1;
 		break;
 #ifndef CONFIG_KVM_BOOK3S_64_HV
@@ -627,6 +628,32 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 	return r;
 }
 
+static int kvm_vcpu_ioctl_get_one_reg(struct kvm_vcpu *vcpu,
+				      struct kvm_one_reg *reg)
+{
+	int r = -EINVAL;
+
+	switch (reg->id) {
+	default:
+		break;
+	}
+
+	return r;
+}
+
+static int kvm_vcpu_ioctl_set_one_reg(struct kvm_vcpu *vcpu,
+				      struct kvm_one_reg *reg)
+{
+	int r = -EINVAL;
+
+	switch (reg->id) {
+	default:
+		break;
+	}
+
+	return r;
+}
+
 int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu,
                                     struct kvm_mp_state *mp_state)
 {
@@ -666,6 +693,30 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 		break;
 	}
 
+	case KVM_GET_ONE_REG:
+	{
+		struct kvm_one_reg reg;
+		r = -EFAULT;
+		if (copy_from_user(&reg, argp, sizeof(reg)))
+			goto out;
+		r = kvm_vcpu_ioctl_get_one_reg(vcpu, &reg);
+		if (copy_to_user(argp, &reg, sizeof(reg))) {
+			r = -EFAULT;
+			goto out;
+		}
+		break;
+	}
+
+	case KVM_SET_ONE_REG:
+	{
+		struct kvm_one_reg reg;
+		r = -EFAULT;
+		if (copy_from_user(&reg, argp, sizeof(reg)))
+			goto out;
+		r = kvm_vcpu_ioctl_set_one_reg(vcpu, &reg);
+		break;
+	}
+
 #ifdef CONFIG_KVM_E500
 	case KVM_DIRTY_TLB: {
 		struct kvm_dirty_tlb dirty;
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index a6b1295..e652a7b 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -557,6 +557,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_MAX_VCPUS 66       /* returns max vcpus per vm */
 #define KVM_CAP_PPC_PAPR 68
 #define KVM_CAP_SW_TLB 69
+#define KVM_CAP_ONE_REG 70
 #define KVM_CAP_S390_GMAP 71
 
 #ifdef KVM_CAP_IRQ_ROUTING
@@ -652,6 +653,34 @@ struct kvm_dirty_tlb {
 	__u32 num_dirty;
 };
 
+/* Available with KVM_CAP_ONE_REG */
+
+#define KVM_ONE_REG_GENERIC		0x0000000000000000ULL
+
+/*
+ * Architecture specific registers are to be defined in arch headers and
+ * ORed with the arch identifier.
+ */
+#define KVM_ONE_REG_PPC			0x1000000000000000ULL
+#define KVM_ONE_REG_X86			0x2000000000000000ULL
+#define KVM_ONE_REG_IA64		0x3000000000000000ULL
+#define KVM_ONE_REG_ARM			0x4000000000000000ULL
+#define KVM_ONE_REG_S390		0x5000000000000000ULL
+
+struct kvm_one_reg {
+	__u64 id;
+	union {
+		__u8 reg8;
+		__u16 reg16;
+		__u32 reg32;
+		__u64 reg64;
+		__u8 reg128[16];
+		__u8 reg256[32];
+		__u8 reg512[64];
+		__u8 reg1024[128];
+	} u;
+};
+
 /*
  * ioctls for VM fds
  */
@@ -780,6 +809,9 @@ struct kvm_dirty_tlb {
 #define KVM_ALLOCATE_RMA	  _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
 /* Available with KVM_CAP_SW_TLB */
 #define KVM_DIRTY_TLB		  _IOW(KVMIO,  0xaa, struct kvm_dirty_tlb)
+/* Available with KVM_CAP_ONE_REG */
+#define KVM_GET_ONE_REG		  _IOWR(KVMIO, 0xab, struct kvm_one_reg)
+#define KVM_SET_ONE_REG		  _IOW(KVMIO,  0xac, struct kvm_one_reg)
 
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 10/14] KVM: PPC: Add support for explicit HIOR setting
  2011-10-31  7:53 ` Alexander Graf
@ 2011-10-31  7:53   ` Alexander Graf
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti

Until now, we always set HIOR based on the PVR, but this is just wrong.
Instead, we should be setting HIOR explicitly, so user space can decide
what the initial HIOR value is - just like on real hardware.

We keep the old PVR based way around for backwards compatibility, but
once user space uses the SET_ONE_REG based method, we drop the PVR logic.

Signed-off-by: Alexander Graf <agraf@suse.de>
---
 Documentation/virtual/kvm/api.txt     |    1 +
 arch/powerpc/include/asm/kvm.h        |    2 ++
 arch/powerpc/include/asm/kvm_book3s.h |    2 ++
 arch/powerpc/kvm/book3s_pr.c          |    6 ++++--
 arch/powerpc/kvm/powerpc.c            |   14 ++++++++++++++
 include/linux/kvm.h                   |    1 +
 6 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index a23fe62..e56a46d 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1512,6 +1512,7 @@ track of the implemented registers, find a list below:
 
   Arch  |       Register        | Width (bits)
         |                       |
+  PPC   | KVM_ONE_REG_PPC_HIOR  | 64
 
 4.65 KVM_GET_ONE_REG
 
diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h
index a635e22..53b8759 100644
--- a/arch/powerpc/include/asm/kvm.h
+++ b/arch/powerpc/include/asm/kvm.h
@@ -327,4 +327,6 @@ struct kvm_book3e_206_tlb_params {
 	__u32 reserved[8];
 };
 
+#define KVM_ONE_REG_PPC_HIOR	KVM_ONE_REG_PPC | 0x100
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
index d4df013..0ba8ba9 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -90,6 +90,8 @@ struct kvmppc_vcpu_book3s {
 #endif
 	int context_id[SID_CONTEXTS];
 
+	bool hior_explicit;		/* HIOR is set by ioctl, not PVR */
+
 	struct hlist_head hpte_hash_pte[HPTEG_HASH_NUM_PTE];
 	struct hlist_head hpte_hash_pte_long[HPTEG_HASH_NUM_PTE_LONG];
 	struct hlist_head hpte_hash_vpte[HPTEG_HASH_NUM_VPTE];
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 84505a2..565af5a 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -150,14 +150,16 @@ void kvmppc_set_pvr(struct kvm_vcpu *vcpu, u32 pvr)
 #ifdef CONFIG_PPC_BOOK3S_64
 	if ((pvr >= 0x330000) && (pvr < 0x70330000)) {
 		kvmppc_mmu_book3s_64_init(vcpu);
-		to_book3s(vcpu)->hior = 0xfff00000;
+		if (!to_book3s(vcpu)->hior_explicit)
+			to_book3s(vcpu)->hior = 0xfff00000;
 		to_book3s(vcpu)->msr_mask = 0xffffffffffffffffULL;
 		vcpu->arch.cpu_type = KVM_CPU_3S_64;
 	} else
 #endif
 	{
 		kvmppc_mmu_book3s_32_init(vcpu);
-		to_book3s(vcpu)->hior = 0;
+		if (!to_book3s(vcpu)->hior_explicit)
+			to_book3s(vcpu)->hior = 0;
 		to_book3s(vcpu)->msr_mask = 0xffffffffULL;
 		vcpu->arch.cpu_type = KVM_CPU_3S_32;
 	}
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 39cdb3f..c33f6a7 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -209,6 +209,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_PPC_BOOKE_SREGS:
 #else
 	case KVM_CAP_PPC_SEGSTATE:
+	case KVM_CAP_PPC_HIOR:
 	case KVM_CAP_PPC_PAPR:
 #endif
 	case KVM_CAP_PPC_UNSET_IRQ:
@@ -634,6 +635,12 @@ static int kvm_vcpu_ioctl_get_one_reg(struct kvm_vcpu *vcpu,
 	int r = -EINVAL;
 
 	switch (reg->id) {
+#ifdef CONFIG_PPC_BOOK3S
+	case KVM_ONE_REG_PPC_HIOR:
+		reg->u.reg64 = to_book3s(vcpu)->hior;
+		r = 0;
+		break;
+#endif
 	default:
 		break;
 	}
@@ -647,6 +654,13 @@ static int kvm_vcpu_ioctl_set_one_reg(struct kvm_vcpu *vcpu,
 	int r = -EINVAL;
 
 	switch (reg->id) {
+#ifdef CONFIG_PPC_BOOK3S
+	case KVM_ONE_REG_PPC_HIOR:
+		to_book3s(vcpu)->hior = reg->u.reg64;
+		to_book3s(vcpu)->hior_explicit = true;
+		r = 0;
+		break;
+#endif
 	default:
 		break;
 	}
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index e652a7b..c107fae 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -555,6 +555,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_PPC_SMT 64
 #define KVM_CAP_PPC_RMA	65
 #define KVM_CAP_MAX_VCPUS 66       /* returns max vcpus per vm */
+#define KVM_CAP_PPC_HIOR 67
 #define KVM_CAP_PPC_PAPR 68
 #define KVM_CAP_SW_TLB 69
 #define KVM_CAP_ONE_REG 70
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 10/14] KVM: PPC: Add support for explicit HIOR setting
@ 2011-10-31  7:53   ` Alexander Graf
  0 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti

Until now, we always set HIOR based on the PVR, but this is just wrong.
Instead, we should be setting HIOR explicitly, so user space can decide
what the initial HIOR value is - just like on real hardware.

We keep the old PVR based way around for backwards compatibility, but
once user space uses the SET_ONE_REG based method, we drop the PVR logic.

Signed-off-by: Alexander Graf <agraf@suse.de>
---
 Documentation/virtual/kvm/api.txt     |    1 +
 arch/powerpc/include/asm/kvm.h        |    2 ++
 arch/powerpc/include/asm/kvm_book3s.h |    2 ++
 arch/powerpc/kvm/book3s_pr.c          |    6 ++++--
 arch/powerpc/kvm/powerpc.c            |   14 ++++++++++++++
 include/linux/kvm.h                   |    1 +
 6 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index a23fe62..e56a46d 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1512,6 +1512,7 @@ track of the implemented registers, find a list below:
 
   Arch  |       Register        | Width (bits)
         |                       |
+  PPC   | KVM_ONE_REG_PPC_HIOR  | 64
 
 4.65 KVM_GET_ONE_REG
 
diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h
index a635e22..53b8759 100644
--- a/arch/powerpc/include/asm/kvm.h
+++ b/arch/powerpc/include/asm/kvm.h
@@ -327,4 +327,6 @@ struct kvm_book3e_206_tlb_params {
 	__u32 reserved[8];
 };
 
+#define KVM_ONE_REG_PPC_HIOR	KVM_ONE_REG_PPC | 0x100
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
index d4df013..0ba8ba9 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -90,6 +90,8 @@ struct kvmppc_vcpu_book3s {
 #endif
 	int context_id[SID_CONTEXTS];
 
+	bool hior_explicit;		/* HIOR is set by ioctl, not PVR */
+
 	struct hlist_head hpte_hash_pte[HPTEG_HASH_NUM_PTE];
 	struct hlist_head hpte_hash_pte_long[HPTEG_HASH_NUM_PTE_LONG];
 	struct hlist_head hpte_hash_vpte[HPTEG_HASH_NUM_VPTE];
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 84505a2..565af5a 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -150,14 +150,16 @@ void kvmppc_set_pvr(struct kvm_vcpu *vcpu, u32 pvr)
 #ifdef CONFIG_PPC_BOOK3S_64
 	if ((pvr >= 0x330000) && (pvr < 0x70330000)) {
 		kvmppc_mmu_book3s_64_init(vcpu);
-		to_book3s(vcpu)->hior = 0xfff00000;
+		if (!to_book3s(vcpu)->hior_explicit)
+			to_book3s(vcpu)->hior = 0xfff00000;
 		to_book3s(vcpu)->msr_mask = 0xffffffffffffffffULL;
 		vcpu->arch.cpu_type = KVM_CPU_3S_64;
 	} else
 #endif
 	{
 		kvmppc_mmu_book3s_32_init(vcpu);
-		to_book3s(vcpu)->hior = 0;
+		if (!to_book3s(vcpu)->hior_explicit)
+			to_book3s(vcpu)->hior = 0;
 		to_book3s(vcpu)->msr_mask = 0xffffffffULL;
 		vcpu->arch.cpu_type = KVM_CPU_3S_32;
 	}
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 39cdb3f..c33f6a7 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -209,6 +209,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_PPC_BOOKE_SREGS:
 #else
 	case KVM_CAP_PPC_SEGSTATE:
+	case KVM_CAP_PPC_HIOR:
 	case KVM_CAP_PPC_PAPR:
 #endif
 	case KVM_CAP_PPC_UNSET_IRQ:
@@ -634,6 +635,12 @@ static int kvm_vcpu_ioctl_get_one_reg(struct kvm_vcpu *vcpu,
 	int r = -EINVAL;
 
 	switch (reg->id) {
+#ifdef CONFIG_PPC_BOOK3S
+	case KVM_ONE_REG_PPC_HIOR:
+		reg->u.reg64 = to_book3s(vcpu)->hior;
+		r = 0;
+		break;
+#endif
 	default:
 		break;
 	}
@@ -647,6 +654,13 @@ static int kvm_vcpu_ioctl_set_one_reg(struct kvm_vcpu *vcpu,
 	int r = -EINVAL;
 
 	switch (reg->id) {
+#ifdef CONFIG_PPC_BOOK3S
+	case KVM_ONE_REG_PPC_HIOR:
+		to_book3s(vcpu)->hior = reg->u.reg64;
+		to_book3s(vcpu)->hior_explicit = true;
+		r = 0;
+		break;
+#endif
 	default:
 		break;
 	}
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index e652a7b..c107fae 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -555,6 +555,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_PPC_SMT 64
 #define KVM_CAP_PPC_RMA	65
 #define KVM_CAP_MAX_VCPUS 66       /* returns max vcpus per vm */
+#define KVM_CAP_PPC_HIOR 67
 #define KVM_CAP_PPC_PAPR 68
 #define KVM_CAP_SW_TLB 69
 #define KVM_CAP_ONE_REG 70
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 11/14] KVM: PPC: Whitespace fix for kvm.h
  2011-10-31  7:53 ` Alexander Graf
@ 2011-10-31  7:53   ` Alexander Graf
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti

kvm.h had sparse whitespace at the end of the line. Clean it
up so syncing with QEMU gets easier.

Signed-off-by: Alexander Graf <agraf@suse.de>
---
 arch/powerpc/include/asm/kvm.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h
index 53b8759..fb3fddc 100644
--- a/arch/powerpc/include/asm/kvm.h
+++ b/arch/powerpc/include/asm/kvm.h
@@ -170,8 +170,8 @@ struct kvm_sregs {
 			} ppc64;
 			struct {
 				__u32 sr[16];
-				__u64 ibat[8]; 
-				__u64 dbat[8]; 
+				__u64 ibat[8];
+				__u64 dbat[8];
 			} ppc32;
 		} s;
 		struct {
-- 
1.6.0.2

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 11/14] KVM: PPC: Whitespace fix for kvm.h
@ 2011-10-31  7:53   ` Alexander Graf
  0 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti

kvm.h had sparse whitespace at the end of the line. Clean it
up so syncing with QEMU gets easier.

Signed-off-by: Alexander Graf <agraf@suse.de>
---
 arch/powerpc/include/asm/kvm.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h
index 53b8759..fb3fddc 100644
--- a/arch/powerpc/include/asm/kvm.h
+++ b/arch/powerpc/include/asm/kvm.h
@@ -170,8 +170,8 @@ struct kvm_sregs {
 			} ppc64;
 			struct {
 				__u32 sr[16];
-				__u64 ibat[8]; 
-				__u64 dbat[8]; 
+				__u64 ibat[8];
+				__u64 dbat[8];
 			} ppc32;
 		} s;
 		struct {
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 12/14] KVM: Fix whitespace in kvm_para.h
  2011-10-31  7:53 ` Alexander Graf
@ 2011-10-31  7:53   ` Alexander Graf
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti

When syncing KVM headers with QEMU I (or whoever applies the
diff) end up automatically fixing whitespaces. One of them
is in kvm_para.h.

It's a lot more consistent for people who don't do the whitespace
fixups automatically to already have fixed headers in Linux. So
remove the sparse empty line at the end of kvm_para.h and everyone's
happy.

Reported-by: Blue Swirl <blauwirbel@gmail.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
---
 include/linux/kvm_para.h |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h
index 47a070b..ff476dd 100644
--- a/include/linux/kvm_para.h
+++ b/include/linux/kvm_para.h
@@ -35,4 +35,3 @@ static inline int kvm_para_has_feature(unsigned int feature)
 }
 #endif /* __KERNEL__ */
 #endif /* __LINUX_KVM_PARA_H */
-
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 12/14] KVM: Fix whitespace in kvm_para.h
@ 2011-10-31  7:53   ` Alexander Graf
  0 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti

When syncing KVM headers with QEMU I (or whoever applies the
diff) end up automatically fixing whitespaces. One of them
is in kvm_para.h.

It's a lot more consistent for people who don't do the whitespace
fixups automatically to already have fixed headers in Linux. So
remove the sparse empty line at the end of kvm_para.h and everyone's
happy.

Reported-by: Blue Swirl <blauwirbel@gmail.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
---
 include/linux/kvm_para.h |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h
index 47a070b..ff476dd 100644
--- a/include/linux/kvm_para.h
+++ b/include/linux/kvm_para.h
@@ -35,4 +35,3 @@ static inline int kvm_para_has_feature(unsigned int feature)
 }
 #endif /* __KERNEL__ */
 #endif /* __LINUX_KVM_PARA_H */
-
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 13/14] KVM: PPC: E500: Support hugetlbfs
  2011-10-31  7:53 ` Alexander Graf
@ 2011-10-31  7:53   ` Alexander Graf
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti

With hugetlbfs support emerging on e500, we should also support KVM
backing its guest memory by it.

This patch adds support for hugetlbfs into the e500 shadow mmu code.

Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Scott Wood <scottwood@freescale.com>

---

v1 -> v2:

  - address scott's comments
---
 arch/powerpc/kvm/e500_tlb.c |   24 ++++++++++++++++++++++++
 1 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index ec17148..1dd96a9 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -24,6 +24,7 @@
 #include <linux/sched.h>
 #include <linux/rwsem.h>
 #include <linux/vmalloc.h>
+#include <linux/hugetlb.h>
 #include <asm/kvm_ppc.h>
 #include <asm/kvm_e500.h>
 
@@ -673,12 +674,31 @@ static inline void kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 				pfn &= ~(tsize_pages - 1);
 				break;
 			}
+		} else if (vma && hva >= vma->vm_start &&
+                           (vma->vm_flags & VM_HUGETLB)) {
+			unsigned long psize = vma_kernel_pagesize(vma);
+
+			tsize = (gtlbe->mas1 & MAS1_TSIZE_MASK) >>
+				MAS1_TSIZE_SHIFT;
+
+			/*
+			 * Take the largest page size that satisfies both host
+			 * and guest mapping
+			 */
+			tsize = min(__ilog2(psize) - 10, tsize);
+
+			/*
+			 * e500 doesn't implement the lowest tsize bit,
+			 * or 1K pages.
+			 */
+			tsize = max(BOOK3E_PAGESZ_4K, tsize & ~1);
 		}
 
 		up_read(&current->mm->mmap_sem);
 	}
 
 	if (likely(!pfnmap)) {
+		unsigned long tsize_pages = 1 << (tsize + 10 - PAGE_SHIFT);
 		pfn = gfn_to_pfn_memslot(vcpu_e500->vcpu.kvm, slot, gfn);
 		if (is_error_pfn(pfn)) {
 			printk(KERN_ERR "Couldn't get real page for gfn %lx!\n",
@@ -686,6 +706,10 @@ static inline void kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 			kvm_release_pfn_clean(pfn);
 			return;
 		}
+
+		/* Align guest and physical address to page map boundaries */
+		pfn &= ~(tsize_pages - 1);
+		gvaddr &= ~((tsize_pages << PAGE_SHIFT) - 1);
 	}
 
 	/* Drop old ref and setup new one. */
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 13/14] KVM: PPC: E500: Support hugetlbfs
@ 2011-10-31  7:53   ` Alexander Graf
  0 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti

With hugetlbfs support emerging on e500, we should also support KVM
backing its guest memory by it.

This patch adds support for hugetlbfs into the e500 shadow mmu code.

Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Scott Wood <scottwood@freescale.com>

---

v1 -> v2:

  - address scott's comments
---
 arch/powerpc/kvm/e500_tlb.c |   24 ++++++++++++++++++++++++
 1 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index ec17148..1dd96a9 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -24,6 +24,7 @@
 #include <linux/sched.h>
 #include <linux/rwsem.h>
 #include <linux/vmalloc.h>
+#include <linux/hugetlb.h>
 #include <asm/kvm_ppc.h>
 #include <asm/kvm_e500.h>
 
@@ -673,12 +674,31 @@ static inline void kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 				pfn &= ~(tsize_pages - 1);
 				break;
 			}
+		} else if (vma && hva >= vma->vm_start &&
+                           (vma->vm_flags & VM_HUGETLB)) {
+			unsigned long psize = vma_kernel_pagesize(vma);
+
+			tsize = (gtlbe->mas1 & MAS1_TSIZE_MASK) >>
+				MAS1_TSIZE_SHIFT;
+
+			/*
+			 * Take the largest page size that satisfies both host
+			 * and guest mapping
+			 */
+			tsize = min(__ilog2(psize) - 10, tsize);
+
+			/*
+			 * e500 doesn't implement the lowest tsize bit,
+			 * or 1K pages.
+			 */
+			tsize = max(BOOK3E_PAGESZ_4K, tsize & ~1);
 		}
 
 		up_read(&current->mm->mmap_sem);
 	}
 
 	if (likely(!pfnmap)) {
+		unsigned long tsize_pages = 1 << (tsize + 10 - PAGE_SHIFT);
 		pfn = gfn_to_pfn_memslot(vcpu_e500->vcpu.kvm, slot, gfn);
 		if (is_error_pfn(pfn)) {
 			printk(KERN_ERR "Couldn't get real page for gfn %lx!\n",
@@ -686,6 +706,10 @@ static inline void kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 			kvm_release_pfn_clean(pfn);
 			return;
 		}
+
+		/* Align guest and physical address to page map boundaries */
+		pfn &= ~(tsize_pages - 1);
+		gvaddr &= ~((tsize_pages << PAGE_SHIFT) - 1);
 	}
 
 	/* Drop old ref and setup new one. */
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 14/14] PPC: Fix race in mtmsr paravirt implementation
  2011-10-31  7:53 ` Alexander Graf
@ 2011-10-31  7:53   ` Alexander Graf
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti, Bharat Bhushan, Bharat Bhushan

From: Bharat Bhushan <r65777@freescale.com>

The current implementation of mtmsr and mtmsrd are racy in that it does:

  * check (int_pending == 0)
  ---> host sets int_pending = 1 <---
  * write shared page
  * done

while instead we should check for int_pending after the shared page is written.

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
---
 arch/powerpc/kernel/kvm_emul.S |   10 ++++------
 1 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/kvm_emul.S b/arch/powerpc/kernel/kvm_emul.S
index f2b1b25..3d64c57 100644
--- a/arch/powerpc/kernel/kvm_emul.S
+++ b/arch/powerpc/kernel/kvm_emul.S
@@ -167,6 +167,9 @@ maybe_stay_in_guest:
 kvm_emulate_mtmsr_reg2:
 	ori	r30, r0, 0
 
+	/* Put MSR into magic page because we don't call mtmsr */
+	STL64(r30, KVM_MAGIC_PAGE + KVM_MAGIC_MSR, 0)
+
 	/* Check if we have to fetch an interrupt */
 	lwz	r31, (KVM_MAGIC_PAGE + KVM_MAGIC_INT)(0)
 	cmpwi	r31, 0
@@ -174,15 +177,10 @@ kvm_emulate_mtmsr_reg2:
 
 	/* Check if we may trigger an interrupt */
 	andi.	r31, r30, MSR_EE
-	beq	no_mtmsr
-
-	b	do_mtmsr
+	bne	do_mtmsr
 
 no_mtmsr:
 
-	/* Put MSR into magic page because we don't call mtmsr */
-	STL64(r30, KVM_MAGIC_PAGE + KVM_MAGIC_MSR, 0)
-
 	SCRATCH_RESTORE
 
 	/* Go back to caller */
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 14/14] PPC: Fix race in mtmsr paravirt implementation
@ 2011-10-31  7:53   ` Alexander Graf
  0 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31  7:53 UTC (permalink / raw)
  To: kvm-ppc; +Cc: kvm list, Marcelo Tosatti, Bharat Bhushan, Bharat Bhushan

From: Bharat Bhushan <r65777@freescale.com>

The current implementation of mtmsr and mtmsrd are racy in that it does:

  * check (int_pending = 0)
  ---> host sets int_pending = 1 <---
  * write shared page
  * done

while instead we should check for int_pending after the shared page is written.

Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
---
 arch/powerpc/kernel/kvm_emul.S |   10 ++++------
 1 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/kvm_emul.S b/arch/powerpc/kernel/kvm_emul.S
index f2b1b25..3d64c57 100644
--- a/arch/powerpc/kernel/kvm_emul.S
+++ b/arch/powerpc/kernel/kvm_emul.S
@@ -167,6 +167,9 @@ maybe_stay_in_guest:
 kvm_emulate_mtmsr_reg2:
 	ori	r30, r0, 0
 
+	/* Put MSR into magic page because we don't call mtmsr */
+	STL64(r30, KVM_MAGIC_PAGE + KVM_MAGIC_MSR, 0)
+
 	/* Check if we have to fetch an interrupt */
 	lwz	r31, (KVM_MAGIC_PAGE + KVM_MAGIC_INT)(0)
 	cmpwi	r31, 0
@@ -174,15 +177,10 @@ kvm_emulate_mtmsr_reg2:
 
 	/* Check if we may trigger an interrupt */
 	andi.	r31, r30, MSR_EE
-	beq	no_mtmsr
-
-	b	do_mtmsr
+	bne	do_mtmsr
 
 no_mtmsr:
 
-	/* Put MSR into magic page because we don't call mtmsr */
-	STL64(r30, KVM_MAGIC_PAGE + KVM_MAGIC_MSR, 0)
-
 	SCRATCH_RESTORE
 
 	/* Go back to caller */
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH 01/14] KVM: PPC: e500: don't translate gfn to pfn with preemption disabled
  2011-10-31  7:53   ` Alexander Graf
@ 2011-10-31 12:50     ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2011-10-31 12:50 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm list, Marcelo Tosatti, Scott Wood

On 10/31/2011 09:53 AM, Alexander Graf wrote:
> From: Scott Wood <scottwood@freescale.com>
>
> Delay allocation of the shadow pid until we're ready to disable
> preemption and write the entry.
>
> @@ -507,21 +507,16 @@ static inline void kvmppc_e500_deliver_tlb_miss(struct kvm_vcpu *vcpu,
>  	vcpu_e500->mas7 = 0;
>  }
>  
> +/* TID must be supplied by the caller */
>  static inline void kvmppc_e500_setup_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
>  					   struct tlbe *gtlbe, int tsize,
>  					   struct tlbe_priv *priv,
>  					   u64 gvaddr, struct tlbe *stlbe)
>  {
>  	pfn_t pfn = priv->pfn;
> -	unsigned int stid;
> -
> -	stid = kvmppc_e500_get_sid(vcpu_e500, get_tlb_ts(gtlbe),
> -				   get_tlb_tid(gtlbe),
> -				   get_cur_pr(&vcpu_e500->vcpu), 0);
>  
>  	/* Force TS=1 IPROT=0 for all guest mappings. */
> -	stlbe->mas1 = MAS1_TSIZE(tsize)
> -		| MAS1_TID(stid) | MAS1_TS | MAS1_VALID;
> +	stlbe->mas1 = MAS1_TSIZE(tsize) | MAS1_TS | MAS1_VALID;
>  	stlbe->mas2 = (gvaddr & MAS2_EPN)
>  		| e500_shadow_mas2_attrib(gtlbe->mas2,
>  				vcpu_e500->vcpu.arch.shared->msr & MSR_PR);
> @@ -816,6 +811,24 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
>  	return EMULATE_DONE;
>  }
>  
> +/* sesel is index into the set, not the whole array */
> +static void write_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
> +			struct tlbe *gtlbe,
> +			struct tlbe *stlbe,
> +			int stlbsel, int sesel)
> +{
> +	int stid;
> +
> +	preempt_disable();
> +	stid = kvmppc_e500_get_sid(vcpu_e500, get_tlb_ts(gtlbe),
> +				   get_tlb_tid(gtlbe),
> +				   get_cur_pr(&vcpu_e500->vcpu), 0);
> +
> +	stlbe->mas1 |= MAS1_TID(stid);
> +	write_host_tlbe(vcpu_e500, stlbsel, sesel, stlbe);
> +	preempt_enable();
> +}
> +
>

This naked preempt_disable() is fishy.  What happens if we're migrated
immediately afterwards? we fault again and redo?

I realize that the patch doesn't introduce this.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 01/14] KVM: PPC: e500: don't translate gfn to pfn with
@ 2011-10-31 12:50     ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2011-10-31 12:50 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm list, Marcelo Tosatti, Scott Wood

On 10/31/2011 09:53 AM, Alexander Graf wrote:
> From: Scott Wood <scottwood@freescale.com>
>
> Delay allocation of the shadow pid until we're ready to disable
> preemption and write the entry.
>
> @@ -507,21 +507,16 @@ static inline void kvmppc_e500_deliver_tlb_miss(struct kvm_vcpu *vcpu,
>  	vcpu_e500->mas7 = 0;
>  }
>  
> +/* TID must be supplied by the caller */
>  static inline void kvmppc_e500_setup_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
>  					   struct tlbe *gtlbe, int tsize,
>  					   struct tlbe_priv *priv,
>  					   u64 gvaddr, struct tlbe *stlbe)
>  {
>  	pfn_t pfn = priv->pfn;
> -	unsigned int stid;
> -
> -	stid = kvmppc_e500_get_sid(vcpu_e500, get_tlb_ts(gtlbe),
> -				   get_tlb_tid(gtlbe),
> -				   get_cur_pr(&vcpu_e500->vcpu), 0);
>  
>  	/* Force TS=1 IPROT=0 for all guest mappings. */
> -	stlbe->mas1 = MAS1_TSIZE(tsize)
> -		| MAS1_TID(stid) | MAS1_TS | MAS1_VALID;
> +	stlbe->mas1 = MAS1_TSIZE(tsize) | MAS1_TS | MAS1_VALID;
>  	stlbe->mas2 = (gvaddr & MAS2_EPN)
>  		| e500_shadow_mas2_attrib(gtlbe->mas2,
>  				vcpu_e500->vcpu.arch.shared->msr & MSR_PR);
> @@ -816,6 +811,24 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
>  	return EMULATE_DONE;
>  }
>  
> +/* sesel is index into the set, not the whole array */
> +static void write_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
> +			struct tlbe *gtlbe,
> +			struct tlbe *stlbe,
> +			int stlbsel, int sesel)
> +{
> +	int stid;
> +
> +	preempt_disable();
> +	stid = kvmppc_e500_get_sid(vcpu_e500, get_tlb_ts(gtlbe),
> +				   get_tlb_tid(gtlbe),
> +				   get_cur_pr(&vcpu_e500->vcpu), 0);
> +
> +	stlbe->mas1 |= MAS1_TID(stid);
> +	write_host_tlbe(vcpu_e500, stlbsel, sesel, stlbe);
> +	preempt_enable();
> +}
> +
>

This naked preempt_disable() is fishy.  What happens if we're migrated
immediately afterwards? we fault again and redo?

I realize that the patch doesn't introduce this.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 04/14] KVM: PPC: e500: MMU API
  2011-10-31  7:53   ` Alexander Graf
@ 2011-10-31 13:24     ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2011-10-31 13:24 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm list, Marcelo Tosatti, Scott Wood

On 10/31/2011 09:53 AM, Alexander Graf wrote:
> From: Scott Wood <scottwood@freescale.com>
>
> This implements a shared-memory API for giving host userspace access to
> the guest's TLB.
>
>
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 7945b0b..ab1136f 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -1383,6 +1383,38 @@ The following flags are defined:
>  If datamatch flag is set, the event will be signaled only if the written value
>  to the registered address is equal to datamatch in struct kvm_ioeventfd.
>  
> +4.59 KVM_DIRTY_TLB
> +
> +Capability: KVM_CAP_SW_TLB
> +Architectures: ppc
> +Type: vcpu ioctl
> +Parameters: struct kvm_dirty_tlb (in)
> +Returns: 0 on success, -1 on error
> +
> +struct kvm_dirty_tlb {
> +	__u64 bitmap;
> +	__u32 num_dirty;
> +};

This is not 32/64 bit safe.  e500 is 32-bit only, yes? but what if
someone wants to emulate an e500 on a ppc64?  maybe it's better to add
padding here.

Another alternative is to drop the num_dirty field (and let the kernel
compute it instead, shouldn't take long?), and have the third argument
to ioctl() reference the bitmap directly.

> +
> +This must be called whenever userspace has changed an entry in the shared
> +TLB, prior to calling KVM_RUN on the associated vcpu.
> +
> +The "bitmap" field is the userspace address of an array.  This array
> +consists of a number of bits, equal to the total number of TLB entries as
> +determined by the last successful call to KVM_CONFIG_TLB, rounded up to the
> +nearest multiple of 64.
> +
> +Each bit corresponds to one TLB entry, ordered the same as in the shared TLB
> +array.
> +
> +The array is little-endian: the bit 0 is the least significant bit of the
> +first byte, bit 8 is the least significant bit of the second byte, etc.
> +This avoids any complications with differing word sizes.

And people say little/big endian is just a matter of taste.

> +
> +The "num_dirty" field is a performance hint for KVM to determine whether it
> +should skip processing the bitmap and just invalidate everything.  It must
> +be set to the number of set bits in the bitmap.
> +
>  4.62 KVM_CREATE_SPAPR_TCE
>  
>  Capability: KVM_CAP_SPAPR_TCE
> @@ -1700,3 +1732,45 @@ HTAB address part of SDR1 contains an HVA instead of a GPA, as PAPR keeps the
>  HTAB invisible to the guest.
>  
>  When this capability is enabled, KVM_EXIT_PAPR_HCALL can occur.
> +
> +6.3 KVM_CAP_SW_TLB
> +
> +Architectures: ppc
> +Parameters: args[0] is the address of a struct kvm_config_tlb
> +Returns: 0 on success; -1 on error
> +
> +struct kvm_config_tlb {
> +	__u64 params;
> +	__u64 array;
> +	__u32 mmu_type;
> +	__u32 array_len;
> +};

Would it not be simpler to use args[0-3] for this, instead of yet
another indirection?

> +
> +Configures the virtual CPU's TLB array, establishing a shared memory area
> +between userspace and KVM.  The "params" and "array" fields are userspace
> +addresses of mmu-type-specific data structures.  The "array_len" field is an
> +safety mechanism, and should be set to the size in bytes of the memory that
> +userspace has reserved for the array.  It must be at least the size dictated
> +by "mmu_type" and "params".
> +
> +While KVM_RUN is active, the shared region is under control of KVM.  Its
> +contents are undefined, and any modification by userspace results in
> +boundedly undefined behavior.
> +
> +On return from KVM_RUN, the shared region will reflect the current state of
> +the guest's TLB.  If userspace makes any changes, it must call KVM_DIRTY_TLB
> +to tell KVM which entries have been changed, prior to calling KVM_RUN again
> +on this vcpu.

We already have another mechanism for such shared memory,
mmap(vcpu_fd).  x86 uses it for the coalesced mmio region as well as the
traditional kvm_run area.  Please consider using it.


> +
> +For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
> + - The "params" field is of type "struct kvm_book3e_206_tlb_params".
> + - The "array" field points to an array of type "struct
> +   kvm_book3e_206_tlb_entry".
> + - The array consists of all entries in the first TLB, followed by all
> +   entries in the second TLB.
> + - Within a TLB, entries are ordered first by increasing set number.  Within a
> +   set, entries are ordered by way (increasing ESEL).
> + - The hash for determining set number in TLB0 is: (MAS2 >> 12) & (num_sets - 1)
> +   where "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
> + - The tsize field of mas1 shall be set to 4K on TLB0, even though the
> +   hardware ignores this value for TLB0.

Holy shit.

> @@ -95,6 +90,9 @@ struct kvmppc_vcpu_e500 {
>  	u32 tlb1cfg;
>  	u64 mcar;
>  
> +	struct page **shared_tlb_pages;
> +	int num_shared_tlb_pages;
> +

I missed the requirement that things be page aligned.

If you use mmap(vcpu_fd) this becomes simpler; you can use
get_free_pages() and have a single pointer.  You can also use vmap() on
this array (but get_free_pages() is faster).


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 04/14] KVM: PPC: e500: MMU API
@ 2011-10-31 13:24     ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2011-10-31 13:24 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm list, Marcelo Tosatti, Scott Wood

On 10/31/2011 09:53 AM, Alexander Graf wrote:
> From: Scott Wood <scottwood@freescale.com>
>
> This implements a shared-memory API for giving host userspace access to
> the guest's TLB.
>
>
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 7945b0b..ab1136f 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -1383,6 +1383,38 @@ The following flags are defined:
>  If datamatch flag is set, the event will be signaled only if the written value
>  to the registered address is equal to datamatch in struct kvm_ioeventfd.
>  
> +4.59 KVM_DIRTY_TLB
> +
> +Capability: KVM_CAP_SW_TLB
> +Architectures: ppc
> +Type: vcpu ioctl
> +Parameters: struct kvm_dirty_tlb (in)
> +Returns: 0 on success, -1 on error
> +
> +struct kvm_dirty_tlb {
> +	__u64 bitmap;
> +	__u32 num_dirty;
> +};

This is not 32/64 bit safe.  e500 is 32-bit only, yes? but what if
someone wants to emulate an e500 on a ppc64?  maybe it's better to add
padding here.

Another alternative is to drop the num_dirty field (and let the kernel
compute it instead, shouldn't take long?), and have the third argument
to ioctl() reference the bitmap directly.

> +
> +This must be called whenever userspace has changed an entry in the shared
> +TLB, prior to calling KVM_RUN on the associated vcpu.
> +
> +The "bitmap" field is the userspace address of an array.  This array
> +consists of a number of bits, equal to the total number of TLB entries as
> +determined by the last successful call to KVM_CONFIG_TLB, rounded up to the
> +nearest multiple of 64.
> +
> +Each bit corresponds to one TLB entry, ordered the same as in the shared TLB
> +array.
> +
> +The array is little-endian: the bit 0 is the least significant bit of the
> +first byte, bit 8 is the least significant bit of the second byte, etc.
> +This avoids any complications with differing word sizes.

And people say little/big endian is just a matter of taste.

> +
> +The "num_dirty" field is a performance hint for KVM to determine whether it
> +should skip processing the bitmap and just invalidate everything.  It must
> +be set to the number of set bits in the bitmap.
> +
>  4.62 KVM_CREATE_SPAPR_TCE
>  
>  Capability: KVM_CAP_SPAPR_TCE
> @@ -1700,3 +1732,45 @@ HTAB address part of SDR1 contains an HVA instead of a GPA, as PAPR keeps the
>  HTAB invisible to the guest.
>  
>  When this capability is enabled, KVM_EXIT_PAPR_HCALL can occur.
> +
> +6.3 KVM_CAP_SW_TLB
> +
> +Architectures: ppc
> +Parameters: args[0] is the address of a struct kvm_config_tlb
> +Returns: 0 on success; -1 on error
> +
> +struct kvm_config_tlb {
> +	__u64 params;
> +	__u64 array;
> +	__u32 mmu_type;
> +	__u32 array_len;
> +};

Would it not be simpler to use args[0-3] for this, instead of yet
another indirection?

> +
> +Configures the virtual CPU's TLB array, establishing a shared memory area
> +between userspace and KVM.  The "params" and "array" fields are userspace
> +addresses of mmu-type-specific data structures.  The "array_len" field is an
> +safety mechanism, and should be set to the size in bytes of the memory that
> +userspace has reserved for the array.  It must be at least the size dictated
> +by "mmu_type" and "params".
> +
> +While KVM_RUN is active, the shared region is under control of KVM.  Its
> +contents are undefined, and any modification by userspace results in
> +boundedly undefined behavior.
> +
> +On return from KVM_RUN, the shared region will reflect the current state of
> +the guest's TLB.  If userspace makes any changes, it must call KVM_DIRTY_TLB
> +to tell KVM which entries have been changed, prior to calling KVM_RUN again
> +on this vcpu.

We already have another mechanism for such shared memory,
mmap(vcpu_fd).  x86 uses it for the coalesced mmio region as well as the
traditional kvm_run area.  Please consider using it.


> +
> +For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
> + - The "params" field is of type "struct kvm_book3e_206_tlb_params".
> + - The "array" field points to an array of type "struct
> +   kvm_book3e_206_tlb_entry".
> + - The array consists of all entries in the first TLB, followed by all
> +   entries in the second TLB.
> + - Within a TLB, entries are ordered first by increasing set number.  Within a
> +   set, entries are ordered by way (increasing ESEL).
> + - The hash for determining set number in TLB0 is: (MAS2 >> 12) & (num_sets - 1)
> +   where "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
> + - The tsize field of mas1 shall be set to 4K on TLB0, even though the
> +   hardware ignores this value for TLB0.

Holy shit.

> @@ -95,6 +90,9 @@ struct kvmppc_vcpu_e500 {
>  	u32 tlb1cfg;
>  	u64 mcar;
>  
> +	struct page **shared_tlb_pages;
> +	int num_shared_tlb_pages;
> +

I missed the requirement that things be page aligned.

If you use mmap(vcpu_fd) this becomes simpler; you can use
get_free_pages() and have a single pointer.  You can also use vmap() on
this array (but get_free_pages() is faster).


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 06/14] KVM: PPC: e500: Don't hardcode PIR=0
  2011-10-31  7:53   ` Alexander Graf
@ 2011-10-31 13:27     ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2011-10-31 13:27 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm list, Marcelo Tosatti, Scott Wood

On 10/31/2011 09:53 AM, Alexander Graf wrote:
> From: Scott Wood <scottwood@freescale.com>
>
> The hardcoded behavior prevents proper SMP support.
>
> QEMU shall specify the vcpu's PIR as the vcpu id.
>
>

Could also be kvm tool - we generally use the code name 'userspace' to
refer to qemu (but don't rewrite the patch on this account).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 06/14] KVM: PPC: e500: Don't hardcode PIR=0
@ 2011-10-31 13:27     ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2011-10-31 13:27 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm list, Marcelo Tosatti, Scott Wood

On 10/31/2011 09:53 AM, Alexander Graf wrote:
> From: Scott Wood <scottwood@freescale.com>
>
> The hardcoded behavior prevents proper SMP support.
>
> QEMU shall specify the vcpu's PIR as the vcpu id.
>
>

Could also be kvm tool - we generally use the code name 'userspace' to
refer to qemu (but don't rewrite the patch on this account).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 08/14] Revert "KVM: PPC: Add support for explicit HIOR setting"
  2011-10-31  7:53   ` Alexander Graf
@ 2011-10-31 13:30     ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2011-10-31 13:30 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm list, Marcelo Tosatti

On 10/31/2011 09:53 AM, Alexander Graf wrote:
> This reverts commit 11d7596e18a712dc3bc29d45662ec111fd65946b. It exceeded
> the padding on the SREGS struct, rendering the ABI backwards-incompatible.

Can't find the commit hash.  Please use hashes from the Linus tree when
possible.

This needs to be backported?  To which trees?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 08/14] Revert "KVM: PPC: Add support for explicit HIOR
@ 2011-10-31 13:30     ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2011-10-31 13:30 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm list, Marcelo Tosatti

On 10/31/2011 09:53 AM, Alexander Graf wrote:
> This reverts commit 11d7596e18a712dc3bc29d45662ec111fd65946b. It exceeded
> the padding on the SREGS struct, rendering the ABI backwards-incompatible.

Can't find the commit hash.  Please use hashes from the Linus tree when
possible.

This needs to be backported?  To which trees?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls
  2011-10-31  7:53   ` Alexander Graf
@ 2011-10-31 13:36     ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2011-10-31 13:36 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm list, Marcelo Tosatti, Jan Kiszka

On 10/31/2011 09:53 AM, Alexander Graf wrote:
> Right now we transfer a static struct every time we want to get or set
> registers. Unfortunately, over time we realize that there are more of
> these than we thought of before and the extensibility and flexibility of
> transferring a full struct every time is limited.
>
> So this is a new approach to the problem. With these new ioctls, we can
> get and set a single register that is identified by an ID. This allows for
> very precise and limited transmittal of data. When we later realize that
> it's a better idea to shove over multiple registers at once, we can reuse
> most of the infrastructure and simply implement a GET_MANY_REGS / SET_MANY_REGS
> interface.
>
> The only downpoint I see to this one is that it needs to pad to 1024 bits
> (hardware is already on 512 bit registers, so I wanted to leave some room)
> which is slightly too much for transmitting only 64 bits. But if that's all
> the tradeoff we have to do for getting an extensible interface, I'd say go
> for it nevertheless.

Do we want this for x86 too?  How often do we want just one register?

>  
> +4.64 KVM_SET_ONE_REG
> +
> +Capability: KVM_CAP_ONE_REG
> +Architectures: all
> +Type: vcpu ioctl
> +Parameters: struct kvm_one_reg (in)
> +Returns: 0 on success, negative value on failure
> +
> +struct kvm_one_reg {
> +       __u64 id;

would be better to have a register set (in x86 terms,
gpr/x86/sse/cr/xcr/msr/special) and an ID within the set.  __u64 is
excessive, I hope.

> +       union {
> +               __u8 reg8;
> +               __u16 reg16;
> +               __u32 reg32;
> +               __u64 reg64;
> +               __u8 reg128[16];
> +               __u8 reg256[32];
> +               __u8 reg512[64];
> +               __u8 reg1024[128];
> +       } u;
> +};
> +
> +Using this ioctl, a single vcpu register can be set to a specific value
> +defined by user space with the passed in struct kvm_one_reg. There can
> +be architecture agnostic and architecture specific registers. Each have
> +their own range of operation and their own constants and width. To keep
> +track of the implemented registers, find a list below:
> +
> +  Arch  |       Register        | Width (bits)
> +        |                       |
> +
>

One possible issue is that certain register have mutually exclusive
values, so you may need to issue multiple calls to get the right
sequence.  You probably don't have that on ppc.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls
@ 2011-10-31 13:36     ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2011-10-31 13:36 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm list, Marcelo Tosatti, Jan Kiszka

On 10/31/2011 09:53 AM, Alexander Graf wrote:
> Right now we transfer a static struct every time we want to get or set
> registers. Unfortunately, over time we realize that there are more of
> these than we thought of before and the extensibility and flexibility of
> transferring a full struct every time is limited.
>
> So this is a new approach to the problem. With these new ioctls, we can
> get and set a single register that is identified by an ID. This allows for
> very precise and limited transmittal of data. When we later realize that
> it's a better idea to shove over multiple registers at once, we can reuse
> most of the infrastructure and simply implement a GET_MANY_REGS / SET_MANY_REGS
> interface.
>
> The only downpoint I see to this one is that it needs to pad to 1024 bits
> (hardware is already on 512 bit registers, so I wanted to leave some room)
> which is slightly too much for transmitting only 64 bits. But if that's all
> the tradeoff we have to do for getting an extensible interface, I'd say go
> for it nevertheless.

Do we want this for x86 too?  How often do we want just one register?

>  
> +4.64 KVM_SET_ONE_REG
> +
> +Capability: KVM_CAP_ONE_REG
> +Architectures: all
> +Type: vcpu ioctl
> +Parameters: struct kvm_one_reg (in)
> +Returns: 0 on success, negative value on failure
> +
> +struct kvm_one_reg {
> +       __u64 id;

would be better to have a register set (in x86 terms,
gpr/x86/sse/cr/xcr/msr/special) and an ID within the set.  __u64 is
excessive, I hope.

> +       union {
> +               __u8 reg8;
> +               __u16 reg16;
> +               __u32 reg32;
> +               __u64 reg64;
> +               __u8 reg128[16];
> +               __u8 reg256[32];
> +               __u8 reg512[64];
> +               __u8 reg1024[128];
> +       } u;
> +};
> +
> +Using this ioctl, a single vcpu register can be set to a specific value
> +defined by user space with the passed in struct kvm_one_reg. There can
> +be architecture agnostic and architecture specific registers. Each have
> +their own range of operation and their own constants and width. To keep
> +track of the implemented registers, find a list below:
> +
> +  Arch  |       Register        | Width (bits)
> +        |                       |
> +
>

One possible issue is that certain register have mutually exclusive
values, so you may need to issue multiple calls to get the right
sequence.  You probably don't have that on ppc.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 13/14] KVM: PPC: E500: Support hugetlbfs
  2011-10-31  7:53   ` Alexander Graf
@ 2011-10-31 13:38     ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2011-10-31 13:38 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm list, Marcelo Tosatti

On 10/31/2011 09:53 AM, Alexander Graf wrote:
> With hugetlbfs support emerging on e500, we should also support KVM
> backing its guest memory by it.
>
> This patch adds support for hugetlbfs into the e500 shadow mmu code.
>
>
> @@ -673,12 +674,31 @@ static inline void kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
>  				pfn &= ~(tsize_pages - 1);
>  				break;
>  			}
> +		} else if (vma && hva >= vma->vm_start &&
> +                           (vma->vm_flags & VM_HUGETLB)) {
> +			unsigned long psize = vma_kernel_pagesize(vma);
>

Leading spaces spotted.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 13/14] KVM: PPC: E500: Support hugetlbfs
@ 2011-10-31 13:38     ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2011-10-31 13:38 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm list, Marcelo Tosatti

On 10/31/2011 09:53 AM, Alexander Graf wrote:
> With hugetlbfs support emerging on e500, we should also support KVM
> backing its guest memory by it.
>
> This patch adds support for hugetlbfs into the e500 shadow mmu code.
>
>
> @@ -673,12 +674,31 @@ static inline void kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
>  				pfn &= ~(tsize_pages - 1);
>  				break;
>  			}
> +		} else if (vma && hva >= vma->vm_start &&
> +                           (vma->vm_flags & VM_HUGETLB)) {
> +			unsigned long psize = vma_kernel_pagesize(vma);
>

Leading spaces spotted.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls
  2011-10-31 13:36     ` Avi Kivity
@ 2011-10-31 17:26       ` Jan Kiszka
  -1 siblings, 0 replies; 82+ messages in thread
From: Jan Kiszka @ 2011-10-31 17:26 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alexander Graf, kvm-ppc, kvm list, Marcelo Tosatti

On 2011-10-31 14:36, Avi Kivity wrote:
> On 10/31/2011 09:53 AM, Alexander Graf wrote:
>> Right now we transfer a static struct every time we want to get or set
>> registers. Unfortunately, over time we realize that there are more of
>> these than we thought of before and the extensibility and flexibility of
>> transferring a full struct every time is limited.
>>
>> So this is a new approach to the problem. With these new ioctls, we can
>> get and set a single register that is identified by an ID. This allows for
>> very precise and limited transmittal of data. When we later realize that
>> it's a better idea to shove over multiple registers at once, we can reuse
>> most of the infrastructure and simply implement a GET_MANY_REGS / SET_MANY_REGS
>> interface.
>>
>> The only downpoint I see to this one is that it needs to pad to 1024 bits
>> (hardware is already on 512 bit registers, so I wanted to leave some room)
>> which is slightly too much for transmitting only 64 bits. But if that's all
>> the tradeoff we have to do for getting an extensible interface, I'd say go
>> for it nevertheless.
> 
> Do we want this for x86 too?  How often do we want just one register?

On x86, a single register is probably only interesting for debugging
purposes. Things that matter (performance wise) are in kvm.run, anything
else does not worry that much about speed. At least for now. I'm still
waiting for Kemari to propose some get/set optimizations, but there is
obviously not that much to gain or still bigger fish to fry.

Also, x86 is less regular than PPC. And where it is fairly regular, we
already have a GET/SET_MANY interface: MSRs.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls
@ 2011-10-31 17:26       ` Jan Kiszka
  0 siblings, 0 replies; 82+ messages in thread
From: Jan Kiszka @ 2011-10-31 17:26 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alexander Graf, kvm-ppc, kvm list, Marcelo Tosatti

On 2011-10-31 14:36, Avi Kivity wrote:
> On 10/31/2011 09:53 AM, Alexander Graf wrote:
>> Right now we transfer a static struct every time we want to get or set
>> registers. Unfortunately, over time we realize that there are more of
>> these than we thought of before and the extensibility and flexibility of
>> transferring a full struct every time is limited.
>>
>> So this is a new approach to the problem. With these new ioctls, we can
>> get and set a single register that is identified by an ID. This allows for
>> very precise and limited transmittal of data. When we later realize that
>> it's a better idea to shove over multiple registers at once, we can reuse
>> most of the infrastructure and simply implement a GET_MANY_REGS / SET_MANY_REGS
>> interface.
>>
>> The only downpoint I see to this one is that it needs to pad to 1024 bits
>> (hardware is already on 512 bit registers, so I wanted to leave some room)
>> which is slightly too much for transmitting only 64 bits. But if that's all
>> the tradeoff we have to do for getting an extensible interface, I'd say go
>> for it nevertheless.
> 
> Do we want this for x86 too?  How often do we want just one register?

On x86, a single register is probably only interesting for debugging
purposes. Things that matter (performance wise) are in kvm.run, anything
else does not worry that much about speed. At least for now. I'm still
waiting for Kemari to propose some get/set optimizations, but there is
obviously not that much to gain or still bigger fish to fry.

Also, x86 is less regular than PPC. And where it is fairly regular, we
already have a GET/SET_MANY interface: MSRs.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 01/14] KVM: PPC: e500: don't translate gfn to pfn with preemption disabled
  2011-10-31 12:50     ` [PATCH 01/14] KVM: PPC: e500: don't translate gfn to pfn with Avi Kivity
@ 2011-10-31 18:52       ` Scott Wood
  -1 siblings, 0 replies; 82+ messages in thread
From: Scott Wood @ 2011-10-31 18:52 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alexander Graf, kvm-ppc, kvm list, Marcelo Tosatti

On 10/31/2011 07:50 AM, Avi Kivity wrote:
> On 10/31/2011 09:53 AM, Alexander Graf wrote:
>> +/* sesel is index into the set, not the whole array */
>> +static void write_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
>> +			struct tlbe *gtlbe,
>> +			struct tlbe *stlbe,
>> +			int stlbsel, int sesel)
>> +{
>> +	int stid;
>> +
>> +	preempt_disable();
>> +	stid = kvmppc_e500_get_sid(vcpu_e500, get_tlb_ts(gtlbe),
>> +				   get_tlb_tid(gtlbe),
>> +				   get_cur_pr(&vcpu_e500->vcpu), 0);
>> +
>> +	stlbe->mas1 |= MAS1_TID(stid);
>> +	write_host_tlbe(vcpu_e500, stlbsel, sesel, stlbe);
>> +	preempt_enable();
>> +}
>> +
>>
> 
> This naked preempt_disable() is fishy.  What happens if we're migrated
> immediately afterwards? we fault again and redo?

Yes, we'll fault again.

We just want to make sure that the sid is still valid when we write the
TLB entry.  If we migrate, we'll get a new sid and the old TLB entry
will be irrelevant, even if we migrate back to the same CPU.  The entire
TLB will be flushed before a sid is reused.

If we don't do the preempt_disable(), we could get the sid on one CPU
and then get migrated and run it on another CPU where that sid is (or
will be) valid for a different context.  Or we could run out of sids
while preempted, making the sid allocated before this possibly valid for
a different context.

-Scott

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 01/14] KVM: PPC: e500: don't translate gfn to pfn with
@ 2011-10-31 18:52       ` Scott Wood
  0 siblings, 0 replies; 82+ messages in thread
From: Scott Wood @ 2011-10-31 18:52 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alexander Graf, kvm-ppc, kvm list, Marcelo Tosatti

On 10/31/2011 07:50 AM, Avi Kivity wrote:
> On 10/31/2011 09:53 AM, Alexander Graf wrote:
>> +/* sesel is index into the set, not the whole array */
>> +static void write_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
>> +			struct tlbe *gtlbe,
>> +			struct tlbe *stlbe,
>> +			int stlbsel, int sesel)
>> +{
>> +	int stid;
>> +
>> +	preempt_disable();
>> +	stid = kvmppc_e500_get_sid(vcpu_e500, get_tlb_ts(gtlbe),
>> +				   get_tlb_tid(gtlbe),
>> +				   get_cur_pr(&vcpu_e500->vcpu), 0);
>> +
>> +	stlbe->mas1 |= MAS1_TID(stid);
>> +	write_host_tlbe(vcpu_e500, stlbsel, sesel, stlbe);
>> +	preempt_enable();
>> +}
>> +
>>
> 
> This naked preempt_disable() is fishy.  What happens if we're migrated
> immediately afterwards? we fault again and redo?

Yes, we'll fault again.

We just want to make sure that the sid is still valid when we write the
TLB entry.  If we migrate, we'll get a new sid and the old TLB entry
will be irrelevant, even if we migrate back to the same CPU.  The entire
TLB will be flushed before a sid is reused.

If we don't do the preempt_disable(), we could get the sid on one CPU
and then get migrated and run it on another CPU where that sid is (or
will be) valid for a different context.  Or we could run out of sids
while preempted, making the sid allocated before this possibly valid for
a different context.

-Scott


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 04/14] KVM: PPC: e500: MMU API
  2011-10-31 13:24     ` Avi Kivity
@ 2011-10-31 20:12       ` Scott Wood
  -1 siblings, 0 replies; 82+ messages in thread
From: Scott Wood @ 2011-10-31 20:12 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alexander Graf, kvm-ppc, kvm list, Marcelo Tosatti

On 10/31/2011 08:24 AM, Avi Kivity wrote:
> On 10/31/2011 09:53 AM, Alexander Graf wrote:
>> From: Scott Wood <scottwood@freescale.com>
>>
>> This implements a shared-memory API for giving host userspace access to
>> the guest's TLB.
>>
>>
>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>> index 7945b0b..ab1136f 100644
>> --- a/Documentation/virtual/kvm/api.txt
>> +++ b/Documentation/virtual/kvm/api.txt
>> @@ -1383,6 +1383,38 @@ The following flags are defined:
>>  If datamatch flag is set, the event will be signaled only if the written value
>>  to the registered address is equal to datamatch in struct kvm_ioeventfd.
>>  
>> +4.59 KVM_DIRTY_TLB
>> +
>> +Capability: KVM_CAP_SW_TLB
>> +Architectures: ppc
>> +Type: vcpu ioctl
>> +Parameters: struct kvm_dirty_tlb (in)
>> +Returns: 0 on success, -1 on error
>> +
>> +struct kvm_dirty_tlb {
>> +	__u64 bitmap;
>> +	__u32 num_dirty;
>> +};
> 
> This is not 32/64 bit safe.  e500 is 32-bit only, yes?

e5500 is 64-bit -- we don't support it with KVM yet, but it's planned.

> but what if someone wants to emulate an e500 on a ppc64?  maybe it's better to add
> padding here.

What is unsafe about it?  Are you picturing TLBs with more than 4
billion entries?

There shouldn't be any alignment issues.

> Another alternative is to drop the num_dirty field (and let the kernel
> compute it instead, shouldn't take long?), and have the third argument
> to ioctl() reference the bitmap directly.

The idea was to make it possible for the kernel to apply a threshold
above which it would be better to ignore the bitmap entirely and flush
everything:

http://www.spinics.net/lists/kvm/msg50079.html

Currently we always just flush everything, and QEMU always says
everything is dirty when it makes a change, but the API is there if needed.

>>  4.62 KVM_CREATE_SPAPR_TCE
>>  
>>  Capability: KVM_CAP_SPAPR_TCE
>> @@ -1700,3 +1732,45 @@ HTAB address part of SDR1 contains an HVA instead of a GPA, as PAPR keeps the
>>  HTAB invisible to the guest.
>>  
>>  When this capability is enabled, KVM_EXIT_PAPR_HCALL can occur.
>> +
>> +6.3 KVM_CAP_SW_TLB
>> +
>> +Architectures: ppc
>> +Parameters: args[0] is the address of a struct kvm_config_tlb
>> +Returns: 0 on success; -1 on error
>> +
>> +struct kvm_config_tlb {
>> +	__u64 params;
>> +	__u64 array;
>> +	__u32 mmu_type;
>> +	__u32 array_len;
>> +};
> 
> Would it not be simpler to use args[0-3] for this, instead of yet
> another indirection?

I suppose so.  Its existence as a struct dates from when it was its own
ioctl rather than an argument to KVM_ENABLE_CAP.

>> +Configures the virtual CPU's TLB array, establishing a shared memory area
>> +between userspace and KVM.  The "params" and "array" fields are userspace
>> +addresses of mmu-type-specific data structures.  The "array_len" field is an
>> +safety mechanism, and should be set to the size in bytes of the memory that
>> +userspace has reserved for the array.  It must be at least the size dictated
>> +by "mmu_type" and "params".
>> +
>> +While KVM_RUN is active, the shared region is under control of KVM.  Its
>> +contents are undefined, and any modification by userspace results in
>> +boundedly undefined behavior.
>> +
>> +On return from KVM_RUN, the shared region will reflect the current state of
>> +the guest's TLB.  If userspace makes any changes, it must call KVM_DIRTY_TLB
>> +to tell KVM which entries have been changed, prior to calling KVM_RUN again
>> +on this vcpu.
> 
> We already have another mechanism for such shared memory,
> mmap(vcpu_fd).  x86 uses it for the coalesced mmio region as well as the
> traditional kvm_run area.  Please consider using it.

What does it buy us, other than needing a separate codepath in QEMU to
allocate the memory differently based on whether KVM (and this feature)
are being used, since QEMU uses this for its own MMU representation?

This API has been discussed extensively, and the code using it is
already in mainline QEMU.  This aspect of it hasn't changed since the
discussion back in February:

http://www.spinics.net/lists/kvm/msg50102.html

I'd prefer to avoid another round of major overhaul without a really
good reason.

>> +For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
>> + - The "params" field is of type "struct kvm_book3e_206_tlb_params".
>> + - The "array" field points to an array of type "struct
>> +   kvm_book3e_206_tlb_entry".
>> + - The array consists of all entries in the first TLB, followed by all
>> +   entries in the second TLB.
>> + - Within a TLB, entries are ordered first by increasing set number.  Within a
>> +   set, entries are ordered by way (increasing ESEL).
>> + - The hash for determining set number in TLB0 is: (MAS2 >> 12) & (num_sets - 1)
>> +   where "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
>> + - The tsize field of mas1 shall be set to 4K on TLB0, even though the
>> +   hardware ignores this value for TLB0.
> 
> Holy shit.

You were the one that first suggested we use shared data:
http://www.spinics.net/lists/kvm/msg49802.html

These are the assumptions needed to make such an interface well-defined.

>> @@ -95,6 +90,9 @@ struct kvmppc_vcpu_e500 {
>>  	u32 tlb1cfg;
>>  	u64 mcar;
>>  
>> +	struct page **shared_tlb_pages;
>> +	int num_shared_tlb_pages;
>> +
> 
> I missed the requirement that things be page aligned.

They don't need to be, we'll ignore the data before and after the shared
area.

> If you use mmap(vcpu_fd) this becomes simpler; you can use
> get_free_pages() and have a single pointer.  You can also use vmap() on
> this array (but get_free_pages() is faster).

We do use vmap().  This is just the bookkeeping so we know what pages to
free later.

-Scott

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 04/14] KVM: PPC: e500: MMU API
@ 2011-10-31 20:12       ` Scott Wood
  0 siblings, 0 replies; 82+ messages in thread
From: Scott Wood @ 2011-10-31 20:12 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alexander Graf, kvm-ppc, kvm list, Marcelo Tosatti

On 10/31/2011 08:24 AM, Avi Kivity wrote:
> On 10/31/2011 09:53 AM, Alexander Graf wrote:
>> From: Scott Wood <scottwood@freescale.com>
>>
>> This implements a shared-memory API for giving host userspace access to
>> the guest's TLB.
>>
>>
>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>> index 7945b0b..ab1136f 100644
>> --- a/Documentation/virtual/kvm/api.txt
>> +++ b/Documentation/virtual/kvm/api.txt
>> @@ -1383,6 +1383,38 @@ The following flags are defined:
>>  If datamatch flag is set, the event will be signaled only if the written value
>>  to the registered address is equal to datamatch in struct kvm_ioeventfd.
>>  
>> +4.59 KVM_DIRTY_TLB
>> +
>> +Capability: KVM_CAP_SW_TLB
>> +Architectures: ppc
>> +Type: vcpu ioctl
>> +Parameters: struct kvm_dirty_tlb (in)
>> +Returns: 0 on success, -1 on error
>> +
>> +struct kvm_dirty_tlb {
>> +	__u64 bitmap;
>> +	__u32 num_dirty;
>> +};
> 
> This is not 32/64 bit safe.  e500 is 32-bit only, yes?

e5500 is 64-bit -- we don't support it with KVM yet, but it's planned.

> but what if someone wants to emulate an e500 on a ppc64?  maybe it's better to add
> padding here.

What is unsafe about it?  Are you picturing TLBs with more than 4
billion entries?

There shouldn't be any alignment issues.

> Another alternative is to drop the num_dirty field (and let the kernel
> compute it instead, shouldn't take long?), and have the third argument
> to ioctl() reference the bitmap directly.

The idea was to make it possible for the kernel to apply a threshold
above which it would be better to ignore the bitmap entirely and flush
everything:

http://www.spinics.net/lists/kvm/msg50079.html

Currently we always just flush everything, and QEMU always says
everything is dirty when it makes a change, but the API is there if needed.

>>  4.62 KVM_CREATE_SPAPR_TCE
>>  
>>  Capability: KVM_CAP_SPAPR_TCE
>> @@ -1700,3 +1732,45 @@ HTAB address part of SDR1 contains an HVA instead of a GPA, as PAPR keeps the
>>  HTAB invisible to the guest.
>>  
>>  When this capability is enabled, KVM_EXIT_PAPR_HCALL can occur.
>> +
>> +6.3 KVM_CAP_SW_TLB
>> +
>> +Architectures: ppc
>> +Parameters: args[0] is the address of a struct kvm_config_tlb
>> +Returns: 0 on success; -1 on error
>> +
>> +struct kvm_config_tlb {
>> +	__u64 params;
>> +	__u64 array;
>> +	__u32 mmu_type;
>> +	__u32 array_len;
>> +};
> 
> Would it not be simpler to use args[0-3] for this, instead of yet
> another indirection?

I suppose so.  Its existence as a struct dates from when it was its own
ioctl rather than an argument to KVM_ENABLE_CAP.

>> +Configures the virtual CPU's TLB array, establishing a shared memory area
>> +between userspace and KVM.  The "params" and "array" fields are userspace
>> +addresses of mmu-type-specific data structures.  The "array_len" field is an
>> +safety mechanism, and should be set to the size in bytes of the memory that
>> +userspace has reserved for the array.  It must be at least the size dictated
>> +by "mmu_type" and "params".
>> +
>> +While KVM_RUN is active, the shared region is under control of KVM.  Its
>> +contents are undefined, and any modification by userspace results in
>> +boundedly undefined behavior.
>> +
>> +On return from KVM_RUN, the shared region will reflect the current state of
>> +the guest's TLB.  If userspace makes any changes, it must call KVM_DIRTY_TLB
>> +to tell KVM which entries have been changed, prior to calling KVM_RUN again
>> +on this vcpu.
> 
> We already have another mechanism for such shared memory,
> mmap(vcpu_fd).  x86 uses it for the coalesced mmio region as well as the
> traditional kvm_run area.  Please consider using it.

What does it buy us, other than needing a separate codepath in QEMU to
allocate the memory differently based on whether KVM (and this feature)
are being used, since QEMU uses this for its own MMU representation?

This API has been discussed extensively, and the code using it is
already in mainline QEMU.  This aspect of it hasn't changed since the
discussion back in February:

http://www.spinics.net/lists/kvm/msg50102.html

I'd prefer to avoid another round of major overhaul without a really
good reason.

>> +For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
>> + - The "params" field is of type "struct kvm_book3e_206_tlb_params".
>> + - The "array" field points to an array of type "struct
>> +   kvm_book3e_206_tlb_entry".
>> + - The array consists of all entries in the first TLB, followed by all
>> +   entries in the second TLB.
>> + - Within a TLB, entries are ordered first by increasing set number.  Within a
>> +   set, entries are ordered by way (increasing ESEL).
>> + - The hash for determining set number in TLB0 is: (MAS2 >> 12) & (num_sets - 1)
>> +   where "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
>> + - The tsize field of mas1 shall be set to 4K on TLB0, even though the
>> +   hardware ignores this value for TLB0.
> 
> Holy shit.

You were the one that first suggested we use shared data:
http://www.spinics.net/lists/kvm/msg49802.html

These are the assumptions needed to make such an interface well-defined.

>> @@ -95,6 +90,9 @@ struct kvmppc_vcpu_e500 {
>>  	u32 tlb1cfg;
>>  	u64 mcar;
>>  
>> +	struct page **shared_tlb_pages;
>> +	int num_shared_tlb_pages;
>> +
> 
> I missed the requirement that things be page aligned.

They don't need to be, we'll ignore the data before and after the shared
area.

> If you use mmap(vcpu_fd) this becomes simpler; you can use
> get_free_pages() and have a single pointer.  You can also use vmap() on
> this array (but get_free_pages() is faster).

We do use vmap().  This is just the bookkeeping so we know what pages to
free later.

-Scott


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 08/14] Revert "KVM: PPC: Add support for explicit HIOR setting"
  2011-10-31 13:30     ` [PATCH 08/14] Revert "KVM: PPC: Add support for explicit HIOR Avi Kivity
@ 2011-10-31 23:49       ` Alexander Graf
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31 23:49 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-ppc, kvm list, Marcelo Tosatti


On 31.10.2011, at 06:30, Avi Kivity <avi@redhat.com> wrote:

> On 10/31/2011 09:53 AM, Alexander Graf wrote:
>> This reverts commit 11d7596e18a712dc3bc29d45662ec111fd65946b. It exceeded
>> the padding on the SREGS struct, rendering the ABI backwards-incompatible.
> 
> Can't find the commit hash.  Please use hashes from the Linus tree when
> possible.
> 
> This needs to be backported?  To which trees?

I sent the revert pretty quickly after the original patch, but my last pullreq seems to have gone lost and I realized it way too late.

So worst case it's in 3.0. Maybe 3.1. Could you please check? I'm currently slightly limited on connectivity.

Alex

> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 08/14] Revert "KVM: PPC: Add support for explicit HIOR setting"
@ 2011-10-31 23:49       ` Alexander Graf
  0 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-10-31 23:49 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-ppc, kvm list, Marcelo Tosatti


On 31.10.2011, at 06:30, Avi Kivity <avi@redhat.com> wrote:

> On 10/31/2011 09:53 AM, Alexander Graf wrote:
>> This reverts commit 11d7596e18a712dc3bc29d45662ec111fd65946b. It exceeded
>> the padding on the SREGS struct, rendering the ABI backwards-incompatible.
> 
> Can't find the commit hash.  Please use hashes from the Linus tree when
> possible.
> 
> This needs to be backported?  To which trees?

I sent the revert pretty quickly after the original patch, but my last pullreq seems to have gone lost and I realized it way too late.

So worst case it's in 3.0. Maybe 3.1. Could you please check? I'm currently slightly limited on connectivity.

Alex

> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 04/14] KVM: PPC: e500: MMU API
  2011-10-31 20:12       ` Scott Wood
@ 2011-11-01  8:58         ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2011-11-01  8:58 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm list, Marcelo Tosatti

On 10/31/2011 10:12 PM, Scott Wood wrote:
> >> +4.59 KVM_DIRTY_TLB
> >> +
> >> +Capability: KVM_CAP_SW_TLB
> >> +Architectures: ppc
> >> +Type: vcpu ioctl
> >> +Parameters: struct kvm_dirty_tlb (in)
> >> +Returns: 0 on success, -1 on error
> >> +
> >> +struct kvm_dirty_tlb {
> >> +	__u64 bitmap;
> >> +	__u32 num_dirty;
> >> +};
> > 
> > This is not 32/64 bit safe.  e500 is 32-bit only, yes?
>
> e5500 is 64-bit -- we don't support it with KVM yet, but it's planned.
>
> > but what if someone wants to emulate an e500 on a ppc64?  maybe it's better to add
> > padding here.
>
> What is unsafe about it?  Are you picturing TLBs with more than 4
> billion entries?

sizeof(struct kvm_tlb_dirty) == 12 for 32-bit userspace, but ==  16 for
64-bit userspace and the kernel.  ABI structures must have the same
alignment and size for 32/64 bit userspace, or they need compat handling.

> There shouldn't be any alignment issues.
>
> > Another alternative is to drop the num_dirty field (and let the kernel
> > compute it instead, shouldn't take long?), and have the third argument
> > to ioctl() reference the bitmap directly.
>
> The idea was to make it possible for the kernel to apply a threshold
> above which it would be better to ignore the bitmap entirely and flush
> everything:
>
> http://www.spinics.net/lists/kvm/msg50079.html
>
> Currently we always just flush everything, and QEMU always says
> everything is dirty when it makes a change, but the API is there if needed.

Right, but you don't need num_dirty for it.  There are typically only a
few dozen entries, yes?  It should take a trivial amount of time to
calculate its weight.

> >> +Configures the virtual CPU's TLB array, establishing a shared memory area
> >> +between userspace and KVM.  The "params" and "array" fields are userspace
> >> +addresses of mmu-type-specific data structures.  The "array_len" field is an
> >> +safety mechanism, and should be set to the size in bytes of the memory that
> >> +userspace has reserved for the array.  It must be at least the size dictated
> >> +by "mmu_type" and "params".
> >> +
> >> +While KVM_RUN is active, the shared region is under control of KVM.  Its
> >> +contents are undefined, and any modification by userspace results in
> >> +boundedly undefined behavior.
> >> +
> >> +On return from KVM_RUN, the shared region will reflect the current state of
> >> +the guest's TLB.  If userspace makes any changes, it must call KVM_DIRTY_TLB
> >> +to tell KVM which entries have been changed, prior to calling KVM_RUN again
> >> +on this vcpu.
> > 
> > We already have another mechanism for such shared memory,
> > mmap(vcpu_fd).  x86 uses it for the coalesced mmio region as well as the
> > traditional kvm_run area.  Please consider using it.
>
> What does it buy us, other than needing a separate codepath in QEMU to
> allocate the memory differently based on whether KVM (and this feature)

The ability to use get_free_pages() and ordinary kernel memory directly,
instead of indirection through a struct page ** array.

> are being used, since QEMU uses this for its own MMU representation?
>
> This API has been discussed extensively, and the code using it is
> already in mainline QEMU.  This aspect of it hasn't changed since the
> discussion back in February:
>
> http://www.spinics.net/lists/kvm/msg50102.html
>
> I'd prefer to avoid another round of major overhaul without a really
> good reason.

Me too, but I also prefer not to make ABI choices by inertia.  ABI is
practically the only thing I care about wrt non-x86 (other than
whitespace, of course).  Please involve me in the discussions earlier in
the future.

> >> +For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
> >> + - The "params" field is of type "struct kvm_book3e_206_tlb_params".
> >> + - The "array" field points to an array of type "struct
> >> +   kvm_book3e_206_tlb_entry".
> >> + - The array consists of all entries in the first TLB, followed by all
> >> +   entries in the second TLB.
> >> + - Within a TLB, entries are ordered first by increasing set number.  Within a
> >> +   set, entries are ordered by way (increasing ESEL).
> >> + - The hash for determining set number in TLB0 is: (MAS2 >> 12) & (num_sets - 1)
> >> +   where "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
> >> + - The tsize field of mas1 shall be set to 4K on TLB0, even though the
> >> +   hardware ignores this value for TLB0.
> > 
> > Holy shit.
>
> You were the one that first suggested we use shared data:
> http://www.spinics.net/lists/kvm/msg49802.html
>
> These are the assumptions needed to make such an interface well-defined.

Just remarking on the complexity, don't take it personally.

> >> @@ -95,6 +90,9 @@ struct kvmppc_vcpu_e500 {
> >>  	u32 tlb1cfg;
> >>  	u64 mcar;
> >>  
> >> +	struct page **shared_tlb_pages;
> >> +	int num_shared_tlb_pages;
> >> +
> > 
> > I missed the requirement that things be page aligned.
>
> They don't need to be, we'll ignore the data before and after the shared
> area.
>
> > If you use mmap(vcpu_fd) this becomes simpler; you can use
> > get_free_pages() and have a single pointer.  You can also use vmap() on
> > this array (but get_free_pages() is faster).
>
> We do use vmap().  This is just the bookkeeping so we know what pages to
> free later.
>

Ah, I missed that (and the pointer).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 04/14] KVM: PPC: e500: MMU API
@ 2011-11-01  8:58         ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2011-11-01  8:58 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm list, Marcelo Tosatti

On 10/31/2011 10:12 PM, Scott Wood wrote:
> >> +4.59 KVM_DIRTY_TLB
> >> +
> >> +Capability: KVM_CAP_SW_TLB
> >> +Architectures: ppc
> >> +Type: vcpu ioctl
> >> +Parameters: struct kvm_dirty_tlb (in)
> >> +Returns: 0 on success, -1 on error
> >> +
> >> +struct kvm_dirty_tlb {
> >> +	__u64 bitmap;
> >> +	__u32 num_dirty;
> >> +};
> > 
> > This is not 32/64 bit safe.  e500 is 32-bit only, yes?
>
> e5500 is 64-bit -- we don't support it with KVM yet, but it's planned.
>
> > but what if someone wants to emulate an e500 on a ppc64?  maybe it's better to add
> > padding here.
>
> What is unsafe about it?  Are you picturing TLBs with more than 4
> billion entries?

sizeof(struct kvm_tlb_dirty) = 12 for 32-bit userspace, but =  16 for
64-bit userspace and the kernel.  ABI structures must have the same
alignment and size for 32/64 bit userspace, or they need compat handling.

> There shouldn't be any alignment issues.
>
> > Another alternative is to drop the num_dirty field (and let the kernel
> > compute it instead, shouldn't take long?), and have the third argument
> > to ioctl() reference the bitmap directly.
>
> The idea was to make it possible for the kernel to apply a threshold
> above which it would be better to ignore the bitmap entirely and flush
> everything:
>
> http://www.spinics.net/lists/kvm/msg50079.html
>
> Currently we always just flush everything, and QEMU always says
> everything is dirty when it makes a change, but the API is there if needed.

Right, but you don't need num_dirty for it.  There are typically only a
few dozen entries, yes?  It should take a trivial amount of time to
calculate its weight.

> >> +Configures the virtual CPU's TLB array, establishing a shared memory area
> >> +between userspace and KVM.  The "params" and "array" fields are userspace
> >> +addresses of mmu-type-specific data structures.  The "array_len" field is an
> >> +safety mechanism, and should be set to the size in bytes of the memory that
> >> +userspace has reserved for the array.  It must be at least the size dictated
> >> +by "mmu_type" and "params".
> >> +
> >> +While KVM_RUN is active, the shared region is under control of KVM.  Its
> >> +contents are undefined, and any modification by userspace results in
> >> +boundedly undefined behavior.
> >> +
> >> +On return from KVM_RUN, the shared region will reflect the current state of
> >> +the guest's TLB.  If userspace makes any changes, it must call KVM_DIRTY_TLB
> >> +to tell KVM which entries have been changed, prior to calling KVM_RUN again
> >> +on this vcpu.
> > 
> > We already have another mechanism for such shared memory,
> > mmap(vcpu_fd).  x86 uses it for the coalesced mmio region as well as the
> > traditional kvm_run area.  Please consider using it.
>
> What does it buy us, other than needing a separate codepath in QEMU to
> allocate the memory differently based on whether KVM (and this feature)

The ability to use get_free_pages() and ordinary kernel memory directly,
instead of indirection through a struct page ** array.

> are being used, since QEMU uses this for its own MMU representation?
>
> This API has been discussed extensively, and the code using it is
> already in mainline QEMU.  This aspect of it hasn't changed since the
> discussion back in February:
>
> http://www.spinics.net/lists/kvm/msg50102.html
>
> I'd prefer to avoid another round of major overhaul without a really
> good reason.

Me too, but I also prefer not to make ABI choices by inertia.  ABI is
practically the only thing I care about wrt non-x86 (other than
whitespace, of course).  Please involve me in the discussions earlier in
the future.

> >> +For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
> >> + - The "params" field is of type "struct kvm_book3e_206_tlb_params".
> >> + - The "array" field points to an array of type "struct
> >> +   kvm_book3e_206_tlb_entry".
> >> + - The array consists of all entries in the first TLB, followed by all
> >> +   entries in the second TLB.
> >> + - Within a TLB, entries are ordered first by increasing set number.  Within a
> >> +   set, entries are ordered by way (increasing ESEL).
> >> + - The hash for determining set number in TLB0 is: (MAS2 >> 12) & (num_sets - 1)
> >> +   where "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
> >> + - The tsize field of mas1 shall be set to 4K on TLB0, even though the
> >> +   hardware ignores this value for TLB0.
> > 
> > Holy shit.
>
> You were the one that first suggested we use shared data:
> http://www.spinics.net/lists/kvm/msg49802.html
>
> These are the assumptions needed to make such an interface well-defined.

Just remarking on the complexity, don't take it personally.

> >> @@ -95,6 +90,9 @@ struct kvmppc_vcpu_e500 {
> >>  	u32 tlb1cfg;
> >>  	u64 mcar;
> >>  
> >> +	struct page **shared_tlb_pages;
> >> +	int num_shared_tlb_pages;
> >> +
> > 
> > I missed the requirement that things be page aligned.
>
> They don't need to be, we'll ignore the data before and after the shared
> area.
>
> > If you use mmap(vcpu_fd) this becomes simpler; you can use
> > get_free_pages() and have a single pointer.  You can also use vmap() on
> > this array (but get_free_pages() is faster).
>
> We do use vmap().  This is just the bookkeeping so we know what pages to
> free later.
>

Ah, I missed that (and the pointer).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 01/14] KVM: PPC: e500: don't translate gfn to pfn with preemption disabled
  2011-10-31 18:52       ` [PATCH 01/14] KVM: PPC: e500: don't translate gfn to pfn with Scott Wood
@ 2011-11-01  9:00         ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2011-11-01  9:00 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm list, Marcelo Tosatti

On 10/31/2011 08:52 PM, Scott Wood wrote:
> On 10/31/2011 07:50 AM, Avi Kivity wrote:
> > On 10/31/2011 09:53 AM, Alexander Graf wrote:
> >> +/* sesel is index into the set, not the whole array */
> >> +static void write_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
> >> +			struct tlbe *gtlbe,
> >> +			struct tlbe *stlbe,
> >> +			int stlbsel, int sesel)
> >> +{
> >> +	int stid;
> >> +
> >> +	preempt_disable();
> >> +	stid = kvmppc_e500_get_sid(vcpu_e500, get_tlb_ts(gtlbe),
> >> +				   get_tlb_tid(gtlbe),
> >> +				   get_cur_pr(&vcpu_e500->vcpu), 0);
> >> +
> >> +	stlbe->mas1 |= MAS1_TID(stid);
> >> +	write_host_tlbe(vcpu_e500, stlbsel, sesel, stlbe);
> >> +	preempt_enable();
> >> +}
> >> +
> >>
> > 
> > This naked preempt_disable() is fishy.  What happens if we're migrated
> > immediately afterwards? we fault again and redo?
>
> Yes, we'll fault again.
>
> We just want to make sure that the sid is still valid when we write the
> TLB entry.  If we migrate, we'll get a new sid and the old TLB entry
> will be irrelevant, even if we migrate back to the same CPU.  The entire
> TLB will be flushed before a sid is reused.
>
> If we don't do the preempt_disable(), we could get the sid on one CPU
> and then get migrated and run it on another CPU where that sid is (or
> will be) valid for a different context.  Or we could run out of sids
> while preempted, making the sid allocated before this possibly valid for
> a different context.
>
>

Makes sense, thanks.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 01/14] KVM: PPC: e500: don't translate gfn to pfn with
@ 2011-11-01  9:00         ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2011-11-01  9:00 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm list, Marcelo Tosatti

On 10/31/2011 08:52 PM, Scott Wood wrote:
> On 10/31/2011 07:50 AM, Avi Kivity wrote:
> > On 10/31/2011 09:53 AM, Alexander Graf wrote:
> >> +/* sesel is index into the set, not the whole array */
> >> +static void write_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
> >> +			struct tlbe *gtlbe,
> >> +			struct tlbe *stlbe,
> >> +			int stlbsel, int sesel)
> >> +{
> >> +	int stid;
> >> +
> >> +	preempt_disable();
> >> +	stid = kvmppc_e500_get_sid(vcpu_e500, get_tlb_ts(gtlbe),
> >> +				   get_tlb_tid(gtlbe),
> >> +				   get_cur_pr(&vcpu_e500->vcpu), 0);
> >> +
> >> +	stlbe->mas1 |= MAS1_TID(stid);
> >> +	write_host_tlbe(vcpu_e500, stlbsel, sesel, stlbe);
> >> +	preempt_enable();
> >> +}
> >> +
> >>
> > 
> > This naked preempt_disable() is fishy.  What happens if we're migrated
> > immediately afterwards? we fault again and redo?
>
> Yes, we'll fault again.
>
> We just want to make sure that the sid is still valid when we write the
> TLB entry.  If we migrate, we'll get a new sid and the old TLB entry
> will be irrelevant, even if we migrate back to the same CPU.  The entire
> TLB will be flushed before a sid is reused.
>
> If we don't do the preempt_disable(), we could get the sid on one CPU
> and then get migrated and run it on another CPU where that sid is (or
> will be) valid for a different context.  Or we could run out of sids
> while preempted, making the sid allocated before this possibly valid for
> a different context.
>
>

Makes sense, thanks.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 04/14] KVM: PPC: e500: MMU API
  2011-11-01  8:58         ` Avi Kivity
@ 2011-11-01  9:55           ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2011-11-01  9:55 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm list, Marcelo Tosatti

On 11/01/2011 10:58 AM, Avi Kivity wrote:
> > > We already have another mechanism for such shared memory,
> > > mmap(vcpu_fd).  x86 uses it for the coalesced mmio region as well as the
> > > traditional kvm_run area.  Please consider using it.
> >
> > What does it buy us, other than needing a separate codepath in QEMU to
> > allocate the memory differently based on whether KVM (and this feature)
>
> The ability to use get_free_pages() and ordinary kernel memory directly,
> instead of indirection through a struct page ** array.

Ugh, you use vmap(), so this doesn't hold.

get_free_pages() is faster than vmap(), but not by much.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 04/14] KVM: PPC: e500: MMU API
@ 2011-11-01  9:55           ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2011-11-01  9:55 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm list, Marcelo Tosatti

On 11/01/2011 10:58 AM, Avi Kivity wrote:
> > > We already have another mechanism for such shared memory,
> > > mmap(vcpu_fd).  x86 uses it for the coalesced mmio region as well as the
> > > traditional kvm_run area.  Please consider using it.
> >
> > What does it buy us, other than needing a separate codepath in QEMU to
> > allocate the memory differently based on whether KVM (and this feature)
>
> The ability to use get_free_pages() and ordinary kernel memory directly,
> instead of indirection through a struct page ** array.

Ugh, you use vmap(), so this doesn't hold.

get_free_pages() is faster than vmap(), but not by much.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 04/14] KVM: PPC: e500: MMU API
  2011-11-01  8:58         ` Avi Kivity
@ 2011-11-01 16:16           ` Scott Wood
  -1 siblings, 0 replies; 82+ messages in thread
From: Scott Wood @ 2011-11-01 16:16 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alexander Graf, kvm-ppc, kvm list, Marcelo Tosatti

On 11/01/2011 03:58 AM, Avi Kivity wrote:
> On 10/31/2011 10:12 PM, Scott Wood wrote:
>>>> +4.59 KVM_DIRTY_TLB
>>>> +
>>>> +Capability: KVM_CAP_SW_TLB
>>>> +Architectures: ppc
>>>> +Type: vcpu ioctl
>>>> +Parameters: struct kvm_dirty_tlb (in)
>>>> +Returns: 0 on success, -1 on error
>>>> +
>>>> +struct kvm_dirty_tlb {
>>>> +	__u64 bitmap;
>>>> +	__u32 num_dirty;
>>>> +};
>>>
>>> This is not 32/64 bit safe.  e500 is 32-bit only, yes?
>>
>> e5500 is 64-bit -- we don't support it with KVM yet, but it's planned.
>>
>>> but what if someone wants to emulate an e500 on a ppc64?  maybe it's better to add
>>> padding here.
>>
>> What is unsafe about it?  Are you picturing TLBs with more than 4
>> billion entries?
> 
> sizeof(struct kvm_tlb_dirty) == 12 for 32-bit userspace, but ==  16 for
> 64-bit userspace and the kernel.  ABI structures must have the same
> alignment and size for 32/64 bit userspace, or they need compat handling.

The size is 16 on 32-bit ppc -- the alignment of __u64 forces this.  It
looks like this is different in the 32x86 ABI.

We can pad explicitly if you prefer.

>> There shouldn't be any alignment issues.
>>
>>> Another alternative is to drop the num_dirty field (and let the kernel
>>> compute it instead, shouldn't take long?), and have the third argument
>>> to ioctl() reference the bitmap directly.
>>
>> The idea was to make it possible for the kernel to apply a threshold
>> above which it would be better to ignore the bitmap entirely and flush
>> everything:
>>
>> http://www.spinics.net/lists/kvm/msg50079.html
>>
>> Currently we always just flush everything, and QEMU always says
>> everything is dirty when it makes a change, but the API is there if needed.
> 
> Right, but you don't need num_dirty for it.  There are typically only a
> few dozen entries, yes?  It should take a trivial amount of time to
> calculate its weight.

There are over 500 entries currently, and QEMU could make it much larger
if it wants to decrease guest-visible faults on certain workloads.

It's not the most important feature, indeed we currently ignore the
bitmap entirely.  But it could be useful depending on how the API is
used in the future, and I don't think we gain much by dropping it at
this point.  Alex, any thoughts?

>> This API has been discussed extensively, and the code using it is
>> already in mainline QEMU.  This aspect of it hasn't changed since the
>> discussion back in February:
>>
>> http://www.spinics.net/lists/kvm/msg50102.html
>>
>> I'd prefer to avoid another round of major overhaul without a really
>> good reason.
> 
> Me too, but I also prefer not to make ABI choices by inertia.  ABI is
> practically the only thing I care about wrt non-x86 (other than
> whitespace, of course).  Please involve me in the discussions earlier in
> the future.

You participated in that thread. :-)

I apologize for forgetting the main kvm list (rather than just kvm-ppc)
when sending out the most recent batch of patches.

>>>> +For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
>>>> + - The "params" field is of type "struct kvm_book3e_206_tlb_params".
>>>> + - The "array" field points to an array of type "struct
>>>> +   kvm_book3e_206_tlb_entry".
>>>> + - The array consists of all entries in the first TLB, followed by all
>>>> +   entries in the second TLB.
>>>> + - Within a TLB, entries are ordered first by increasing set number.  Within a
>>>> +   set, entries are ordered by way (increasing ESEL).
>>>> + - The hash for determining set number in TLB0 is: (MAS2 >> 12) & (num_sets - 1)
>>>> +   where "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
>>>> + - The tsize field of mas1 shall be set to 4K on TLB0, even though the
>>>> +   hardware ignores this value for TLB0.
>>>
>>> Holy shit.
>>
>> You were the one that first suggested we use shared data:
>> http://www.spinics.net/lists/kvm/msg49802.html
>>
>> These are the assumptions needed to make such an interface well-defined.
> 
> Just remarking on the complexity, don't take it personally.

:-)

Just wasn't sure whether the implication was that it was too complex.

-scott

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 04/14] KVM: PPC: e500: MMU API
@ 2011-11-01 16:16           ` Scott Wood
  0 siblings, 0 replies; 82+ messages in thread
From: Scott Wood @ 2011-11-01 16:16 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alexander Graf, kvm-ppc, kvm list, Marcelo Tosatti

On 11/01/2011 03:58 AM, Avi Kivity wrote:
> On 10/31/2011 10:12 PM, Scott Wood wrote:
>>>> +4.59 KVM_DIRTY_TLB
>>>> +
>>>> +Capability: KVM_CAP_SW_TLB
>>>> +Architectures: ppc
>>>> +Type: vcpu ioctl
>>>> +Parameters: struct kvm_dirty_tlb (in)
>>>> +Returns: 0 on success, -1 on error
>>>> +
>>>> +struct kvm_dirty_tlb {
>>>> +	__u64 bitmap;
>>>> +	__u32 num_dirty;
>>>> +};
>>>
>>> This is not 32/64 bit safe.  e500 is 32-bit only, yes?
>>
>> e5500 is 64-bit -- we don't support it with KVM yet, but it's planned.
>>
>>> but what if someone wants to emulate an e500 on a ppc64?  maybe it's better to add
>>> padding here.
>>
>> What is unsafe about it?  Are you picturing TLBs with more than 4
>> billion entries?
> 
> sizeof(struct kvm_tlb_dirty) = 12 for 32-bit userspace, but =  16 for
> 64-bit userspace and the kernel.  ABI structures must have the same
> alignment and size for 32/64 bit userspace, or they need compat handling.

The size is 16 on 32-bit ppc -- the alignment of __u64 forces this.  It
looks like this is different in the 32x86 ABI.

We can pad explicitly if you prefer.

>> There shouldn't be any alignment issues.
>>
>>> Another alternative is to drop the num_dirty field (and let the kernel
>>> compute it instead, shouldn't take long?), and have the third argument
>>> to ioctl() reference the bitmap directly.
>>
>> The idea was to make it possible for the kernel to apply a threshold
>> above which it would be better to ignore the bitmap entirely and flush
>> everything:
>>
>> http://www.spinics.net/lists/kvm/msg50079.html
>>
>> Currently we always just flush everything, and QEMU always says
>> everything is dirty when it makes a change, but the API is there if needed.
> 
> Right, but you don't need num_dirty for it.  There are typically only a
> few dozen entries, yes?  It should take a trivial amount of time to
> calculate its weight.

There are over 500 entries currently, and QEMU could make it much larger
if it wants to decrease guest-visible faults on certain workloads.

It's not the most important feature, indeed we currently ignore the
bitmap entirely.  But it could be useful depending on how the API is
used in the future, and I don't think we gain much by dropping it at
this point.  Alex, any thoughts?

>> This API has been discussed extensively, and the code using it is
>> already in mainline QEMU.  This aspect of it hasn't changed since the
>> discussion back in February:
>>
>> http://www.spinics.net/lists/kvm/msg50102.html
>>
>> I'd prefer to avoid another round of major overhaul without a really
>> good reason.
> 
> Me too, but I also prefer not to make ABI choices by inertia.  ABI is
> practically the only thing I care about wrt non-x86 (other than
> whitespace, of course).  Please involve me in the discussions earlier in
> the future.

You participated in that thread. :-)

I apologize for forgetting the main kvm list (rather than just kvm-ppc)
when sending out the most recent batch of patches.

>>>> +For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
>>>> + - The "params" field is of type "struct kvm_book3e_206_tlb_params".
>>>> + - The "array" field points to an array of type "struct
>>>> +   kvm_book3e_206_tlb_entry".
>>>> + - The array consists of all entries in the first TLB, followed by all
>>>> +   entries in the second TLB.
>>>> + - Within a TLB, entries are ordered first by increasing set number.  Within a
>>>> +   set, entries are ordered by way (increasing ESEL).
>>>> + - The hash for determining set number in TLB0 is: (MAS2 >> 12) & (num_sets - 1)
>>>> +   where "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
>>>> + - The tsize field of mas1 shall be set to 4K on TLB0, even though the
>>>> +   hardware ignores this value for TLB0.
>>>
>>> Holy shit.
>>
>> You were the one that first suggested we use shared data:
>> http://www.spinics.net/lists/kvm/msg49802.html
>>
>> These are the assumptions needed to make such an interface well-defined.
> 
> Just remarking on the complexity, don't take it personally.

:-)

Just wasn't sure whether the implication was that it was too complex.

-scott


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 04/14] KVM: PPC: e500: MMU API
  2011-11-01 16:16           ` Scott Wood
@ 2011-11-02 10:33             ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2011-11-02 10:33 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm list, Marcelo Tosatti

On 11/01/2011 06:16 PM, Scott Wood wrote:
> > 
> > sizeof(struct kvm_tlb_dirty) == 12 for 32-bit userspace, but ==  16 for
> > 64-bit userspace and the kernel.  ABI structures must have the same
> > alignment and size for 32/64 bit userspace, or they need compat handling.
>
> The size is 16 on 32-bit ppc -- the alignment of __u64 forces this.  It
> looks like this is different in the 32x86 ABI.

Right, __u64 alignment on i386 is 4.

> We can pad explicitly if you prefer.

No real need - unless it may be reused by another arch?  I think that's
unlikely.

> >> This API has been discussed extensively, and the code using it is
> >> already in mainline QEMU.  This aspect of it hasn't changed since the
> >> discussion back in February:
> >>
> >> http://www.spinics.net/lists/kvm/msg50102.html
> >>
> >> I'd prefer to avoid another round of major overhaul without a really
> >> good reason.
> > 
> > Me too, but I also prefer not to make ABI choices by inertia.  ABI is
> > practically the only thing I care about wrt non-x86 (other than
> > whitespace, of course).  Please involve me in the discussions earlier in
> > the future.
>
> You participated in that thread. :-)

Well, my memory isn't what it used to be, or at least what I seem to
remember it used to be.

> >>
> >> These are the assumptions needed to make such an interface well-defined.
> > 
> > Just remarking on the complexity, don't take it personally.
>
> :-)
>
> Just wasn't sure whether the implication was that it was too complex.
>

It is too complex, but that's entirely the fault of the hardware.  All
we can do is complain and enjoy the guaranteed job security.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 04/14] KVM: PPC: e500: MMU API
@ 2011-11-02 10:33             ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2011-11-02 10:33 UTC (permalink / raw)
  To: Scott Wood; +Cc: Alexander Graf, kvm-ppc, kvm list, Marcelo Tosatti

On 11/01/2011 06:16 PM, Scott Wood wrote:
> > 
> > sizeof(struct kvm_tlb_dirty) = 12 for 32-bit userspace, but =  16 for
> > 64-bit userspace and the kernel.  ABI structures must have the same
> > alignment and size for 32/64 bit userspace, or they need compat handling.
>
> The size is 16 on 32-bit ppc -- the alignment of __u64 forces this.  It
> looks like this is different in the 32x86 ABI.

Right, __u64 alignment on i386 is 4.

> We can pad explicitly if you prefer.

No real need - unless it may be reused by another arch?  I think that's
unlikely.

> >> This API has been discussed extensively, and the code using it is
> >> already in mainline QEMU.  This aspect of it hasn't changed since the
> >> discussion back in February:
> >>
> >> http://www.spinics.net/lists/kvm/msg50102.html
> >>
> >> I'd prefer to avoid another round of major overhaul without a really
> >> good reason.
> > 
> > Me too, but I also prefer not to make ABI choices by inertia.  ABI is
> > practically the only thing I care about wrt non-x86 (other than
> > whitespace, of course).  Please involve me in the discussions earlier in
> > the future.
>
> You participated in that thread. :-)

Well, my memory isn't what it used to be, or at least what I seem to
remember it used to be.

> >>
> >> These are the assumptions needed to make such an interface well-defined.
> > 
> > Just remarking on the complexity, don't take it personally.
>
> :-)
>
> Just wasn't sure whether the implication was that it was too complex.
>

It is too complex, but that's entirely the fault of the hardware.  All
we can do is complain and enjoy the guaranteed job security.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 04/14] KVM: PPC: e500: MMU API
  2011-11-10 14:20             ` Alexander Graf
@ 2011-11-10 14:16               ` Avi Kivity
  -1 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2011-11-10 14:16 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Scott Wood, kvm-ppc, kvm list, Marcelo Tosatti

On 11/10/2011 04:20 PM, Alexander Graf wrote:
>> looks like this is different in the 32x86 ABI.
>>
>> We can pad explicitly if you prefer.
> The size is 16 on 32-bit ppc -- the alignment of __u64 forces this.  It
>
> I would prefer if we keep this stable :). There's no good reason to
> pad it - ppc64 creates the same struct definition. There are over 500
> entries currently, and QEMU could make it much larger
>> if it wants to decrease guest-visible faults on certain workloads.
>>
>> It's not the most important feature, indeed we currently ignore the
>> bitmap entirely.  But it could be useful depending on how the API is
>> used in the future, and I don't think we gain much by dropping it at
>> this point.  Alex, any thoughts?
>
> The kernel can always opt in to ignore the field if it chooses to, so
> I don't see the point in dropping it. There shouldn't be an alignment
> problem in the first place :).

Ok.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 04/14] KVM: PPC: e500: MMU API
@ 2011-11-10 14:16               ` Avi Kivity
  0 siblings, 0 replies; 82+ messages in thread
From: Avi Kivity @ 2011-11-10 14:16 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Scott Wood, kvm-ppc, kvm list, Marcelo Tosatti

On 11/10/2011 04:20 PM, Alexander Graf wrote:
>> looks like this is different in the 32x86 ABI.
>>
>> We can pad explicitly if you prefer.
> The size is 16 on 32-bit ppc -- the alignment of __u64 forces this.  It
>
> I would prefer if we keep this stable :). There's no good reason to
> pad it - ppc64 creates the same struct definition. There are over 500
> entries currently, and QEMU could make it much larger
>> if it wants to decrease guest-visible faults on certain workloads.
>>
>> It's not the most important feature, indeed we currently ignore the
>> bitmap entirely.  But it could be useful depending on how the API is
>> used in the future, and I don't think we gain much by dropping it at
>> this point.  Alex, any thoughts?
>
> The kernel can always opt in to ignore the field if it chooses to, so
> I don't see the point in dropping it. There shouldn't be an alignment
> problem in the first place :).

Ok.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 04/14] KVM: PPC: e500: MMU API
  2011-11-01 16:16           ` Scott Wood
@ 2011-11-10 14:20             ` Alexander Graf
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-11-10 14:20 UTC (permalink / raw)
  To: Scott Wood; +Cc: Avi Kivity, kvm-ppc, kvm list, Marcelo Tosatti

On 11/01/2011 05:16 PM, Scott Wood wrote:
> On 11/01/2011 03:58 AM, Avi Kivity wrote:
>> On 10/31/2011 10:12 PM, Scott Wood wrote:
>>>>> +4.59 KVM_DIRTY_TLB
>>>>> +
>>>>> +Capability: KVM_CAP_SW_TLB
>>>>> +Architectures: ppc
>>>>> +Type: vcpu ioctl
>>>>> +Parameters: struct kvm_dirty_tlb (in)
>>>>> +Returns: 0 on success, -1 on error
>>>>> +
>>>>> +struct kvm_dirty_tlb {
>>>>> +	__u64 bitmap;
>>>>> +	__u32 num_dirty;
>>>>> +};
>>>> This is not 32/64 bit safe.  e500 is 32-bit only, yes?
>>> e5500 is 64-bit -- we don't support it with KVM yet, but it's planned.
>>>
>>>> but what if someone wants to emulate an e500 on a ppc64?  maybe it's better to add
>>>> padding here.
>>> What is unsafe about it?  Are you picturing TLBs with more than 4
>>> billion entries?
>> sizeof(struct kvm_tlb_dirty) == 12 for 32-bit userspace, but ==  16 for
>> 64-bit userspace and the kernel.  ABI structures must have the same
>> alignment and size for 32/64 bit userspace, or they need compat handling.
> The size is 16 on 32-bit ppc -- the alignment of __u64 forces this.  It
> looks like this is different in the 32x86 ABI.
>
> We can pad explicitly if you prefer.

I would prefer if we keep this stable :). There's no good reason to pad 
it - ppc64 creates the same struct definition.

>>> There shouldn't be any alignment issues.
>>>
>>>> Another alternative is to drop the num_dirty field (and let the kernel
>>>> compute it instead, shouldn't take long?), and have the third argument
>>>> to ioctl() reference the bitmap directly.
>>> The idea was to make it possible for the kernel to apply a threshold
>>> above which it would be better to ignore the bitmap entirely and flush
>>> everything:
>>>
>>> http://www.spinics.net/lists/kvm/msg50079.html
>>>
>>> Currently we always just flush everything, and QEMU always says
>>> everything is dirty when it makes a change, but the API is there if needed.
>> Right, but you don't need num_dirty for it.  There are typically only a
>> few dozen entries, yes?  It should take a trivial amount of time to
>> calculate its weight.
> There are over 500 entries currently, and QEMU could make it much larger
> if it wants to decrease guest-visible faults on certain workloads.
>
> It's not the most important feature, indeed we currently ignore the
> bitmap entirely.  But it could be useful depending on how the API is
> used in the future, and I don't think we gain much by dropping it at
> this point.  Alex, any thoughts?

The kernel can always opt in to ignore the field if it chooses to, so I 
don't see the point in dropping it. There shouldn't be an alignment 
problem in the first place :).


Alex

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 04/14] KVM: PPC: e500: MMU API
@ 2011-11-10 14:20             ` Alexander Graf
  0 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-11-10 14:20 UTC (permalink / raw)
  To: Scott Wood; +Cc: Avi Kivity, kvm-ppc, kvm list, Marcelo Tosatti

On 11/01/2011 05:16 PM, Scott Wood wrote:
> On 11/01/2011 03:58 AM, Avi Kivity wrote:
>> On 10/31/2011 10:12 PM, Scott Wood wrote:
>>>>> +4.59 KVM_DIRTY_TLB
>>>>> +
>>>>> +Capability: KVM_CAP_SW_TLB
>>>>> +Architectures: ppc
>>>>> +Type: vcpu ioctl
>>>>> +Parameters: struct kvm_dirty_tlb (in)
>>>>> +Returns: 0 on success, -1 on error
>>>>> +
>>>>> +struct kvm_dirty_tlb {
>>>>> +	__u64 bitmap;
>>>>> +	__u32 num_dirty;
>>>>> +};
>>>> This is not 32/64 bit safe.  e500 is 32-bit only, yes?
>>> e5500 is 64-bit -- we don't support it with KVM yet, but it's planned.
>>>
>>>> but what if someone wants to emulate an e500 on a ppc64?  maybe it's better to add
>>>> padding here.
>>> What is unsafe about it?  Are you picturing TLBs with more than 4
>>> billion entries?
>> sizeof(struct kvm_tlb_dirty) = 12 for 32-bit userspace, but =  16 for
>> 64-bit userspace and the kernel.  ABI structures must have the same
>> alignment and size for 32/64 bit userspace, or they need compat handling.
> The size is 16 on 32-bit ppc -- the alignment of __u64 forces this.  It
> looks like this is different in the 32x86 ABI.
>
> We can pad explicitly if you prefer.

I would prefer if we keep this stable :). There's no good reason to pad 
it - ppc64 creates the same struct definition.

>>> There shouldn't be any alignment issues.
>>>
>>>> Another alternative is to drop the num_dirty field (and let the kernel
>>>> compute it instead, shouldn't take long?), and have the third argument
>>>> to ioctl() reference the bitmap directly.
>>> The idea was to make it possible for the kernel to apply a threshold
>>> above which it would be better to ignore the bitmap entirely and flush
>>> everything:
>>>
>>> http://www.spinics.net/lists/kvm/msg50079.html
>>>
>>> Currently we always just flush everything, and QEMU always says
>>> everything is dirty when it makes a change, but the API is there if needed.
>> Right, but you don't need num_dirty for it.  There are typically only a
>> few dozen entries, yes?  It should take a trivial amount of time to
>> calculate its weight.
> There are over 500 entries currently, and QEMU could make it much larger
> if it wants to decrease guest-visible faults on certain workloads.
>
> It's not the most important feature, indeed we currently ignore the
> bitmap entirely.  But it could be useful depending on how the API is
> used in the future, and I don't think we gain much by dropping it at
> this point.  Alex, any thoughts?

The kernel can always opt in to ignore the field if it chooses to, so I 
don't see the point in dropping it. There shouldn't be an alignment 
problem in the first place :).


Alex


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls
  2011-10-31 13:36     ` Avi Kivity
@ 2011-11-10 14:22       ` Alexander Graf
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-11-10 14:22 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-ppc, kvm list, Marcelo Tosatti, Jan Kiszka

On 10/31/2011 02:36 PM, Avi Kivity wrote:
> On 10/31/2011 09:53 AM, Alexander Graf wrote:
>> Right now we transfer a static struct every time we want to get or set
>> registers. Unfortunately, over time we realize that there are more of
>> these than we thought of before and the extensibility and flexibility of
>> transferring a full struct every time is limited.
>>
>> So this is a new approach to the problem. With these new ioctls, we can
>> get and set a single register that is identified by an ID. This allows for
>> very precise and limited transmittal of data. When we later realize that
>> it's a better idea to shove over multiple registers at once, we can reuse
>> most of the infrastructure and simply implement a GET_MANY_REGS / SET_MANY_REGS
>> interface.
>>
>> The only downpoint I see to this one is that it needs to pad to 1024 bits
>> (hardware is already on 512 bit registers, so I wanted to leave some room)
>> which is slightly too much for transmitting only 64 bits. But if that's all
>> the tradeoff we have to do for getting an extensible interface, I'd say go
>> for it nevertheless.
> Do we want this for x86 too?  How often do we want just one register?

I'm not sure. Depends on your user space I suppose :). If you want a 
simple debugging tool that exposes register poking directly to user 
space, then it can be handy.

>>
>> +4.64 KVM_SET_ONE_REG
>> +
>> +Capability: KVM_CAP_ONE_REG
>> +Architectures: all
>> +Type: vcpu ioctl
>> +Parameters: struct kvm_one_reg (in)
>> +Returns: 0 on success, negative value on failure
>> +
>> +struct kvm_one_reg {
>> +       __u64 id;
> would be better to have a register set (in x86 terms,
> gpr/x86/sse/cr/xcr/msr/special) and an ID within the set.  __u64 is
> excessive, I hope.

Yeah, we have that in the ID. But since the sets are arch specific I'd 
rather keep the definition of which parts of the ID are used for the set 
and which are used for the actual register id inside that set to the arch.

>> +       union {
>> +               __u8 reg8;
>> +               __u16 reg16;
>> +               __u32 reg32;
>> +               __u64 reg64;
>> +               __u8 reg128[16];
>> +               __u8 reg256[32];
>> +               __u8 reg512[64];
>> +               __u8 reg1024[128];
>> +       } u;
>> +};
>> +
>> +Using this ioctl, a single vcpu register can be set to a specific value
>> +defined by user space with the passed in struct kvm_one_reg. There can
>> +be architecture agnostic and architecture specific registers. Each have
>> +their own range of operation and their own constants and width. To keep
>> +track of the implemented registers, find a list below:
>> +
>> +  Arch  |       Register        | Width (bits)
>> +        |                       |
>> +
>>
> One possible issue is that certain register have mutually exclusive
> values, so you may need to issue multiple calls to get the right
> sequence.  You probably don't have that on ppc.

I'm fairly sure we don't. But even if so, it's the same as running code 
inside the guest, so it should come natural, no?


Alex


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls
@ 2011-11-10 14:22       ` Alexander Graf
  0 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-11-10 14:22 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-ppc, kvm list, Marcelo Tosatti, Jan Kiszka

On 10/31/2011 02:36 PM, Avi Kivity wrote:
> On 10/31/2011 09:53 AM, Alexander Graf wrote:
>> Right now we transfer a static struct every time we want to get or set
>> registers. Unfortunately, over time we realize that there are more of
>> these than we thought of before and the extensibility and flexibility of
>> transferring a full struct every time is limited.
>>
>> So this is a new approach to the problem. With these new ioctls, we can
>> get and set a single register that is identified by an ID. This allows for
>> very precise and limited transmittal of data. When we later realize that
>> it's a better idea to shove over multiple registers at once, we can reuse
>> most of the infrastructure and simply implement a GET_MANY_REGS / SET_MANY_REGS
>> interface.
>>
>> The only downpoint I see to this one is that it needs to pad to 1024 bits
>> (hardware is already on 512 bit registers, so I wanted to leave some room)
>> which is slightly too much for transmitting only 64 bits. But if that's all
>> the tradeoff we have to do for getting an extensible interface, I'd say go
>> for it nevertheless.
> Do we want this for x86 too?  How often do we want just one register?

I'm not sure. Depends on your user space I suppose :). If you want a 
simple debugging tool that exposes register poking directly to user 
space, then it can be handy.

>>
>> +4.64 KVM_SET_ONE_REG
>> +
>> +Capability: KVM_CAP_ONE_REG
>> +Architectures: all
>> +Type: vcpu ioctl
>> +Parameters: struct kvm_one_reg (in)
>> +Returns: 0 on success, negative value on failure
>> +
>> +struct kvm_one_reg {
>> +       __u64 id;
> would be better to have a register set (in x86 terms,
> gpr/x86/sse/cr/xcr/msr/special) and an ID within the set.  __u64 is
> excessive, I hope.

Yeah, we have that in the ID. But since the sets are arch specific I'd 
rather keep the definition of which parts of the ID are used for the set 
and which are used for the actual register id inside that set to the arch.

>> +       union {
>> +               __u8 reg8;
>> +               __u16 reg16;
>> +               __u32 reg32;
>> +               __u64 reg64;
>> +               __u8 reg128[16];
>> +               __u8 reg256[32];
>> +               __u8 reg512[64];
>> +               __u8 reg1024[128];
>> +       } u;
>> +};
>> +
>> +Using this ioctl, a single vcpu register can be set to a specific value
>> +defined by user space with the passed in struct kvm_one_reg. There can
>> +be architecture agnostic and architecture specific registers. Each have
>> +their own range of operation and their own constants and width. To keep
>> +track of the implemented registers, find a list below:
>> +
>> +  Arch  |       Register        | Width (bits)
>> +        |                       |
>> +
>>
> One possible issue is that certain register have mutually exclusive
> values, so you may need to issue multiple calls to get the right
> sequence.  You probably don't have that on ppc.

I'm fairly sure we don't. But even if so, it's the same as running code 
inside the guest, so it should come natural, no?


Alex


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 13/14] KVM: PPC: E500: Support hugetlbfs
  2011-10-31 13:38     ` Avi Kivity
@ 2011-11-10 14:24       ` Alexander Graf
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-11-10 14:24 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-ppc, kvm list, Marcelo Tosatti

On 10/31/2011 02:38 PM, Avi Kivity wrote:
> On 10/31/2011 09:53 AM, Alexander Graf wrote:
>> With hugetlbfs support emerging on e500, we should also support KVM
>> backing its guest memory by it.
>>
>> This patch adds support for hugetlbfs into the e500 shadow mmu code.
>>
>>
>> @@ -673,12 +674,31 @@ static inline void kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
>>   				pfn&= ~(tsize_pages - 1);
>>   				break;
>>   			}
>> +		} else if (vma&&  hva>= vma->vm_start&&
>> +                           (vma->vm_flags&  VM_HUGETLB)) {
>> +			unsigned long psize = vma_kernel_pagesize(vma);
>>
> Leading spaces spotted.

Oh no! You really do read whitespace :).


Alex

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 13/14] KVM: PPC: E500: Support hugetlbfs
@ 2011-11-10 14:24       ` Alexander Graf
  0 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-11-10 14:24 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm-ppc, kvm list, Marcelo Tosatti

On 10/31/2011 02:38 PM, Avi Kivity wrote:
> On 10/31/2011 09:53 AM, Alexander Graf wrote:
>> With hugetlbfs support emerging on e500, we should also support KVM
>> backing its guest memory by it.
>>
>> This patch adds support for hugetlbfs into the e500 shadow mmu code.
>>
>>
>> @@ -673,12 +674,31 @@ static inline void kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
>>   				pfn&= ~(tsize_pages - 1);
>>   				break;
>>   			}
>> +		} else if (vma&&  hva>= vma->vm_start&&
>> +                           (vma->vm_flags&  VM_HUGETLB)) {
>> +			unsigned long psize = vma_kernel_pagesize(vma);
>>
> Leading spaces spotted.

Oh no! You really do read whitespace :).


Alex


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls
  2011-10-31  7:53   ` Alexander Graf
@ 2011-11-10 16:05     ` Marcelo Tosatti
  -1 siblings, 0 replies; 82+ messages in thread
From: Marcelo Tosatti @ 2011-11-10 16:05 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm list

On Mon, Oct 31, 2011 at 08:53:11AM +0100, Alexander Graf wrote:
> Right now we transfer a static struct every time we want to get or set
> registers. Unfortunately, over time we realize that there are more of
> these than we thought of before and the extensibility and flexibility of
> transferring a full struct every time is limited.
> 
> So this is a new approach to the problem. With these new ioctls, we can
> get and set a single register that is identified by an ID. This allows for
> very precise and limited transmittal of data. When we later realize that
> it's a better idea to shove over multiple registers at once, we can reuse
> most of the infrastructure and simply implement a GET_MANY_REGS / SET_MANY_REGS
> interface.
> 
> The only downpoint I see to this one is that it needs to pad to 1024 bits
> (hardware is already on 512 bit registers, so I wanted to leave some room)
> which is slightly too much for transmitting only 64 bits. But if that's all
> the tradeoff we have to do for getting an extensible interface, I'd say go
> for it nevertheless.
> 
> Signed-off-by: Alexander Graf <agraf@suse.de>
> ---
>  Documentation/virtual/kvm/api.txt |   47 ++++++++++++++++++++++++++++++++++
>  arch/powerpc/kvm/powerpc.c        |   51 +++++++++++++++++++++++++++++++++++++
>  include/linux/kvm.h               |   32 +++++++++++++++++++++++
>  3 files changed, 130 insertions(+), 0 deletions(-)

I don't see the benefit of this generalization, the current structure where 
context information is hardcoded in the data transmitted works well.

Avi?

> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index ab1136f..a23fe62 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -1482,6 +1482,53 @@ is supported; 2 if the processor requires all virtual machines to have
>  an RMA, or 1 if the processor can use an RMA but doesn't require it,
>  because it supports the Virtual RMA (VRMA) facility.
>  
> +4.64 KVM_SET_ONE_REG
> +
> +Capability: KVM_CAP_ONE_REG
> +Architectures: all
> +Type: vcpu ioctl
> +Parameters: struct kvm_one_reg (in)
> +Returns: 0 on success, negative value on failure
> +
> +struct kvm_one_reg {
> +       __u64 id;
> +       union {
> +               __u8 reg8;
> +               __u16 reg16;
> +               __u32 reg32;
> +               __u64 reg64;
> +               __u8 reg128[16];
> +               __u8 reg256[32];
> +               __u8 reg512[64];
> +               __u8 reg1024[128];
> +       } u;
> +};
> +
> +Using this ioctl, a single vcpu register can be set to a specific value
> +defined by user space with the passed in struct kvm_one_reg. There can
> +be architecture agnostic and architecture specific registers. Each have
> +their own range of operation and their own constants and width. To keep
> +track of the implemented registers, find a list below:
> +
> +  Arch  |       Register        | Width (bits)
> +        |                       |
> +
> +4.65 KVM_GET_ONE_REG
> +
> +Capability: KVM_CAP_ONE_REG
> +Architectures: all
> +Type: vcpu ioctl
> +Parameters: struct kvm_one_reg (in and out)
> +Returns: 0 on success, negative value on failure
> +
> +This ioctl allows to receive the value of a single register implemented
> +in a vcpu. The register to read is indicated by the "id" field of the
> +kvm_one_reg struct passed in. On success, the register value can be found
> +in the respective width field of the struct after this call.
> +
> +The list of registers accessible using this interface is identical to the
> +list in 4.64.
> +
>  5. The kvm_run structure
>  
>  Application code obtains a pointer to the kvm_run structure by
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index e75c5ac..39cdb3f 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -214,6 +214,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>  	case KVM_CAP_PPC_UNSET_IRQ:
>  	case KVM_CAP_PPC_IRQ_LEVEL:
>  	case KVM_CAP_ENABLE_CAP:
> +	case KVM_CAP_ONE_REG:
>  		r = 1;
>  		break;
>  #ifndef CONFIG_KVM_BOOK3S_64_HV
> @@ -627,6 +628,32 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
>  	return r;
>  }
>  
> +static int kvm_vcpu_ioctl_get_one_reg(struct kvm_vcpu *vcpu,
> +				      struct kvm_one_reg *reg)
> +{
> +	int r = -EINVAL;
> +
> +	switch (reg->id) {
> +	default:
> +		break;
> +	}
> +
> +	return r;
> +}
> +
> +static int kvm_vcpu_ioctl_set_one_reg(struct kvm_vcpu *vcpu,
> +				      struct kvm_one_reg *reg)
> +{
> +	int r = -EINVAL;
> +
> +	switch (reg->id) {
> +	default:
> +		break;
> +	}
> +
> +	return r;
> +}
> +
>  int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu,
>                                      struct kvm_mp_state *mp_state)
>  {
> @@ -666,6 +693,30 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>  		break;
>  	}
>  
> +	case KVM_GET_ONE_REG:
> +	{
> +		struct kvm_one_reg reg;
> +		r = -EFAULT;
> +		if (copy_from_user(&reg, argp, sizeof(reg)))
> +			goto out;
> +		r = kvm_vcpu_ioctl_get_one_reg(vcpu, &reg);
> +		if (copy_to_user(argp, &reg, sizeof(reg))) {
> +			r = -EFAULT;
> +			goto out;
> +		}
> +		break;
> +	}
> +
> +	case KVM_SET_ONE_REG:
> +	{
> +		struct kvm_one_reg reg;
> +		r = -EFAULT;
> +		if (copy_from_user(&reg, argp, sizeof(reg)))
> +			goto out;
> +		r = kvm_vcpu_ioctl_set_one_reg(vcpu, &reg);
> +		break;
> +	}
> +
>  #ifdef CONFIG_KVM_E500
>  	case KVM_DIRTY_TLB: {
>  		struct kvm_dirty_tlb dirty;
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index a6b1295..e652a7b 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -557,6 +557,7 @@ struct kvm_ppc_pvinfo {
>  #define KVM_CAP_MAX_VCPUS 66       /* returns max vcpus per vm */
>  #define KVM_CAP_PPC_PAPR 68
>  #define KVM_CAP_SW_TLB 69
> +#define KVM_CAP_ONE_REG 70
>  #define KVM_CAP_S390_GMAP 71
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
> @@ -652,6 +653,34 @@ struct kvm_dirty_tlb {
>  	__u32 num_dirty;
>  };
>  
> +/* Available with KVM_CAP_ONE_REG */
> +
> +#define KVM_ONE_REG_GENERIC		0x0000000000000000ULL
> +
> +/*
> + * Architecture specific registers are to be defined in arch headers and
> + * ORed with the arch identifier.
> + */
> +#define KVM_ONE_REG_PPC			0x1000000000000000ULL
> +#define KVM_ONE_REG_X86			0x2000000000000000ULL
> +#define KVM_ONE_REG_IA64		0x3000000000000000ULL
> +#define KVM_ONE_REG_ARM			0x4000000000000000ULL
> +#define KVM_ONE_REG_S390		0x5000000000000000ULL
> +
> +struct kvm_one_reg {
> +	__u64 id;
> +	union {
> +		__u8 reg8;
> +		__u16 reg16;
> +		__u32 reg32;
> +		__u64 reg64;
> +		__u8 reg128[16];
> +		__u8 reg256[32];
> +		__u8 reg512[64];
> +		__u8 reg1024[128];
> +	} u;
> +};
> +
>  /*
>   * ioctls for VM fds
>   */
> @@ -780,6 +809,9 @@ struct kvm_dirty_tlb {
>  #define KVM_ALLOCATE_RMA	  _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
>  /* Available with KVM_CAP_SW_TLB */
>  #define KVM_DIRTY_TLB		  _IOW(KVMIO,  0xaa, struct kvm_dirty_tlb)
> +/* Available with KVM_CAP_ONE_REG */
> +#define KVM_GET_ONE_REG		  _IOWR(KVMIO, 0xab, struct kvm_one_reg)
> +#define KVM_SET_ONE_REG		  _IOW(KVMIO,  0xac, struct kvm_one_reg)
>  
>  #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
>  
> -- 
> 1.6.0.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls
@ 2011-11-10 16:05     ` Marcelo Tosatti
  0 siblings, 0 replies; 82+ messages in thread
From: Marcelo Tosatti @ 2011-11-10 16:05 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm list

On Mon, Oct 31, 2011 at 08:53:11AM +0100, Alexander Graf wrote:
> Right now we transfer a static struct every time we want to get or set
> registers. Unfortunately, over time we realize that there are more of
> these than we thought of before and the extensibility and flexibility of
> transferring a full struct every time is limited.
> 
> So this is a new approach to the problem. With these new ioctls, we can
> get and set a single register that is identified by an ID. This allows for
> very precise and limited transmittal of data. When we later realize that
> it's a better idea to shove over multiple registers at once, we can reuse
> most of the infrastructure and simply implement a GET_MANY_REGS / SET_MANY_REGS
> interface.
> 
> The only downpoint I see to this one is that it needs to pad to 1024 bits
> (hardware is already on 512 bit registers, so I wanted to leave some room)
> which is slightly too much for transmitting only 64 bits. But if that's all
> the tradeoff we have to do for getting an extensible interface, I'd say go
> for it nevertheless.
> 
> Signed-off-by: Alexander Graf <agraf@suse.de>
> ---
>  Documentation/virtual/kvm/api.txt |   47 ++++++++++++++++++++++++++++++++++
>  arch/powerpc/kvm/powerpc.c        |   51 +++++++++++++++++++++++++++++++++++++
>  include/linux/kvm.h               |   32 +++++++++++++++++++++++
>  3 files changed, 130 insertions(+), 0 deletions(-)

I don't see the benefit of this generalization, the current structure where 
context information is hardcoded in the data transmitted works well.

Avi?

> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index ab1136f..a23fe62 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -1482,6 +1482,53 @@ is supported; 2 if the processor requires all virtual machines to have
>  an RMA, or 1 if the processor can use an RMA but doesn't require it,
>  because it supports the Virtual RMA (VRMA) facility.
>  
> +4.64 KVM_SET_ONE_REG
> +
> +Capability: KVM_CAP_ONE_REG
> +Architectures: all
> +Type: vcpu ioctl
> +Parameters: struct kvm_one_reg (in)
> +Returns: 0 on success, negative value on failure
> +
> +struct kvm_one_reg {
> +       __u64 id;
> +       union {
> +               __u8 reg8;
> +               __u16 reg16;
> +               __u32 reg32;
> +               __u64 reg64;
> +               __u8 reg128[16];
> +               __u8 reg256[32];
> +               __u8 reg512[64];
> +               __u8 reg1024[128];
> +       } u;
> +};
> +
> +Using this ioctl, a single vcpu register can be set to a specific value
> +defined by user space with the passed in struct kvm_one_reg. There can
> +be architecture agnostic and architecture specific registers. Each have
> +their own range of operation and their own constants and width. To keep
> +track of the implemented registers, find a list below:
> +
> +  Arch  |       Register        | Width (bits)
> +        |                       |
> +
> +4.65 KVM_GET_ONE_REG
> +
> +Capability: KVM_CAP_ONE_REG
> +Architectures: all
> +Type: vcpu ioctl
> +Parameters: struct kvm_one_reg (in and out)
> +Returns: 0 on success, negative value on failure
> +
> +This ioctl allows to receive the value of a single register implemented
> +in a vcpu. The register to read is indicated by the "id" field of the
> +kvm_one_reg struct passed in. On success, the register value can be found
> +in the respective width field of the struct after this call.
> +
> +The list of registers accessible using this interface is identical to the
> +list in 4.64.
> +
>  5. The kvm_run structure
>  
>  Application code obtains a pointer to the kvm_run structure by
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index e75c5ac..39cdb3f 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -214,6 +214,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>  	case KVM_CAP_PPC_UNSET_IRQ:
>  	case KVM_CAP_PPC_IRQ_LEVEL:
>  	case KVM_CAP_ENABLE_CAP:
> +	case KVM_CAP_ONE_REG:
>  		r = 1;
>  		break;
>  #ifndef CONFIG_KVM_BOOK3S_64_HV
> @@ -627,6 +628,32 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
>  	return r;
>  }
>  
> +static int kvm_vcpu_ioctl_get_one_reg(struct kvm_vcpu *vcpu,
> +				      struct kvm_one_reg *reg)
> +{
> +	int r = -EINVAL;
> +
> +	switch (reg->id) {
> +	default:
> +		break;
> +	}
> +
> +	return r;
> +}
> +
> +static int kvm_vcpu_ioctl_set_one_reg(struct kvm_vcpu *vcpu,
> +				      struct kvm_one_reg *reg)
> +{
> +	int r = -EINVAL;
> +
> +	switch (reg->id) {
> +	default:
> +		break;
> +	}
> +
> +	return r;
> +}
> +
>  int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu,
>                                      struct kvm_mp_state *mp_state)
>  {
> @@ -666,6 +693,30 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>  		break;
>  	}
>  
> +	case KVM_GET_ONE_REG:
> +	{
> +		struct kvm_one_reg reg;
> +		r = -EFAULT;
> +		if (copy_from_user(&reg, argp, sizeof(reg)))
> +			goto out;
> +		r = kvm_vcpu_ioctl_get_one_reg(vcpu, &reg);
> +		if (copy_to_user(argp, &reg, sizeof(reg))) {
> +			r = -EFAULT;
> +			goto out;
> +		}
> +		break;
> +	}
> +
> +	case KVM_SET_ONE_REG:
> +	{
> +		struct kvm_one_reg reg;
> +		r = -EFAULT;
> +		if (copy_from_user(&reg, argp, sizeof(reg)))
> +			goto out;
> +		r = kvm_vcpu_ioctl_set_one_reg(vcpu, &reg);
> +		break;
> +	}
> +
>  #ifdef CONFIG_KVM_E500
>  	case KVM_DIRTY_TLB: {
>  		struct kvm_dirty_tlb dirty;
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index a6b1295..e652a7b 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -557,6 +557,7 @@ struct kvm_ppc_pvinfo {
>  #define KVM_CAP_MAX_VCPUS 66       /* returns max vcpus per vm */
>  #define KVM_CAP_PPC_PAPR 68
>  #define KVM_CAP_SW_TLB 69
> +#define KVM_CAP_ONE_REG 70
>  #define KVM_CAP_S390_GMAP 71
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
> @@ -652,6 +653,34 @@ struct kvm_dirty_tlb {
>  	__u32 num_dirty;
>  };
>  
> +/* Available with KVM_CAP_ONE_REG */
> +
> +#define KVM_ONE_REG_GENERIC		0x0000000000000000ULL
> +
> +/*
> + * Architecture specific registers are to be defined in arch headers and
> + * ORed with the arch identifier.
> + */
> +#define KVM_ONE_REG_PPC			0x1000000000000000ULL
> +#define KVM_ONE_REG_X86			0x2000000000000000ULL
> +#define KVM_ONE_REG_IA64		0x3000000000000000ULL
> +#define KVM_ONE_REG_ARM			0x4000000000000000ULL
> +#define KVM_ONE_REG_S390		0x5000000000000000ULL
> +
> +struct kvm_one_reg {
> +	__u64 id;
> +	union {
> +		__u8 reg8;
> +		__u16 reg16;
> +		__u32 reg32;
> +		__u64 reg64;
> +		__u8 reg128[16];
> +		__u8 reg256[32];
> +		__u8 reg512[64];
> +		__u8 reg1024[128];
> +	} u;
> +};
> +
>  /*
>   * ioctls for VM fds
>   */
> @@ -780,6 +809,9 @@ struct kvm_dirty_tlb {
>  #define KVM_ALLOCATE_RMA	  _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
>  /* Available with KVM_CAP_SW_TLB */
>  #define KVM_DIRTY_TLB		  _IOW(KVMIO,  0xaa, struct kvm_dirty_tlb)
> +/* Available with KVM_CAP_ONE_REG */
> +#define KVM_GET_ONE_REG		  _IOWR(KVMIO, 0xab, struct kvm_one_reg)
> +#define KVM_SET_ONE_REG		  _IOW(KVMIO,  0xac, struct kvm_one_reg)
>  
>  #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
>  
> -- 
> 1.6.0.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls
  2011-11-10 16:05     ` Marcelo Tosatti
@ 2011-11-10 16:49       ` Alexander Graf
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-11-10 16:49 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-ppc, kvm list

On 11/10/2011 05:05 PM, Marcelo Tosatti wrote:
> On Mon, Oct 31, 2011 at 08:53:11AM +0100, Alexander Graf wrote:
>> Right now we transfer a static struct every time we want to get or set
>> registers. Unfortunately, over time we realize that there are more of
>> these than we thought of before and the extensibility and flexibility of
>> transferring a full struct every time is limited.
>>
>> So this is a new approach to the problem. With these new ioctls, we can
>> get and set a single register that is identified by an ID. This allows for
>> very precise and limited transmittal of data. When we later realize that
>> it's a better idea to shove over multiple registers at once, we can reuse
>> most of the infrastructure and simply implement a GET_MANY_REGS / SET_MANY_REGS
>> interface.
>>
>> The only downpoint I see to this one is that it needs to pad to 1024 bits
>> (hardware is already on 512 bit registers, so I wanted to leave some room)
>> which is slightly too much for transmitting only 64 bits. But if that's all
>> the tradeoff we have to do for getting an extensible interface, I'd say go
>> for it nevertheless.
>>
>> Signed-off-by: Alexander Graf<agraf@suse.de>
>> ---
>>   Documentation/virtual/kvm/api.txt |   47 ++++++++++++++++++++++++++++++++++
>>   arch/powerpc/kvm/powerpc.c        |   51 +++++++++++++++++++++++++++++++++++++
>>   include/linux/kvm.h               |   32 +++++++++++++++++++++++
>>   3 files changed, 130 insertions(+), 0 deletions(-)
> I don't see the benefit of this generalization, the current structure where
> context information is hardcoded in the data transmitted works well.

Well, unfortunately it doesn't work quite as well for us because we are 
a much more evolving platform. Also, there are a lot of edges and 
corners of the architecture that simply aren't implemented in KVM as of 
now. I want to have something extensible enough so we don't break the 
ABI along the way.


Alex

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls
@ 2011-11-10 16:49       ` Alexander Graf
  0 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-11-10 16:49 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-ppc, kvm list

On 11/10/2011 05:05 PM, Marcelo Tosatti wrote:
> On Mon, Oct 31, 2011 at 08:53:11AM +0100, Alexander Graf wrote:
>> Right now we transfer a static struct every time we want to get or set
>> registers. Unfortunately, over time we realize that there are more of
>> these than we thought of before and the extensibility and flexibility of
>> transferring a full struct every time is limited.
>>
>> So this is a new approach to the problem. With these new ioctls, we can
>> get and set a single register that is identified by an ID. This allows for
>> very precise and limited transmittal of data. When we later realize that
>> it's a better idea to shove over multiple registers at once, we can reuse
>> most of the infrastructure and simply implement a GET_MANY_REGS / SET_MANY_REGS
>> interface.
>>
>> The only downpoint I see to this one is that it needs to pad to 1024 bits
>> (hardware is already on 512 bit registers, so I wanted to leave some room)
>> which is slightly too much for transmitting only 64 bits. But if that's all
>> the tradeoff we have to do for getting an extensible interface, I'd say go
>> for it nevertheless.
>>
>> Signed-off-by: Alexander Graf<agraf@suse.de>
>> ---
>>   Documentation/virtual/kvm/api.txt |   47 ++++++++++++++++++++++++++++++++++
>>   arch/powerpc/kvm/powerpc.c        |   51 +++++++++++++++++++++++++++++++++++++
>>   include/linux/kvm.h               |   32 +++++++++++++++++++++++
>>   3 files changed, 130 insertions(+), 0 deletions(-)
> I don't see the benefit of this generalization, the current structure where
> context information is hardcoded in the data transmitted works well.

Well, unfortunately it doesn't work quite as well for us because we are 
a much more evolving platform. Also, there are a lot of edges and 
corners of the architecture that simply aren't implemented in KVM as of 
now. I want to have something extensible enough so we don't break the 
ABI along the way.


Alex


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls
  2011-11-10 16:49       ` Alexander Graf
@ 2011-11-10 17:35         ` Marcelo Tosatti
  -1 siblings, 0 replies; 82+ messages in thread
From: Marcelo Tosatti @ 2011-11-10 17:35 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm list

On Thu, Nov 10, 2011 at 05:49:42PM +0100, Alexander Graf wrote:
> >>  Documentation/virtual/kvm/api.txt |   47 ++++++++++++++++++++++++++++++++++
> >>  arch/powerpc/kvm/powerpc.c        |   51 +++++++++++++++++++++++++++++++++++++
> >>  include/linux/kvm.h               |   32 +++++++++++++++++++++++
> >>  3 files changed, 130 insertions(+), 0 deletions(-)
> >I don't see the benefit of this generalization, the current structure where
> >context information is hardcoded in the data transmitted works well.
> 
> Well, unfortunately it doesn't work quite as well for us because we
> are a much more evolving platform. Also, there are a lot of edges
> and corners of the architecture that simply aren't implemented in
> KVM as of now. I want to have something extensible enough so we
> don't break the ABI along the way.

You still have to agree on format between userspace and kernel, right?
If either party fails to conform to that, you're doomed.

The problem with two interfaces is potential ambiguity: is
register X implemented through KVM_GET_ONE_REG and also through
KVM_GET_XYZ_REGISTER_SET ? If its accessible by two interfaces, what is
the register writeback order? Is there a plan to convert, etc.

If you agree these concerns are valid, perhaps this interface can be PPC
specific.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls
@ 2011-11-10 17:35         ` Marcelo Tosatti
  0 siblings, 0 replies; 82+ messages in thread
From: Marcelo Tosatti @ 2011-11-10 17:35 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm list

On Thu, Nov 10, 2011 at 05:49:42PM +0100, Alexander Graf wrote:
> >>  Documentation/virtual/kvm/api.txt |   47 ++++++++++++++++++++++++++++++++++
> >>  arch/powerpc/kvm/powerpc.c        |   51 +++++++++++++++++++++++++++++++++++++
> >>  include/linux/kvm.h               |   32 +++++++++++++++++++++++
> >>  3 files changed, 130 insertions(+), 0 deletions(-)
> >I don't see the benefit of this generalization, the current structure where
> >context information is hardcoded in the data transmitted works well.
> 
> Well, unfortunately it doesn't work quite as well for us because we
> are a much more evolving platform. Also, there are a lot of edges
> and corners of the architecture that simply aren't implemented in
> KVM as of now. I want to have something extensible enough so we
> don't break the ABI along the way.

You still have to agree on format between userspace and kernel, right?
If either party fails to conform to that, you're doomed.

The problem with two interfaces is potential ambiguity: is
register X implemented through KVM_GET_ONE_REG and also through
KVM_GET_XYZ_REGISTER_SET ? If its accessible by two interfaces, what is
the register writeback order? Is there a plan to convert, etc.

If you agree these concerns are valid, perhaps this interface can be PPC
specific.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls
  2011-11-10 17:35         ` Marcelo Tosatti
@ 2011-11-15 23:45           ` Alexander Graf
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-11-15 23:45 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-ppc, kvm list


On 10.11.2011, at 18:35, Marcelo Tosatti wrote:

> On Thu, Nov 10, 2011 at 05:49:42PM +0100, Alexander Graf wrote:
>>>> Documentation/virtual/kvm/api.txt |   47 ++++++++++++++++++++++++++++++++++
>>>> arch/powerpc/kvm/powerpc.c        |   51 +++++++++++++++++++++++++++++++++++++
>>>> include/linux/kvm.h               |   32 +++++++++++++++++++++++
>>>> 3 files changed, 130 insertions(+), 0 deletions(-)
>>> I don't see the benefit of this generalization, the current structure where
>>> context information is hardcoded in the data transmitted works well.
>> 
>> Well, unfortunately it doesn't work quite as well for us because we
>> are a much more evolving platform. Also, there are a lot of edges
>> and corners of the architecture that simply aren't implemented in
>> KVM as of now. I want to have something extensible enough so we
>> don't break the ABI along the way.
> 
> You still have to agree on format between userspace and kernel, right?
> If either party fails to conform to that, you're doomed.

Yes, but we can shove registers back and forth without allocating 8kb of ram each time. If all we need to do is poke one register, we poke one register. If we poke 10, we poke the 10 we need to touch.

> The problem with two interfaces is potential ambiguity: is
> register X implemented through KVM_GET_ONE_REG and also through
> KVM_GET_XYZ_REGISTER_SET ? If its accessible by two interfaces, what is
> the register writeback order? Is there a plan to convert, etc.

Why writeback order? Register modification operations should always happen from the same thread the vCPU would run on at the end of the day, no? Avi wanted to go as far as making this a syscall interface even.

> If you agree these concerns are valid, perhaps this interface can be PPC
> specific.

I can always make it PPC specific, but I believe it would make sense as a generic interface for everyone, similar to how ENABLE_CAP can make sense for any arch.


Alex

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls
@ 2011-11-15 23:45           ` Alexander Graf
  0 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-11-15 23:45 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-ppc, kvm list


On 10.11.2011, at 18:35, Marcelo Tosatti wrote:

> On Thu, Nov 10, 2011 at 05:49:42PM +0100, Alexander Graf wrote:
>>>> Documentation/virtual/kvm/api.txt |   47 ++++++++++++++++++++++++++++++++++
>>>> arch/powerpc/kvm/powerpc.c        |   51 +++++++++++++++++++++++++++++++++++++
>>>> include/linux/kvm.h               |   32 +++++++++++++++++++++++
>>>> 3 files changed, 130 insertions(+), 0 deletions(-)
>>> I don't see the benefit of this generalization, the current structure where
>>> context information is hardcoded in the data transmitted works well.
>> 
>> Well, unfortunately it doesn't work quite as well for us because we
>> are a much more evolving platform. Also, there are a lot of edges
>> and corners of the architecture that simply aren't implemented in
>> KVM as of now. I want to have something extensible enough so we
>> don't break the ABI along the way.
> 
> You still have to agree on format between userspace and kernel, right?
> If either party fails to conform to that, you're doomed.

Yes, but we can shove registers back and forth without allocating 8kb of ram each time. If all we need to do is poke one register, we poke one register. If we poke 10, we poke the 10 we need to touch.

> The problem with two interfaces is potential ambiguity: is
> register X implemented through KVM_GET_ONE_REG and also through
> KVM_GET_XYZ_REGISTER_SET ? If its accessible by two interfaces, what is
> the register writeback order? Is there a plan to convert, etc.

Why writeback order? Register modification operations should always happen from the same thread the vCPU would run on at the end of the day, no? Avi wanted to go as far as making this a syscall interface even.

> If you agree these concerns are valid, perhaps this interface can be PPC
> specific.

I can always make it PPC specific, but I believe it would make sense as a generic interface for everyone, similar to how ENABLE_CAP can make sense for any arch.


Alex


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls
  2011-11-15 23:45           ` Alexander Graf
@ 2011-11-23 12:47             ` Marcelo Tosatti
  -1 siblings, 0 replies; 82+ messages in thread
From: Marcelo Tosatti @ 2011-11-23 12:47 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm list

On Wed, Nov 16, 2011 at 12:45:45AM +0100, Alexander Graf wrote:
> 
> On 10.11.2011, at 18:35, Marcelo Tosatti wrote:
> 
> > On Thu, Nov 10, 2011 at 05:49:42PM +0100, Alexander Graf wrote:
> >>>> Documentation/virtual/kvm/api.txt |   47 ++++++++++++++++++++++++++++++++++
> >>>> arch/powerpc/kvm/powerpc.c        |   51 +++++++++++++++++++++++++++++++++++++
> >>>> include/linux/kvm.h               |   32 +++++++++++++++++++++++
> >>>> 3 files changed, 130 insertions(+), 0 deletions(-)
> >>> I don't see the benefit of this generalization, the current structure where
> >>> context information is hardcoded in the data transmitted works well.
> >> 
> >> Well, unfortunately it doesn't work quite as well for us because we
> >> are a much more evolving platform. Also, there are a lot of edges
> >> and corners of the architecture that simply aren't implemented in
> >> KVM as of now. I want to have something extensible enough so we
> >> don't break the ABI along the way.
> > 
> > You still have to agree on format between userspace and kernel, right?
> > If either party fails to conform to that, you're doomed.
> 
> Yes, but we can shove registers back and forth without allocating 8kb of ram each time. If all we need to do is poke one register, we poke one register. If we poke 10, we poke the 10 we need to touch.
> 
> > The problem with two interfaces is potential ambiguity: is
> > register X implemented through KVM_GET_ONE_REG and also through
> > KVM_GET_XYZ_REGISTER_SET ? If its accessible by two interfaces, what is
> > the register writeback order? Is there a plan to convert, etc.
> 
> Why writeback order? Register modification operations should always happen from the same thread the vCPU would run on at the end of the day, no?

Yes, but there is a specified order which the registers must be written
back, in case there are dependencies between them (the QEMU x86's code
does its best to document these dependencies).

All i'm saying is that two distinct interfaces make it potentially
confusing for the programmer. That said, its up to Avi to decide.

> Avi wanted to go as far as making this a syscall interface even.

Ok, then full convertion work must be done.

> > If you agree these concerns are valid, perhaps this interface can be PPC
> > specific.
> 
> I can always make it PPC specific, but I believe it would make sense as a generic interface for everyone, similar to how ENABLE_CAP can make sense for any arch.
> 
> 
> Alex

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls
@ 2011-11-23 12:47             ` Marcelo Tosatti
  0 siblings, 0 replies; 82+ messages in thread
From: Marcelo Tosatti @ 2011-11-23 12:47 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm list

On Wed, Nov 16, 2011 at 12:45:45AM +0100, Alexander Graf wrote:
> 
> On 10.11.2011, at 18:35, Marcelo Tosatti wrote:
> 
> > On Thu, Nov 10, 2011 at 05:49:42PM +0100, Alexander Graf wrote:
> >>>> Documentation/virtual/kvm/api.txt |   47 ++++++++++++++++++++++++++++++++++
> >>>> arch/powerpc/kvm/powerpc.c        |   51 +++++++++++++++++++++++++++++++++++++
> >>>> include/linux/kvm.h               |   32 +++++++++++++++++++++++
> >>>> 3 files changed, 130 insertions(+), 0 deletions(-)
> >>> I don't see the benefit of this generalization, the current structure where
> >>> context information is hardcoded in the data transmitted works well.
> >> 
> >> Well, unfortunately it doesn't work quite as well for us because we
> >> are a much more evolving platform. Also, there are a lot of edges
> >> and corners of the architecture that simply aren't implemented in
> >> KVM as of now. I want to have something extensible enough so we
> >> don't break the ABI along the way.
> > 
> > You still have to agree on format between userspace and kernel, right?
> > If either party fails to conform to that, you're doomed.
> 
> Yes, but we can shove registers back and forth without allocating 8kb of ram each time. If all we need to do is poke one register, we poke one register. If we poke 10, we poke the 10 we need to touch.
> 
> > The problem with two interfaces is potential ambiguity: is
> > register X implemented through KVM_GET_ONE_REG and also through
> > KVM_GET_XYZ_REGISTER_SET ? If its accessible by two interfaces, what is
> > the register writeback order? Is there a plan to convert, etc.
> 
> Why writeback order? Register modification operations should always happen from the same thread the vCPU would run on at the end of the day, no?

Yes, but there is a specified order which the registers must be written
back, in case there are dependencies between them (the QEMU x86's code
does its best to document these dependencies).

All i'm saying is that two distinct interfaces make it potentially
confusing for the programmer. That said, its up to Avi to decide.

> Avi wanted to go as far as making this a syscall interface even.

Ok, then full convertion work must be done.

> > If you agree these concerns are valid, perhaps this interface can be PPC
> > specific.
> 
> I can always make it PPC specific, but I believe it would make sense as a generic interface for everyone, similar to how ENABLE_CAP can make sense for any arch.
> 
> 
> Alex

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls
  2011-11-23 12:47             ` Marcelo Tosatti
@ 2011-12-19 12:58               ` Alexander Graf
  -1 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-12-19 12:58 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-ppc, kvm list, Avi Kivity


On 23.11.2011, at 13:47, Marcelo Tosatti wrote:

> On Wed, Nov 16, 2011 at 12:45:45AM +0100, Alexander Graf wrote:
>> 
>> On 10.11.2011, at 18:35, Marcelo Tosatti wrote:
>> 
>>> On Thu, Nov 10, 2011 at 05:49:42PM +0100, Alexander Graf wrote:
>>>>>> Documentation/virtual/kvm/api.txt |   47 ++++++++++++++++++++++++++++++++++
>>>>>> arch/powerpc/kvm/powerpc.c        |   51 +++++++++++++++++++++++++++++++++++++
>>>>>> include/linux/kvm.h               |   32 +++++++++++++++++++++++
>>>>>> 3 files changed, 130 insertions(+), 0 deletions(-)
>>>>> I don't see the benefit of this generalization, the current structure where
>>>>> context information is hardcoded in the data transmitted works well.
>>>> 
>>>> Well, unfortunately it doesn't work quite as well for us because we
>>>> are a much more evolving platform. Also, there are a lot of edges
>>>> and corners of the architecture that simply aren't implemented in
>>>> KVM as of now. I want to have something extensible enough so we
>>>> don't break the ABI along the way.
>>> 
>>> You still have to agree on format between userspace and kernel, right?
>>> If either party fails to conform to that, you're doomed.
>> 
>> Yes, but we can shove registers back and forth without allocating 8kb of ram each time. If all we need to do is poke one register, we poke one register. If we poke 10, we poke the 10 we need to touch.
>> 
>>> The problem with two interfaces is potential ambiguity: is
>>> register X implemented through KVM_GET_ONE_REG and also through
>>> KVM_GET_XYZ_REGISTER_SET ? If its accessible by two interfaces, what is
>>> the register writeback order? Is there a plan to convert, etc.
>> 
>> Why writeback order? Register modification operations should always happen from the same thread the vCPU would run on at the end of the day, no?
> 
> Yes, but there is a specified order which the registers must be written
> back, in case there are dependencies between them (the QEMU x86's code
> does its best to document these dependencies).
> 
> All i'm saying is that two distinct interfaces make it potentially
> confusing for the programmer. That said, its up to Avi to decide.

I still don't fully understand. You pass in a list of register modifications. The same would happen from guest code. You have a code stream of register modifications. They should both end up calling the same functions in the kernel at the end of the day with the same order. If you call XYZ_REGISTER_SET and then GET_ONE_REG, you get the same the guest would get.

If it's difficult to implement for specific registers then just don't implement those with the ONE_REG interface. You're not forced to implement all registers with either interface - it's mostly a nicely extensible interface for architectures that evolve quite a bit with people implementing things only partially and then later realizing what's missing :). In other words, it should work great for us ppc folks and I'm fairly sure the ARM guys will appreciate it too. X86 is rather stable and well-exploited, so I can see how it doesn't make sense to use it there.


Alex

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls
@ 2011-12-19 12:58               ` Alexander Graf
  0 siblings, 0 replies; 82+ messages in thread
From: Alexander Graf @ 2011-12-19 12:58 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm-ppc, kvm list, Avi Kivity


On 23.11.2011, at 13:47, Marcelo Tosatti wrote:

> On Wed, Nov 16, 2011 at 12:45:45AM +0100, Alexander Graf wrote:
>> 
>> On 10.11.2011, at 18:35, Marcelo Tosatti wrote:
>> 
>>> On Thu, Nov 10, 2011 at 05:49:42PM +0100, Alexander Graf wrote:
>>>>>> Documentation/virtual/kvm/api.txt |   47 ++++++++++++++++++++++++++++++++++
>>>>>> arch/powerpc/kvm/powerpc.c        |   51 +++++++++++++++++++++++++++++++++++++
>>>>>> include/linux/kvm.h               |   32 +++++++++++++++++++++++
>>>>>> 3 files changed, 130 insertions(+), 0 deletions(-)
>>>>> I don't see the benefit of this generalization, the current structure where
>>>>> context information is hardcoded in the data transmitted works well.
>>>> 
>>>> Well, unfortunately it doesn't work quite as well for us because we
>>>> are a much more evolving platform. Also, there are a lot of edges
>>>> and corners of the architecture that simply aren't implemented in
>>>> KVM as of now. I want to have something extensible enough so we
>>>> don't break the ABI along the way.
>>> 
>>> You still have to agree on format between userspace and kernel, right?
>>> If either party fails to conform to that, you're doomed.
>> 
>> Yes, but we can shove registers back and forth without allocating 8kb of ram each time. If all we need to do is poke one register, we poke one register. If we poke 10, we poke the 10 we need to touch.
>> 
>>> The problem with two interfaces is potential ambiguity: is
>>> register X implemented through KVM_GET_ONE_REG and also through
>>> KVM_GET_XYZ_REGISTER_SET ? If its accessible by two interfaces, what is
>>> the register writeback order? Is there a plan to convert, etc.
>> 
>> Why writeback order? Register modification operations should always happen from the same thread the vCPU would run on at the end of the day, no?
> 
> Yes, but there is a specified order which the registers must be written
> back, in case there are dependencies between them (the QEMU x86's code
> does its best to document these dependencies).
> 
> All i'm saying is that two distinct interfaces make it potentially
> confusing for the programmer. That said, its up to Avi to decide.

I still don't fully understand. You pass in a list of register modifications. The same would happen from guest code. You have a code stream of register modifications. They should both end up calling the same functions in the kernel at the end of the day with the same order. If you call XYZ_REGISTER_SET and then GET_ONE_REG, you get the same the guest would get.

If it's difficult to implement for specific registers then just don't implement those with the ONE_REG interface. You're not forced to implement all registers with either interface - it's mostly a nicely extensible interface for architectures that evolve quite a bit with people implementing things only partially and then later realizing what's missing :). In other words, it should work great for us ppc folks and I'm fairly sure the ARM guys will appreciate it too. X86 is rather stable and well-exploited, so I can see how it doesn't make sense to use it there.


Alex


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls
  2011-12-19 12:58               ` Alexander Graf
@ 2011-12-19 17:29                 ` Marcelo Tosatti
  -1 siblings, 0 replies; 82+ messages in thread
From: Marcelo Tosatti @ 2011-12-19 17:29 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm list, Avi Kivity

On Mon, Dec 19, 2011 at 01:58:32PM +0100, Alexander Graf wrote:
> I still don't fully understand. You pass in a list of register modifications. The same would happen from guest code. You have a code stream of register modifications. They should both end up calling the same functions in the kernel at the end of the day with the same order. If you call XYZ_REGISTER_SET and then GET_ONE_REG, you get the same the guest would get.
> 
> If it's difficult to implement for specific registers then just don't implement those with the ONE_REG interface. You're not forced to implement all registers with either interface - it's mostly a nicely extensible interface for architectures that evolve quite a bit with people implementing things only partially and then later realizing what's missing :). In other words, it should work great for us ppc folks and I'm fairly sure the ARM guys will appreciate it too. X86 is rather stable and well-exploited, so I can see how it doesn't make sense to use it there.
> 

I was picturing a convertion of x86 to use that interface. My bad.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 09/14] KVM: PPC: Add generic single register ioctls
@ 2011-12-19 17:29                 ` Marcelo Tosatti
  0 siblings, 0 replies; 82+ messages in thread
From: Marcelo Tosatti @ 2011-12-19 17:29 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm-ppc, kvm list, Avi Kivity

On Mon, Dec 19, 2011 at 01:58:32PM +0100, Alexander Graf wrote:
> I still don't fully understand. You pass in a list of register modifications. The same would happen from guest code. You have a code stream of register modifications. They should both end up calling the same functions in the kernel at the end of the day with the same order. If you call XYZ_REGISTER_SET and then GET_ONE_REG, you get the same the guest would get.
> 
> If it's difficult to implement for specific registers then just don't implement those with the ONE_REG interface. You're not forced to implement all registers with either interface - it's mostly a nicely extensible interface for architectures that evolve quite a bit with people implementing things only partially and then later realizing what's missing :). In other words, it should work great for us ppc folks and I'm fairly sure the ARM guys will appreciate it too. X86 is rather stable and well-exploited, so I can see how it doesn't make sense to use it there.
> 

I was picturing a convertion of x86 to use that interface. My bad.


^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2011-12-19 17:29 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-31  7:53 [PULL 00/14] ppc patch queue 2011-10-31 Alexander Graf
2011-10-31  7:53 ` Alexander Graf
2011-10-31  7:53 ` [PATCH 01/14] KVM: PPC: e500: don't translate gfn to pfn with preemption disabled Alexander Graf
2011-10-31  7:53   ` Alexander Graf
2011-10-31 12:50   ` Avi Kivity
2011-10-31 12:50     ` [PATCH 01/14] KVM: PPC: e500: don't translate gfn to pfn with Avi Kivity
2011-10-31 18:52     ` [PATCH 01/14] KVM: PPC: e500: don't translate gfn to pfn with preemption disabled Scott Wood
2011-10-31 18:52       ` [PATCH 01/14] KVM: PPC: e500: don't translate gfn to pfn with Scott Wood
2011-11-01  9:00       ` [PATCH 01/14] KVM: PPC: e500: don't translate gfn to pfn with preemption disabled Avi Kivity
2011-11-01  9:00         ` [PATCH 01/14] KVM: PPC: e500: don't translate gfn to pfn with Avi Kivity
2011-10-31  7:53 ` [PATCH 02/14] KVM: PPC: e500: Eliminate preempt_disable in local_sid_destroy_all Alexander Graf
2011-10-31  7:53   ` Alexander Graf
2011-10-31  7:53 ` [PATCH 03/14] KVM: PPC: e500: clear up confusion between host and guest entries Alexander Graf
2011-10-31  7:53   ` Alexander Graf
2011-10-31  7:53 ` [PATCH 04/14] KVM: PPC: e500: MMU API Alexander Graf
2011-10-31  7:53   ` Alexander Graf
2011-10-31 13:24   ` Avi Kivity
2011-10-31 13:24     ` Avi Kivity
2011-10-31 20:12     ` Scott Wood
2011-10-31 20:12       ` Scott Wood
2011-11-01  8:58       ` Avi Kivity
2011-11-01  8:58         ` Avi Kivity
2011-11-01  9:55         ` Avi Kivity
2011-11-01  9:55           ` Avi Kivity
2011-11-01 16:16         ` Scott Wood
2011-11-01 16:16           ` Scott Wood
2011-11-02 10:33           ` Avi Kivity
2011-11-02 10:33             ` Avi Kivity
2011-11-10 14:20           ` Alexander Graf
2011-11-10 14:20             ` Alexander Graf
2011-11-10 14:16             ` Avi Kivity
2011-11-10 14:16               ` Avi Kivity
2011-10-31  7:53 ` [PATCH 05/14] KVM: PPC: e500: tlbsx: fix tlb0 esel Alexander Graf
2011-10-31  7:53   ` Alexander Graf
2011-10-31  7:53 ` [PATCH 06/14] KVM: PPC: e500: Don't hardcode PIR=0 Alexander Graf
2011-10-31  7:53   ` Alexander Graf
2011-10-31 13:27   ` Avi Kivity
2011-10-31 13:27     ` Avi Kivity
2011-10-31  7:53 ` [PATCH 07/14] KVM: PPC: Fix build failure with HV KVM and CBE Alexander Graf
2011-10-31  7:53   ` Alexander Graf
2011-10-31  7:53 ` [PATCH 08/14] Revert "KVM: PPC: Add support for explicit HIOR setting" Alexander Graf
2011-10-31  7:53   ` Alexander Graf
2011-10-31 13:30   ` Avi Kivity
2011-10-31 13:30     ` [PATCH 08/14] Revert "KVM: PPC: Add support for explicit HIOR Avi Kivity
2011-10-31 23:49     ` [PATCH 08/14] Revert "KVM: PPC: Add support for explicit HIOR setting" Alexander Graf
2011-10-31 23:49       ` Alexander Graf
2011-10-31  7:53 ` [PATCH 09/14] KVM: PPC: Add generic single register ioctls Alexander Graf
2011-10-31  7:53   ` Alexander Graf
2011-10-31 13:36   ` Avi Kivity
2011-10-31 13:36     ` Avi Kivity
2011-10-31 17:26     ` Jan Kiszka
2011-10-31 17:26       ` Jan Kiszka
2011-11-10 14:22     ` Alexander Graf
2011-11-10 14:22       ` Alexander Graf
2011-11-10 16:05   ` Marcelo Tosatti
2011-11-10 16:05     ` Marcelo Tosatti
2011-11-10 16:49     ` Alexander Graf
2011-11-10 16:49       ` Alexander Graf
2011-11-10 17:35       ` Marcelo Tosatti
2011-11-10 17:35         ` Marcelo Tosatti
2011-11-15 23:45         ` Alexander Graf
2011-11-15 23:45           ` Alexander Graf
2011-11-23 12:47           ` Marcelo Tosatti
2011-11-23 12:47             ` Marcelo Tosatti
2011-12-19 12:58             ` Alexander Graf
2011-12-19 12:58               ` Alexander Graf
2011-12-19 17:29               ` Marcelo Tosatti
2011-12-19 17:29                 ` Marcelo Tosatti
2011-10-31  7:53 ` [PATCH 10/14] KVM: PPC: Add support for explicit HIOR setting Alexander Graf
2011-10-31  7:53   ` Alexander Graf
2011-10-31  7:53 ` [PATCH 11/14] KVM: PPC: Whitespace fix for kvm.h Alexander Graf
2011-10-31  7:53   ` Alexander Graf
2011-10-31  7:53 ` [PATCH 12/14] KVM: Fix whitespace in kvm_para.h Alexander Graf
2011-10-31  7:53   ` Alexander Graf
2011-10-31  7:53 ` [PATCH 13/14] KVM: PPC: E500: Support hugetlbfs Alexander Graf
2011-10-31  7:53   ` Alexander Graf
2011-10-31 13:38   ` Avi Kivity
2011-10-31 13:38     ` Avi Kivity
2011-11-10 14:24     ` Alexander Graf
2011-11-10 14:24       ` Alexander Graf
2011-10-31  7:53 ` [PATCH 14/14] PPC: Fix race in mtmsr paravirt implementation Alexander Graf
2011-10-31  7:53   ` Alexander Graf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.