All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] KVM: PPC: Book3S HV: HPT read/write functions for userspace
@ 2012-10-16  3:58 ` Paul Mackerras
  0 siblings, 0 replies; 30+ messages in thread
From: Paul Mackerras @ 2012-10-16  3:58 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc, David Gibson

This series of patches provides an interface by which userspace can
read and write the hashed page table (HPT) of a Book3S HV guest.
The interface is an ioctl which provides a file descriptor which can
be accessed with the read() and write() system calls.  The data read
and written is the guest view of the HPT, in which the second
doubleword of each HPTE (HPT entry) contains a guest physical address,
as distinct from the real HPT that the hardware accesses, where the
second doubleword of each HPTE contains a real address.

Because the HPT is divided into groups (HPTEGs) of 8 entries each,
where each HPTEG usually only contains a few valid entries, or none,
the data format that we use does run-length encoding of the invalid
entries, so in fact the invalid entries take up no space in the
stream.

The interface also provides for doing multiple passes over the HPT,
where the first pass provides information on all HPTEs, and subsequent
passes only return the HPTEs that have changed since the previous pass.

I have implemented a read/write interface rather than an mmap-based
interface because the data is not stored contiguously anywhere in
kernel memory.  Of each 16-byte HPTE, the first 8 bytes come from the
real HPT and the second 8 bytes come from the parallel vmalloc'd array
where we store the guest view of the guest physical address,
permissions, accessed/dirty bits etc.  Thus a mmap-based interface
would not be practicable (not without doubling the size of the
parallel array, typically requiring an extra 8MB of kernel memory per
guest).  This is also why I have not used the memslot interface for
this.

This implements the interface for HV-style KVM but not for PR-style
KVM.  Userspace does not need any additional interface with PR-style
KVM because userspace maintains the guest HPT already in that case,
and has an image of the guest view of the HPT in its address space.

This series is against the next branch of the kvm tree plus my
recently-posted set of 8 patches ("Various Book3s HV fixes that
haven't been picked up yet").  The overall diffstat is:

 Documentation/virtual/kvm/api.txt        |   53 +++++
 arch/powerpc/include/asm/kvm.h           |   24 ++
 arch/powerpc/include/asm/kvm_book3s.h    |    8 +-
 arch/powerpc/include/asm/kvm_book3s_64.h |   24 ++
 arch/powerpc/include/asm/kvm_host.h      |    1 +
 arch/powerpc/include/asm/kvm_ppc.h       |    2 +
 arch/powerpc/kvm/book3s_64_mmu_hv.c      |  380 +++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_hv.c             |   12 -
 arch/powerpc/kvm/book3s_hv_rm_mmu.c      |   71 ++++--
 arch/powerpc/kvm/powerpc.c               |   17 ++
 include/linux/kvm.h                      |    3 +
 include/linux/kvm_host.h                 |   11 +-
 12 files changed, 559 insertions(+), 47 deletions(-)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 0/5] KVM: PPC: Book3S HV: HPT read/write functions for userspace
@ 2012-10-16  3:58 ` Paul Mackerras
  0 siblings, 0 replies; 30+ messages in thread
From: Paul Mackerras @ 2012-10-16  3:58 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc, David Gibson

This series of patches provides an interface by which userspace can
read and write the hashed page table (HPT) of a Book3S HV guest.
The interface is an ioctl which provides a file descriptor which can
be accessed with the read() and write() system calls.  The data read
and written is the guest view of the HPT, in which the second
doubleword of each HPTE (HPT entry) contains a guest physical address,
as distinct from the real HPT that the hardware accesses, where the
second doubleword of each HPTE contains a real address.

Because the HPT is divided into groups (HPTEGs) of 8 entries each,
where each HPTEG usually only contains a few valid entries, or none,
the data format that we use does run-length encoding of the invalid
entries, so in fact the invalid entries take up no space in the
stream.

The interface also provides for doing multiple passes over the HPT,
where the first pass provides information on all HPTEs, and subsequent
passes only return the HPTEs that have changed since the previous pass.

I have implemented a read/write interface rather than an mmap-based
interface because the data is not stored contiguously anywhere in
kernel memory.  Of each 16-byte HPTE, the first 8 bytes come from the
real HPT and the second 8 bytes come from the parallel vmalloc'd array
where we store the guest view of the guest physical address,
permissions, accessed/dirty bits etc.  Thus a mmap-based interface
would not be practicable (not without doubling the size of the
parallel array, typically requiring an extra 8MB of kernel memory per
guest).  This is also why I have not used the memslot interface for
this.

This implements the interface for HV-style KVM but not for PR-style
KVM.  Userspace does not need any additional interface with PR-style
KVM because userspace maintains the guest HPT already in that case,
and has an image of the guest view of the HPT in its address space.

This series is against the next branch of the kvm tree plus my
recently-posted set of 8 patches ("Various Book3s HV fixes that
haven't been picked up yet").  The overall diffstat is:

 Documentation/virtual/kvm/api.txt        |   53 +++++
 arch/powerpc/include/asm/kvm.h           |   24 ++
 arch/powerpc/include/asm/kvm_book3s.h    |    8 +-
 arch/powerpc/include/asm/kvm_book3s_64.h |   24 ++
 arch/powerpc/include/asm/kvm_host.h      |    1 +
 arch/powerpc/include/asm/kvm_ppc.h       |    2 +
 arch/powerpc/kvm/book3s_64_mmu_hv.c      |  380 +++++++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_hv.c             |   12 -
 arch/powerpc/kvm/book3s_hv_rm_mmu.c      |   71 ++++--
 arch/powerpc/kvm/powerpc.c               |   17 ++
 include/linux/kvm.h                      |    3 +
 include/linux/kvm_host.h                 |   11 +-
 12 files changed, 559 insertions(+), 47 deletions(-)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/5] KVM: Provide mmu notifier retry test based on struct kvm
  2012-10-16  3:58 ` Paul Mackerras
@ 2012-10-16  3:59   ` Paul Mackerras
  -1 siblings, 0 replies; 30+ messages in thread
From: Paul Mackerras @ 2012-10-16  3:59 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc, David Gibson

The mmu_notifier_retry() function, used to test whether any page
invalidations are in progress, currently takes a vcpu pointer, though
the code only needs the VM's struct kvm pointer.  Forthcoming patches
to the powerpc Book3S HV code will need to test for retry within a VM
ioctl, where a struct kvm pointer is available but a struct vcpu
pointer isn't.  Therefore this creates a variant of mmu_notifier_retry
called kvm_mmu_notifier_retry that takes a struct kvm pointer, and
implements mmu_notifier_retry in terms of it.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 include/linux/kvm_host.h |   11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6afc5be..1cc1e1d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -841,9 +841,9 @@ extern struct kvm_stats_debugfs_item debugfs_entries[];
 extern struct dentry *kvm_debugfs_dir;
 
 #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
-static inline int mmu_notifier_retry(struct kvm_vcpu *vcpu, unsigned long mmu_seq)
+static inline int kvm_mmu_notifier_retry(struct kvm *kvm, unsigned long mmu_seq)
 {
-	if (unlikely(vcpu->kvm->mmu_notifier_count))
+	if (unlikely(kvm->mmu_notifier_count))
 		return 1;
 	/*
 	 * Ensure the read of mmu_notifier_count happens before the read
@@ -856,10 +856,15 @@ static inline int mmu_notifier_retry(struct kvm_vcpu *vcpu, unsigned long mmu_se
 	 * can't rely on kvm->mmu_lock to keep things ordered.
 	 */
 	smp_rmb();
-	if (vcpu->kvm->mmu_notifier_seq != mmu_seq)
+	if (kvm->mmu_notifier_seq != mmu_seq)
 		return 1;
 	return 0;
 }
+
+static inline int mmu_notifier_retry(struct kvm_vcpu *vcpu, unsigned long mmu_seq)
+{
+	return kvm_mmu_notifier_retry(vcpu->kvm, mmu_seq);
+}
 #endif
 
 #ifdef KVM_CAP_IRQ_ROUTING
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 1/5] KVM: Provide mmu notifier retry test based on struct kvm
@ 2012-10-16  3:59   ` Paul Mackerras
  0 siblings, 0 replies; 30+ messages in thread
From: Paul Mackerras @ 2012-10-16  3:59 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc, David Gibson

The mmu_notifier_retry() function, used to test whether any page
invalidations are in progress, currently takes a vcpu pointer, though
the code only needs the VM's struct kvm pointer.  Forthcoming patches
to the powerpc Book3S HV code will need to test for retry within a VM
ioctl, where a struct kvm pointer is available but a struct vcpu
pointer isn't.  Therefore this creates a variant of mmu_notifier_retry
called kvm_mmu_notifier_retry that takes a struct kvm pointer, and
implements mmu_notifier_retry in terms of it.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 include/linux/kvm_host.h |   11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6afc5be..1cc1e1d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -841,9 +841,9 @@ extern struct kvm_stats_debugfs_item debugfs_entries[];
 extern struct dentry *kvm_debugfs_dir;
 
 #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
-static inline int mmu_notifier_retry(struct kvm_vcpu *vcpu, unsigned long mmu_seq)
+static inline int kvm_mmu_notifier_retry(struct kvm *kvm, unsigned long mmu_seq)
 {
-	if (unlikely(vcpu->kvm->mmu_notifier_count))
+	if (unlikely(kvm->mmu_notifier_count))
 		return 1;
 	/*
 	 * Ensure the read of mmu_notifier_count happens before the read
@@ -856,10 +856,15 @@ static inline int mmu_notifier_retry(struct kvm_vcpu *vcpu, unsigned long mmu_se
 	 * can't rely on kvm->mmu_lock to keep things ordered.
 	 */
 	smp_rmb();
-	if (vcpu->kvm->mmu_notifier_seq != mmu_seq)
+	if (kvm->mmu_notifier_seq != mmu_seq)
 		return 1;
 	return 0;
 }
+
+static inline int mmu_notifier_retry(struct kvm_vcpu *vcpu, unsigned long mmu_seq)
+{
+	return kvm_mmu_notifier_retry(vcpu->kvm, mmu_seq);
+}
 #endif
 
 #ifdef KVM_CAP_IRQ_ROUTING
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/5] KVM: PPC: Book3S HV: Restructure HPT entry creation code
  2012-10-16  3:58 ` Paul Mackerras
@ 2012-10-16  4:00   ` Paul Mackerras
  -1 siblings, 0 replies; 30+ messages in thread
From: Paul Mackerras @ 2012-10-16  4:00 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc, David Gibson

This restructures the code that creates HPT (hashed page table)
entries so that it can be called in situations where we don't have a
struct vcpu pointer, only a struct kvm pointer.  It also fixes a bug
where kvmppc_map_vrma() would corrupt the guest R4 value.

Now, most of the work of kvmppc_virtmode_h_enter is done by a new
function, kvmppc_virtmode_do_h_enter, which itself calls another new
function, kvmppc_do_h_enter, which contains most of the old
kvmppc_h_enter.  The new kvmppc_do_h_enter takes explicit arguments
for the place to return the HPTE index, the Linux page tables to use,
and whether it is being called in real mode, thus removing the need
for it to have the vcpu as an argument.

Currently kvmppc_map_vrma creates the VRMA (virtual real mode area)
HPTEs by calling kvmppc_virtmode_h_enter, which is designed primarily
to handle H_ENTER hcalls from the guest that need to pin a page of
memory.  Since H_ENTER returns the index of the created HPTE in R4,
kvmppc_virtmode_h_enter updates the guest R4, corrupting the guest R4
in the case when it gets called from kvmppc_map_vrma on the first
VCPU_RUN ioctl.  With this, kvmppc_map_vrma instead calls
kvmppc_virtmode_do_h_enter with the address of a dummy word as the
place to store the HPTE index, thus avoiding corrupting the guest R4.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/kvm_book3s.h |    5 +++--
 arch/powerpc/kvm/book3s_64_mmu_hv.c   |   36 +++++++++++++++++++++++----------
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   |   27 ++++++++++++++++---------
 3 files changed, 45 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
index ab73800..199b7fd 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -157,8 +157,9 @@ extern void *kvmppc_pin_guest_page(struct kvm *kvm, unsigned long addr,
 extern void kvmppc_unpin_guest_page(struct kvm *kvm, void *addr);
 extern long kvmppc_virtmode_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
 			long pte_index, unsigned long pteh, unsigned long ptel);
-extern long kvmppc_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
-			long pte_index, unsigned long pteh, unsigned long ptel);
+extern long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
+			long pte_index, unsigned long pteh, unsigned long ptel,
+			pgd_t *pgdir, bool realmode, unsigned long *idx_ret);
 extern long kvmppc_hv_get_dirty_log(struct kvm *kvm,
 			struct kvm_memory_slot *memslot, unsigned long *map);
 
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 7a4aae9..351f2ac 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -41,6 +41,10 @@
 /* Power architecture requires HPT is at least 256kB */
 #define PPC_MIN_HPT_ORDER	18
 
+static long kvmppc_virtmode_do_h_enter(struct kvm *kvm, unsigned long flags,
+				long pte_index, unsigned long pteh,
+				unsigned long ptel, unsigned long *pte_idx_ret);
+
 long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
 {
 	unsigned long hpt;
@@ -185,6 +189,7 @@ void kvmppc_map_vrma(struct kvm_vcpu *vcpu, struct kvm_memory_slot *memslot,
 	unsigned long addr, hash;
 	unsigned long psize;
 	unsigned long hp0, hp1;
+	unsigned long idx_ret;
 	long ret;
 	struct kvm *kvm = vcpu->kvm;
 
@@ -216,7 +221,8 @@ void kvmppc_map_vrma(struct kvm_vcpu *vcpu, struct kvm_memory_slot *memslot,
 		hash = (hash << 3) + 7;
 		hp_v = hp0 | ((addr >> 16) & ~0x7fUL);
 		hp_r = hp1 | addr;
-		ret = kvmppc_virtmode_h_enter(vcpu, H_EXACT, hash, hp_v, hp_r);
+		ret = kvmppc_virtmode_do_h_enter(kvm, H_EXACT, hash, hp_v, hp_r,
+						 &idx_ret);
 		if (ret != H_SUCCESS) {
 			pr_err("KVM: map_vrma at %lx failed, ret=%ld\n",
 			       addr, ret);
@@ -354,15 +360,10 @@ static long kvmppc_get_guest_page(struct kvm *kvm, unsigned long gfn,
 	return err;
 }
 
-/*
- * We come here on a H_ENTER call from the guest when we are not
- * using mmu notifiers and we don't have the requested page pinned
- * already.
- */
-long kvmppc_virtmode_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
-			long pte_index, unsigned long pteh, unsigned long ptel)
+long kvmppc_virtmode_do_h_enter(struct kvm *kvm, unsigned long flags,
+				long pte_index, unsigned long pteh,
+				unsigned long ptel, unsigned long *pte_idx_ret)
 {
-	struct kvm *kvm = vcpu->kvm;
 	unsigned long psize, gpa, gfn;
 	struct kvm_memory_slot *memslot;
 	long ret;
@@ -390,8 +391,8 @@ long kvmppc_virtmode_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
  do_insert:
 	/* Protect linux PTE lookup from page table destruction */
 	rcu_read_lock_sched();	/* this disables preemption too */
-	vcpu->arch.pgdir = current->mm->pgd;
-	ret = kvmppc_h_enter(vcpu, flags, pte_index, pteh, ptel);
+	ret = kvmppc_do_h_enter(kvm, flags, pte_index, pteh, ptel,
+				current->mm->pgd, false, pte_idx_ret);
 	rcu_read_unlock_sched();
 	if (ret == H_TOO_HARD) {
 		/* this can't happen */
@@ -402,6 +403,19 @@ long kvmppc_virtmode_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
 
 }
 
+/*
+ * We come here on a H_ENTER call from the guest when we are not
+ * using mmu notifiers and we don't have the requested page pinned
+ * already.
+ */
+long kvmppc_virtmode_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
+			     long pte_index, unsigned long pteh,
+			     unsigned long ptel)
+{
+	return kvmppc_virtmode_do_h_enter(vcpu->kvm, flags, pte_index,
+					  pteh, ptel, &vcpu->arch.gpr[4]);
+}
+
 static struct kvmppc_slb *kvmppc_mmu_book3s_hv_find_slbe(struct kvm_vcpu *vcpu,
 							 gva_t eaddr)
 {
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 9955216..3233587 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -103,14 +103,14 @@ static void remove_revmap_chain(struct kvm *kvm, long pte_index,
 	unlock_rmap(rmap);
 }
 
-static pte_t lookup_linux_pte(struct kvm_vcpu *vcpu, unsigned long hva,
+static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
 			      int writing, unsigned long *pte_sizep)
 {
 	pte_t *ptep;
 	unsigned long ps = *pte_sizep;
 	unsigned int shift;
 
-	ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva, &shift);
+	ptep = find_linux_pte_or_hugepte(pgdir, hva, &shift);
 	if (!ptep)
 		return __pte(0);
 	if (shift)
@@ -130,10 +130,10 @@ static inline void unlock_hpte(unsigned long *hpte, unsigned long hpte_v)
 	hpte[0] = hpte_v;
 }
 
-long kvmppc_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
-		    long pte_index, unsigned long pteh, unsigned long ptel)
+long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
+		       long pte_index, unsigned long pteh, unsigned long ptel,
+		       pgd_t *pgdir, bool realmode, unsigned long *pte_idx_ret)
 {
-	struct kvm *kvm = vcpu->kvm;
 	unsigned long i, pa, gpa, gfn, psize;
 	unsigned long slot_fn, hva;
 	unsigned long *hpte;
@@ -147,7 +147,6 @@ long kvmppc_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
 	unsigned int writing;
 	unsigned long mmu_seq;
 	unsigned long rcbits;
-	bool realmode = vcpu->arch.vcore->vcore_state == VCORE_RUNNING;
 
 	psize = hpte_page_size(pteh, ptel);
 	if (!psize)
@@ -201,7 +200,7 @@ long kvmppc_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
 
 		/* Look up the Linux PTE for the backing page */
 		pte_size = psize;
-		pte = lookup_linux_pte(vcpu, hva, writing, &pte_size);
+		pte = lookup_linux_pte(pgdir, hva, writing, &pte_size);
 		if (pte_present(pte)) {
 			if (writing && !pte_write(pte))
 				/* make the actual HPTE be read-only */
@@ -210,6 +209,7 @@ long kvmppc_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
 			pa = pte_pfn(pte) << PAGE_SHIFT;
 		}
 	}
+
 	if (pte_size < psize)
 		return H_PARAMETER;
 	if (pa && pte_size > psize)
@@ -297,7 +297,7 @@ long kvmppc_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
 		lock_rmap(rmap);
 		/* Check for pending invalidations under the rmap chain lock */
 		if (kvm->arch.using_mmu_notifiers &&
-		    mmu_notifier_retry(vcpu, mmu_seq)) {
+		    kvm_mmu_notifier_retry(kvm, mmu_seq)) {
 			/* inval in progress, write a non-present HPTE */
 			pteh |= HPTE_V_ABSENT;
 			pteh &= ~HPTE_V_VALID;
@@ -318,10 +318,17 @@ long kvmppc_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
 	hpte[0] = pteh;
 	asm volatile("ptesync" : : : "memory");
 
-	vcpu->arch.gpr[4] = pte_index;
+	*pte_idx_ret = pte_index;
 	return H_SUCCESS;
 }
-EXPORT_SYMBOL_GPL(kvmppc_h_enter);
+EXPORT_SYMBOL_GPL(kvmppc_do_h_enter);
+
+long kvmppc_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
+		    long pte_index, unsigned long pteh, unsigned long ptel)
+{
+	return kvmppc_do_h_enter(vcpu->kvm, flags, pte_index, pteh, ptel,
+				 vcpu->arch.pgdir, true, &vcpu->arch.gpr[4]);
+}
 
 #define LOCK_TOKEN	(*(u32 *)(&get_paca()->lock_token))
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/5] KVM: PPC: Book3S HV: Restructure HPT entry creation code
@ 2012-10-16  4:00   ` Paul Mackerras
  0 siblings, 0 replies; 30+ messages in thread
From: Paul Mackerras @ 2012-10-16  4:00 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc, David Gibson

This restructures the code that creates HPT (hashed page table)
entries so that it can be called in situations where we don't have a
struct vcpu pointer, only a struct kvm pointer.  It also fixes a bug
where kvmppc_map_vrma() would corrupt the guest R4 value.

Now, most of the work of kvmppc_virtmode_h_enter is done by a new
function, kvmppc_virtmode_do_h_enter, which itself calls another new
function, kvmppc_do_h_enter, which contains most of the old
kvmppc_h_enter.  The new kvmppc_do_h_enter takes explicit arguments
for the place to return the HPTE index, the Linux page tables to use,
and whether it is being called in real mode, thus removing the need
for it to have the vcpu as an argument.

Currently kvmppc_map_vrma creates the VRMA (virtual real mode area)
HPTEs by calling kvmppc_virtmode_h_enter, which is designed primarily
to handle H_ENTER hcalls from the guest that need to pin a page of
memory.  Since H_ENTER returns the index of the created HPTE in R4,
kvmppc_virtmode_h_enter updates the guest R4, corrupting the guest R4
in the case when it gets called from kvmppc_map_vrma on the first
VCPU_RUN ioctl.  With this, kvmppc_map_vrma instead calls
kvmppc_virtmode_do_h_enter with the address of a dummy word as the
place to store the HPTE index, thus avoiding corrupting the guest R4.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/kvm_book3s.h |    5 +++--
 arch/powerpc/kvm/book3s_64_mmu_hv.c   |   36 +++++++++++++++++++++++----------
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   |   27 ++++++++++++++++---------
 3 files changed, 45 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
index ab73800..199b7fd 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -157,8 +157,9 @@ extern void *kvmppc_pin_guest_page(struct kvm *kvm, unsigned long addr,
 extern void kvmppc_unpin_guest_page(struct kvm *kvm, void *addr);
 extern long kvmppc_virtmode_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
 			long pte_index, unsigned long pteh, unsigned long ptel);
-extern long kvmppc_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
-			long pte_index, unsigned long pteh, unsigned long ptel);
+extern long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
+			long pte_index, unsigned long pteh, unsigned long ptel,
+			pgd_t *pgdir, bool realmode, unsigned long *idx_ret);
 extern long kvmppc_hv_get_dirty_log(struct kvm *kvm,
 			struct kvm_memory_slot *memslot, unsigned long *map);
 
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 7a4aae9..351f2ac 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -41,6 +41,10 @@
 /* Power architecture requires HPT is at least 256kB */
 #define PPC_MIN_HPT_ORDER	18
 
+static long kvmppc_virtmode_do_h_enter(struct kvm *kvm, unsigned long flags,
+				long pte_index, unsigned long pteh,
+				unsigned long ptel, unsigned long *pte_idx_ret);
+
 long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
 {
 	unsigned long hpt;
@@ -185,6 +189,7 @@ void kvmppc_map_vrma(struct kvm_vcpu *vcpu, struct kvm_memory_slot *memslot,
 	unsigned long addr, hash;
 	unsigned long psize;
 	unsigned long hp0, hp1;
+	unsigned long idx_ret;
 	long ret;
 	struct kvm *kvm = vcpu->kvm;
 
@@ -216,7 +221,8 @@ void kvmppc_map_vrma(struct kvm_vcpu *vcpu, struct kvm_memory_slot *memslot,
 		hash = (hash << 3) + 7;
 		hp_v = hp0 | ((addr >> 16) & ~0x7fUL);
 		hp_r = hp1 | addr;
-		ret = kvmppc_virtmode_h_enter(vcpu, H_EXACT, hash, hp_v, hp_r);
+		ret = kvmppc_virtmode_do_h_enter(kvm, H_EXACT, hash, hp_v, hp_r,
+						 &idx_ret);
 		if (ret != H_SUCCESS) {
 			pr_err("KVM: map_vrma at %lx failed, ret=%ld\n",
 			       addr, ret);
@@ -354,15 +360,10 @@ static long kvmppc_get_guest_page(struct kvm *kvm, unsigned long gfn,
 	return err;
 }
 
-/*
- * We come here on a H_ENTER call from the guest when we are not
- * using mmu notifiers and we don't have the requested page pinned
- * already.
- */
-long kvmppc_virtmode_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
-			long pte_index, unsigned long pteh, unsigned long ptel)
+long kvmppc_virtmode_do_h_enter(struct kvm *kvm, unsigned long flags,
+				long pte_index, unsigned long pteh,
+				unsigned long ptel, unsigned long *pte_idx_ret)
 {
-	struct kvm *kvm = vcpu->kvm;
 	unsigned long psize, gpa, gfn;
 	struct kvm_memory_slot *memslot;
 	long ret;
@@ -390,8 +391,8 @@ long kvmppc_virtmode_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
  do_insert:
 	/* Protect linux PTE lookup from page table destruction */
 	rcu_read_lock_sched();	/* this disables preemption too */
-	vcpu->arch.pgdir = current->mm->pgd;
-	ret = kvmppc_h_enter(vcpu, flags, pte_index, pteh, ptel);
+	ret = kvmppc_do_h_enter(kvm, flags, pte_index, pteh, ptel,
+				current->mm->pgd, false, pte_idx_ret);
 	rcu_read_unlock_sched();
 	if (ret = H_TOO_HARD) {
 		/* this can't happen */
@@ -402,6 +403,19 @@ long kvmppc_virtmode_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
 
 }
 
+/*
+ * We come here on a H_ENTER call from the guest when we are not
+ * using mmu notifiers and we don't have the requested page pinned
+ * already.
+ */
+long kvmppc_virtmode_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
+			     long pte_index, unsigned long pteh,
+			     unsigned long ptel)
+{
+	return kvmppc_virtmode_do_h_enter(vcpu->kvm, flags, pte_index,
+					  pteh, ptel, &vcpu->arch.gpr[4]);
+}
+
 static struct kvmppc_slb *kvmppc_mmu_book3s_hv_find_slbe(struct kvm_vcpu *vcpu,
 							 gva_t eaddr)
 {
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 9955216..3233587 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -103,14 +103,14 @@ static void remove_revmap_chain(struct kvm *kvm, long pte_index,
 	unlock_rmap(rmap);
 }
 
-static pte_t lookup_linux_pte(struct kvm_vcpu *vcpu, unsigned long hva,
+static pte_t lookup_linux_pte(pgd_t *pgdir, unsigned long hva,
 			      int writing, unsigned long *pte_sizep)
 {
 	pte_t *ptep;
 	unsigned long ps = *pte_sizep;
 	unsigned int shift;
 
-	ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva, &shift);
+	ptep = find_linux_pte_or_hugepte(pgdir, hva, &shift);
 	if (!ptep)
 		return __pte(0);
 	if (shift)
@@ -130,10 +130,10 @@ static inline void unlock_hpte(unsigned long *hpte, unsigned long hpte_v)
 	hpte[0] = hpte_v;
 }
 
-long kvmppc_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
-		    long pte_index, unsigned long pteh, unsigned long ptel)
+long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
+		       long pte_index, unsigned long pteh, unsigned long ptel,
+		       pgd_t *pgdir, bool realmode, unsigned long *pte_idx_ret)
 {
-	struct kvm *kvm = vcpu->kvm;
 	unsigned long i, pa, gpa, gfn, psize;
 	unsigned long slot_fn, hva;
 	unsigned long *hpte;
@@ -147,7 +147,6 @@ long kvmppc_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
 	unsigned int writing;
 	unsigned long mmu_seq;
 	unsigned long rcbits;
-	bool realmode = vcpu->arch.vcore->vcore_state = VCORE_RUNNING;
 
 	psize = hpte_page_size(pteh, ptel);
 	if (!psize)
@@ -201,7 +200,7 @@ long kvmppc_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
 
 		/* Look up the Linux PTE for the backing page */
 		pte_size = psize;
-		pte = lookup_linux_pte(vcpu, hva, writing, &pte_size);
+		pte = lookup_linux_pte(pgdir, hva, writing, &pte_size);
 		if (pte_present(pte)) {
 			if (writing && !pte_write(pte))
 				/* make the actual HPTE be read-only */
@@ -210,6 +209,7 @@ long kvmppc_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
 			pa = pte_pfn(pte) << PAGE_SHIFT;
 		}
 	}
+
 	if (pte_size < psize)
 		return H_PARAMETER;
 	if (pa && pte_size > psize)
@@ -297,7 +297,7 @@ long kvmppc_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
 		lock_rmap(rmap);
 		/* Check for pending invalidations under the rmap chain lock */
 		if (kvm->arch.using_mmu_notifiers &&
-		    mmu_notifier_retry(vcpu, mmu_seq)) {
+		    kvm_mmu_notifier_retry(kvm, mmu_seq)) {
 			/* inval in progress, write a non-present HPTE */
 			pteh |= HPTE_V_ABSENT;
 			pteh &= ~HPTE_V_VALID;
@@ -318,10 +318,17 @@ long kvmppc_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
 	hpte[0] = pteh;
 	asm volatile("ptesync" : : : "memory");
 
-	vcpu->arch.gpr[4] = pte_index;
+	*pte_idx_ret = pte_index;
 	return H_SUCCESS;
 }
-EXPORT_SYMBOL_GPL(kvmppc_h_enter);
+EXPORT_SYMBOL_GPL(kvmppc_do_h_enter);
+
+long kvmppc_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
+		    long pte_index, unsigned long pteh, unsigned long ptel)
+{
+	return kvmppc_do_h_enter(vcpu->kvm, flags, pte_index, pteh, ptel,
+				 vcpu->arch.pgdir, true, &vcpu->arch.gpr[4]);
+}
 
 #define LOCK_TOKEN	(*(u32 *)(&get_paca()->lock_token))
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/5] KVM: PPC: Book3S HV: Add a mechanism for recording modified HPTEs
  2012-10-16  3:58 ` Paul Mackerras
@ 2012-10-16  4:00   ` Paul Mackerras
  -1 siblings, 0 replies; 30+ messages in thread
From: Paul Mackerras @ 2012-10-16  4:00 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc, David Gibson

This uses a bit in our record of the guest view of the HPTE to record
when the HPTE gets modified.  We use a reserved bit for this, and ensure
that this bit is always cleared in HPTE values returned to the guest.
The recording of modified HPTEs is only done if other code indicates
its interest by setting kvm->arch.hpte_mod_interest to a non-zero value.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/kvm_book3s_64.h |    6 ++++++
 arch/powerpc/include/asm/kvm_host.h      |    1 +
 arch/powerpc/kvm/book3s_hv_rm_mmu.c      |   25 ++++++++++++++++++++++---
 3 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h
index 1472a5b..4ca4f25 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -50,6 +50,12 @@ extern int kvm_hpt_order;		/* order of preallocated HPTs */
 #define HPTE_V_HVLOCK	0x40UL
 #define HPTE_V_ABSENT	0x20UL
 
+/*
+ * We use this bit in the guest_rpte field of the revmap entry
+ * to indicate a modified HPTE.
+ */
+#define HPTE_GR_MODIFIED	(1ul << 62)
+
 static inline long try_lock_hpte(unsigned long *hpte, unsigned long bits)
 {
 	unsigned long tmp, old;
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 3093896..58c7264 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -248,6 +248,7 @@ struct kvm_arch {
 	atomic_t vcpus_running;
 	unsigned long hpt_npte;
 	unsigned long hpt_mask;
+	atomic_t hpte_mod_interest;
 	spinlock_t slot_phys_lock;
 	unsigned short last_vcpu[NR_CPUS];
 	struct kvmppc_vcore *vcores[KVM_MAX_VCORES];
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 3233587..c83c0ca 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -66,6 +66,18 @@ void kvmppc_add_revmap_chain(struct kvm *kvm, struct revmap_entry *rev,
 }
 EXPORT_SYMBOL_GPL(kvmppc_add_revmap_chain);
 
+/*
+ * Note modification of an HPTE; set the HPTE modified bit
+ * if it wasn't modified before and anyone is interested.
+ */
+static inline void note_hpte_modification(struct kvm *kvm,
+					  struct revmap_entry *rev)
+{
+	if (!(rev->guest_rpte & HPTE_GR_MODIFIED) &&
+	    atomic_read(&kvm->arch.hpte_mod_interest))
+		rev->guest_rpte |= HPTE_GR_MODIFIED;
+}
+
 /* Remove this HPTE from the chain for a real page */
 static void remove_revmap_chain(struct kvm *kvm, long pte_index,
 				struct revmap_entry *rev,
@@ -287,8 +299,10 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 	rev = &kvm->arch.revmap[pte_index];
 	if (realmode)
 		rev = real_vmalloc_addr(rev);
-	if (rev)
+	if (rev) {
 		rev->guest_rpte = g_ptel;
+		note_hpte_modification(kvm, rev);
+	}
 
 	/* Link HPTE into reverse-map chain */
 	if (pteh & HPTE_V_VALID) {
@@ -392,7 +406,8 @@ long kvmppc_h_remove(struct kvm_vcpu *vcpu, unsigned long flags,
 		/* Read PTE low word after tlbie to get final R/C values */
 		remove_revmap_chain(kvm, pte_index, rev, v, hpte[1]);
 	}
-	r = rev->guest_rpte;
+	r = rev->guest_rpte & ~HPTE_GR_MODIFIED;
+	note_hpte_modification(kvm, rev);
 	unlock_hpte(hpte, 0);
 
 	vcpu->arch.gpr[4] = v;
@@ -466,6 +481,7 @@ long kvmppc_h_bulk_remove(struct kvm_vcpu *vcpu)
 
 			args[j] = ((0x80 | flags) << 56) + pte_index;
 			rev = real_vmalloc_addr(&kvm->arch.revmap[pte_index]);
+			note_hpte_modification(kvm, rev);
 
 			if (!(hp[0] & HPTE_V_VALID)) {
 				/* insert R and C bits from PTE */
@@ -555,6 +571,7 @@ long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long flags,
 	if (rev) {
 		r = (rev->guest_rpte & ~mask) | bits;
 		rev->guest_rpte = r;
+		note_hpte_modification(kvm, rev);
 	}
 	r = (hpte[1] & ~mask) | bits;
 
@@ -606,8 +623,10 @@ long kvmppc_h_read(struct kvm_vcpu *vcpu, unsigned long flags,
 			v &= ~HPTE_V_ABSENT;
 			v |= HPTE_V_VALID;
 		}
-		if (v & HPTE_V_VALID)
+		if (v & HPTE_V_VALID) {
 			r = rev[i].guest_rpte | (r & (HPTE_R_R | HPTE_R_C));
+			r &= ~HPTE_GR_MODIFIED;
+		}
 		vcpu->arch.gpr[4 + i * 2] = v;
 		vcpu->arch.gpr[5 + i * 2] = r;
 	}
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/5] KVM: PPC: Book3S HV: Add a mechanism for recording modified HPTEs
@ 2012-10-16  4:00   ` Paul Mackerras
  0 siblings, 0 replies; 30+ messages in thread
From: Paul Mackerras @ 2012-10-16  4:00 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc, David Gibson

This uses a bit in our record of the guest view of the HPTE to record
when the HPTE gets modified.  We use a reserved bit for this, and ensure
that this bit is always cleared in HPTE values returned to the guest.
The recording of modified HPTEs is only done if other code indicates
its interest by setting kvm->arch.hpte_mod_interest to a non-zero value.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/kvm_book3s_64.h |    6 ++++++
 arch/powerpc/include/asm/kvm_host.h      |    1 +
 arch/powerpc/kvm/book3s_hv_rm_mmu.c      |   25 ++++++++++++++++++++++---
 3 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h
index 1472a5b..4ca4f25 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -50,6 +50,12 @@ extern int kvm_hpt_order;		/* order of preallocated HPTs */
 #define HPTE_V_HVLOCK	0x40UL
 #define HPTE_V_ABSENT	0x20UL
 
+/*
+ * We use this bit in the guest_rpte field of the revmap entry
+ * to indicate a modified HPTE.
+ */
+#define HPTE_GR_MODIFIED	(1ul << 62)
+
 static inline long try_lock_hpte(unsigned long *hpte, unsigned long bits)
 {
 	unsigned long tmp, old;
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 3093896..58c7264 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -248,6 +248,7 @@ struct kvm_arch {
 	atomic_t vcpus_running;
 	unsigned long hpt_npte;
 	unsigned long hpt_mask;
+	atomic_t hpte_mod_interest;
 	spinlock_t slot_phys_lock;
 	unsigned short last_vcpu[NR_CPUS];
 	struct kvmppc_vcore *vcores[KVM_MAX_VCORES];
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 3233587..c83c0ca 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -66,6 +66,18 @@ void kvmppc_add_revmap_chain(struct kvm *kvm, struct revmap_entry *rev,
 }
 EXPORT_SYMBOL_GPL(kvmppc_add_revmap_chain);
 
+/*
+ * Note modification of an HPTE; set the HPTE modified bit
+ * if it wasn't modified before and anyone is interested.
+ */
+static inline void note_hpte_modification(struct kvm *kvm,
+					  struct revmap_entry *rev)
+{
+	if (!(rev->guest_rpte & HPTE_GR_MODIFIED) &&
+	    atomic_read(&kvm->arch.hpte_mod_interest))
+		rev->guest_rpte |= HPTE_GR_MODIFIED;
+}
+
 /* Remove this HPTE from the chain for a real page */
 static void remove_revmap_chain(struct kvm *kvm, long pte_index,
 				struct revmap_entry *rev,
@@ -287,8 +299,10 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 	rev = &kvm->arch.revmap[pte_index];
 	if (realmode)
 		rev = real_vmalloc_addr(rev);
-	if (rev)
+	if (rev) {
 		rev->guest_rpte = g_ptel;
+		note_hpte_modification(kvm, rev);
+	}
 
 	/* Link HPTE into reverse-map chain */
 	if (pteh & HPTE_V_VALID) {
@@ -392,7 +406,8 @@ long kvmppc_h_remove(struct kvm_vcpu *vcpu, unsigned long flags,
 		/* Read PTE low word after tlbie to get final R/C values */
 		remove_revmap_chain(kvm, pte_index, rev, v, hpte[1]);
 	}
-	r = rev->guest_rpte;
+	r = rev->guest_rpte & ~HPTE_GR_MODIFIED;
+	note_hpte_modification(kvm, rev);
 	unlock_hpte(hpte, 0);
 
 	vcpu->arch.gpr[4] = v;
@@ -466,6 +481,7 @@ long kvmppc_h_bulk_remove(struct kvm_vcpu *vcpu)
 
 			args[j] = ((0x80 | flags) << 56) + pte_index;
 			rev = real_vmalloc_addr(&kvm->arch.revmap[pte_index]);
+			note_hpte_modification(kvm, rev);
 
 			if (!(hp[0] & HPTE_V_VALID)) {
 				/* insert R and C bits from PTE */
@@ -555,6 +571,7 @@ long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long flags,
 	if (rev) {
 		r = (rev->guest_rpte & ~mask) | bits;
 		rev->guest_rpte = r;
+		note_hpte_modification(kvm, rev);
 	}
 	r = (hpte[1] & ~mask) | bits;
 
@@ -606,8 +623,10 @@ long kvmppc_h_read(struct kvm_vcpu *vcpu, unsigned long flags,
 			v &= ~HPTE_V_ABSENT;
 			v |= HPTE_V_VALID;
 		}
-		if (v & HPTE_V_VALID)
+		if (v & HPTE_V_VALID) {
 			r = rev[i].guest_rpte | (r & (HPTE_R_R | HPTE_R_C));
+			r &= ~HPTE_GR_MODIFIED;
+		}
 		vcpu->arch.gpr[4 + i * 2] = v;
 		vcpu->arch.gpr[5 + i * 2] = r;
 	}
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/5] KVM: PPC: Book3S HV: Make a HPTE removal function available
  2012-10-16  3:58 ` Paul Mackerras
@ 2012-10-16  4:01   ` Paul Mackerras
  -1 siblings, 0 replies; 30+ messages in thread
From: Paul Mackerras @ 2012-10-16  4:01 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc, David Gibson

This makes a HPTE removal function, kvmppc_do_h_remove(), available
outside book3s_hv_rm_mmu.c.  This will be used by the HPT writing
code.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/kvm_book3s.h |    3 +++
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   |   19 +++++++++++++------
 2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
index 199b7fd..4ac1c67 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -160,6 +160,9 @@ extern long kvmppc_virtmode_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
 extern long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 			long pte_index, unsigned long pteh, unsigned long ptel,
 			pgd_t *pgdir, bool realmode, unsigned long *idx_ret);
+extern long kvmppc_do_h_remove(struct kvm *kvm, unsigned long flags,
+			unsigned long pte_index, unsigned long avpn,
+			unsigned long *hpret);
 extern long kvmppc_hv_get_dirty_log(struct kvm *kvm,
 			struct kvm_memory_slot *memslot, unsigned long *map);
 
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index c83c0ca..505548a 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -364,11 +364,10 @@ static inline int try_lock_tlbie(unsigned int *lock)
 	return old == 0;
 }
 
-long kvmppc_h_remove(struct kvm_vcpu *vcpu, unsigned long flags,
-		     unsigned long pte_index, unsigned long avpn,
-		     unsigned long va)
+long kvmppc_do_h_remove(struct kvm *kvm, unsigned long flags,
+			unsigned long pte_index, unsigned long avpn,
+			unsigned long *hpret)
 {
-	struct kvm *kvm = vcpu->kvm;
 	unsigned long *hpte;
 	unsigned long v, r, rb;
 	struct revmap_entry *rev;
@@ -410,10 +409,18 @@ long kvmppc_h_remove(struct kvm_vcpu *vcpu, unsigned long flags,
 	note_hpte_modification(kvm, rev);
 	unlock_hpte(hpte, 0);
 
-	vcpu->arch.gpr[4] = v;
-	vcpu->arch.gpr[5] = r;
+	hpret[0] = v;
+	hpret[1] = r;
 	return H_SUCCESS;
 }
+EXPORT_SYMBOL_GPL(kvmppc_do_h_remove);
+
+long kvmppc_h_remove(struct kvm_vcpu *vcpu, unsigned long flags,
+		     unsigned long pte_index, unsigned long avpn)
+{
+	return kvmppc_do_h_remove(vcpu->kvm, flags, pte_index, avpn,
+				  &vcpu->arch.gpr[4]);
+}
 
 long kvmppc_h_bulk_remove(struct kvm_vcpu *vcpu)
 {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/5] KVM: PPC: Book3S HV: Make a HPTE removal function available
@ 2012-10-16  4:01   ` Paul Mackerras
  0 siblings, 0 replies; 30+ messages in thread
From: Paul Mackerras @ 2012-10-16  4:01 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc, David Gibson

This makes a HPTE removal function, kvmppc_do_h_remove(), available
outside book3s_hv_rm_mmu.c.  This will be used by the HPT writing
code.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/kvm_book3s.h |    3 +++
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   |   19 +++++++++++++------
 2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
index 199b7fd..4ac1c67 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -160,6 +160,9 @@ extern long kvmppc_virtmode_h_enter(struct kvm_vcpu *vcpu, unsigned long flags,
 extern long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 			long pte_index, unsigned long pteh, unsigned long ptel,
 			pgd_t *pgdir, bool realmode, unsigned long *idx_ret);
+extern long kvmppc_do_h_remove(struct kvm *kvm, unsigned long flags,
+			unsigned long pte_index, unsigned long avpn,
+			unsigned long *hpret);
 extern long kvmppc_hv_get_dirty_log(struct kvm *kvm,
 			struct kvm_memory_slot *memslot, unsigned long *map);
 
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index c83c0ca..505548a 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -364,11 +364,10 @@ static inline int try_lock_tlbie(unsigned int *lock)
 	return old = 0;
 }
 
-long kvmppc_h_remove(struct kvm_vcpu *vcpu, unsigned long flags,
-		     unsigned long pte_index, unsigned long avpn,
-		     unsigned long va)
+long kvmppc_do_h_remove(struct kvm *kvm, unsigned long flags,
+			unsigned long pte_index, unsigned long avpn,
+			unsigned long *hpret)
 {
-	struct kvm *kvm = vcpu->kvm;
 	unsigned long *hpte;
 	unsigned long v, r, rb;
 	struct revmap_entry *rev;
@@ -410,10 +409,18 @@ long kvmppc_h_remove(struct kvm_vcpu *vcpu, unsigned long flags,
 	note_hpte_modification(kvm, rev);
 	unlock_hpte(hpte, 0);
 
-	vcpu->arch.gpr[4] = v;
-	vcpu->arch.gpr[5] = r;
+	hpret[0] = v;
+	hpret[1] = r;
 	return H_SUCCESS;
 }
+EXPORT_SYMBOL_GPL(kvmppc_do_h_remove);
+
+long kvmppc_h_remove(struct kvm_vcpu *vcpu, unsigned long flags,
+		     unsigned long pte_index, unsigned long avpn)
+{
+	return kvmppc_do_h_remove(vcpu->kvm, flags, pte_index, avpn,
+				  &vcpu->arch.gpr[4]);
+}
 
 long kvmppc_h_bulk_remove(struct kvm_vcpu *vcpu)
 {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
  2012-10-16  3:58 ` Paul Mackerras
@ 2012-10-16  4:01   ` Paul Mackerras
  -1 siblings, 0 replies; 30+ messages in thread
From: Paul Mackerras @ 2012-10-16  4:01 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc, David Gibson

A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor.  Reads on
this fd return the contents of the HPT (hashed page table), writes
create and/or remove entries in the HPT.  There is a new capability,
KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl.  The ioctl
takes an argument structure with the index of the first HPT entry to
read out and a set of flags.  The flags indicate whether the user is
intending to read or write the HPT, and whether to return all entries
or only the "bolted" entries (those with the bolted bit, 0x10, set in
the first doubleword).

This is intended for use in implementing qemu's savevm/loadvm and for
live migration.  Therefore, on reads, the first pass returns information
about all HPTEs (or all bolted HPTEs).  When the first pass reaches the
end of the HPT, it returns from the read.  Subsequent reads only return
information about HPTEs that have changed since they were last read.
A read that finds no changed HPTEs in the HPT following where the last
read finished will return 0 bytes.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 Documentation/virtual/kvm/api.txt        |   53 +++++
 arch/powerpc/include/asm/kvm.h           |   24 +++
 arch/powerpc/include/asm/kvm_book3s_64.h |   18 ++
 arch/powerpc/include/asm/kvm_ppc.h       |    2 +
 arch/powerpc/kvm/book3s_64_mmu_hv.c      |  344 ++++++++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_hv.c             |   12 --
 arch/powerpc/kvm/powerpc.c               |   17 ++
 include/linux/kvm.h                      |    3 +
 8 files changed, 461 insertions(+), 12 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 4258180..8df3e53 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2071,6 +2071,59 @@ KVM_S390_INT_EXTERNAL_CALL (vcpu) - sigp external call; source cpu in parm
 
 Note that the vcpu ioctl is asynchronous to vcpu execution.
 
+4.78 KVM_PPC_GET_HTAB_FD
+
+Capability: KVM_CAP_PPC_HTAB_FD
+Architectures: powerpc
+Type: vm ioctl
+Parameters: Pointer to struct kvm_get_htab_fd (in)
+Returns: file descriptor number (>= 0) on success, -1 on error
+
+This returns a file descriptor that can be used either to read out the
+entries in the guest's hashed page table (HPT), or to write entries to
+initialize the HPT.  The returned fd can only be written to if the
+KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
+can only be read if that bit is clear.  The argument struct looks like
+this:
+
+/* For KVM_PPC_GET_HTAB_FD */
+struct kvm_get_htab_fd {
+	__u64	flags;
+	__u64	start_index;
+};
+
+/* Values for kvm_get_htab_fd.flags */
+#define KVM_GET_HTAB_BOLTED_ONLY	((__u64)0x1)
+#define KVM_GET_HTAB_WRITE		((__u64)0x2)
+
+The `start_index' field gives the index in the HPT of the entry at
+which to start reading.  It is ignored when writing.
+
+Reads on the fd will initially supply information about all
+"interesting" HPT entries.  Interesting entries are those with the
+bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
+all entries.  When the end of the HPT is reached, the read() will
+return.  If read() is called again on the fd, it will start again from
+the beginning of the HPT, but will only return HPT entries that have
+changed since they were last read.
+
+Data read or written is structured as a header (8 bytes) followed by a
+series of valid HPT entries (16 bytes) each.  The header indicates how
+many valid HPT entries there are and how many invalid entries follow
+the valid entries.  The invalid entries are not represented explicitly
+in the stream.  The header format is:
+
+struct kvm_get_htab_header {
+	__u32	index;
+	__u16	n_valid;
+	__u16	n_invalid;
+};
+
+Writes to the fd create HPT entries starting at the index given in the
+header; first `n_valid' valid entries with contents from the data
+written, then `n_invalid' invalid entries, invalidating any previously
+valid entries found.
+
 
 5. The kvm_run structure
 ------------------------
diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h
index b89ae4d..6518e38 100644
--- a/arch/powerpc/include/asm/kvm.h
+++ b/arch/powerpc/include/asm/kvm.h
@@ -331,6 +331,30 @@ struct kvm_book3e_206_tlb_params {
 	__u32 reserved[8];
 };
 
+/* For KVM_PPC_GET_HTAB_FD */
+struct kvm_get_htab_fd {
+	__u64	flags;
+	__u64	start_index;
+};
+
+/* Values for kvm_get_htab_fd.flags */
+#define KVM_GET_HTAB_BOLTED_ONLY	((__u64)0x1)
+#define KVM_GET_HTAB_WRITE		((__u64)0x2)
+
+/*
+ * Data read on the file descriptor is formatted as a series of
+ * records, each consisting of a header followed by a series of
+ * `n_valid' HPTEs (16 bytes each), which are all valid.  Following 
+ * those valid HPTEs there are `n_invalid' invalid HPTEs, which
+ * are not represented explicitly in the stream.  The same format
+ * is used for writing.
+ */
+struct kvm_get_htab_header {
+	__u32	index;
+	__u16	n_valid;
+	__u16	n_invalid;
+};
+
 #define KVM_REG_PPC_HIOR	(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x1)
 #define KVM_REG_PPC_IAC1	(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x2)
 #define KVM_REG_PPC_IAC2	(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x3)
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h
index 4ca4f25..dc0a78d 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -243,4 +243,22 @@ static inline bool slot_is_aligned(struct kvm_memory_slot *memslot,
 	return !(memslot->base_gfn & mask) && !(memslot->npages & mask);
 }
 
+static inline unsigned long slb_pgsize_encoding(unsigned long psize)
+{
+	unsigned long senc = 0;
+
+	if (psize > 0x1000) {
+		senc = SLB_VSID_L;
+		if (psize == 0x10000)
+			senc |= SLB_VSID_LP_01;
+	}
+	return senc;
+}
+
+static inline int is_vrma_hpte(unsigned long hpte_v)
+{
+	return (hpte_v & ~0xffffffUL) ==
+		(HPTE_V_1TB_SEG | (VRMA_VSID << (40 - 16)));
+}
+
 #endif /* __ASM_KVM_BOOK3S_64_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 609cca3..1ca31e9 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -164,6 +164,8 @@ extern void kvmppc_bookehv_exit(void);
 
 extern int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu);
 
+extern int kvm_vm_ioctl_get_htab_fd(struct kvm *kvm, struct kvm_get_htab_fd *);
+
 /*
  * Cuts out inst bits with ordering according to spec.
  * That means the leftmost bit is zero. All given bits are included.
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 351f2ac..3c47f61 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -25,6 +25,8 @@
 #include <linux/hugetlb.h>
 #include <linux/vmalloc.h>
 #include <linux/srcu.h>
+#include <linux/anon_inodes.h>
+#include <linux/file.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -1145,6 +1147,348 @@ void kvmppc_unpin_guest_page(struct kvm *kvm, void *va)
 	put_page(page);
 }
 
+/*
+ * Functions for reading and writing the hash table via reads and
+ * writes on a file descriptor.
+ *
+ * Reads return the guest view of the hash table, which has to be
+ * pieced together from the real hash table and the guest_rpte
+ * values in the revmap array.
+ *
+ * On writes, each HPTE written is considered in turn, and if it
+ * is valid, it is written to the HPT as if an H_ENTER with the
+ * exact flag set was done.  When the invalid count is non-zero
+ * in the header written to the stream, the kernel will make
+ * sure that that many HPTEs are invalid, and invalidate them
+ * if not.
+ */
+
+struct kvm_htab_ctx {
+	unsigned long	index;
+	unsigned long	flags;
+	struct kvm	*kvm;
+	int		first_pass;
+};
+
+#define HPTE_SIZE	(2 * sizeof(unsigned long))
+
+static long record_hpte(unsigned long flags, unsigned long *hptp,
+			unsigned long *hpte, struct revmap_entry *revp,
+			int want_valid, int first_pass)
+{
+	unsigned long v, r;
+	int ok = 1;
+	int valid, dirty;
+
+	/* Unmodified entries are uninteresting except on the first pass */
+	dirty = !!(revp->guest_rpte & HPTE_GR_MODIFIED);
+	if (!first_pass && !dirty)
+		return 0;
+
+	valid = 0;
+	if (hptp[0] & (HPTE_V_VALID | HPTE_V_ABSENT)) {
+		valid = 1;
+		if ((flags & KVM_GET_HTAB_BOLTED_ONLY) &&
+		    !(hptp[0] & HPTE_V_BOLTED))
+			valid = 0;
+	}
+	if (valid != want_valid)
+		return 0;
+
+	v = r = 0;
+	if (valid || dirty) {
+		/* lock the HPTE so it's stable and read it */
+		preempt_disable();
+		while (!try_lock_hpte(hptp, HPTE_V_HVLOCK))
+			cpu_relax();
+		v = hptp[0];
+		if (v & HPTE_V_ABSENT) {
+			v &= ~HPTE_V_ABSENT;
+			v |= HPTE_V_VALID;
+		}
+		/* re-evaluate valid and dirty from synchronized HPTE value */
+		valid = !!(v & HPTE_V_VALID);
+		if ((flags & KVM_GET_HTAB_BOLTED_ONLY) && !(v & HPTE_V_BOLTED))
+			valid = 0;
+		r = revp->guest_rpte | (hptp[1] & (HPTE_R_R | HPTE_R_C));
+		dirty = !!(revp->guest_rpte & HPTE_GR_MODIFIED);
+		/* only clear modified if this is the right sort of entry */
+		if (valid == want_valid && dirty) {
+			r &= ~HPTE_GR_MODIFIED;
+			revp->guest_rpte = r;
+		}
+		asm volatile(PPC_RELEASE_BARRIER "" : : : "memory");
+		hptp[0] &= ~HPTE_V_HVLOCK;
+		preempt_enable();
+		if (!(valid == want_valid && (first_pass || dirty)))
+			ok = 0;
+	}
+	hpte[0] = v;
+	hpte[1] = r;
+	return ok;
+}
+
+static ssize_t kvm_htab_read(struct file *file, char __user *buf,
+			     size_t count, loff_t *ppos)
+{
+	struct kvm_htab_ctx *ctx = file->private_data;
+	struct kvm *kvm = ctx->kvm;
+	struct kvm_get_htab_header hdr;
+	unsigned long *hptp;
+	struct revmap_entry *revp;
+	unsigned long i, nb, nw;
+	unsigned long __user *lbuf;
+	struct kvm_get_htab_header __user *hptr;
+	unsigned long flags;
+	int first_pass;
+	unsigned long hpte[2];
+
+	if (!access_ok(VERIFY_WRITE, buf, count))
+		return -EFAULT;
+
+	first_pass = ctx->first_pass;
+	flags = ctx->flags;
+
+	i = ctx->index;
+	hptp = (unsigned long *)(kvm->arch.hpt_virt + (i * HPTE_SIZE));
+	revp = kvm->arch.revmap + i;
+	lbuf = (unsigned long __user *)buf;
+
+	nb = 0;
+	while (nb + sizeof(hdr) + HPTE_SIZE < count) {
+		/* Initialize header */
+		hptr = (struct kvm_get_htab_header __user *)buf;
+		hdr.index = i;
+		hdr.n_valid = 0;
+		hdr.n_invalid = 0;
+		nw = nb;
+		nb += sizeof(hdr);
+		lbuf = (unsigned long __user *)(buf + sizeof(hdr));
+
+		/* Skip uninteresting entries, i.e. clean on not-first pass */
+		if (!first_pass) {
+			while (i < kvm->arch.hpt_npte &&
+			       !(revp->guest_rpte & HPTE_GR_MODIFIED)) {
+				++i;
+				hptp += 2;
+				++revp;
+			}
+		}
+
+		/* Grab a series of valid entries */
+		while (i < kvm->arch.hpt_npte &&
+		       hdr.n_valid < 0xffff &&
+		       nb + HPTE_SIZE < count &&
+		       record_hpte(flags, hptp, hpte, revp, 1, first_pass)) {
+			/* valid entry, write it out */
+			++hdr.n_valid;
+			if (__put_user(hpte[0], lbuf) ||
+			    __put_user(hpte[1], lbuf + 1))
+				return -EFAULT;
+			nb += HPTE_SIZE;
+			lbuf += 2;
+			++i;
+			hptp += 2;
+			++revp;
+		}
+		/* Now skip invalid entries while we can */
+		while (i < kvm->arch.hpt_npte &&
+		       hdr.n_invalid < 0xffff &&
+		       record_hpte(flags, hptp, hpte, revp, 0, first_pass)) {
+			/* found an invalid entry */
+			++hdr.n_invalid;
+			++i;
+			hptp += 2;
+			++revp;
+		}
+ 
+		if (hdr.n_valid || hdr.n_invalid) {
+			/* write back the header */
+			if (__copy_to_user(hptr, &hdr, sizeof(hdr)))
+				return -EFAULT;
+			nw = nb;
+			buf = (char __user *)lbuf;
+		} else {
+			nb = nw;
+		}
+
+		/* Check if we've wrapped around the hash table */
+		if (i >= kvm->arch.hpt_npte) {
+			i = 0;
+			ctx->first_pass = 0;
+			break;
+		}
+	}
+
+	ctx->index = i;
+
+	return nb;
+}
+
+static ssize_t kvm_htab_write(struct file *file, const char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	struct kvm_htab_ctx *ctx = file->private_data;
+	struct kvm *kvm = ctx->kvm;
+	struct kvm_get_htab_header hdr;
+	unsigned long i, j;
+	unsigned long v, r;
+	unsigned long __user *lbuf;
+	unsigned long *hptp;
+	unsigned long tmp[2];
+	ssize_t nb;
+	long int err, ret;
+	int rma_setup;
+
+	if (!access_ok(VERIFY_READ, buf, count))
+		return -EFAULT;
+
+	/* lock out vcpus from running while we're doing this */
+	mutex_lock(&kvm->lock);
+	rma_setup = kvm->arch.rma_setup_done;
+	if (rma_setup) {
+		kvm->arch.rma_setup_done = 0;	/* temporarily */
+		/* order rma_setup_done vs. vcpus_running */
+		smp_mb();
+		if (atomic_read(&kvm->arch.vcpus_running)) {
+			kvm->arch.rma_setup_done = 1;
+			mutex_unlock(&kvm->lock);
+			return -EBUSY;
+		}
+	}
+
+	err = 0;
+	for (nb = 0; nb + sizeof(hdr) <= count; ) {
+		err = -EFAULT;
+		if (__copy_from_user(&hdr, buf, sizeof(hdr)))
+			break;
+
+		err = 0;
+		if (nb + hdr.n_valid * HPTE_SIZE > count)
+			break;
+
+		nb += sizeof(hdr);
+		buf += sizeof(hdr);
+
+		err = -EINVAL;
+		i = hdr.index;
+		if (i >= kvm->arch.hpt_npte ||
+		    i + hdr.n_valid + hdr.n_invalid > kvm->arch.hpt_npte)
+			break;
+
+		hptp = (unsigned long *)(kvm->arch.hpt_virt + (i * HPTE_SIZE));
+		lbuf = (unsigned long __user *)buf;
+		for (j = 0; j < hdr.n_valid; ++j) {
+			err = -EFAULT;
+			if (__get_user(v, lbuf) || __get_user(r, lbuf + 1))
+				goto out;
+			err = -EINVAL;
+			if (!(v & HPTE_V_VALID))
+				goto out;
+			lbuf += 2;
+			nb += HPTE_SIZE;
+
+			if (hptp[0] & (HPTE_V_VALID | HPTE_V_ABSENT))
+				kvmppc_do_h_remove(kvm, 0, i, 0, tmp);
+			err = -EIO;
+			ret = kvmppc_virtmode_do_h_enter(kvm, H_EXACT, i, v, r,
+							 tmp);
+			if (ret != H_SUCCESS) {
+				pr_err("kvm_htab_write ret %ld i=%ld v=%lx "
+				       "r=%lx\n", ret, i, v, r);
+				goto out;
+			}
+			if (!rma_setup && is_vrma_hpte(v)) {
+				unsigned long psize = hpte_page_size(v, r);
+				unsigned long senc = slb_pgsize_encoding(psize);
+				unsigned long lpcr;
+
+				kvm->arch.vrma_slb_v = senc | SLB_VSID_B_1T |
+					(VRMA_VSID << SLB_VSID_SHIFT_1T);
+				lpcr = kvm->arch.lpcr & ~LPCR_VRMASD;
+				lpcr |= senc << (LPCR_VRMASD_SH - 4);
+				kvm->arch.lpcr = lpcr;
+				rma_setup = 1;
+			}
+			++i;
+			hptp += 2;
+		}
+
+		for (j = 0; j < hdr.n_invalid; ++j) {
+			if (hptp[0] & (HPTE_V_VALID | HPTE_V_ABSENT))
+				kvmppc_do_h_remove(kvm, 0, i, 0, tmp);
+			++i;
+			hptp += 2;
+		}
+		err = 0;
+	}
+
+ out:
+	/* Order HPTE updates vs. rma_setup_done */
+	smp_wmb();
+	kvm->arch.rma_setup_done = rma_setup;
+	mutex_unlock(&kvm->lock);
+
+	if (err)
+		return err;
+	return nb;
+}
+
+static int kvm_htab_release(struct inode *inode, struct file *filp)
+{
+	struct kvm_htab_ctx *ctx = filp->private_data;
+
+	filp->private_data = NULL;
+	if (!(ctx->flags & KVM_GET_HTAB_WRITE))
+		atomic_dec(&ctx->kvm->arch.hpte_mod_interest);
+	kvm_put_kvm(ctx->kvm);
+	kfree(ctx);
+	return 0;
+}
+
+static struct file_operations kvm_htab_fops = {
+	.read		= kvm_htab_read,
+	.write		= kvm_htab_write,
+	.llseek		= default_llseek,
+	.release	= kvm_htab_release,
+};
+
+int kvm_vm_ioctl_get_htab_fd(struct kvm *kvm, struct kvm_get_htab_fd *ghf)
+{
+	int ret;
+	struct kvm_htab_ctx *ctx;
+	int rwflag;
+
+	/* reject flags we don't recognize */
+	if (ghf->flags & ~(KVM_GET_HTAB_BOLTED_ONLY | KVM_GET_HTAB_WRITE))
+		return -EINVAL;
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+	kvm_get_kvm(kvm);
+	ctx->kvm = kvm;
+	ctx->index = ghf->start_index;
+	ctx->flags = ghf->flags;
+	ctx->first_pass = 1;
+
+	rwflag = (ghf->flags & KVM_GET_HTAB_WRITE) ? O_WRONLY : O_RDONLY;
+	ret = anon_inode_getfd("kvm-htab", &kvm_htab_fops, ctx, rwflag);
+	if (ret < 0) {
+		kvm_put_kvm(kvm);
+		return ret;
+	}
+
+	if (rwflag == O_RDONLY) {
+		mutex_lock(&kvm->slots_lock);
+		atomic_inc(&kvm->arch.hpte_mod_interest);
+		/* make sure kvmppc_do_h_enter etc. see the increment */
+		synchronize_srcu_expedited(&kvm->srcu);
+		mutex_unlock(&kvm->slots_lock);
+	}
+
+	return ret;
+}
+
 void kvmppc_mmu_book3s_hv_init(struct kvm_vcpu *vcpu)
 {
 	struct kvmppc_mmu *mmu = &vcpu->arch.mmu;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 812764c..df5ae3e 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1563,18 +1563,6 @@ out:
 	return r;
 }
 
-static unsigned long slb_pgsize_encoding(unsigned long psize)
-{
-	unsigned long senc = 0;
-
-	if (psize > 0x1000) {
-		senc = SLB_VSID_L;
-		if (psize == 0x10000)
-			senc |= SLB_VSID_LP_01;
-	}
-	return senc;
-}
-
 static void unpin_slot(struct kvm_memory_slot *memslot)
 {
 	unsigned long *physp;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index deb0d59..3aa52d9 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -353,6 +353,12 @@ int kvm_dev_ioctl_check_extension(long ext)
 		r = 1;
 #else
 		r = 0;
+		break;
+#endif
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+	case KVM_CAP_PPC_HTAB_FD:
+		r = 1;
+		break;
 #endif
 		break;
 	case KVM_CAP_NR_VCPUS:
@@ -934,6 +940,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		r = 0;
 		break;
 	}
+
+	case KVM_PPC_GET_HTAB_FD: {
+		struct kvm *kvm = filp->private_data;
+		struct kvm_get_htab_fd ghf;
+
+		r = -EFAULT;
+		if (copy_from_user(&ghf, argp, sizeof(ghf)))
+			break;
+		r = kvm_vm_ioctl_get_htab_fd(kvm, &ghf);
+		break;
+	}
 #endif /* CONFIG_KVM_BOOK3S_64_HV */
 
 #ifdef CONFIG_PPC_BOOK3S_64
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 65ad5c6..f1427c3 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -630,6 +630,7 @@ struct kvm_ppc_smmu_info {
 #endif
 #define KVM_CAP_IRQFD_RESAMPLE 82
 #define KVM_CAP_PPC_BOOKE_WATCHDOG 83
+#define KVM_CAP_PPC_HTAB_FD 84
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -855,6 +856,8 @@ struct kvm_s390_ucas_mapping {
 #define KVM_CREATE_SPAPR_TCE	  _IOW(KVMIO,  0xa8, struct kvm_create_spapr_tce)
 /* Available with KVM_CAP_RMA */
 #define KVM_ALLOCATE_RMA	  _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
+/* Available with KVM_CAP_PPC_HTAB_FD */
+#define KVM_PPC_GET_HTAB_FD	  _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
 
 /*
  * ioctls for vcpu fds
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
@ 2012-10-16  4:01   ` Paul Mackerras
  0 siblings, 0 replies; 30+ messages in thread
From: Paul Mackerras @ 2012-10-16  4:01 UTC (permalink / raw)
  To: Alexander Graf; +Cc: kvm, kvm-ppc, David Gibson

A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor.  Reads on
this fd return the contents of the HPT (hashed page table), writes
create and/or remove entries in the HPT.  There is a new capability,
KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl.  The ioctl
takes an argument structure with the index of the first HPT entry to
read out and a set of flags.  The flags indicate whether the user is
intending to read or write the HPT, and whether to return all entries
or only the "bolted" entries (those with the bolted bit, 0x10, set in
the first doubleword).

This is intended for use in implementing qemu's savevm/loadvm and for
live migration.  Therefore, on reads, the first pass returns information
about all HPTEs (or all bolted HPTEs).  When the first pass reaches the
end of the HPT, it returns from the read.  Subsequent reads only return
information about HPTEs that have changed since they were last read.
A read that finds no changed HPTEs in the HPT following where the last
read finished will return 0 bytes.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 Documentation/virtual/kvm/api.txt        |   53 +++++
 arch/powerpc/include/asm/kvm.h           |   24 +++
 arch/powerpc/include/asm/kvm_book3s_64.h |   18 ++
 arch/powerpc/include/asm/kvm_ppc.h       |    2 +
 arch/powerpc/kvm/book3s_64_mmu_hv.c      |  344 ++++++++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_hv.c             |   12 --
 arch/powerpc/kvm/powerpc.c               |   17 ++
 include/linux/kvm.h                      |    3 +
 8 files changed, 461 insertions(+), 12 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 4258180..8df3e53 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2071,6 +2071,59 @@ KVM_S390_INT_EXTERNAL_CALL (vcpu) - sigp external call; source cpu in parm
 
 Note that the vcpu ioctl is asynchronous to vcpu execution.
 
+4.78 KVM_PPC_GET_HTAB_FD
+
+Capability: KVM_CAP_PPC_HTAB_FD
+Architectures: powerpc
+Type: vm ioctl
+Parameters: Pointer to struct kvm_get_htab_fd (in)
+Returns: file descriptor number (>= 0) on success, -1 on error
+
+This returns a file descriptor that can be used either to read out the
+entries in the guest's hashed page table (HPT), or to write entries to
+initialize the HPT.  The returned fd can only be written to if the
+KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
+can only be read if that bit is clear.  The argument struct looks like
+this:
+
+/* For KVM_PPC_GET_HTAB_FD */
+struct kvm_get_htab_fd {
+	__u64	flags;
+	__u64	start_index;
+};
+
+/* Values for kvm_get_htab_fd.flags */
+#define KVM_GET_HTAB_BOLTED_ONLY	((__u64)0x1)
+#define KVM_GET_HTAB_WRITE		((__u64)0x2)
+
+The `start_index' field gives the index in the HPT of the entry at
+which to start reading.  It is ignored when writing.
+
+Reads on the fd will initially supply information about all
+"interesting" HPT entries.  Interesting entries are those with the
+bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
+all entries.  When the end of the HPT is reached, the read() will
+return.  If read() is called again on the fd, it will start again from
+the beginning of the HPT, but will only return HPT entries that have
+changed since they were last read.
+
+Data read or written is structured as a header (8 bytes) followed by a
+series of valid HPT entries (16 bytes) each.  The header indicates how
+many valid HPT entries there are and how many invalid entries follow
+the valid entries.  The invalid entries are not represented explicitly
+in the stream.  The header format is:
+
+struct kvm_get_htab_header {
+	__u32	index;
+	__u16	n_valid;
+	__u16	n_invalid;
+};
+
+Writes to the fd create HPT entries starting at the index given in the
+header; first `n_valid' valid entries with contents from the data
+written, then `n_invalid' invalid entries, invalidating any previously
+valid entries found.
+
 
 5. The kvm_run structure
 ------------------------
diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h
index b89ae4d..6518e38 100644
--- a/arch/powerpc/include/asm/kvm.h
+++ b/arch/powerpc/include/asm/kvm.h
@@ -331,6 +331,30 @@ struct kvm_book3e_206_tlb_params {
 	__u32 reserved[8];
 };
 
+/* For KVM_PPC_GET_HTAB_FD */
+struct kvm_get_htab_fd {
+	__u64	flags;
+	__u64	start_index;
+};
+
+/* Values for kvm_get_htab_fd.flags */
+#define KVM_GET_HTAB_BOLTED_ONLY	((__u64)0x1)
+#define KVM_GET_HTAB_WRITE		((__u64)0x2)
+
+/*
+ * Data read on the file descriptor is formatted as a series of
+ * records, each consisting of a header followed by a series of
+ * `n_valid' HPTEs (16 bytes each), which are all valid.  Following 
+ * those valid HPTEs there are `n_invalid' invalid HPTEs, which
+ * are not represented explicitly in the stream.  The same format
+ * is used for writing.
+ */
+struct kvm_get_htab_header {
+	__u32	index;
+	__u16	n_valid;
+	__u16	n_invalid;
+};
+
 #define KVM_REG_PPC_HIOR	(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x1)
 #define KVM_REG_PPC_IAC1	(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x2)
 #define KVM_REG_PPC_IAC2	(KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x3)
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h
index 4ca4f25..dc0a78d 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -243,4 +243,22 @@ static inline bool slot_is_aligned(struct kvm_memory_slot *memslot,
 	return !(memslot->base_gfn & mask) && !(memslot->npages & mask);
 }
 
+static inline unsigned long slb_pgsize_encoding(unsigned long psize)
+{
+	unsigned long senc = 0;
+
+	if (psize > 0x1000) {
+		senc = SLB_VSID_L;
+		if (psize = 0x10000)
+			senc |= SLB_VSID_LP_01;
+	}
+	return senc;
+}
+
+static inline int is_vrma_hpte(unsigned long hpte_v)
+{
+	return (hpte_v & ~0xffffffUL) =
+		(HPTE_V_1TB_SEG | (VRMA_VSID << (40 - 16)));
+}
+
 #endif /* __ASM_KVM_BOOK3S_64_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 609cca3..1ca31e9 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -164,6 +164,8 @@ extern void kvmppc_bookehv_exit(void);
 
 extern int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu);
 
+extern int kvm_vm_ioctl_get_htab_fd(struct kvm *kvm, struct kvm_get_htab_fd *);
+
 /*
  * Cuts out inst bits with ordering according to spec.
  * That means the leftmost bit is zero. All given bits are included.
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 351f2ac..3c47f61 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -25,6 +25,8 @@
 #include <linux/hugetlb.h>
 #include <linux/vmalloc.h>
 #include <linux/srcu.h>
+#include <linux/anon_inodes.h>
+#include <linux/file.h>
 
 #include <asm/tlbflush.h>
 #include <asm/kvm_ppc.h>
@@ -1145,6 +1147,348 @@ void kvmppc_unpin_guest_page(struct kvm *kvm, void *va)
 	put_page(page);
 }
 
+/*
+ * Functions for reading and writing the hash table via reads and
+ * writes on a file descriptor.
+ *
+ * Reads return the guest view of the hash table, which has to be
+ * pieced together from the real hash table and the guest_rpte
+ * values in the revmap array.
+ *
+ * On writes, each HPTE written is considered in turn, and if it
+ * is valid, it is written to the HPT as if an H_ENTER with the
+ * exact flag set was done.  When the invalid count is non-zero
+ * in the header written to the stream, the kernel will make
+ * sure that that many HPTEs are invalid, and invalidate them
+ * if not.
+ */
+
+struct kvm_htab_ctx {
+	unsigned long	index;
+	unsigned long	flags;
+	struct kvm	*kvm;
+	int		first_pass;
+};
+
+#define HPTE_SIZE	(2 * sizeof(unsigned long))
+
+static long record_hpte(unsigned long flags, unsigned long *hptp,
+			unsigned long *hpte, struct revmap_entry *revp,
+			int want_valid, int first_pass)
+{
+	unsigned long v, r;
+	int ok = 1;
+	int valid, dirty;
+
+	/* Unmodified entries are uninteresting except on the first pass */
+	dirty = !!(revp->guest_rpte & HPTE_GR_MODIFIED);
+	if (!first_pass && !dirty)
+		return 0;
+
+	valid = 0;
+	if (hptp[0] & (HPTE_V_VALID | HPTE_V_ABSENT)) {
+		valid = 1;
+		if ((flags & KVM_GET_HTAB_BOLTED_ONLY) &&
+		    !(hptp[0] & HPTE_V_BOLTED))
+			valid = 0;
+	}
+	if (valid != want_valid)
+		return 0;
+
+	v = r = 0;
+	if (valid || dirty) {
+		/* lock the HPTE so it's stable and read it */
+		preempt_disable();
+		while (!try_lock_hpte(hptp, HPTE_V_HVLOCK))
+			cpu_relax();
+		v = hptp[0];
+		if (v & HPTE_V_ABSENT) {
+			v &= ~HPTE_V_ABSENT;
+			v |= HPTE_V_VALID;
+		}
+		/* re-evaluate valid and dirty from synchronized HPTE value */
+		valid = !!(v & HPTE_V_VALID);
+		if ((flags & KVM_GET_HTAB_BOLTED_ONLY) && !(v & HPTE_V_BOLTED))
+			valid = 0;
+		r = revp->guest_rpte | (hptp[1] & (HPTE_R_R | HPTE_R_C));
+		dirty = !!(revp->guest_rpte & HPTE_GR_MODIFIED);
+		/* only clear modified if this is the right sort of entry */
+		if (valid = want_valid && dirty) {
+			r &= ~HPTE_GR_MODIFIED;
+			revp->guest_rpte = r;
+		}
+		asm volatile(PPC_RELEASE_BARRIER "" : : : "memory");
+		hptp[0] &= ~HPTE_V_HVLOCK;
+		preempt_enable();
+		if (!(valid = want_valid && (first_pass || dirty)))
+			ok = 0;
+	}
+	hpte[0] = v;
+	hpte[1] = r;
+	return ok;
+}
+
+static ssize_t kvm_htab_read(struct file *file, char __user *buf,
+			     size_t count, loff_t *ppos)
+{
+	struct kvm_htab_ctx *ctx = file->private_data;
+	struct kvm *kvm = ctx->kvm;
+	struct kvm_get_htab_header hdr;
+	unsigned long *hptp;
+	struct revmap_entry *revp;
+	unsigned long i, nb, nw;
+	unsigned long __user *lbuf;
+	struct kvm_get_htab_header __user *hptr;
+	unsigned long flags;
+	int first_pass;
+	unsigned long hpte[2];
+
+	if (!access_ok(VERIFY_WRITE, buf, count))
+		return -EFAULT;
+
+	first_pass = ctx->first_pass;
+	flags = ctx->flags;
+
+	i = ctx->index;
+	hptp = (unsigned long *)(kvm->arch.hpt_virt + (i * HPTE_SIZE));
+	revp = kvm->arch.revmap + i;
+	lbuf = (unsigned long __user *)buf;
+
+	nb = 0;
+	while (nb + sizeof(hdr) + HPTE_SIZE < count) {
+		/* Initialize header */
+		hptr = (struct kvm_get_htab_header __user *)buf;
+		hdr.index = i;
+		hdr.n_valid = 0;
+		hdr.n_invalid = 0;
+		nw = nb;
+		nb += sizeof(hdr);
+		lbuf = (unsigned long __user *)(buf + sizeof(hdr));
+
+		/* Skip uninteresting entries, i.e. clean on not-first pass */
+		if (!first_pass) {
+			while (i < kvm->arch.hpt_npte &&
+			       !(revp->guest_rpte & HPTE_GR_MODIFIED)) {
+				++i;
+				hptp += 2;
+				++revp;
+			}
+		}
+
+		/* Grab a series of valid entries */
+		while (i < kvm->arch.hpt_npte &&
+		       hdr.n_valid < 0xffff &&
+		       nb + HPTE_SIZE < count &&
+		       record_hpte(flags, hptp, hpte, revp, 1, first_pass)) {
+			/* valid entry, write it out */
+			++hdr.n_valid;
+			if (__put_user(hpte[0], lbuf) ||
+			    __put_user(hpte[1], lbuf + 1))
+				return -EFAULT;
+			nb += HPTE_SIZE;
+			lbuf += 2;
+			++i;
+			hptp += 2;
+			++revp;
+		}
+		/* Now skip invalid entries while we can */
+		while (i < kvm->arch.hpt_npte &&
+		       hdr.n_invalid < 0xffff &&
+		       record_hpte(flags, hptp, hpte, revp, 0, first_pass)) {
+			/* found an invalid entry */
+			++hdr.n_invalid;
+			++i;
+			hptp += 2;
+			++revp;
+		}
+ 
+		if (hdr.n_valid || hdr.n_invalid) {
+			/* write back the header */
+			if (__copy_to_user(hptr, &hdr, sizeof(hdr)))
+				return -EFAULT;
+			nw = nb;
+			buf = (char __user *)lbuf;
+		} else {
+			nb = nw;
+		}
+
+		/* Check if we've wrapped around the hash table */
+		if (i >= kvm->arch.hpt_npte) {
+			i = 0;
+			ctx->first_pass = 0;
+			break;
+		}
+	}
+
+	ctx->index = i;
+
+	return nb;
+}
+
+static ssize_t kvm_htab_write(struct file *file, const char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	struct kvm_htab_ctx *ctx = file->private_data;
+	struct kvm *kvm = ctx->kvm;
+	struct kvm_get_htab_header hdr;
+	unsigned long i, j;
+	unsigned long v, r;
+	unsigned long __user *lbuf;
+	unsigned long *hptp;
+	unsigned long tmp[2];
+	ssize_t nb;
+	long int err, ret;
+	int rma_setup;
+
+	if (!access_ok(VERIFY_READ, buf, count))
+		return -EFAULT;
+
+	/* lock out vcpus from running while we're doing this */
+	mutex_lock(&kvm->lock);
+	rma_setup = kvm->arch.rma_setup_done;
+	if (rma_setup) {
+		kvm->arch.rma_setup_done = 0;	/* temporarily */
+		/* order rma_setup_done vs. vcpus_running */
+		smp_mb();
+		if (atomic_read(&kvm->arch.vcpus_running)) {
+			kvm->arch.rma_setup_done = 1;
+			mutex_unlock(&kvm->lock);
+			return -EBUSY;
+		}
+	}
+
+	err = 0;
+	for (nb = 0; nb + sizeof(hdr) <= count; ) {
+		err = -EFAULT;
+		if (__copy_from_user(&hdr, buf, sizeof(hdr)))
+			break;
+
+		err = 0;
+		if (nb + hdr.n_valid * HPTE_SIZE > count)
+			break;
+
+		nb += sizeof(hdr);
+		buf += sizeof(hdr);
+
+		err = -EINVAL;
+		i = hdr.index;
+		if (i >= kvm->arch.hpt_npte ||
+		    i + hdr.n_valid + hdr.n_invalid > kvm->arch.hpt_npte)
+			break;
+
+		hptp = (unsigned long *)(kvm->arch.hpt_virt + (i * HPTE_SIZE));
+		lbuf = (unsigned long __user *)buf;
+		for (j = 0; j < hdr.n_valid; ++j) {
+			err = -EFAULT;
+			if (__get_user(v, lbuf) || __get_user(r, lbuf + 1))
+				goto out;
+			err = -EINVAL;
+			if (!(v & HPTE_V_VALID))
+				goto out;
+			lbuf += 2;
+			nb += HPTE_SIZE;
+
+			if (hptp[0] & (HPTE_V_VALID | HPTE_V_ABSENT))
+				kvmppc_do_h_remove(kvm, 0, i, 0, tmp);
+			err = -EIO;
+			ret = kvmppc_virtmode_do_h_enter(kvm, H_EXACT, i, v, r,
+							 tmp);
+			if (ret != H_SUCCESS) {
+				pr_err("kvm_htab_write ret %ld i=%ld v=%lx "
+				       "r=%lx\n", ret, i, v, r);
+				goto out;
+			}
+			if (!rma_setup && is_vrma_hpte(v)) {
+				unsigned long psize = hpte_page_size(v, r);
+				unsigned long senc = slb_pgsize_encoding(psize);
+				unsigned long lpcr;
+
+				kvm->arch.vrma_slb_v = senc | SLB_VSID_B_1T |
+					(VRMA_VSID << SLB_VSID_SHIFT_1T);
+				lpcr = kvm->arch.lpcr & ~LPCR_VRMASD;
+				lpcr |= senc << (LPCR_VRMASD_SH - 4);
+				kvm->arch.lpcr = lpcr;
+				rma_setup = 1;
+			}
+			++i;
+			hptp += 2;
+		}
+
+		for (j = 0; j < hdr.n_invalid; ++j) {
+			if (hptp[0] & (HPTE_V_VALID | HPTE_V_ABSENT))
+				kvmppc_do_h_remove(kvm, 0, i, 0, tmp);
+			++i;
+			hptp += 2;
+		}
+		err = 0;
+	}
+
+ out:
+	/* Order HPTE updates vs. rma_setup_done */
+	smp_wmb();
+	kvm->arch.rma_setup_done = rma_setup;
+	mutex_unlock(&kvm->lock);
+
+	if (err)
+		return err;
+	return nb;
+}
+
+static int kvm_htab_release(struct inode *inode, struct file *filp)
+{
+	struct kvm_htab_ctx *ctx = filp->private_data;
+
+	filp->private_data = NULL;
+	if (!(ctx->flags & KVM_GET_HTAB_WRITE))
+		atomic_dec(&ctx->kvm->arch.hpte_mod_interest);
+	kvm_put_kvm(ctx->kvm);
+	kfree(ctx);
+	return 0;
+}
+
+static struct file_operations kvm_htab_fops = {
+	.read		= kvm_htab_read,
+	.write		= kvm_htab_write,
+	.llseek		= default_llseek,
+	.release	= kvm_htab_release,
+};
+
+int kvm_vm_ioctl_get_htab_fd(struct kvm *kvm, struct kvm_get_htab_fd *ghf)
+{
+	int ret;
+	struct kvm_htab_ctx *ctx;
+	int rwflag;
+
+	/* reject flags we don't recognize */
+	if (ghf->flags & ~(KVM_GET_HTAB_BOLTED_ONLY | KVM_GET_HTAB_WRITE))
+		return -EINVAL;
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+	kvm_get_kvm(kvm);
+	ctx->kvm = kvm;
+	ctx->index = ghf->start_index;
+	ctx->flags = ghf->flags;
+	ctx->first_pass = 1;
+
+	rwflag = (ghf->flags & KVM_GET_HTAB_WRITE) ? O_WRONLY : O_RDONLY;
+	ret = anon_inode_getfd("kvm-htab", &kvm_htab_fops, ctx, rwflag);
+	if (ret < 0) {
+		kvm_put_kvm(kvm);
+		return ret;
+	}
+
+	if (rwflag = O_RDONLY) {
+		mutex_lock(&kvm->slots_lock);
+		atomic_inc(&kvm->arch.hpte_mod_interest);
+		/* make sure kvmppc_do_h_enter etc. see the increment */
+		synchronize_srcu_expedited(&kvm->srcu);
+		mutex_unlock(&kvm->slots_lock);
+	}
+
+	return ret;
+}
+
 void kvmppc_mmu_book3s_hv_init(struct kvm_vcpu *vcpu)
 {
 	struct kvmppc_mmu *mmu = &vcpu->arch.mmu;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 812764c..df5ae3e 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1563,18 +1563,6 @@ out:
 	return r;
 }
 
-static unsigned long slb_pgsize_encoding(unsigned long psize)
-{
-	unsigned long senc = 0;
-
-	if (psize > 0x1000) {
-		senc = SLB_VSID_L;
-		if (psize = 0x10000)
-			senc |= SLB_VSID_LP_01;
-	}
-	return senc;
-}
-
 static void unpin_slot(struct kvm_memory_slot *memslot)
 {
 	unsigned long *physp;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index deb0d59..3aa52d9 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -353,6 +353,12 @@ int kvm_dev_ioctl_check_extension(long ext)
 		r = 1;
 #else
 		r = 0;
+		break;
+#endif
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+	case KVM_CAP_PPC_HTAB_FD:
+		r = 1;
+		break;
 #endif
 		break;
 	case KVM_CAP_NR_VCPUS:
@@ -934,6 +940,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		r = 0;
 		break;
 	}
+
+	case KVM_PPC_GET_HTAB_FD: {
+		struct kvm *kvm = filp->private_data;
+		struct kvm_get_htab_fd ghf;
+
+		r = -EFAULT;
+		if (copy_from_user(&ghf, argp, sizeof(ghf)))
+			break;
+		r = kvm_vm_ioctl_get_htab_fd(kvm, &ghf);
+		break;
+	}
 #endif /* CONFIG_KVM_BOOK3S_64_HV */
 
 #ifdef CONFIG_PPC_BOOK3S_64
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 65ad5c6..f1427c3 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -630,6 +630,7 @@ struct kvm_ppc_smmu_info {
 #endif
 #define KVM_CAP_IRQFD_RESAMPLE 82
 #define KVM_CAP_PPC_BOOKE_WATCHDOG 83
+#define KVM_CAP_PPC_HTAB_FD 84
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -855,6 +856,8 @@ struct kvm_s390_ucas_mapping {
 #define KVM_CREATE_SPAPR_TCE	  _IOW(KVMIO,  0xa8, struct kvm_create_spapr_tce)
 /* Available with KVM_CAP_RMA */
 #define KVM_ALLOCATE_RMA	  _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
+/* Available with KVM_CAP_PPC_HTAB_FD */
+#define KVM_PPC_GET_HTAB_FD	  _IOW(KVMIO,  0xaa, struct kvm_get_htab_fd)
 
 /*
  * ioctls for vcpu fds
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] KVM: Provide mmu notifier retry test based on struct kvm
  2012-10-16  3:59   ` Paul Mackerras
@ 2012-10-16  9:44     ` Avi Kivity
  -1 siblings, 0 replies; 30+ messages in thread
From: Avi Kivity @ 2012-10-16  9:44 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Alexander Graf, kvm, kvm-ppc, David Gibson

On 10/16/2012 05:59 AM, Paul Mackerras wrote:
> The mmu_notifier_retry() function, used to test whether any page
> invalidations are in progress, currently takes a vcpu pointer, though
> the code only needs the VM's struct kvm pointer.  Forthcoming patches
> to the powerpc Book3S HV code will need to test for retry within a VM
> ioctl, where a struct kvm pointer is available but a struct vcpu
> pointer isn't.  Therefore this creates a variant of mmu_notifier_retry
> called kvm_mmu_notifier_retry that takes a struct kvm pointer, and
> implements mmu_notifier_retry in terms of it.

Why not change mmu_notifier_retry() and all its callers?


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] KVM: Provide mmu notifier retry test based on struct kvm
@ 2012-10-16  9:44     ` Avi Kivity
  0 siblings, 0 replies; 30+ messages in thread
From: Avi Kivity @ 2012-10-16  9:44 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Alexander Graf, kvm, kvm-ppc, David Gibson

On 10/16/2012 05:59 AM, Paul Mackerras wrote:
> The mmu_notifier_retry() function, used to test whether any page
> invalidations are in progress, currently takes a vcpu pointer, though
> the code only needs the VM's struct kvm pointer.  Forthcoming patches
> to the powerpc Book3S HV code will need to test for retry within a VM
> ioctl, where a struct kvm pointer is available but a struct vcpu
> pointer isn't.  Therefore this creates a variant of mmu_notifier_retry
> called kvm_mmu_notifier_retry that takes a struct kvm pointer, and
> implements mmu_notifier_retry in terms of it.

Why not change mmu_notifier_retry() and all its callers?


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] KVM: Provide mmu notifier retry test based on struct kvm
  2012-10-16  9:44     ` Avi Kivity
@ 2012-10-16 10:06       ` Alexander Graf
  -1 siblings, 0 replies; 30+ messages in thread
From: Alexander Graf @ 2012-10-16 10:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Paul Mackerras, kvm, kvm-ppc, David Gibson


On 16.10.2012, at 11:44, Avi Kivity wrote:

> On 10/16/2012 05:59 AM, Paul Mackerras wrote:
>> The mmu_notifier_retry() function, used to test whether any page
>> invalidations are in progress, currently takes a vcpu pointer, though
>> the code only needs the VM's struct kvm pointer.  Forthcoming patches
>> to the powerpc Book3S HV code will need to test for retry within a VM
>> ioctl, where a struct kvm pointer is available but a struct vcpu
>> pointer isn't.  Therefore this creates a variant of mmu_notifier_retry
>> called kvm_mmu_notifier_retry that takes a struct kvm pointer, and
>> implements mmu_notifier_retry in terms of it.
> 
> Why not change mmu_notifier_retry() and all its callers?

Why not use Christoffer's patch? :)

Alex

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] KVM: Provide mmu notifier retry test based on struct kvm
@ 2012-10-16 10:06       ` Alexander Graf
  0 siblings, 0 replies; 30+ messages in thread
From: Alexander Graf @ 2012-10-16 10:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Paul Mackerras, kvm, kvm-ppc, David Gibson


On 16.10.2012, at 11:44, Avi Kivity wrote:

> On 10/16/2012 05:59 AM, Paul Mackerras wrote:
>> The mmu_notifier_retry() function, used to test whether any page
>> invalidations are in progress, currently takes a vcpu pointer, though
>> the code only needs the VM's struct kvm pointer.  Forthcoming patches
>> to the powerpc Book3S HV code will need to test for retry within a VM
>> ioctl, where a struct kvm pointer is available but a struct vcpu
>> pointer isn't.  Therefore this creates a variant of mmu_notifier_retry
>> called kvm_mmu_notifier_retry that takes a struct kvm pointer, and
>> implements mmu_notifier_retry in terms of it.
> 
> Why not change mmu_notifier_retry() and all its callers?

Why not use Christoffer's patch? :)

Alex


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
  2012-10-16  4:01   ` Paul Mackerras
@ 2012-10-16 10:06     ` Avi Kivity
  -1 siblings, 0 replies; 30+ messages in thread
From: Avi Kivity @ 2012-10-16 10:06 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Alexander Graf, kvm, kvm-ppc, David Gibson, Juan Quintela,
	Orit Wasserman, Anthony Liguori

On 10/16/2012 06:01 AM, Paul Mackerras wrote:
> A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor.  Reads on
> this fd return the contents of the HPT (hashed page table), writes
> create and/or remove entries in the HPT.  There is a new capability,
> KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl.  The ioctl
> takes an argument structure with the index of the first HPT entry to
> read out and a set of flags.  The flags indicate whether the user is
> intending to read or write the HPT, and whether to return all entries
> or only the "bolted" entries (those with the bolted bit, 0x10, set in
> the first doubleword).
> 
> This is intended for use in implementing qemu's savevm/loadvm and for
> live migration.  Therefore, on reads, the first pass returns information
> about all HPTEs (or all bolted HPTEs).  When the first pass reaches the
> end of the HPT, it returns from the read.  Subsequent reads only return
> information about HPTEs that have changed since they were last read.
> A read that finds no changed HPTEs in the HPT following where the last
> read finished will return 0 bytes.

Copying people with interest in migration.

> +4.78 KVM_PPC_GET_HTAB_FD
> +
> +Capability: KVM_CAP_PPC_HTAB_FD
> +Architectures: powerpc
> +Type: vm ioctl
> +Parameters: Pointer to struct kvm_get_htab_fd (in)
> +Returns: file descriptor number (>= 0) on success, -1 on error
> +
> +This returns a file descriptor that can be used either to read out the
> +entries in the guest's hashed page table (HPT), or to write entries to
> +initialize the HPT.  The returned fd can only be written to if the
> +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
> +can only be read if that bit is clear.  The argument struct looks like
> +this:
> +
> +/* For KVM_PPC_GET_HTAB_FD */
> +struct kvm_get_htab_fd {
> +	__u64	flags;
> +	__u64	start_index;
> +};
> +
> +/* Values for kvm_get_htab_fd.flags */
> +#define KVM_GET_HTAB_BOLTED_ONLY	((__u64)0x1)
> +#define KVM_GET_HTAB_WRITE		((__u64)0x2)
> +
> +The `start_index' field gives the index in the HPT of the entry at
> +which to start reading.  It is ignored when writing.
> +
> +Reads on the fd will initially supply information about all
> +"interesting" HPT entries.  Interesting entries are those with the
> +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
> +all entries.  When the end of the HPT is reached, the read() will
> +return.  

What happens if the read buffer is smaller than the HPT size?

What happens if the read buffer size is not a multiple of entry size?

Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
to change).

> If read() is called again on the fd, it will start again from
> +the beginning of the HPT, but will only return HPT entries that have
> +changed since they were last read.
> +
> +Data read or written is structured as a header (8 bytes) followed by a
> +series of valid HPT entries (16 bytes) each.  The header indicates how
> +many valid HPT entries there are and how many invalid entries follow
> +the valid entries.  The invalid entries are not represented explicitly
> +in the stream.  The header format is:
> +
> +struct kvm_get_htab_header {
> +	__u32	index;
> +	__u16	n_valid;
> +	__u16	n_invalid;
> +};

This structure forces the kernel to return entries sequentially.  Will
this block changing the data structure in the future?  Or is the
hardware spec sufficiently strict that such changes are not realistic?

> +
> +Writes to the fd create HPT entries starting at the index given in the
> +header; first `n_valid' valid entries with contents from the data
> +written, then `n_invalid' invalid entries, invalidating any previously
> +valid entries found.

This scheme is a clever, original, and very interesting approach to live
migration.  That doesn't necessarily mean a NAK, we should see if it
makes sense for other migration APIs as well (we currently have
difficulties migrating very large/wide guests).

What is the typical number of entries in the HPT?  Do you have estimates
of the change rate?

Suppose new hardware arrives that supports nesting HPTs, so that kvm is
no longer synchronously aware of the guest HPT (similar to how NPT/EPT
made kvm unaware of guest virtual->physical translations on x86).  How
will we deal with that?  But I guess this will be a
non-guest-transparent and non-userspace-transparent change, unlike
NPT/EPT, so a userspace ABI addition will be needed anyway).


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
@ 2012-10-16 10:06     ` Avi Kivity
  0 siblings, 0 replies; 30+ messages in thread
From: Avi Kivity @ 2012-10-16 10:06 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Alexander Graf, kvm, kvm-ppc, David Gibson, Juan Quintela,
	Orit Wasserman, Anthony Liguori

On 10/16/2012 06:01 AM, Paul Mackerras wrote:
> A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor.  Reads on
> this fd return the contents of the HPT (hashed page table), writes
> create and/or remove entries in the HPT.  There is a new capability,
> KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl.  The ioctl
> takes an argument structure with the index of the first HPT entry to
> read out and a set of flags.  The flags indicate whether the user is
> intending to read or write the HPT, and whether to return all entries
> or only the "bolted" entries (those with the bolted bit, 0x10, set in
> the first doubleword).
> 
> This is intended for use in implementing qemu's savevm/loadvm and for
> live migration.  Therefore, on reads, the first pass returns information
> about all HPTEs (or all bolted HPTEs).  When the first pass reaches the
> end of the HPT, it returns from the read.  Subsequent reads only return
> information about HPTEs that have changed since they were last read.
> A read that finds no changed HPTEs in the HPT following where the last
> read finished will return 0 bytes.

Copying people with interest in migration.

> +4.78 KVM_PPC_GET_HTAB_FD
> +
> +Capability: KVM_CAP_PPC_HTAB_FD
> +Architectures: powerpc
> +Type: vm ioctl
> +Parameters: Pointer to struct kvm_get_htab_fd (in)
> +Returns: file descriptor number (>= 0) on success, -1 on error
> +
> +This returns a file descriptor that can be used either to read out the
> +entries in the guest's hashed page table (HPT), or to write entries to
> +initialize the HPT.  The returned fd can only be written to if the
> +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
> +can only be read if that bit is clear.  The argument struct looks like
> +this:
> +
> +/* For KVM_PPC_GET_HTAB_FD */
> +struct kvm_get_htab_fd {
> +	__u64	flags;
> +	__u64	start_index;
> +};
> +
> +/* Values for kvm_get_htab_fd.flags */
> +#define KVM_GET_HTAB_BOLTED_ONLY	((__u64)0x1)
> +#define KVM_GET_HTAB_WRITE		((__u64)0x2)
> +
> +The `start_index' field gives the index in the HPT of the entry at
> +which to start reading.  It is ignored when writing.
> +
> +Reads on the fd will initially supply information about all
> +"interesting" HPT entries.  Interesting entries are those with the
> +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
> +all entries.  When the end of the HPT is reached, the read() will
> +return.  

What happens if the read buffer is smaller than the HPT size?

What happens if the read buffer size is not a multiple of entry size?

Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
to change).

> If read() is called again on the fd, it will start again from
> +the beginning of the HPT, but will only return HPT entries that have
> +changed since they were last read.
> +
> +Data read or written is structured as a header (8 bytes) followed by a
> +series of valid HPT entries (16 bytes) each.  The header indicates how
> +many valid HPT entries there are and how many invalid entries follow
> +the valid entries.  The invalid entries are not represented explicitly
> +in the stream.  The header format is:
> +
> +struct kvm_get_htab_header {
> +	__u32	index;
> +	__u16	n_valid;
> +	__u16	n_invalid;
> +};

This structure forces the kernel to return entries sequentially.  Will
this block changing the data structure in the future?  Or is the
hardware spec sufficiently strict that such changes are not realistic?

> +
> +Writes to the fd create HPT entries starting at the index given in the
> +header; first `n_valid' valid entries with contents from the data
> +written, then `n_invalid' invalid entries, invalidating any previously
> +valid entries found.

This scheme is a clever, original, and very interesting approach to live
migration.  That doesn't necessarily mean a NAK, we should see if it
makes sense for other migration APIs as well (we currently have
difficulties migrating very large/wide guests).

What is the typical number of entries in the HPT?  Do you have estimates
of the change rate?

Suppose new hardware arrives that supports nesting HPTs, so that kvm is
no longer synchronously aware of the guest HPT (similar to how NPT/EPT
made kvm unaware of guest virtual->physical translations on x86).  How
will we deal with that?  But I guess this will be a
non-guest-transparent and non-userspace-transparent change, unlike
NPT/EPT, so a userspace ABI addition will be needed anyway).


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
  2012-10-16 10:06     ` Avi Kivity
@ 2012-10-16 11:58       ` Paul Mackerras
  -1 siblings, 0 replies; 30+ messages in thread
From: Paul Mackerras @ 2012-10-16 11:58 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexander Graf, kvm, kvm-ppc, David Gibson, Juan Quintela,
	Orit Wasserman, Anthony Liguori

On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
> On 10/16/2012 06:01 AM, Paul Mackerras wrote:
> > +4.78 KVM_PPC_GET_HTAB_FD
> > +
> > +Capability: KVM_CAP_PPC_HTAB_FD
> > +Architectures: powerpc
> > +Type: vm ioctl
> > +Parameters: Pointer to struct kvm_get_htab_fd (in)
> > +Returns: file descriptor number (>= 0) on success, -1 on error
> > +
> > +This returns a file descriptor that can be used either to read out the
> > +entries in the guest's hashed page table (HPT), or to write entries to
> > +initialize the HPT.  The returned fd can only be written to if the
> > +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
> > +can only be read if that bit is clear.  The argument struct looks like
> > +this:
> > +
> > +/* For KVM_PPC_GET_HTAB_FD */
> > +struct kvm_get_htab_fd {
> > +	__u64	flags;
> > +	__u64	start_index;
> > +};
> > +
> > +/* Values for kvm_get_htab_fd.flags */
> > +#define KVM_GET_HTAB_BOLTED_ONLY	((__u64)0x1)
> > +#define KVM_GET_HTAB_WRITE		((__u64)0x2)
> > +
> > +The `start_index' field gives the index in the HPT of the entry at
> > +which to start reading.  It is ignored when writing.
> > +
> > +Reads on the fd will initially supply information about all
> > +"interesting" HPT entries.  Interesting entries are those with the
> > +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
> > +all entries.  When the end of the HPT is reached, the read() will
> > +return.  
> 
> What happens if the read buffer is smaller than the HPT size?

That's fine; the read stops when it has filled the buffer and a
subsequent read will continue from where the previous one finished.

> What happens if the read buffer size is not a multiple of entry size?

Then we don't use the last few bytes of the buffer.  The read() call
returns the number of bytes that were filled in, of course.  In any
case, the header size is 8 bytes and the HPT entry size is 16 bytes,
so the number of bytes filled in won't necessarily be a multiple of 16
bytes.

> Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
> to change).

No.

> > If read() is called again on the fd, it will start again from
> > +the beginning of the HPT, but will only return HPT entries that have
> > +changed since they were last read.
> > +
> > +Data read or written is structured as a header (8 bytes) followed by a
> > +series of valid HPT entries (16 bytes) each.  The header indicates how
> > +many valid HPT entries there are and how many invalid entries follow
> > +the valid entries.  The invalid entries are not represented explicitly
> > +in the stream.  The header format is:
> > +
> > +struct kvm_get_htab_header {
> > +	__u32	index;
> > +	__u16	n_valid;
> > +	__u16	n_invalid;
> > +};
> 
> This structure forces the kernel to return entries sequentially.  Will
> this block changing the data structure in the future?  Or is the
> hardware spec sufficiently strict that such changes are not realistic?

By "data structure", do you mean the stream format on the file
descriptor, or the HPT structure?  If we want a new stream format,
then we would define a bit in the flags field of struct
kvm_get_htab_fd to mean "I want the new stream format".  The code
fails the ioctl if any unknown flag bits are set, so a new userspace
that wants to use the new format could then detect that it is running
on an old kernel and fall back to the old format.

The HPT entry format is very unlikely to change in size or basic
layout (though the architects do redefine some of the bits
occasionally).

> > +
> > +Writes to the fd create HPT entries starting at the index given in the
> > +header; first `n_valid' valid entries with contents from the data
> > +written, then `n_invalid' invalid entries, invalidating any previously
> > +valid entries found.
> 
> This scheme is a clever, original, and very interesting approach to live
> migration.  That doesn't necessarily mean a NAK, we should see if it
> makes sense for other migration APIs as well (we currently have
> difficulties migrating very large/wide guests).
> 
> What is the typical number of entries in the HPT?  Do you have estimates
> of the change rate?

Typically the HPT would have about a million entries, i.e. it would be
16MiB in size.  The usual guideline is to make it about 1/64 of the
maximum amount of RAM the guest could ever have, rounded up to a power
of two, although we often run with less, say 1/128 or even 1/256.

Because it is a hash table, updates tend to be scattered throughout
the whole table, which is another reason why per-page dirty tracking
and updates would be pretty inefficient.

As for the change rate, it depends on the application of course, but
basically every time the guest changes a PTE in its Linux page tables
we do the corresponding change to the corresponding HPT entry, so the
rate can be quite high.  Workloads that do a lot of fork, exit, mmap,
exec, etc. have a high rate of HPT updates.

> Suppose new hardware arrives that supports nesting HPTs, so that kvm is
> no longer synchronously aware of the guest HPT (similar to how NPT/EPT
> made kvm unaware of guest virtual->physical translations on x86).  How
> will we deal with that?  But I guess this will be a
> non-guest-transparent and non-userspace-transparent change, unlike
> NPT/EPT, so a userspace ABI addition will be needed anyway).

Nested HPTs or other changes to the MMU architecture would certainly
need new guest kernels and new support in KVM.  With a nested
approach, the guest-side MMU data structures (HPT or whatever) would
presumably be in guest memory and thus be handled along with all the
other guest memory, while the host-side MMU data structures would not
need to be saved, so from the migration point of view that would make
it all a lot simpler.

Paul.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
@ 2012-10-16 11:58       ` Paul Mackerras
  0 siblings, 0 replies; 30+ messages in thread
From: Paul Mackerras @ 2012-10-16 11:58 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexander Graf, kvm, kvm-ppc, David Gibson, Juan Quintela,
	Orit Wasserman, Anthony Liguori

On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
> On 10/16/2012 06:01 AM, Paul Mackerras wrote:
> > +4.78 KVM_PPC_GET_HTAB_FD
> > +
> > +Capability: KVM_CAP_PPC_HTAB_FD
> > +Architectures: powerpc
> > +Type: vm ioctl
> > +Parameters: Pointer to struct kvm_get_htab_fd (in)
> > +Returns: file descriptor number (>= 0) on success, -1 on error
> > +
> > +This returns a file descriptor that can be used either to read out the
> > +entries in the guest's hashed page table (HPT), or to write entries to
> > +initialize the HPT.  The returned fd can only be written to if the
> > +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
> > +can only be read if that bit is clear.  The argument struct looks like
> > +this:
> > +
> > +/* For KVM_PPC_GET_HTAB_FD */
> > +struct kvm_get_htab_fd {
> > +	__u64	flags;
> > +	__u64	start_index;
> > +};
> > +
> > +/* Values for kvm_get_htab_fd.flags */
> > +#define KVM_GET_HTAB_BOLTED_ONLY	((__u64)0x1)
> > +#define KVM_GET_HTAB_WRITE		((__u64)0x2)
> > +
> > +The `start_index' field gives the index in the HPT of the entry at
> > +which to start reading.  It is ignored when writing.
> > +
> > +Reads on the fd will initially supply information about all
> > +"interesting" HPT entries.  Interesting entries are those with the
> > +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
> > +all entries.  When the end of the HPT is reached, the read() will
> > +return.  
> 
> What happens if the read buffer is smaller than the HPT size?

That's fine; the read stops when it has filled the buffer and a
subsequent read will continue from where the previous one finished.

> What happens if the read buffer size is not a multiple of entry size?

Then we don't use the last few bytes of the buffer.  The read() call
returns the number of bytes that were filled in, of course.  In any
case, the header size is 8 bytes and the HPT entry size is 16 bytes,
so the number of bytes filled in won't necessarily be a multiple of 16
bytes.

> Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
> to change).

No.

> > If read() is called again on the fd, it will start again from
> > +the beginning of the HPT, but will only return HPT entries that have
> > +changed since they were last read.
> > +
> > +Data read or written is structured as a header (8 bytes) followed by a
> > +series of valid HPT entries (16 bytes) each.  The header indicates how
> > +many valid HPT entries there are and how many invalid entries follow
> > +the valid entries.  The invalid entries are not represented explicitly
> > +in the stream.  The header format is:
> > +
> > +struct kvm_get_htab_header {
> > +	__u32	index;
> > +	__u16	n_valid;
> > +	__u16	n_invalid;
> > +};
> 
> This structure forces the kernel to return entries sequentially.  Will
> this block changing the data structure in the future?  Or is the
> hardware spec sufficiently strict that such changes are not realistic?

By "data structure", do you mean the stream format on the file
descriptor, or the HPT structure?  If we want a new stream format,
then we would define a bit in the flags field of struct
kvm_get_htab_fd to mean "I want the new stream format".  The code
fails the ioctl if any unknown flag bits are set, so a new userspace
that wants to use the new format could then detect that it is running
on an old kernel and fall back to the old format.

The HPT entry format is very unlikely to change in size or basic
layout (though the architects do redefine some of the bits
occasionally).

> > +
> > +Writes to the fd create HPT entries starting at the index given in the
> > +header; first `n_valid' valid entries with contents from the data
> > +written, then `n_invalid' invalid entries, invalidating any previously
> > +valid entries found.
> 
> This scheme is a clever, original, and very interesting approach to live
> migration.  That doesn't necessarily mean a NAK, we should see if it
> makes sense for other migration APIs as well (we currently have
> difficulties migrating very large/wide guests).
> 
> What is the typical number of entries in the HPT?  Do you have estimates
> of the change rate?

Typically the HPT would have about a million entries, i.e. it would be
16MiB in size.  The usual guideline is to make it about 1/64 of the
maximum amount of RAM the guest could ever have, rounded up to a power
of two, although we often run with less, say 1/128 or even 1/256.

Because it is a hash table, updates tend to be scattered throughout
the whole table, which is another reason why per-page dirty tracking
and updates would be pretty inefficient.

As for the change rate, it depends on the application of course, but
basically every time the guest changes a PTE in its Linux page tables
we do the corresponding change to the corresponding HPT entry, so the
rate can be quite high.  Workloads that do a lot of fork, exit, mmap,
exec, etc. have a high rate of HPT updates.

> Suppose new hardware arrives that supports nesting HPTs, so that kvm is
> no longer synchronously aware of the guest HPT (similar to how NPT/EPT
> made kvm unaware of guest virtual->physical translations on x86).  How
> will we deal with that?  But I guess this will be a
> non-guest-transparent and non-userspace-transparent change, unlike
> NPT/EPT, so a userspace ABI addition will be needed anyway).

Nested HPTs or other changes to the MMU architecture would certainly
need new guest kernels and new support in KVM.  With a nested
approach, the guest-side MMU data structures (HPT or whatever) would
presumably be in guest memory and thus be handled along with all the
other guest memory, while the host-side MMU data structures would not
need to be saved, so from the migration point of view that would make
it all a lot simpler.

Paul.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
  2012-10-16 11:58       ` Paul Mackerras
@ 2012-10-16 13:06         ` Avi Kivity
  -1 siblings, 0 replies; 30+ messages in thread
From: Avi Kivity @ 2012-10-16 13:06 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Alexander Graf, kvm, kvm-ppc, David Gibson, Juan Quintela,
	Orit Wasserman, Anthony Liguori

On 10/16/2012 01:58 PM, Paul Mackerras wrote:
> On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
>> On 10/16/2012 06:01 AM, Paul Mackerras wrote:
>> > +4.78 KVM_PPC_GET_HTAB_FD
>> > +
>> > +Capability: KVM_CAP_PPC_HTAB_FD
>> > +Architectures: powerpc
>> > +Type: vm ioctl
>> > +Parameters: Pointer to struct kvm_get_htab_fd (in)
>> > +Returns: file descriptor number (>= 0) on success, -1 on error
>> > +
>> > +This returns a file descriptor that can be used either to read out the
>> > +entries in the guest's hashed page table (HPT), or to write entries to
>> > +initialize the HPT.  The returned fd can only be written to if the
>> > +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
>> > +can only be read if that bit is clear.  The argument struct looks like
>> > +this:
>> > +
>> > +/* For KVM_PPC_GET_HTAB_FD */
>> > +struct kvm_get_htab_fd {
>> > +	__u64	flags;
>> > +	__u64	start_index;
>> > +};
>> > +
>> > +/* Values for kvm_get_htab_fd.flags */
>> > +#define KVM_GET_HTAB_BOLTED_ONLY	((__u64)0x1)
>> > +#define KVM_GET_HTAB_WRITE		((__u64)0x2)
>> > +
>> > +The `start_index' field gives the index in the HPT of the entry at
>> > +which to start reading.  It is ignored when writing.
>> > +
>> > +Reads on the fd will initially supply information about all
>> > +"interesting" HPT entries.  Interesting entries are those with the
>> > +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
>> > +all entries.  When the end of the HPT is reached, the read() will
>> > +return.  
>> 
>> What happens if the read buffer is smaller than the HPT size?
> 
> That's fine; the read stops when it has filled the buffer and a
> subsequent read will continue from where the previous one finished.
> 
>> What happens if the read buffer size is not a multiple of entry size?
> 
> Then we don't use the last few bytes of the buffer.  The read() call
> returns the number of bytes that were filled in, of course.  In any
> case, the header size is 8 bytes and the HPT entry size is 16 bytes,
> so the number of bytes filled in won't necessarily be a multiple of 16
> bytes.

That's sane and expected, but it should be documented.

> 
>> Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
>> to change).
> 
> No.

This forces userspace to dedicate a thread for the HPT.

> 
>> > If read() is called again on the fd, it will start again from
>> > +the beginning of the HPT, but will only return HPT entries that have
>> > +changed since they were last read.
>> > +
>> > +Data read or written is structured as a header (8 bytes) followed by a
>> > +series of valid HPT entries (16 bytes) each.  The header indicates how
>> > +many valid HPT entries there are and how many invalid entries follow
>> > +the valid entries.  The invalid entries are not represented explicitly
>> > +in the stream.  The header format is:
>> > +
>> > +struct kvm_get_htab_header {
>> > +	__u32	index;
>> > +	__u16	n_valid;
>> > +	__u16	n_invalid;
>> > +};
>> 
>> This structure forces the kernel to return entries sequentially.  Will
>> this block changing the data structure in the future?  Or is the
>> hardware spec sufficiently strict that such changes are not realistic?
> 
> By "data structure", do you mean the stream format on the file
> descriptor, or the HPT structure?  If we want a new stream format,
> then we would define a bit in the flags field of struct
> kvm_get_htab_fd to mean "I want the new stream format".  The code
> fails the ioctl if any unknown flag bits are set, so a new userspace
> that wants to use the new format could then detect that it is running
> on an old kernel and fall back to the old format.
> 
> The HPT entry format is very unlikely to change in size or basic
> layout (though the architects do redefine some of the bits
> occasionally).

I meant the internal data structure that holds HPT entries.

I guess I don't understand the index.  Do we expect changes to be in
contiguous ranges?  And invalid entries to be contiguous as well?  That
doesn't fit with how hash tables work.  Does the index represent the
position of the entry within the table, or something else?


> 
>> > +
>> > +Writes to the fd create HPT entries starting at the index given in the
>> > +header; first `n_valid' valid entries with contents from the data
>> > +written, then `n_invalid' invalid entries, invalidating any previously
>> > +valid entries found.
>> 
>> This scheme is a clever, original, and very interesting approach to live
>> migration.  That doesn't necessarily mean a NAK, we should see if it
>> makes sense for other migration APIs as well (we currently have
>> difficulties migrating very large/wide guests).
>> 
>> What is the typical number of entries in the HPT?  Do you have estimates
>> of the change rate?
> 
> Typically the HPT would have about a million entries, i.e. it would be
> 16MiB in size.  The usual guideline is to make it about 1/64 of the
> maximum amount of RAM the guest could ever have, rounded up to a power
> of two, although we often run with less, say 1/128 or even 1/256.

16MiB is transferred in ~0.15 sec on GbE, much faster with 10GbE.  Does
it warrant a live migration protocol?

> Because it is a hash table, updates tend to be scattered throughout
> the whole table, which is another reason why per-page dirty tracking
> and updates would be pretty inefficient.

This suggests a stream format that includes the index in every entry.

> 
> As for the change rate, it depends on the application of course, but
> basically every time the guest changes a PTE in its Linux page tables
> we do the corresponding change to the corresponding HPT entry, so the
> rate can be quite high.  Workloads that do a lot of fork, exit, mmap,
> exec, etc. have a high rate of HPT updates.

If the rate is high enough, then there's no point in a live update.

> 
>> Suppose new hardware arrives that supports nesting HPTs, so that kvm is
>> no longer synchronously aware of the guest HPT (similar to how NPT/EPT
>> made kvm unaware of guest virtual->physical translations on x86).  How
>> will we deal with that?  But I guess this will be a
>> non-guest-transparent and non-userspace-transparent change, unlike
>> NPT/EPT, so a userspace ABI addition will be needed anyway).
> 
> Nested HPTs or other changes to the MMU architecture would certainly
> need new guest kernels and new support in KVM.  With a nested
> approach, the guest-side MMU data structures (HPT or whatever) would
> presumably be in guest memory and thus be handled along with all the
> other guest memory, while the host-side MMU data structures would not
> need to be saved, so from the migration point of view that would make
> it all a lot simpler.

Yeah.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
@ 2012-10-16 13:06         ` Avi Kivity
  0 siblings, 0 replies; 30+ messages in thread
From: Avi Kivity @ 2012-10-16 13:06 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Alexander Graf, kvm, kvm-ppc, David Gibson, Juan Quintela,
	Orit Wasserman, Anthony Liguori

On 10/16/2012 01:58 PM, Paul Mackerras wrote:
> On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
>> On 10/16/2012 06:01 AM, Paul Mackerras wrote:
>> > +4.78 KVM_PPC_GET_HTAB_FD
>> > +
>> > +Capability: KVM_CAP_PPC_HTAB_FD
>> > +Architectures: powerpc
>> > +Type: vm ioctl
>> > +Parameters: Pointer to struct kvm_get_htab_fd (in)
>> > +Returns: file descriptor number (>= 0) on success, -1 on error
>> > +
>> > +This returns a file descriptor that can be used either to read out the
>> > +entries in the guest's hashed page table (HPT), or to write entries to
>> > +initialize the HPT.  The returned fd can only be written to if the
>> > +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
>> > +can only be read if that bit is clear.  The argument struct looks like
>> > +this:
>> > +
>> > +/* For KVM_PPC_GET_HTAB_FD */
>> > +struct kvm_get_htab_fd {
>> > +	__u64	flags;
>> > +	__u64	start_index;
>> > +};
>> > +
>> > +/* Values for kvm_get_htab_fd.flags */
>> > +#define KVM_GET_HTAB_BOLTED_ONLY	((__u64)0x1)
>> > +#define KVM_GET_HTAB_WRITE		((__u64)0x2)
>> > +
>> > +The `start_index' field gives the index in the HPT of the entry at
>> > +which to start reading.  It is ignored when writing.
>> > +
>> > +Reads on the fd will initially supply information about all
>> > +"interesting" HPT entries.  Interesting entries are those with the
>> > +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
>> > +all entries.  When the end of the HPT is reached, the read() will
>> > +return.  
>> 
>> What happens if the read buffer is smaller than the HPT size?
> 
> That's fine; the read stops when it has filled the buffer and a
> subsequent read will continue from where the previous one finished.
> 
>> What happens if the read buffer size is not a multiple of entry size?
> 
> Then we don't use the last few bytes of the buffer.  The read() call
> returns the number of bytes that were filled in, of course.  In any
> case, the header size is 8 bytes and the HPT entry size is 16 bytes,
> so the number of bytes filled in won't necessarily be a multiple of 16
> bytes.

That's sane and expected, but it should be documented.

> 
>> Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
>> to change).
> 
> No.

This forces userspace to dedicate a thread for the HPT.

> 
>> > If read() is called again on the fd, it will start again from
>> > +the beginning of the HPT, but will only return HPT entries that have
>> > +changed since they were last read.
>> > +
>> > +Data read or written is structured as a header (8 bytes) followed by a
>> > +series of valid HPT entries (16 bytes) each.  The header indicates how
>> > +many valid HPT entries there are and how many invalid entries follow
>> > +the valid entries.  The invalid entries are not represented explicitly
>> > +in the stream.  The header format is:
>> > +
>> > +struct kvm_get_htab_header {
>> > +	__u32	index;
>> > +	__u16	n_valid;
>> > +	__u16	n_invalid;
>> > +};
>> 
>> This structure forces the kernel to return entries sequentially.  Will
>> this block changing the data structure in the future?  Or is the
>> hardware spec sufficiently strict that such changes are not realistic?
> 
> By "data structure", do you mean the stream format on the file
> descriptor, or the HPT structure?  If we want a new stream format,
> then we would define a bit in the flags field of struct
> kvm_get_htab_fd to mean "I want the new stream format".  The code
> fails the ioctl if any unknown flag bits are set, so a new userspace
> that wants to use the new format could then detect that it is running
> on an old kernel and fall back to the old format.
> 
> The HPT entry format is very unlikely to change in size or basic
> layout (though the architects do redefine some of the bits
> occasionally).

I meant the internal data structure that holds HPT entries.

I guess I don't understand the index.  Do we expect changes to be in
contiguous ranges?  And invalid entries to be contiguous as well?  That
doesn't fit with how hash tables work.  Does the index represent the
position of the entry within the table, or something else?


> 
>> > +
>> > +Writes to the fd create HPT entries starting at the index given in the
>> > +header; first `n_valid' valid entries with contents from the data
>> > +written, then `n_invalid' invalid entries, invalidating any previously
>> > +valid entries found.
>> 
>> This scheme is a clever, original, and very interesting approach to live
>> migration.  That doesn't necessarily mean a NAK, we should see if it
>> makes sense for other migration APIs as well (we currently have
>> difficulties migrating very large/wide guests).
>> 
>> What is the typical number of entries in the HPT?  Do you have estimates
>> of the change rate?
> 
> Typically the HPT would have about a million entries, i.e. it would be
> 16MiB in size.  The usual guideline is to make it about 1/64 of the
> maximum amount of RAM the guest could ever have, rounded up to a power
> of two, although we often run with less, say 1/128 or even 1/256.

16MiB is transferred in ~0.15 sec on GbE, much faster with 10GbE.  Does
it warrant a live migration protocol?

> Because it is a hash table, updates tend to be scattered throughout
> the whole table, which is another reason why per-page dirty tracking
> and updates would be pretty inefficient.

This suggests a stream format that includes the index in every entry.

> 
> As for the change rate, it depends on the application of course, but
> basically every time the guest changes a PTE in its Linux page tables
> we do the corresponding change to the corresponding HPT entry, so the
> rate can be quite high.  Workloads that do a lot of fork, exit, mmap,
> exec, etc. have a high rate of HPT updates.

If the rate is high enough, then there's no point in a live update.

> 
>> Suppose new hardware arrives that supports nesting HPTs, so that kvm is
>> no longer synchronously aware of the guest HPT (similar to how NPT/EPT
>> made kvm unaware of guest virtual->physical translations on x86).  How
>> will we deal with that?  But I guess this will be a
>> non-guest-transparent and non-userspace-transparent change, unlike
>> NPT/EPT, so a userspace ABI addition will be needed anyway).
> 
> Nested HPTs or other changes to the MMU architecture would certainly
> need new guest kernels and new support in KVM.  With a nested
> approach, the guest-side MMU data structures (HPT or whatever) would
> presumably be in guest memory and thus be handled along with all the
> other guest memory, while the host-side MMU data structures would not
> need to be saved, so from the migration point of view that would make
> it all a lot simpler.

Yeah.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
  2012-10-16 13:06         ` Avi Kivity
@ 2012-10-16 20:03           ` Anthony Liguori
  -1 siblings, 0 replies; 30+ messages in thread
From: Anthony Liguori @ 2012-10-16 20:03 UTC (permalink / raw)
  To: Avi Kivity, Paul Mackerras
  Cc: Alexander Graf, kvm, kvm-ppc, David Gibson, Juan Quintela,
	Orit Wasserman

Avi Kivity <avi@redhat.com> writes:

> On 10/16/2012 01:58 PM, Paul Mackerras wrote:
>> On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
>>> On 10/16/2012 06:01 AM, Paul Mackerras wrote:
>>> > +4.78 KVM_PPC_GET_HTAB_FD
>>> > +
>>> > +Capability: KVM_CAP_PPC_HTAB_FD
>>> > +Architectures: powerpc
>>> > +Type: vm ioctl
>>> > +Parameters: Pointer to struct kvm_get_htab_fd (in)
>>> > +Returns: file descriptor number (>= 0) on success, -1 on error
>>> > +
>>> > +This returns a file descriptor that can be used either to read out the
>>> > +entries in the guest's hashed page table (HPT), or to write entries to
>>> > +initialize the HPT.  The returned fd can only be written to if the
>>> > +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
>>> > +can only be read if that bit is clear.  The argument struct looks like
>>> > +this:
>>> > +
>>> > +/* For KVM_PPC_GET_HTAB_FD */
>>> > +struct kvm_get_htab_fd {
>>> > +	__u64	flags;
>>> > +	__u64	start_index;
>>> > +};
>>> > +
>>> > +/* Values for kvm_get_htab_fd.flags */
>>> > +#define KVM_GET_HTAB_BOLTED_ONLY	((__u64)0x1)
>>> > +#define KVM_GET_HTAB_WRITE		((__u64)0x2)
>>> > +
>>> > +The `start_index' field gives the index in the HPT of the entry at
>>> > +which to start reading.  It is ignored when writing.
>>> > +
>>> > +Reads on the fd will initially supply information about all
>>> > +"interesting" HPT entries.  Interesting entries are those with the
>>> > +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
>>> > +all entries.  When the end of the HPT is reached, the read() will
>>> > +return.  
>>> 
>>> What happens if the read buffer is smaller than the HPT size?
>> 
>> That's fine; the read stops when it has filled the buffer and a
>> subsequent read will continue from where the previous one finished.
>> 
>>> What happens if the read buffer size is not a multiple of entry size?
>> 
>> Then we don't use the last few bytes of the buffer.  The read() call
>> returns the number of bytes that were filled in, of course.  In any
>> case, the header size is 8 bytes and the HPT entry size is 16 bytes,
>> so the number of bytes filled in won't necessarily be a multiple of 16
>> bytes.
>
> That's sane and expected, but it should be documented.
>
>> 
>>> Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
>>> to change).
>> 
>> No.
>
> This forces userspace to dedicate a thread for the HPT.

If no changes are available, does read return a size > 0?  I don't think
it's necessary to support polling.  The kernel should always be able to
respond to userspace here.  The only catch is whether to return !0 read
sizes when there are no changes.

At any case, I can't see why a dedicated thread is needed.  QEMU is
going to poll HPT based on how fast we can send data over the wire.

>>> > If read() is called again on the fd, it will start again from
>>> > +the beginning of the HPT, but will only return HPT entries that have
>>> > +changed since they were last read.
>>> > +
>>> > +Data read or written is structured as a header (8 bytes) followed by a
>>> > +series of valid HPT entries (16 bytes) each.  The header indicates how
>>> > +many valid HPT entries there are and how many invalid entries follow
>>> > +the valid entries.  The invalid entries are not represented explicitly
>>> > +in the stream.  The header format is:
>>> > +
>>> > +struct kvm_get_htab_header {
>>> > +	__u32	index;
>>> > +	__u16	n_valid;
>>> > +	__u16	n_invalid;
>>> > +};
>>> 
>>> This structure forces the kernel to return entries sequentially.  Will
>>> this block changing the data structure in the future?  Or is the
>>> hardware spec sufficiently strict that such changes are not realistic?
>> 
>> By "data structure", do you mean the stream format on the file
>> descriptor, or the HPT structure?  If we want a new stream format,
>> then we would define a bit in the flags field of struct
>> kvm_get_htab_fd to mean "I want the new stream format".  The code
>> fails the ioctl if any unknown flag bits are set, so a new userspace
>> that wants to use the new format could then detect that it is running
>> on an old kernel and fall back to the old format.
>> 
>> The HPT entry format is very unlikely to change in size or basic
>> layout (though the architects do redefine some of the bits
>> occasionally).
>
> I meant the internal data structure that holds HPT entries.
>
> I guess I don't understand the index.  Do we expect changes to be in
> contiguous ranges?  And invalid entries to be contiguous as well?  That
> doesn't fit with how hash tables work.  Does the index represent the
> position of the entry within the table, or something else?
>
>
>> 
>>> > +
>>> > +Writes to the fd create HPT entries starting at the index given in the
>>> > +header; first `n_valid' valid entries with contents from the data
>>> > +written, then `n_invalid' invalid entries, invalidating any previously
>>> > +valid entries found.
>>> 
>>> This scheme is a clever, original, and very interesting approach to live
>>> migration.  That doesn't necessarily mean a NAK, we should see if it
>>> makes sense for other migration APIs as well (we currently have
>>> difficulties migrating very large/wide guests).
>>> 
>>> What is the typical number of entries in the HPT?  Do you have estimates
>>> of the change rate?
>> 
>> Typically the HPT would have about a million entries, i.e. it would be
>> 16MiB in size.  The usual guideline is to make it about 1/64 of the
>> maximum amount of RAM the guest could ever have, rounded up to a power
>> of two, although we often run with less, say 1/128 or even 1/256.
>
> 16MiB is transferred in ~0.15 sec on GbE, much faster with 10GbE.  Does
> it warrant a live migration protocol?

0.15 sec == 150ms.  The typical downtime window is 30ms.  So yeah, I
think it does.

>> Because it is a hash table, updates tend to be scattered throughout
>> the whole table, which is another reason why per-page dirty tracking
>> and updates would be pretty inefficient.
>
> This suggests a stream format that includes the index in every entry.
>
>> 
>> As for the change rate, it depends on the application of course, but
>> basically every time the guest changes a PTE in its Linux page tables
>> we do the corresponding change to the corresponding HPT entry, so the
>> rate can be quite high.  Workloads that do a lot of fork, exit, mmap,
>> exec, etc. have a high rate of HPT updates.
>
> If the rate is high enough, then there's no point in a live update.

Do we have practical data here?

Regards,

Anthony Liguori

>
>> 
>>> Suppose new hardware arrives that supports nesting HPTs, so that kvm is
>>> no longer synchronously aware of the guest HPT (similar to how NPT/EPT
>>> made kvm unaware of guest virtual->physical translations on x86).  How
>>> will we deal with that?  But I guess this will be a
>>> non-guest-transparent and non-userspace-transparent change, unlike
>>> NPT/EPT, so a userspace ABI addition will be needed anyway).
>> 
>> Nested HPTs or other changes to the MMU architecture would certainly
>> need new guest kernels and new support in KVM.  With a nested
>> approach, the guest-side MMU data structures (HPT or whatever) would
>> presumably be in guest memory and thus be handled along with all the
>> other guest memory, while the host-side MMU data structures would not
>> need to be saved, so from the migration point of view that would make
>> it all a lot simpler.
>
> Yeah.
>
>
> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
@ 2012-10-16 20:03           ` Anthony Liguori
  0 siblings, 0 replies; 30+ messages in thread
From: Anthony Liguori @ 2012-10-16 20:03 UTC (permalink / raw)
  To: Avi Kivity, Paul Mackerras
  Cc: Alexander Graf, kvm, kvm-ppc, David Gibson, Juan Quintela,
	Orit Wasserman

Avi Kivity <avi@redhat.com> writes:

> On 10/16/2012 01:58 PM, Paul Mackerras wrote:
>> On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
>>> On 10/16/2012 06:01 AM, Paul Mackerras wrote:
>>> > +4.78 KVM_PPC_GET_HTAB_FD
>>> > +
>>> > +Capability: KVM_CAP_PPC_HTAB_FD
>>> > +Architectures: powerpc
>>> > +Type: vm ioctl
>>> > +Parameters: Pointer to struct kvm_get_htab_fd (in)
>>> > +Returns: file descriptor number (>= 0) on success, -1 on error
>>> > +
>>> > +This returns a file descriptor that can be used either to read out the
>>> > +entries in the guest's hashed page table (HPT), or to write entries to
>>> > +initialize the HPT.  The returned fd can only be written to if the
>>> > +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
>>> > +can only be read if that bit is clear.  The argument struct looks like
>>> > +this:
>>> > +
>>> > +/* For KVM_PPC_GET_HTAB_FD */
>>> > +struct kvm_get_htab_fd {
>>> > +	__u64	flags;
>>> > +	__u64	start_index;
>>> > +};
>>> > +
>>> > +/* Values for kvm_get_htab_fd.flags */
>>> > +#define KVM_GET_HTAB_BOLTED_ONLY	((__u64)0x1)
>>> > +#define KVM_GET_HTAB_WRITE		((__u64)0x2)
>>> > +
>>> > +The `start_index' field gives the index in the HPT of the entry at
>>> > +which to start reading.  It is ignored when writing.
>>> > +
>>> > +Reads on the fd will initially supply information about all
>>> > +"interesting" HPT entries.  Interesting entries are those with the
>>> > +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
>>> > +all entries.  When the end of the HPT is reached, the read() will
>>> > +return.  
>>> 
>>> What happens if the read buffer is smaller than the HPT size?
>> 
>> That's fine; the read stops when it has filled the buffer and a
>> subsequent read will continue from where the previous one finished.
>> 
>>> What happens if the read buffer size is not a multiple of entry size?
>> 
>> Then we don't use the last few bytes of the buffer.  The read() call
>> returns the number of bytes that were filled in, of course.  In any
>> case, the header size is 8 bytes and the HPT entry size is 16 bytes,
>> so the number of bytes filled in won't necessarily be a multiple of 16
>> bytes.
>
> That's sane and expected, but it should be documented.
>
>> 
>>> Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
>>> to change).
>> 
>> No.
>
> This forces userspace to dedicate a thread for the HPT.

If no changes are available, does read return a size > 0?  I don't think
it's necessary to support polling.  The kernel should always be able to
respond to userspace here.  The only catch is whether to return !0 read
sizes when there are no changes.

At any case, I can't see why a dedicated thread is needed.  QEMU is
going to poll HPT based on how fast we can send data over the wire.

>>> > If read() is called again on the fd, it will start again from
>>> > +the beginning of the HPT, but will only return HPT entries that have
>>> > +changed since they were last read.
>>> > +
>>> > +Data read or written is structured as a header (8 bytes) followed by a
>>> > +series of valid HPT entries (16 bytes) each.  The header indicates how
>>> > +many valid HPT entries there are and how many invalid entries follow
>>> > +the valid entries.  The invalid entries are not represented explicitly
>>> > +in the stream.  The header format is:
>>> > +
>>> > +struct kvm_get_htab_header {
>>> > +	__u32	index;
>>> > +	__u16	n_valid;
>>> > +	__u16	n_invalid;
>>> > +};
>>> 
>>> This structure forces the kernel to return entries sequentially.  Will
>>> this block changing the data structure in the future?  Or is the
>>> hardware spec sufficiently strict that such changes are not realistic?
>> 
>> By "data structure", do you mean the stream format on the file
>> descriptor, or the HPT structure?  If we want a new stream format,
>> then we would define a bit in the flags field of struct
>> kvm_get_htab_fd to mean "I want the new stream format".  The code
>> fails the ioctl if any unknown flag bits are set, so a new userspace
>> that wants to use the new format could then detect that it is running
>> on an old kernel and fall back to the old format.
>> 
>> The HPT entry format is very unlikely to change in size or basic
>> layout (though the architects do redefine some of the bits
>> occasionally).
>
> I meant the internal data structure that holds HPT entries.
>
> I guess I don't understand the index.  Do we expect changes to be in
> contiguous ranges?  And invalid entries to be contiguous as well?  That
> doesn't fit with how hash tables work.  Does the index represent the
> position of the entry within the table, or something else?
>
>
>> 
>>> > +
>>> > +Writes to the fd create HPT entries starting at the index given in the
>>> > +header; first `n_valid' valid entries with contents from the data
>>> > +written, then `n_invalid' invalid entries, invalidating any previously
>>> > +valid entries found.
>>> 
>>> This scheme is a clever, original, and very interesting approach to live
>>> migration.  That doesn't necessarily mean a NAK, we should see if it
>>> makes sense for other migration APIs as well (we currently have
>>> difficulties migrating very large/wide guests).
>>> 
>>> What is the typical number of entries in the HPT?  Do you have estimates
>>> of the change rate?
>> 
>> Typically the HPT would have about a million entries, i.e. it would be
>> 16MiB in size.  The usual guideline is to make it about 1/64 of the
>> maximum amount of RAM the guest could ever have, rounded up to a power
>> of two, although we often run with less, say 1/128 or even 1/256.
>
> 16MiB is transferred in ~0.15 sec on GbE, much faster with 10GbE.  Does
> it warrant a live migration protocol?

0.15 sec = 150ms.  The typical downtime window is 30ms.  So yeah, I
think it does.

>> Because it is a hash table, updates tend to be scattered throughout
>> the whole table, which is another reason why per-page dirty tracking
>> and updates would be pretty inefficient.
>
> This suggests a stream format that includes the index in every entry.
>
>> 
>> As for the change rate, it depends on the application of course, but
>> basically every time the guest changes a PTE in its Linux page tables
>> we do the corresponding change to the corresponding HPT entry, so the
>> rate can be quite high.  Workloads that do a lot of fork, exit, mmap,
>> exec, etc. have a high rate of HPT updates.
>
> If the rate is high enough, then there's no point in a live update.

Do we have practical data here?

Regards,

Anthony Liguori

>
>> 
>>> Suppose new hardware arrives that supports nesting HPTs, so that kvm is
>>> no longer synchronously aware of the guest HPT (similar to how NPT/EPT
>>> made kvm unaware of guest virtual->physical translations on x86).  How
>>> will we deal with that?  But I guess this will be a
>>> non-guest-transparent and non-userspace-transparent change, unlike
>>> NPT/EPT, so a userspace ABI addition will be needed anyway).
>> 
>> Nested HPTs or other changes to the MMU architecture would certainly
>> need new guest kernels and new support in KVM.  With a nested
>> approach, the guest-side MMU data structures (HPT or whatever) would
>> presumably be in guest memory and thus be handled along with all the
>> other guest memory, while the host-side MMU data structures would not
>> need to be saved, so from the migration point of view that would make
>> it all a lot simpler.
>
> Yeah.
>
>
> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
  2012-10-16 13:06         ` Avi Kivity
@ 2012-10-16 21:52           ` Paul Mackerras
  -1 siblings, 0 replies; 30+ messages in thread
From: Paul Mackerras @ 2012-10-16 21:52 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexander Graf, kvm, kvm-ppc, David Gibson, Juan Quintela,
	Orit Wasserman, Anthony Liguori

On Tue, Oct 16, 2012 at 03:06:33PM +0200, Avi Kivity wrote:
> On 10/16/2012 01:58 PM, Paul Mackerras wrote:
> > On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
> >> Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
> >> to change).
> > 
> > No.
> 
> This forces userspace to dedicate a thread for the HPT.

Why? Reads never block in any case.

> >> > If read() is called again on the fd, it will start again from
> >> > +the beginning of the HPT, but will only return HPT entries that have
> >> > +changed since they were last read.
> >> > +
> >> > +Data read or written is structured as a header (8 bytes) followed by a
> >> > +series of valid HPT entries (16 bytes) each.  The header indicates how
> >> > +many valid HPT entries there are and how many invalid entries follow
> >> > +the valid entries.  The invalid entries are not represented explicitly
> >> > +in the stream.  The header format is:
> >> > +
> >> > +struct kvm_get_htab_header {
> >> > +	__u32	index;
> >> > +	__u16	n_valid;
> >> > +	__u16	n_invalid;
> >> > +};
> >> 
> >> This structure forces the kernel to return entries sequentially.  Will
> >> this block changing the data structure in the future?  Or is the
> >> hardware spec sufficiently strict that such changes are not realistic?
> > 
> > By "data structure", do you mean the stream format on the file
> > descriptor, or the HPT structure?  If we want a new stream format,
> > then we would define a bit in the flags field of struct
> > kvm_get_htab_fd to mean "I want the new stream format".  The code
> > fails the ioctl if any unknown flag bits are set, so a new userspace
> > that wants to use the new format could then detect that it is running
> > on an old kernel and fall back to the old format.
> > 
> > The HPT entry format is very unlikely to change in size or basic
> > layout (though the architects do redefine some of the bits
> > occasionally).
> 
> I meant the internal data structure that holds HPT entries.

Oh, that's just an array, and userspace already knows how big it is.

> I guess I don't understand the index.  Do we expect changes to be in
> contiguous ranges?  And invalid entries to be contiguous as well?  That
> doesn't fit with how hash tables work.  Does the index represent the
> position of the entry within the table, or something else?

The index is just the position in the array.  Typically, in each group
of 8 it will tend to be the low-numbered ones that are valid, since
creating an entry usually uses the first empty slot.  So I expect that
on the first pass, most of the records will represent 8 HPTEs.  On
subsequent passes, probably most records will represent a single HPTE.

> >> > +
> >> > +Writes to the fd create HPT entries starting at the index given in the
> >> > +header; first `n_valid' valid entries with contents from the data
> >> > +written, then `n_invalid' invalid entries, invalidating any previously
> >> > +valid entries found.
> >> 
> >> This scheme is a clever, original, and very interesting approach to live
> >> migration.  That doesn't necessarily mean a NAK, we should see if it
> >> makes sense for other migration APIs as well (we currently have
> >> difficulties migrating very large/wide guests).
> >> 
> >> What is the typical number of entries in the HPT?  Do you have estimates
> >> of the change rate?
> > 
> > Typically the HPT would have about a million entries, i.e. it would be
> > 16MiB in size.  The usual guideline is to make it about 1/64 of the
> > maximum amount of RAM the guest could ever have, rounded up to a power
> > of two, although we often run with less, say 1/128 or even 1/256.
> 
> 16MiB is transferred in ~0.15 sec on GbE, much faster with 10GbE.  Does
> it warrant a live migration protocol?

The qemu people I talked to seemed to think so.

> > Because it is a hash table, updates tend to be scattered throughout
> > the whole table, which is another reason why per-page dirty tracking
> > and updates would be pretty inefficient.
> 
> This suggests a stream format that includes the index in every entry.

That would amount to dropping the n_valid and n_invalid fields from
the current header format.  That would be less efficient for the
initial pass (assuming we achieve an average n_valid of at least 2 on
the initial pass), and probably less efficient for the incremental
updates, since a newly-invalidated entry would have to be represented
as 16 zero bytes rather than just an 8-byte header with n_valid=0 and
n_invalid=1.  I'm assuming here that the initial pass would omit
invalid entries.

> > 
> > As for the change rate, it depends on the application of course, but
> > basically every time the guest changes a PTE in its Linux page tables
> > we do the corresponding change to the corresponding HPT entry, so the
> > rate can be quite high.  Workloads that do a lot of fork, exit, mmap,
> > exec, etc. have a high rate of HPT updates.
> 
> If the rate is high enough, then there's no point in a live update.

True, but doesn't that argument apply to memory pages as well?

Paul.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
@ 2012-10-16 21:52           ` Paul Mackerras
  0 siblings, 0 replies; 30+ messages in thread
From: Paul Mackerras @ 2012-10-16 21:52 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexander Graf, kvm, kvm-ppc, David Gibson, Juan Quintela,
	Orit Wasserman, Anthony Liguori

On Tue, Oct 16, 2012 at 03:06:33PM +0200, Avi Kivity wrote:
> On 10/16/2012 01:58 PM, Paul Mackerras wrote:
> > On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
> >> Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
> >> to change).
> > 
> > No.
> 
> This forces userspace to dedicate a thread for the HPT.

Why? Reads never block in any case.

> >> > If read() is called again on the fd, it will start again from
> >> > +the beginning of the HPT, but will only return HPT entries that have
> >> > +changed since they were last read.
> >> > +
> >> > +Data read or written is structured as a header (8 bytes) followed by a
> >> > +series of valid HPT entries (16 bytes) each.  The header indicates how
> >> > +many valid HPT entries there are and how many invalid entries follow
> >> > +the valid entries.  The invalid entries are not represented explicitly
> >> > +in the stream.  The header format is:
> >> > +
> >> > +struct kvm_get_htab_header {
> >> > +	__u32	index;
> >> > +	__u16	n_valid;
> >> > +	__u16	n_invalid;
> >> > +};
> >> 
> >> This structure forces the kernel to return entries sequentially.  Will
> >> this block changing the data structure in the future?  Or is the
> >> hardware spec sufficiently strict that such changes are not realistic?
> > 
> > By "data structure", do you mean the stream format on the file
> > descriptor, or the HPT structure?  If we want a new stream format,
> > then we would define a bit in the flags field of struct
> > kvm_get_htab_fd to mean "I want the new stream format".  The code
> > fails the ioctl if any unknown flag bits are set, so a new userspace
> > that wants to use the new format could then detect that it is running
> > on an old kernel and fall back to the old format.
> > 
> > The HPT entry format is very unlikely to change in size or basic
> > layout (though the architects do redefine some of the bits
> > occasionally).
> 
> I meant the internal data structure that holds HPT entries.

Oh, that's just an array, and userspace already knows how big it is.

> I guess I don't understand the index.  Do we expect changes to be in
> contiguous ranges?  And invalid entries to be contiguous as well?  That
> doesn't fit with how hash tables work.  Does the index represent the
> position of the entry within the table, or something else?

The index is just the position in the array.  Typically, in each group
of 8 it will tend to be the low-numbered ones that are valid, since
creating an entry usually uses the first empty slot.  So I expect that
on the first pass, most of the records will represent 8 HPTEs.  On
subsequent passes, probably most records will represent a single HPTE.

> >> > +
> >> > +Writes to the fd create HPT entries starting at the index given in the
> >> > +header; first `n_valid' valid entries with contents from the data
> >> > +written, then `n_invalid' invalid entries, invalidating any previously
> >> > +valid entries found.
> >> 
> >> This scheme is a clever, original, and very interesting approach to live
> >> migration.  That doesn't necessarily mean a NAK, we should see if it
> >> makes sense for other migration APIs as well (we currently have
> >> difficulties migrating very large/wide guests).
> >> 
> >> What is the typical number of entries in the HPT?  Do you have estimates
> >> of the change rate?
> > 
> > Typically the HPT would have about a million entries, i.e. it would be
> > 16MiB in size.  The usual guideline is to make it about 1/64 of the
> > maximum amount of RAM the guest could ever have, rounded up to a power
> > of two, although we often run with less, say 1/128 or even 1/256.
> 
> 16MiB is transferred in ~0.15 sec on GbE, much faster with 10GbE.  Does
> it warrant a live migration protocol?

The qemu people I talked to seemed to think so.

> > Because it is a hash table, updates tend to be scattered throughout
> > the whole table, which is another reason why per-page dirty tracking
> > and updates would be pretty inefficient.
> 
> This suggests a stream format that includes the index in every entry.

That would amount to dropping the n_valid and n_invalid fields from
the current header format.  That would be less efficient for the
initial pass (assuming we achieve an average n_valid of at least 2 on
the initial pass), and probably less efficient for the incremental
updates, since a newly-invalidated entry would have to be represented
as 16 zero bytes rather than just an 8-byte header with n_valid=0 and
n_invalid=1.  I'm assuming here that the initial pass would omit
invalid entries.

> > 
> > As for the change rate, it depends on the application of course, but
> > basically every time the guest changes a PTE in its Linux page tables
> > we do the corresponding change to the corresponding HPT entry, so the
> > rate can be quite high.  Workloads that do a lot of fork, exit, mmap,
> > exec, etc. have a high rate of HPT updates.
> 
> If the rate is high enough, then there's no point in a live update.

True, but doesn't that argument apply to memory pages as well?

Paul.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
  2012-10-16 20:03           ` Anthony Liguori
@ 2012-10-17 10:27             ` Avi Kivity
  -1 siblings, 0 replies; 30+ messages in thread
From: Avi Kivity @ 2012-10-17 10:27 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Paul Mackerras, Alexander Graf, kvm, kvm-ppc, David Gibson,
	Juan Quintela, Orit Wasserman

On 10/16/2012 10:03 PM, Anthony Liguori wrote:
>>
>> This forces userspace to dedicate a thread for the HPT.
> 
> If no changes are available, does read return a size > 0?  I don't think
> it's necessary to support polling.  The kernel should always be able to
> respond to userspace here.  The only catch is whether to return !0 read
> sizes when there are no changes.
> 
> At any case, I can't see why a dedicated thread is needed.  QEMU is
> going to poll HPT based on how fast we can send data over the wire.

That means spinning if we can send the data faster than we dirty it.
But we do that anyway for memory.



-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
@ 2012-10-17 10:27             ` Avi Kivity
  0 siblings, 0 replies; 30+ messages in thread
From: Avi Kivity @ 2012-10-17 10:27 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Paul Mackerras, Alexander Graf, kvm, kvm-ppc, David Gibson,
	Juan Quintela, Orit Wasserman

On 10/16/2012 10:03 PM, Anthony Liguori wrote:
>>
>> This forces userspace to dedicate a thread for the HPT.
> 
> If no changes are available, does read return a size > 0?  I don't think
> it's necessary to support polling.  The kernel should always be able to
> respond to userspace here.  The only catch is whether to return !0 read
> sizes when there are no changes.
> 
> At any case, I can't see why a dedicated thread is needed.  QEMU is
> going to poll HPT based on how fast we can send data over the wire.

That means spinning if we can send the data faster than we dirty it.
But we do that anyway for memory.



-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
  2012-10-16 21:52           ` Paul Mackerras
@ 2012-10-17 10:31             ` Avi Kivity
  -1 siblings, 0 replies; 30+ messages in thread
From: Avi Kivity @ 2012-10-17 10:31 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Alexander Graf, kvm, kvm-ppc, David Gibson, Juan Quintela,
	Orit Wasserman, Anthony Liguori

On 10/16/2012 11:52 PM, Paul Mackerras wrote:
> On Tue, Oct 16, 2012 at 03:06:33PM +0200, Avi Kivity wrote:
>> On 10/16/2012 01:58 PM, Paul Mackerras wrote:
>> > On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
>> >> Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
>> >> to change).
>> > 
>> > No.
>> 
>> This forces userspace to dedicate a thread for the HPT.
> 
> Why? Reads never block in any case.

Ok.  This parallels KVM_GET_DIRTY_LOG.

>> 
>> I meant the internal data structure that holds HPT entries.
> 
> Oh, that's just an array, and userspace already knows how big it is.
> 
>> I guess I don't understand the index.  Do we expect changes to be in
>> contiguous ranges?  And invalid entries to be contiguous as well?  That
>> doesn't fit with how hash tables work.  Does the index represent the
>> position of the entry within the table, or something else?
> 
> The index is just the position in the array.  Typically, in each group
> of 8 it will tend to be the low-numbered ones that are valid, since
> creating an entry usually uses the first empty slot.  So I expect that
> on the first pass, most of the records will represent 8 HPTEs.  On
> subsequent passes, probably most records will represent a single HPTE.

So it's a form of RLE compression.  Ok.

>> 
>> 16MiB is transferred in ~0.15 sec on GbE, much faster with 10GbE.  Does
>> it warrant a live migration protocol?
> 
> The qemu people I talked to seemed to think so.
> 
>> > Because it is a hash table, updates tend to be scattered throughout
>> > the whole table, which is another reason why per-page dirty tracking
>> > and updates would be pretty inefficient.
>> 
>> This suggests a stream format that includes the index in every entry.
> 
> That would amount to dropping the n_valid and n_invalid fields from
> the current header format.  That would be less efficient for the
> initial pass (assuming we achieve an average n_valid of at least 2 on
> the initial pass), and probably less efficient for the incremental
> updates, since a newly-invalidated entry would have to be represented
> as 16 zero bytes rather than just an 8-byte header with n_valid=0 and
> n_invalid=1.  I'm assuming here that the initial pass would omit
> invalid entries.

I agree.  But let's have some measurements to make sure.

> 
>> > 
>> > As for the change rate, it depends on the application of course, but
>> > basically every time the guest changes a PTE in its Linux page tables
>> > we do the corresponding change to the corresponding HPT entry, so the
>> > rate can be quite high.  Workloads that do a lot of fork, exit, mmap,
>> > exec, etc. have a high rate of HPT updates.
>> 
>> If the rate is high enough, then there's no point in a live update.
> 
> True, but doesn't that argument apply to memory pages as well?

In some cases it does.  The question is what happens in practice.  If
you migrate a kernel build, how many entries are sent in the guest
stopped phase?


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT
@ 2012-10-17 10:31             ` Avi Kivity
  0 siblings, 0 replies; 30+ messages in thread
From: Avi Kivity @ 2012-10-17 10:31 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Alexander Graf, kvm, kvm-ppc, David Gibson, Juan Quintela,
	Orit Wasserman, Anthony Liguori

On 10/16/2012 11:52 PM, Paul Mackerras wrote:
> On Tue, Oct 16, 2012 at 03:06:33PM +0200, Avi Kivity wrote:
>> On 10/16/2012 01:58 PM, Paul Mackerras wrote:
>> > On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
>> >> Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
>> >> to change).
>> > 
>> > No.
>> 
>> This forces userspace to dedicate a thread for the HPT.
> 
> Why? Reads never block in any case.

Ok.  This parallels KVM_GET_DIRTY_LOG.

>> 
>> I meant the internal data structure that holds HPT entries.
> 
> Oh, that's just an array, and userspace already knows how big it is.
> 
>> I guess I don't understand the index.  Do we expect changes to be in
>> contiguous ranges?  And invalid entries to be contiguous as well?  That
>> doesn't fit with how hash tables work.  Does the index represent the
>> position of the entry within the table, or something else?
> 
> The index is just the position in the array.  Typically, in each group
> of 8 it will tend to be the low-numbered ones that are valid, since
> creating an entry usually uses the first empty slot.  So I expect that
> on the first pass, most of the records will represent 8 HPTEs.  On
> subsequent passes, probably most records will represent a single HPTE.

So it's a form of RLE compression.  Ok.

>> 
>> 16MiB is transferred in ~0.15 sec on GbE, much faster with 10GbE.  Does
>> it warrant a live migration protocol?
> 
> The qemu people I talked to seemed to think so.
> 
>> > Because it is a hash table, updates tend to be scattered throughout
>> > the whole table, which is another reason why per-page dirty tracking
>> > and updates would be pretty inefficient.
>> 
>> This suggests a stream format that includes the index in every entry.
> 
> That would amount to dropping the n_valid and n_invalid fields from
> the current header format.  That would be less efficient for the
> initial pass (assuming we achieve an average n_valid of at least 2 on
> the initial pass), and probably less efficient for the incremental
> updates, since a newly-invalidated entry would have to be represented
> as 16 zero bytes rather than just an 8-byte header with n_valid=0 and
> n_invalid=1.  I'm assuming here that the initial pass would omit
> invalid entries.

I agree.  But let's have some measurements to make sure.

> 
>> > 
>> > As for the change rate, it depends on the application of course, but
>> > basically every time the guest changes a PTE in its Linux page tables
>> > we do the corresponding change to the corresponding HPT entry, so the
>> > rate can be quite high.  Workloads that do a lot of fork, exit, mmap,
>> > exec, etc. have a high rate of HPT updates.
>> 
>> If the rate is high enough, then there's no point in a live update.
> 
> True, but doesn't that argument apply to memory pages as well?

In some cases it does.  The question is what happens in practice.  If
you migrate a kernel build, how many entries are sent in the guest
stopped phase?


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2012-10-17 10:32 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-16  3:58 [PATCH 0/5] KVM: PPC: Book3S HV: HPT read/write functions for userspace Paul Mackerras
2012-10-16  3:58 ` Paul Mackerras
2012-10-16  3:59 ` [PATCH 1/5] KVM: Provide mmu notifier retry test based on struct kvm Paul Mackerras
2012-10-16  3:59   ` Paul Mackerras
2012-10-16  9:44   ` Avi Kivity
2012-10-16  9:44     ` Avi Kivity
2012-10-16 10:06     ` Alexander Graf
2012-10-16 10:06       ` Alexander Graf
2012-10-16  4:00 ` [PATCH 2/5] KVM: PPC: Book3S HV: Restructure HPT entry creation code Paul Mackerras
2012-10-16  4:00   ` Paul Mackerras
2012-10-16  4:00 ` [PATCH 3/5] KVM: PPC: Book3S HV: Add a mechanism for recording modified HPTEs Paul Mackerras
2012-10-16  4:00   ` Paul Mackerras
2012-10-16  4:01 ` [PATCH 4/5] KVM: PPC: Book3S HV: Make a HPTE removal function available Paul Mackerras
2012-10-16  4:01   ` Paul Mackerras
2012-10-16  4:01 ` [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT Paul Mackerras
2012-10-16  4:01   ` Paul Mackerras
2012-10-16 10:06   ` Avi Kivity
2012-10-16 10:06     ` Avi Kivity
2012-10-16 11:58     ` Paul Mackerras
2012-10-16 11:58       ` Paul Mackerras
2012-10-16 13:06       ` Avi Kivity
2012-10-16 13:06         ` Avi Kivity
2012-10-16 20:03         ` Anthony Liguori
2012-10-16 20:03           ` Anthony Liguori
2012-10-17 10:27           ` Avi Kivity
2012-10-17 10:27             ` Avi Kivity
2012-10-16 21:52         ` Paul Mackerras
2012-10-16 21:52           ` Paul Mackerras
2012-10-17 10:31           ` Avi Kivity
2012-10-17 10:31             ` Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.