[RFC PATCH 0/2] KVM: PPC: Book3S HV: Report stolen time to guests

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] KVM: PPC: Book3S HV: Report stolen time to guests
@ 2011-12-20 10:21 Paul Mackerras
  2011-12-20 10:22 ` [RFC PATCH 1/2] KVM: PPC: Book3S HV: Make virtual processor area registration more robust Paul Mackerras
  2011-12-20 10:37 ` [KVM PATCH 2/2] KVM: PPC: Book3S HV: Report stolen time to guest through dispatch trace log Paul Mackerras
  0 siblings, 2 replies; 9+ messages in thread
From: Paul Mackerras @ 2011-12-20 10:21 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linuxppc-dev, kvm-ppc

Under pHyp, recent kernels use the dispatch trace log (DTL) to measure
stolen time.  The DTL is a ring buffer containing 48-byte entries,
where the hypervisor creates an entry each time a virtual cpu is
dispatched.  The entries contain a couple of fields that the kernel
interprets as stolen time, measured in timebase ticks.

Although this is not an ideal interface, it is one that our guest
kernels already support.  So this series of patches adds code to
Book3S HV KVM to measure stolen time and report it to the guest via a
dispatch trace log.

Stolen time is measured per virtual core (set of 4 vcpus, on POWER7)
as being all the time when no vcpu thread is executing inside
kvmppc_run_core(), or when a vcpu thread is running the virtual core
but is preempted.

The first patch fixes some potential races with the registration and
unregistration of the DTL and the other per-virtual-processor areas,
since the guest can (un)register a per-virtual-processor area for one
vcpu in a call to the H_REGISTER_VPA hypercall on another vcpu, and
hence potentially while KVM is using a previously-registered area.

The second patch adds the machinery for measuring stolen time and for
creating DTL entries.

Paul.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 1/2] KVM: PPC: Book3S HV: Make virtual processor area registration more robust
  2011-12-20 10:21 [RFC PATCH 0/2] KVM: PPC: Book3S HV: Report stolen time to guests Paul Mackerras
@ 2011-12-20 10:22 ` Paul Mackerras
  2012-01-16 13:04   ` Alexander Graf
  2011-12-20 10:37 ` [KVM PATCH 2/2] KVM: PPC: Book3S HV: Report stolen time to guest through dispatch trace log Paul Mackerras
  1 sibling, 1 reply; 9+ messages in thread
From: Paul Mackerras @ 2011-12-20 10:22 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linuxppc-dev, kvm-ppc

The PAPR API allows three sorts of per-virtual-processor areas to be
registered (VPA, SLB shadow buffer, and dispatch trace log), and
furthermore, these can be registered and unregistered for another
virtual CPU.  Currently we just update the vcpu fields pointing to
these areas at the time of registration or unregistration.  If this
is done on another vcpu, there is the possibility that the target vcpu
is using those fields at the time and could end up using a bogus
pointer and corrupting memory.

This fixes the race by making the target cpu itself do the update, so
we can be sure that the update happens at a time when the fields aren't
being used.  These are updated from a set of 'next_*' fields, which
are protected by a spinlock.  (We could have just taken the spinlock
when using the vpa, slb_shadow or dtl fields, but that would mean
taking the spinlock on every guest entry and exit.)

The code in do_h_register_vpa now takes the spinlock and updates the
'next_*' fields.  There is also a set of '*_pending' flags to indicate
that an update is pending.

This also changes 'struct dtl' (which was undefined) to 'struct dtl_entry',
which is what the rest of the kernel uses.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/kvm_host.h |   15 +++-
 arch/powerpc/kvm/book3s_hv.c        |  167 +++++++++++++++++++++++++----------
 2 files changed, 131 insertions(+), 51 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 1cb6e52..b1126c1 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -82,7 +82,7 @@ struct kvm_vcpu;
 
 struct lppaca;
 struct slb_shadow;
-struct dtl;
+struct dtl_entry;
 
 struct kvm_vm_stat {
 	u32 remote_tlb_flush;
@@ -449,9 +449,18 @@ struct kvm_vcpu_arch {
 	u32 last_inst;
 
 	struct lppaca *vpa;
+	struct lppaca *next_vpa;
 	struct slb_shadow *slb_shadow;
-	struct dtl *dtl;
-	struct dtl *dtl_end;
+	struct slb_shadow *next_slb_shadow;
+	struct dtl_entry *dtl;
+	struct dtl_entry *dtl_end;
+	struct dtl_entry *dtl_ptr;
+	struct dtl_entry *next_dtl;
+	struct dtl_entry *next_dtl_end;
+	u8 vpa_pending;
+	u8 slb_shadow_pending;
+	u8 dtl_pending;
+	spinlock_t vpa_update_lock;
 
 	wait_queue_head_t *wqp;
 	struct kvmppc_vcore *vcore;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index c11d960..6f6e88d 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -140,7 +140,7 @@ static unsigned long do_h_register_vpa(struct kvm_vcpu *vcpu,
 {
 	struct kvm *kvm = vcpu->kvm;
 	unsigned long len, nb;
-	void *va;
+	void *va, *free_va, *tvpa, *dtl, *ss;
 	struct kvm_vcpu *tvcpu;
 	int err = H_PARAMETER;
 
@@ -152,6 +152,8 @@ static unsigned long do_h_register_vpa(struct kvm_vcpu *vcpu,
 	flags &= 7;
 	if (flags == 0 || flags == 4)
 		return H_PARAMETER;
+	free_va = va = NULL;
+	len = 0;
 	if (flags < 4) {
 		if (vpa & 0x7f)
 			return H_PARAMETER;
@@ -165,65 +167,122 @@ static unsigned long do_h_register_vpa(struct kvm_vcpu *vcpu,
 			len = *(unsigned short *)(va + 4);
 		else
 			len = *(unsigned int *)(va + 4);
+		free_va = va;
 		if (len > nb)
 			goto out_unpin;
-		switch (flags) {
-		case 1:		/* register VPA */
-			if (len < 640)
-				goto out_unpin;
-			if (tvcpu->arch.vpa)
-				kvmppc_unpin_guest_page(kvm, vcpu->arch.vpa);
-			tvcpu->arch.vpa = va;
-			init_vpa(vcpu, va);
-			break;
-		case 2:		/* register DTL */
-			if (len < 48)
-				goto out_unpin;
-			len -= len % 48;
-			if (tvcpu->arch.dtl)
-				kvmppc_unpin_guest_page(kvm, vcpu->arch.dtl);
-			tvcpu->arch.dtl = va;
-			tvcpu->arch.dtl_end = va + len;
+	}
+
+	spin_lock(&tvcpu->arch.vpa_update_lock);
+
+	switch (flags) {
+	case 1:		/* register VPA */
+		if (len < 640)
 			break;
-		case 3:		/* register SLB shadow buffer */
-			if (len < 16)
-				goto out_unpin;
-			if (tvcpu->arch.slb_shadow)
-				kvmppc_unpin_guest_page(kvm, vcpu->arch.slb_shadow);
-			tvcpu->arch.slb_shadow = va;
+		free_va = tvcpu->arch.next_vpa;
+		tvcpu->arch.next_vpa = va;
+		tvcpu->arch.vpa_pending = 1;
+		init_vpa(tvcpu, va);
+		err = 0;
+		break;
+	case 2:		/* register DTL */
+		if (len < 48)
 			break;
+		len -= len % 48;
+		tvpa = tvcpu->arch.vpa;
+		if (tvcpu->arch.vpa_pending)
+			tvpa = tvcpu->arch.next_vpa;
+		err = H_RESOURCE;
+		if (tvpa) {
+			free_va = tvcpu->arch.next_dtl;
+			tvcpu->arch.next_dtl = va;
+			tvcpu->arch.next_dtl_end = va + len;
+			tvcpu->arch.dtl_pending = 1;
+			err = 0;
 		}
-	} else {
-		switch (flags) {
-		case 5:		/* unregister VPA */
-			if (tvcpu->arch.slb_shadow || tvcpu->arch.dtl)
-				return H_RESOURCE;
-			if (!tvcpu->arch.vpa)
-				break;
-			kvmppc_unpin_guest_page(kvm, tvcpu->arch.vpa);
-			tvcpu->arch.vpa = NULL;
-			break;
-		case 6:		/* unregister DTL */
-			if (!tvcpu->arch.dtl)
-				break;
-			kvmppc_unpin_guest_page(kvm, tvcpu->arch.dtl);
-			tvcpu->arch.dtl = NULL;
-			break;
-		case 7:		/* unregister SLB shadow buffer */
-			if (!tvcpu->arch.slb_shadow)
-				break;
-			kvmppc_unpin_guest_page(kvm, tvcpu->arch.slb_shadow);
-			tvcpu->arch.slb_shadow = NULL;
+		break;
+	case 3:		/* register SLB shadow buffer */
+		if (len < 16)
 			break;
+		tvpa = tvcpu->arch.vpa;
+		if (tvcpu->arch.vpa_pending)
+			tvpa = tvcpu->arch.next_vpa;
+		err = H_RESOURCE;
+		if (tvpa) {
+			free_va = tvcpu->arch.next_slb_shadow;
+			tvcpu->arch.next_slb_shadow = va;
+			tvcpu->arch.slb_shadow_pending = 1;
+			err = 0;
+		}
+		break;
+
+	case 5:		/* unregister VPA */
+		dtl = tvcpu->arch.dtl;
+		if (tvcpu->arch.dtl_pending)
+			dtl = tvcpu->arch.next_dtl;
+		ss = tvcpu->arch.slb_shadow;
+		if (tvcpu->arch.slb_shadow_pending)
+			ss = tvcpu->arch.next_slb_shadow;
+		err = H_RESOURCE;
+		if (!dtl && !ss) {
+			free_va = tvcpu->arch.next_vpa;
+			tvcpu->arch.next_vpa = NULL;
+			tvcpu->arch.vpa_pending = 1;
+			err = 0;
 		}
+		break;
+	case 6:		/* unregister DTL */
+		free_va = tvcpu->arch.next_dtl;
+		tvcpu->arch.next_dtl = NULL;
+		tvcpu->arch.dtl_pending = 1;
+		err = 0;
+		break;
+	case 7:		/* unregister SLB shadow buffer */
+		free_va = tvcpu->arch.next_slb_shadow;
+		tvcpu->arch.next_slb_shadow = NULL;
+		tvcpu->arch.slb_shadow_pending = 1;
+		err = 0;
+		break;
 	}
-	return H_SUCCESS;
 
+	spin_unlock(&tvcpu->arch.vpa_update_lock);
  out_unpin:
-	kvmppc_unpin_guest_page(kvm, va);
+	if (free_va)
+		kvmppc_unpin_guest_page(kvm, free_va);
 	return err;
 }
 
+static void kvmppc_update_vpas(struct kvm_vcpu *vcpu)
+{
+	struct kvm *kvm = vcpu->kvm;
+
+	spin_lock(&vcpu->arch.vpa_update_lock);
+	if (vcpu->arch.vpa_pending) {
+		if (vcpu->arch.vpa)
+			kvmppc_unpin_guest_page(kvm, vcpu->arch.vpa);
+		vcpu->arch.vpa = vcpu->arch.next_vpa;
+		vcpu->arch.next_vpa = NULL;
+		vcpu->arch.vpa_pending = 0;
+	}
+	if (vcpu->arch.slb_shadow_pending) {
+		if (vcpu->arch.slb_shadow)
+			kvmppc_unpin_guest_page(kvm, vcpu->arch.slb_shadow);
+		vcpu->arch.slb_shadow = vcpu->arch.next_slb_shadow;
+		vcpu->arch.next_slb_shadow = NULL;
+		vcpu->arch.slb_shadow_pending = 0;
+	}
+	if (vcpu->arch.dtl_pending) {
+		if (vcpu->arch.dtl)
+			kvmppc_unpin_guest_page(kvm, vcpu->arch.dtl);
+		vcpu->arch.dtl = vcpu->arch.dtl_ptr = vcpu->arch.next_dtl;
+		vcpu->arch.dtl_end = vcpu->arch.next_dtl_end;
+		vcpu->arch.next_dtl = NULL;
+		vcpu->arch.dtl_pending = 0;
+		if (vcpu->arch.vpa)	/* (should always be non-NULL) */
+			vcpu->arch.vpa->dtl_idx = 0;
+	}
+	spin_unlock(&vcpu->arch.vpa_update_lock);
+}
+
 int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 {
 	unsigned long req = kvmppc_get_gpr(vcpu, 3);
@@ -509,12 +568,20 @@ out:
 
 void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
 {
+	spin_lock(&vcpu->arch.vpa_update_lock);
 	if (vcpu->arch.dtl)
 		kvmppc_unpin_guest_page(vcpu->kvm, vcpu->arch.dtl);
+	if (vcpu->arch.dtl_pending && vcpu->arch.next_dtl)
+		kvmppc_unpin_guest_page(vcpu->kvm, vcpu->arch.next_dtl);
 	if (vcpu->arch.slb_shadow)
 		kvmppc_unpin_guest_page(vcpu->kvm, vcpu->arch.slb_shadow);
+	if (vcpu->arch.slb_shadow_pending && vcpu->arch.next_slb_shadow)
+		kvmppc_unpin_guest_page(vcpu->kvm, vcpu->arch.next_slb_shadow);
 	if (vcpu->arch.vpa)
 		kvmppc_unpin_guest_page(vcpu->kvm, vcpu->arch.vpa);
+	if (vcpu->arch.vpa_pending && vcpu->arch.next_vpa)
+		kvmppc_unpin_guest_page(vcpu->kvm, vcpu->arch.next_vpa);
+	spin_unlock(&vcpu->arch.vpa_update_lock);
 	kvm_vcpu_uninit(vcpu);
 	kfree(vcpu);
 }
@@ -681,8 +748,12 @@ static int kvmppc_run_core(struct kvmppc_vcore *vc)
 	vc->in_guest = 0;
 	vc->pcpu = smp_processor_id();
 	vc->napping_threads = 0;
-	list_for_each_entry(vcpu, &vc->runnable_threads, arch.run_list)
+	list_for_each_entry(vcpu, &vc->runnable_threads, arch.run_list) {
 		kvmppc_start_thread(vcpu);
+		if (vcpu->arch.vpa_pending || vcpu->arch.slb_shadow_pending ||
+		    vcpu->arch.dtl_pending)
+			kvmppc_update_vpas(vcpu);
+	}
 
 	preempt_disable();
 	spin_unlock(&vc->lock);
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [KVM PATCH 2/2] KVM: PPC: Book3S HV: Report stolen time to guest through dispatch trace log
  2011-12-20 10:21 [RFC PATCH 0/2] KVM: PPC: Book3S HV: Report stolen time to guests Paul Mackerras
  2011-12-20 10:22 ` [RFC PATCH 1/2] KVM: PPC: Book3S HV: Make virtual processor area registration more robust Paul Mackerras
@ 2011-12-20 10:37 ` Paul Mackerras
  2012-01-16 13:11   ` Alexander Graf
  1 sibling, 1 reply; 9+ messages in thread
From: Paul Mackerras @ 2011-12-20 10:37 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linuxppc-dev, kvm-ppc

This adds code to measure "stolen" time per virtual core in units of
timebase ticks, and to report the stolen time to the guest using the
dispatch trace log (DTL).  The guest can register an area of memory
for the DTL for a given vcpu.  The DTL is a ring buffer where KVM
fills in one entry every time it enters the guest for that vcpu.

Stolen time is measured as time when the virtual core is not running,
either because the vcore is not runnable (e.g. some of its vcpus are
executing elsewhere in the kernel or in userspace), or when the vcpu
thread that is running the vcore is preempted.  This includes time
when all the vcpus are idle (i.e. have executed the H_CEDE hypercall),
which is OK because the guest accounts stolen time while idle as idle
time.

Each vcpu keeps a record of how much stolen time has been reported to
the guest for that vcpu so far.  When we are about to enter the guest,
we create a new DTL entry (if the guest vcpu has a DTL) and report the
difference between total stolen time for the vcore and stolen time
reported so far for the vcpu as the "enqueue to dispatch" time in the
DTL entry.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/kvm_host.h |    4 +++
 arch/powerpc/kvm/book3s_hv.c        |   43 ++++++++++++++++++++++++++++++++++-
 2 files changed, 46 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index b1126c1..3c5ec79 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -258,6 +258,9 @@ struct kvmppc_vcore {
 	struct list_head runnable_threads;
 	spinlock_t lock;
 	wait_queue_head_t wq;
+	u64 stolen_tb;
+	u64 preempt_tb;
+	struct kvm_vcpu *runner;
 };
 
 #define VCORE_ENTRY_COUNT(vc)	((vc)->entry_exit_count & 0xff)
@@ -461,6 +464,7 @@ struct kvm_vcpu_arch {
 	u8 slb_shadow_pending;
 	u8 dtl_pending;
 	spinlock_t vpa_update_lock;
+	u64 stolen_logged;
 
 	wait_queue_head_t *wqp;
 	struct kvmppc_vcore *vcore;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 6f6e88d..b835df7 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -60,12 +60,20 @@ static int kvmppc_hv_setup_rma(struct kvm_vcpu *vcpu);
 
 void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
+	struct kvmppc_vcore *vc = vcpu->arch.vcore;
+
 	local_paca->kvm_hstate.kvm_vcpu = vcpu;
-	local_paca->kvm_hstate.kvm_vcore = vcpu->arch.vcore;
+	local_paca->kvm_hstate.kvm_vcore = vc;
+	if (vc->runner == vcpu && vc->vcore_state != VCORE_INACTIVE)
+		vc->stolen_tb += mftb() - vc->preempt_tb;
 }
 
 void kvmppc_core_vcpu_put(struct kvm_vcpu *vcpu)
 {
+	struct kvmppc_vcore *vc = vcpu->arch.vcore;
+
+	if (vc->runner == vcpu && vc->vcore_state != VCORE_INACTIVE)
+		vc->preempt_tb = mftb();
 }
 
 void kvmppc_set_msr(struct kvm_vcpu *vcpu, u64 msr)
@@ -283,6 +291,31 @@ static void kvmppc_update_vpas(struct kvm_vcpu *vcpu)
 	spin_unlock(&vcpu->arch.vpa_update_lock);
 }
 
+static void kvmppc_create_dtl_entry(struct kvm_vcpu *vcpu,
+				    struct kvmppc_vcore *vc)
+{
+	struct dtl_entry *dt;
+	struct lppaca *vpa;
+
+	dt = vcpu->arch.dtl_ptr;
+	vpa = vcpu->arch.vpa;
+	if (!dt || !vpa)
+		return;
+	memset(dt, 0, sizeof(struct dtl_entry));
+	dt->dispatch_reason = 7;
+	dt->processor_id = vc->pcpu + vcpu->arch.ptid;
+	dt->timebase = mftb();
+	dt->enqueue_to_dispatch_time = vc->stolen_tb - vcpu->arch.stolen_logged;
+	dt->srr0 = kvmppc_get_pc(vcpu);
+	dt->srr1 = vcpu->arch.shregs.msr;
+	vcpu->arch.stolen_logged = vc->stolen_tb;
+	++dt;
+	if (dt == vcpu->arch.dtl_end)
+		dt = vcpu->arch.dtl;
+	vcpu->arch.dtl_ptr = dt;
+	++vpa->dtl_idx;
+}
+
 int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 {
 	unsigned long req = kvmppc_get_gpr(vcpu, 3);
@@ -542,6 +575,7 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
 			INIT_LIST_HEAD(&vcore->runnable_threads);
 			spin_lock_init(&vcore->lock);
 			init_waitqueue_head(&vcore->wq);
+			vcore->preempt_tb = mftb();
 		}
 		kvm->arch.vcores[core] = vcore;
 	}
@@ -554,6 +588,7 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
 	++vcore->num_threads;
 	spin_unlock(&vcore->lock);
 	vcpu->arch.vcore = vcore;
+	vcpu->arch.stolen_logged = vcore->stolen_tb;
 
 	vcpu->arch.cpu_type = KVM_CPU_3S_64;
 	kvmppc_sanity_check(vcpu);
@@ -745,6 +780,7 @@ static int kvmppc_run_core(struct kvmppc_vcore *vc)
 	vc->nap_count = 0;
 	vc->entry_exit_count = 0;
 	vc->vcore_state = VCORE_RUNNING;
+	vc->stolen_tb += mftb() - vc->preempt_tb;
 	vc->in_guest = 0;
 	vc->pcpu = smp_processor_id();
 	vc->napping_threads = 0;
@@ -753,6 +789,8 @@ static int kvmppc_run_core(struct kvmppc_vcore *vc)
 		if (vcpu->arch.vpa_pending || vcpu->arch.slb_shadow_pending ||
 		    vcpu->arch.dtl_pending)
 			kvmppc_update_vpas(vcpu);
+		if (vcpu->arch.dtl_ptr)
+			kvmppc_create_dtl_entry(vcpu, vc);
 	}
 
 	preempt_disable();
@@ -805,6 +843,7 @@ static int kvmppc_run_core(struct kvmppc_vcore *vc)
 	spin_lock(&vc->lock);
  out:
 	vc->vcore_state = VCORE_INACTIVE;
+	vc->preempt_tb = mftb();
 	list_for_each_entry_safe(vcpu, vnext, &vc->runnable_threads,
 				 arch.run_list) {
 		if (vcpu->arch.ret != RESUME_GUEST) {
@@ -903,6 +942,7 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
 			spin_lock(&vc->lock);
 			continue;
 		}
+		vc->runner = vcpu;
 		n_ceded = 0;
 		list_for_each_entry(v, &vc->runnable_threads, arch.run_list)
 			n_ceded += v->arch.ceded;
@@ -922,6 +962,7 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
 				wake_up(&v->arch.cpu_run);
 			}
 		}
+		vc->runner = NULL;
 	}
 
 	if (signal_pending(current)) {
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] KVM: PPC: Book3S HV: Make virtual processor area registration more robust
  2011-12-20 10:22 ` [RFC PATCH 1/2] KVM: PPC: Book3S HV: Make virtual processor area registration more robust Paul Mackerras
@ 2012-01-16 13:04   ` Alexander Graf
  2012-01-17  5:56     ` Paul Mackerras
  0 siblings, 1 reply; 9+ messages in thread
From: Alexander Graf @ 2012-01-16 13:04 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, kvm-ppc


On 20.12.2011, at 11:22, Paul Mackerras wrote:

> The PAPR API allows three sorts of per-virtual-processor areas to be
> registered (VPA, SLB shadow buffer, and dispatch trace log), and
> furthermore, these can be registered and unregistered for another
> virtual CPU.  Currently we just update the vcpu fields pointing to
> these areas at the time of registration or unregistration.  If this
> is done on another vcpu, there is the possibility that the target vcpu
> is using those fields at the time and could end up using a bogus
> pointer and corrupting memory.
>=20
> This fixes the race by making the target cpu itself do the update, so
> we can be sure that the update happens at a time when the fields =
aren't
> being used.  These are updated from a set of 'next_*' fields, which
> are protected by a spinlock.  (We could have just taken the spinlock
> when using the vpa, slb_shadow or dtl fields, but that would mean
> taking the spinlock on every guest entry and exit.)
>=20
> The code in do_h_register_vpa now takes the spinlock and updates the
> 'next_*' fields.  There is also a set of '*_pending' flags to indicate
> that an update is pending.
>=20
> This also changes 'struct dtl' (which was undefined) to 'struct =
dtl_entry',
> which is what the rest of the kernel uses.
>=20
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> ---
> arch/powerpc/include/asm/kvm_host.h |   15 +++-
> arch/powerpc/kvm/book3s_hv.c        |  167 =
+++++++++++++++++++++++++----------
> 2 files changed, 131 insertions(+), 51 deletions(-)
>=20
> diff --git a/arch/powerpc/include/asm/kvm_host.h =
b/arch/powerpc/include/asm/kvm_host.h
> index 1cb6e52..b1126c1 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -82,7 +82,7 @@ struct kvm_vcpu;
>=20
> struct lppaca;
> struct slb_shadow;
> -struct dtl;
> +struct dtl_entry;
>=20
> struct kvm_vm_stat {
> 	u32 remote_tlb_flush;
> @@ -449,9 +449,18 @@ struct kvm_vcpu_arch {
> 	u32 last_inst;
>=20
> 	struct lppaca *vpa;
> +	struct lppaca *next_vpa;
> 	struct slb_shadow *slb_shadow;
> -	struct dtl *dtl;
> -	struct dtl *dtl_end;
> +	struct slb_shadow *next_slb_shadow;
> +	struct dtl_entry *dtl;
> +	struct dtl_entry *dtl_end;
> +	struct dtl_entry *dtl_ptr;
> +	struct dtl_entry *next_dtl;
> +	struct dtl_entry *next_dtl_end;
> +	u8 vpa_pending;
> +	u8 slb_shadow_pending;
> +	u8 dtl_pending;
> +	spinlock_t vpa_update_lock;
>=20
> 	wait_queue_head_t *wqp;
> 	struct kvmppc_vcore *vcore;
> diff --git a/arch/powerpc/kvm/book3s_hv.c =
b/arch/powerpc/kvm/book3s_hv.c
> index c11d960..6f6e88d 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -140,7 +140,7 @@ static unsigned long do_h_register_vpa(struct =
kvm_vcpu *vcpu,
> {
> 	struct kvm *kvm =3D vcpu->kvm;
> 	unsigned long len, nb;
> -	void *va;
> +	void *va, *free_va, *tvpa, *dtl, *ss;
> 	struct kvm_vcpu *tvcpu;
> 	int err =3D H_PARAMETER;
>=20
> @@ -152,6 +152,8 @@ static unsigned long do_h_register_vpa(struct =
kvm_vcpu *vcpu,
> 	flags &=3D 7;
> 	if (flags =3D=3D 0 || flags =3D=3D 4)

This could probably use a new variable name. Also, what do 0 and 4 mean? =
Constant defines would be nice here.

> 		return H_PARAMETER;
> +	free_va =3D va =3D NULL;
> +	len =3D 0;
> 	if (flags < 4) {
> 		if (vpa & 0x7f)
> 			return H_PARAMETER;
> @@ -165,65 +167,122 @@ static unsigned long do_h_register_vpa(struct =
kvm_vcpu *vcpu,

[pasted from real source]
>                 va =3D kvmppc_pin_guest_page(kvm, vpa, &nb);

Here you're pinning the page, setting va to that temporarily available =
address.

[...]

> 			len =3D *(unsigned short *)(va + 4);

This is a condition on (flags <=3D 1). We bail out on flags =3D=3D 0 a =
few lines up. Just move this whole thing into the respective function =
handlers.

> 		else
> 			len =3D *(unsigned int *)(va + 4);

va + 4 isn't really descriptive. Is this a defined struct? Why not =
actually define one which you can just read data from? Or at least make =
this a define too. Reading random numbers in code is barely readable.

> +		free_va =3D va;

Now free_va is the temporarily available address.

> 		if (len > nb)
> 			goto out_unpin;
> -		switch (flags) {
> -		case 1:		/* register VPA */
> -			if (len < 640)
> -				goto out_unpin;
> -			if (tvcpu->arch.vpa)
> -				kvmppc_unpin_guest_page(kvm, =
vcpu->arch.vpa);
> -			tvcpu->arch.vpa =3D va;
> -			init_vpa(vcpu, va);
> -			break;
> -		case 2:		/* register DTL */
> -			if (len < 48)
> -				goto out_unpin;
> -			len -=3D len % 48;
> -			if (tvcpu->arch.dtl)
> -				kvmppc_unpin_guest_page(kvm, =
vcpu->arch.dtl);
> -			tvcpu->arch.dtl =3D va;
> -			tvcpu->arch.dtl_end =3D va + len;
> +	}
> +
> +	spin_lock(&tvcpu->arch.vpa_update_lock);
> +
> +	switch (flags) {
> +	case 1:		/* register VPA */

Yeah, these could also use defines :)


> +		if (len < 640)
> 			break;
> -		case 3:		/* register SLB shadow buffer */
> -			if (len < 16)
> -				goto out_unpin;
> -			if (tvcpu->arch.slb_shadow)
> -				kvmppc_unpin_guest_page(kvm, =
vcpu->arch.slb_shadow);
> -			tvcpu->arch.slb_shadow =3D va;
> +		free_va =3D tvcpu->arch.next_vpa;
> +		tvcpu->arch.next_vpa =3D va;

Now you're setting next_vpa to this temporarily available address? But =
next_vpa will be used after va is getting free'd, no? Or is that why you =
have free_va?

Wouldn't it be easier to just map it every time we actually use it and =
only shove the GPA around? We could basically save ourselves a lot of =
the logic here.


Alex

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [KVM PATCH 2/2] KVM: PPC: Book3S HV: Report stolen time to guest through dispatch trace log
  2011-12-20 10:37 ` [KVM PATCH 2/2] KVM: PPC: Book3S HV: Report stolen time to guest through dispatch trace log Paul Mackerras
@ 2012-01-16 13:11   ` Alexander Graf
  0 siblings, 0 replies; 9+ messages in thread
From: Alexander Graf @ 2012-01-16 13:11 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, kvm-ppc


On 20.12.2011, at 11:37, Paul Mackerras wrote:

> This adds code to measure "stolen" time per virtual core in units of
> timebase ticks, and to report the stolen time to the guest using the
> dispatch trace log (DTL).  The guest can register an area of memory
> for the DTL for a given vcpu.  The DTL is a ring buffer where KVM
> fills in one entry every time it enters the guest for that vcpu.
> 
> Stolen time is measured as time when the virtual core is not running,
> either because the vcore is not runnable (e.g. some of its vcpus are
> executing elsewhere in the kernel or in userspace), or when the vcpu
> thread that is running the vcore is preempted.  This includes time
> when all the vcpus are idle (i.e. have executed the H_CEDE hypercall),
> which is OK because the guest accounts stolen time while idle as idle
> time.
> 
> Each vcpu keeps a record of how much stolen time has been reported to
> the guest for that vcpu so far.  When we are about to enter the guest,
> we create a new DTL entry (if the guest vcpu has a DTL) and report the
> difference between total stolen time for the vcore and stolen time
> reported so far for the vcpu as the "enqueue to dispatch" time in the
> DTL entry.
> 
> Signed-off-by: Paul Mackerras <paulus@samba.org>

This patch makes sense and looks good to me :)


Alex

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] KVM: PPC: Book3S HV: Make virtual processor area registration more robust
  2012-01-16 13:04   ` Alexander Graf
@ 2012-01-17  5:56     ` Paul Mackerras
  2012-01-17  9:27       ` Alexander Graf
  0 siblings, 1 reply; 9+ messages in thread
From: Paul Mackerras @ 2012-01-17  5:56 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linuxppc-dev, kvm-ppc

On Mon, Jan 16, 2012 at 02:04:29PM +0100, Alexander Graf wrote:
> 
> On 20.12.2011, at 11:22, Paul Mackerras wrote:

> > @@ -152,6 +152,8 @@ static unsigned long do_h_register_vpa(struct kvm_vcpu *vcpu,
> > 	flags &= 7;
> > 	if (flags == 0 || flags == 4)
> 
> This could probably use a new variable name. Also, what do 0 and 4 mean? Constant defines would be nice here.

Those constants are defined in PAPR as being a subfunction code
indicating what sort of area and whether it is to be registered or
unregistered.  I'll make up some names for them.

> [pasted from real source]
> >                 va = kvmppc_pin_guest_page(kvm, vpa, &nb);
> 
> Here you're pinning the page, setting va to that temporarily available address.

Well, it's not just temporarily available, it's available until we
unpin it, since we increment the page count, which inhibits migration.

> > 			len = *(unsigned int *)(va + 4);
> 
> va + 4 isn't really descriptive. Is this a defined struct? Why not actually define one which you can just read data from? Or at least make this a define too. Reading random numbers in code is barely readable.

It's not really a struct, at least not one that is used for anything
else.  PAPR defines that the length of the buffer has to be placed in
the second 32-bit word at registration time.

> 
> > +		free_va = va;
> 
> Now free_va is the temporarily available address.
...
> > +		free_va = tvcpu->arch.next_vpa;
> > +		tvcpu->arch.next_vpa = va;
> 
> Now you're setting next_vpa to this temporarily available address? But next_vpa will be used after va is getting free'd, no? Or is that why you have free_va?

Yes; here we are freeing any previously-set value of next_vpa.  The
idea of free_va is that it is initially set to va so that we correctly
unpin va if any error occurs.  But if there is no error, va gets put
into next_vpa and we free anything that was previously in next_vpa
instead.

> 
> Wouldn't it be easier to just map it every time we actually use it and only shove the GPA around? We could basically save ourselves a lot of the logic here.

There are fields in the VPA that we really want to be able to access
from real mode, for instance the fields that indicate whether we need
to save the FPR and/or VR values.  As far as the DTL is concerned, we
could in fact use copy_to_user to access it, so it doesn't strictly
need to be pinned.  We don't currently use the slb_shadow buffer, but
if we did we would need to access it from real mode, since we would be
reading it in order to set up guest SLB entries.

The other thing is that the VPA registration/unregistration is only
done a few times in the life of the guest, whereas we use the VPAs
constantly while the guest is running.  So it is more efficient to do
more of the work at registration time to make it quicker to access the
VPAs.

I'll send revised patches.  There's a small change I want to make to
patch 2 to avoid putting a very large stolen time value in the first
entry that gets put in after the DTL is registered, which can happen
currently if the DTL gets registered some time after the guest started
running.

Paul.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] KVM: PPC: Book3S HV: Make virtual processor area registration more robust
  2012-01-17  5:56     ` Paul Mackerras
@ 2012-01-17  9:27       ` Alexander Graf
  2012-01-17 11:31         ` Paul Mackerras
  0 siblings, 1 reply; 9+ messages in thread
From: Alexander Graf @ 2012-01-17  9:27 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, kvm-ppc



On 17.01.2012, at 06:56, Paul Mackerras <paulus@samba.org> wrote:

> On Mon, Jan 16, 2012 at 02:04:29PM +0100, Alexander Graf wrote:
>>=20
>> On 20.12.2011, at 11:22, Paul Mackerras wrote:
>=20
>>> @@ -152,6 +152,8 @@ static unsigned long do_h_register_vpa(struct kvm_vc=
pu *vcpu,
>>>    flags &=3D 7;
>>>    if (flags =3D=3D 0 || flags =3D=3D 4)
>>=20
>> This could probably use a new variable name. Also, what do 0 and 4 mean? C=
onstant defines would be nice here.
>=20
> Those constants are defined in PAPR as being a subfunction code
> indicating what sort of area and whether it is to be registered or
> unregistered.  I'll make up some names for them.
>=20
>> [pasted from real source]
>>>                va =3D kvmppc_pin_guest_page(kvm, vpa, &nb);
>>=20
>> Here you're pinning the page, setting va to that temporarily available ad=
dress.
>=20
> Well, it's not just temporarily available, it's available until we
> unpin it, since we increment the page count, which inhibits migration.
>=20
>>>            len =3D *(unsigned int *)(va + 4);
>>=20
>> va + 4 isn't really descriptive. Is this a defined struct? Why not actual=
ly define one which you can just read data from? Or at least make this a def=
ine too. Reading random numbers in code is barely readable.
>=20
> It's not really a struct, at least not one that is used for anything
> else.  PAPR defines that the length of the buffer has to be placed in
> the second 32-bit word at registration time.
>=20
>>=20
>>> +        free_va =3D va;
>>=20
>> Now free_va is the temporarily available address.
> ...
>>> +        free_va =3D tvcpu->arch.next_vpa;
>>> +        tvcpu->arch.next_vpa =3D va;
>>=20
>> Now you're setting next_vpa to this temporarily available address? But ne=
xt_vpa will be used after va is getting free'd, no? Or is that why you have f=
ree_va?
>=20
> Yes; here we are freeing any previously-set value of next_vpa.  The
> idea of free_va is that it is initially set to va so that we correctly
> unpin va if any error occurs.  But if there is no error, va gets put
> into next_vpa and we free anything that was previously in next_vpa
> instead.
>=20
>>=20
>> Wouldn't it be easier to just map it every time we actually use it and on=
ly shove the GPA around? We could basically save ourselves a lot of the logi=
c here.
>=20
> There are fields in the VPA that we really want to be able to access
> from real mode, for instance the fields that indicate whether we need
> to save the FPR and/or VR values.  As far as the DTL is concerned, we
> could in fact use copy_to_user to access it, so it doesn't strictly
> need to be pinned.  We don't currently use the slb_shadow buffer, but
> if we did we would need to access it from real mode, since we would be
> reading it in order to set up guest SLB entries.
>=20
> The other thing is that the VPA registration/unregistration is only
> done a few times in the life of the guest, whereas we use the VPAs
> constantly while the guest is running.  So it is more efficient to do
> more of the work at registration time to make it quicker to access the
> VPAs.

The thing I was getting at was not the map during the lifetime, but the map d=
uring registration. Currently we have:

1) Set VPA to x
2) Assign feature y to VPA
3) Use VPA

1 and 2 are the slow path, 3 occurs more frequently. So we want 3 to be fast=
. 1 and 2 don't matter that much wrt performance.

You are currently mapping the VPA at /, which gets you into this map/unmap m=
ess trying to free the previous mapping. If you moved the map to step 2 and o=
nly stored the GPA at step 1, all map+unmap operations except for final unma=
ps would be in one spot, so you wouldn't need to construct this big complex s=
tate machine.

I hope that makes it more clear :)

Alex

>=20
> I'll send revised patches.  There's a small change I want to make to
> patch 2 to avoid putting a very large stolen time value in the first
> entry that gets put in after the DTL is registered, which can happen
> currently if the DTL gets registered some time after the guest started
> running.
>=20
> Paul.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] KVM: PPC: Book3S HV: Make virtual processor area registration more robust
  2012-01-17  9:27       ` Alexander Graf
@ 2012-01-17 11:31         ` Paul Mackerras
  2012-01-17 12:19           ` Alexander Graf
  0 siblings, 1 reply; 9+ messages in thread
From: Paul Mackerras @ 2012-01-17 11:31 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linuxppc-dev, kvm-ppc

On Tue, Jan 17, 2012 at 10:27:26AM +0100, Alexander Graf wrote:

> The thing I was getting at was not the map during the lifetime, but
> the map during registration. Currently we have:
> 
> 1) Set VPA to x
> 2) Assign feature y to VPA
> 3) Use VPA
> 
> 1 and 2 are the slow path, 3 occurs more frequently. So we want 3 to
> be fast. 1 and 2 don't matter that much wrt performance.
> 
> You are currently mapping the VPA at /, which gets you into this
> map/unmap mess trying to free the previous mapping. If you moved the
> map to step 2 and only stored the GPA at step 1, all map+unmap
> operations except for final unmaps would be in one spot, so you
> wouldn't need to construct this big complex state machine.

That might simplify things - I'll try it and see.  The worry with
doing the map/pin at 2 is that if anything goes wrong we no longer
have the opportunity to return an error for the H_REGISTER_VPA call,
so I'll have to at least do some checking in 1, leading to possibly
more code overall.

Paul.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] KVM: PPC: Book3S HV: Make virtual processor area registration more robust
  2012-01-17 11:31         ` Paul Mackerras
@ 2012-01-17 12:19           ` Alexander Graf
  0 siblings, 0 replies; 9+ messages in thread
From: Alexander Graf @ 2012-01-17 12:19 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, kvm-ppc


On 17.01.2012, at 12:31, Paul Mackerras wrote:

> On Tue, Jan 17, 2012 at 10:27:26AM +0100, Alexander Graf wrote:
>=20
>> The thing I was getting at was not the map during the lifetime, but
>> the map during registration. Currently we have:
>>=20
>> 1) Set VPA to x
>> 2) Assign feature y to VPA
>> 3) Use VPA
>>=20
>> 1 and 2 are the slow path, 3 occurs more frequently. So we want 3 to
>> be fast. 1 and 2 don't matter that much wrt performance.
>>=20
>> You are currently mapping the VPA at /, which gets you into this
>> map/unmap mess trying to free the previous mapping. If you moved the
>> map to step 2 and only stored the GPA at step 1, all map+unmap
>> operations except for final unmaps would be in one spot, so you
>> wouldn't need to construct this big complex state machine.
>=20
> That might simplify things - I'll try it and see.  The worry with
> doing the map/pin at 2 is that if anything goes wrong we no longer
> have the opportunity to return an error for the H_REGISTER_VPA call,
> so I'll have to at least do some checking in 1, leading to possibly
> more code overall.

Well, then map and unmap it in step 1 and map it in step 2 again. We're =
in the slow path so performance isn't critical. Readability and =
maintainability however are :)


Alex

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-01-17 12:19 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-20 10:21 [RFC PATCH 0/2] KVM: PPC: Book3S HV: Report stolen time to guests Paul Mackerras
2011-12-20 10:22 ` [RFC PATCH 1/2] KVM: PPC: Book3S HV: Make virtual processor area registration more robust Paul Mackerras
2012-01-16 13:04   ` Alexander Graf
2012-01-17  5:56     ` Paul Mackerras
2012-01-17  9:27       ` Alexander Graf
2012-01-17 11:31         ` Paul Mackerras
2012-01-17 12:19           ` Alexander Graf
2011-12-20 10:37 ` [KVM PATCH 2/2] KVM: PPC: Book3S HV: Report stolen time to guest through dispatch trace log Paul Mackerras
2012-01-16 13:11   ` Alexander Graf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).