linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0] A series patches for kvm&qemu to enable vcpu destruction in kvm
@ 2011-11-25  2:35 Liu Ping Fan
  2011-11-25  2:35 ` [PATCH 1/2] kvm: make vcpu life cycle separated from kvm instance Liu Ping Fan
                   ` (8 more replies)
  0 siblings, 9 replies; 78+ messages in thread
From: Liu Ping Fan @ 2011-11-25  2:35 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: linux-kernel, avi, aliguori, jan.kiszka, ryanh

A series of patches from kvm, qemu to guest. These patches will finally enable vcpu destruction in kvm instance and let vcpu thread exit in qemu.
 
Currently, the vcpu online feature enables the dynamical creation of vcpu and vcpu thread, while the offline feature can not destruct the vcpu and let vcpu thread exit, it just halt in kvm. Because currently, the vcpu will only be destructed when kvm instance is destroyed. We can 
change vcpu as an refer of kvm instance, and then vcpu's destruction MUST and CAN come before kvm's destruction.

These patches use guest driver to notify the CPU_DEAD event to qemu, and later qemu asks kvm to release the dead vcpu and finally exit the 
thread. 
The usage is: 
	qemu$cpu_set n online
	qemu$cpu_set n zap   ------------ This will destroy the vcpu-n in kvm and let vcpu thread exit
     OR	
	qemu$cpu_set n offline 	--------- This will just block vcpu-n in kvm

Any comment and suggestion are welcome.


Patches include:
|-- guest
|   `-- 0001-virtio-add-a-pci-driver-to-notify-host-the-CPU_DEAD-.patch
|-- kvm
|   |-- 0001-kvm-make-vcpu-life-cycle-separated-from-kvm-instance.patch
|   `-- 0002-kvm-exit-to-userspace-with-reason-KVM_EXIT_VCPU_DEAD.patch
`-- qemu
    |-- 0001-Add-cpu_phyid_to_cpu-to-map-cpu-phyid-to-CPUState.patch
    |-- 0002-Add-cpu_free-to-support-arch-related-CPUState-releas.patch
    |-- 0003-Introduce-a-pci-device-cpustate-to-get-CPU_DEAD-even.patch
    |-- 0004-Release-vcpu-and-finally-exit-vcpu-thread-safely.patch
    `-- 0005-tmp-patches-for-linux-header-files.patch


^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 1/2] kvm: make vcpu life cycle separated from kvm instance
  2011-11-25  2:35 [PATCH 0] A series patches for kvm&qemu to enable vcpu destruction in kvm Liu Ping Fan
@ 2011-11-25  2:35 ` Liu Ping Fan
  2011-11-27 10:36   ` Avi Kivity
  2011-11-25 17:54 ` [PATCH 0] A series patches for kvm&qemu to enable vcpu destruction in kvm Jan Kiszka
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 78+ messages in thread
From: Liu Ping Fan @ 2011-11-25  2:35 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: linux-kernel, avi, aliguori, jan.kiszka, ryanh, Liu Ping Fan

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

Currently, vcpu can be destructed only when kvm instance destroyed.
Change this to vcpu as a refer to kvm, and then vcpu MUST and CAN be
destroyed before kvm's destroy. Qemu will take advantage of this to
exit the vcpu thread if the thread is no longer in use by guest.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 arch/x86/kvm/x86.c       |   28 ++++++++--------------------
 include/linux/kvm_host.h |    2 ++
 virt/kvm/kvm_main.c      |   31 +++++++++++++++++++++++++++++--
 3 files changed, 39 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c38efd7..ea2315a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6560,27 +6560,16 @@ static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
 	vcpu_put(vcpu);
 }
 
-static void kvm_free_vcpus(struct kvm *kvm)
+void kvm_arch_vcpu_zap(struct kref *ref)
 {
-	unsigned int i;
-	struct kvm_vcpu *vcpu;
-
-	/*
-	 * Unpin any mmu pages first.
-	 */
-	kvm_for_each_vcpu(i, vcpu, kvm) {
-		kvm_clear_async_pf_completion_queue(vcpu);
-		kvm_unload_vcpu_mmu(vcpu);
-	}
-	kvm_for_each_vcpu(i, vcpu, kvm)
-		kvm_arch_vcpu_free(vcpu);
-
-	mutex_lock(&kvm->lock);
-	for (i = 0; i < atomic_read(&kvm->online_vcpus); i++)
-		kvm->vcpus[i] = NULL;
+	struct kvm_vcpu *vcpu = container_of(ref, struct kvm_vcpu, refcount);
+	struct kvm *kvm = vcpu->kvm;
 
-	atomic_set(&kvm->online_vcpus, 0);
-	mutex_unlock(&kvm->lock);
+	printk(KERN_INFO "%s, zap vcpu:0x%x\n", __func__, vcpu->vcpu_id);
+	kvm_clear_async_pf_completion_queue(vcpu);
+	kvm_unload_vcpu_mmu(vcpu);
+	kvm_arch_vcpu_free(vcpu);
+	kvm_put_kvm(kvm);
 }
 
 void kvm_arch_sync_events(struct kvm *kvm)
@@ -6594,7 +6583,6 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 	kvm_iommu_unmap_guest(kvm);
 	kfree(kvm->arch.vpic);
 	kfree(kvm->arch.vioapic);
-	kvm_free_vcpus(kvm);
 	if (kvm->arch.apic_access_page)
 		put_page(kvm->arch.apic_access_page);
 	if (kvm->arch.ept_identity_pagetable)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d526231..fe35078 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -113,6 +113,7 @@ enum {
 
 struct kvm_vcpu {
 	struct kvm *kvm;
+	struct kref refcount;
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	struct preempt_notifier preempt_notifier;
 #endif
@@ -460,6 +461,7 @@ void kvm_arch_exit(void);
 int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu);
 
+void kvm_arch_vcpu_zap(struct kref *ref);
 void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d9cfb78..f166bc8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -580,6 +580,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	kvm_arch_free_vm(kvm);
 	hardware_disable_all();
 	mmdrop(mm);
+	printk(KERN_INFO "%s finished\n", __func__);
 }
 
 void kvm_get_kvm(struct kvm *kvm)
@@ -1503,6 +1504,16 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
 	mark_page_dirty_in_slot(kvm, memslot, gfn);
 }
 
+void kvm_vcpu_get(struct kvm_vcpu *vcpu)
+{
+	kref_get(&vcpu->refcount);
+}
+
+void kvm_vcpu_put(struct kvm_vcpu *vcpu)
+{
+	kref_put(&vcpu->refcount, kvm_arch_vcpu_zap);
+}
+
 /*
  * The vCPU has executed a HLT instruction with in-kernel mode enabled.
  */
@@ -1623,8 +1634,13 @@ static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
 static int kvm_vcpu_release(struct inode *inode, struct file *filp)
 {
 	struct kvm_vcpu *vcpu = filp->private_data;
+	struct kvm *kvm = vcpu->kvm;
 
-	kvm_put_kvm(vcpu->kvm);
+	filp->private_data = NULL;
+	mutex_lock(&kvm->lock);
+	atomic_sub(1, &kvm->online_vcpus);
+	mutex_unlock(&kvm->lock);
+	kvm_vcpu_put(vcpu);
 	return 0;
 }
 
@@ -1646,6 +1662,17 @@ static int create_vcpu_fd(struct kvm_vcpu *vcpu)
 	return anon_inode_getfd("kvm-vcpu", &kvm_vcpu_fops, vcpu, O_RDWR);
 }
 
+static struct kvm_vcpu *kvm_vcpu_create(struct kvm *kvm, u32 id)
+{
+	struct kvm_vcpu *vcpu;
+	vcpu = kvm_arch_vcpu_create(kvm, id);
+	if (IS_ERR(vcpu))
+		return vcpu;
+
+	kref_init(&vcpu->refcount);
+	return vcpu;
+}
+
 /*
  * Creates some virtual cpus.  Good luck creating more than one.
  */
@@ -1654,7 +1681,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 	int r;
 	struct kvm_vcpu *vcpu, *v;
 
-	vcpu = kvm_arch_vcpu_create(kvm, id);
+	vcpu = kvm_vcpu_create(kvm, id);
 	if (IS_ERR(vcpu))
 		return PTR_ERR(vcpu);
 
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 0] A series patches for kvm&qemu to enable vcpu destruction in kvm
  2011-11-25  2:35 [PATCH 0] A series patches for kvm&qemu to enable vcpu destruction in kvm Liu Ping Fan
  2011-11-25  2:35 ` [PATCH 1/2] kvm: make vcpu life cycle separated from kvm instance Liu Ping Fan
@ 2011-11-25 17:54 ` Jan Kiszka
  2011-11-27  3:07   ` Liu ping fan
  2011-11-27  2:42 ` [PATCH 2/2] kvm: exit to userspace with reason KVM_EXIT_VCPU_DEAD Liu Ping Fan
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 78+ messages in thread
From: Jan Kiszka @ 2011-11-25 17:54 UTC (permalink / raw)
  To: Liu Ping Fan; +Cc: kvm, qemu-devel, linux-kernel, avi, aliguori, ryanh

[-- Attachment #1: Type: text/plain, Size: 2029 bytes --]

On 2011-11-25 00:35, Liu Ping Fan wrote:
> A series of patches from kvm, qemu to guest. These patches will finally enable vcpu destruction in kvm instance and let vcpu thread exit in qemu.
>  
> Currently, the vcpu online feature enables the dynamical creation of vcpu and vcpu thread, while the offline feature can not destruct the vcpu and let vcpu thread exit, it just halt in kvm. Because currently, the vcpu will only be destructed when kvm instance is destroyed. We can 
> change vcpu as an refer of kvm instance, and then vcpu's destruction MUST and CAN come before kvm's destruction.
> 
> These patches use guest driver to notify the CPU_DEAD event to qemu, and later qemu asks kvm to release the dead vcpu and finally exit the 
> thread. 
> The usage is: 
> 	qemu$cpu_set n online
> 	qemu$cpu_set n zap   ------------ This will destroy the vcpu-n in kvm and let vcpu thread exit
>      OR	
> 	qemu$cpu_set n offline 	--------- This will just block vcpu-n in kvm
> 
> Any comment and suggestion are welcome.

The cpu_set command will probably not make it to QEMU upstream
(device_add/delete is the way to go - IMHO). So I would refrain from
adding anything to qemu-kvm at this point anyway. Also, what would be
the advantage of 'zap' from user perspective?

> 
> 
> Patches include:
> |-- guest
> |   `-- 0001-virtio-add-a-pci-driver-to-notify-host-the-CPU_DEAD-.patch
> |-- kvm
> |   |-- 0001-kvm-make-vcpu-life-cycle-separated-from-kvm-instance.patch
> |   `-- 0002-kvm-exit-to-userspace-with-reason-KVM_EXIT_VCPU_DEAD.patch
> `-- qemu
>     |-- 0001-Add-cpu_phyid_to_cpu-to-map-cpu-phyid-to-CPUState.patch
>     |-- 0002-Add-cpu_free-to-support-arch-related-CPUState-releas.patch
>     |-- 0003-Introduce-a-pci-device-cpustate-to-get-CPU_DEAD-even.patch
>     |-- 0004-Release-vcpu-and-finally-exit-vcpu-thread-safely.patch
>     `-- 0005-tmp-patches-for-linux-header-files.patch
> 

I only found kvm patch 0001 so far. Something probably went wrong with
your postings.

Jan


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 2/2] kvm: exit to userspace with reason KVM_EXIT_VCPU_DEAD
  2011-11-25  2:35 [PATCH 0] A series patches for kvm&qemu to enable vcpu destruction in kvm Liu Ping Fan
  2011-11-25  2:35 ` [PATCH 1/2] kvm: make vcpu life cycle separated from kvm instance Liu Ping Fan
  2011-11-25 17:54 ` [PATCH 0] A series patches for kvm&qemu to enable vcpu destruction in kvm Jan Kiszka
@ 2011-11-27  2:42 ` Liu Ping Fan
  2011-11-27 10:36   ` Avi Kivity
  2011-11-27  2:45 ` [PATCH 1/5] QEMU Add cpu_phyid_to_cpu() to map cpu phyid to CPUState Liu Ping Fan
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 78+ messages in thread
From: Liu Ping Fan @ 2011-11-27  2:42 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: linux-kernel, avi, aliguori, jan.kiszka, ryanh

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

The vcpu can be safely released when
--1.guest tells us that the vcpu is not needed any longer.
--2.vcpu hits the last instruction _halt_

If both of the conditions are satisfied, kvm exits to userspace
with the reason vcpu dead. So the user thread can exit safely.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 arch/x86/kvm/x86.c       |   16 ++++++++++++++++
 include/linux/kvm.h      |   11 +++++++++++
 include/linux/kvm_host.h |    1 +
 3 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ea2315a..7948eaf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5825,11 +5825,27 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
 		    !vcpu->arch.apf.halted)
 			r = vcpu_enter_guest(vcpu);
 		else {
+retry:
+			if  (vcpu->arch.mp_state == KVM_MP_STATE_HALTED) {
+				/*1st check whether guest notify CPU_DEAD*/
+				if (vcpu->state == KVM_VCPU_STATE_DYING) {
+					vcpu->state = KVM_VCPU_STATE_DEAD;
+					vcpu->run->exit_reason = KVM_EXIT_VCPU_DEAD;
+					break;
+				}
+			}
 			srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
 			kvm_vcpu_block(vcpu);
 			vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
 			if (kvm_check_request(KVM_REQ_UNHALT, vcpu))
 			{
+				switch (vcpu->state) {
+				case KVM_VCPU_STATE_DYING:
+					r = 1;
+					goto retry;
+				default:
+					break;
+				}
 				switch(vcpu->arch.mp_state) {
 				case KVM_MP_STATE_HALTED:
 					vcpu->arch.mp_state =
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index c3892fc..d5ff3f7 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -162,6 +162,7 @@ struct kvm_pit_config {
 #define KVM_EXIT_INTERNAL_ERROR   17
 #define KVM_EXIT_OSI              18
 #define KVM_EXIT_PAPR_HCALL	  19
+#define KVM_EXIT_VCPU_DEAD              20
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 #define KVM_INTERNAL_ERROR_EMULATION 1
@@ -334,6 +335,12 @@ struct kvm_signal_mask {
 	__u8  sigset[0];
 };
 
+/*for KVM_VCPU_SET_STATE */
+struct kvm_vcpu_state {
+	int vcpu_id;
+	int state;
+};
+
 /* for KVM_TPR_ACCESS_REPORTING */
 struct kvm_tpr_access_ctl {
 	__u32 enabled;
@@ -354,6 +361,9 @@ struct kvm_vapic_addr {
 #define KVM_MP_STATE_HALTED            3
 #define KVM_MP_STATE_SIPI_RECEIVED     4
 
+#define KVM_VCPU_STATE_DYING 1
+#define KVM_VCPU_STATE_DEAD 2
+
 struct kvm_mp_state {
 	__u32 mp_state;
 };
@@ -762,6 +772,7 @@ struct kvm_clock_data {
 #define KVM_CREATE_SPAPR_TCE	  _IOW(KVMIO,  0xa8, struct kvm_create_spapr_tce)
 /* Available with KVM_CAP_RMA */
 #define KVM_ALLOCATE_RMA	  _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
+#define KVM_SETSTATE_VCPU     _IOW(KVMIO,   0xaa, struct kvm_vcpu_state)
 
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index fe35078..6fdf927 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -114,6 +114,7 @@ enum {
 struct kvm_vcpu {
 	struct kvm *kvm;
 	struct kref refcount;
+	int state;
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	struct preempt_notifier preempt_notifier;
 #endif
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 1/5] QEMU Add cpu_phyid_to_cpu() to map cpu phyid to CPUState
  2011-11-25  2:35 [PATCH 0] A series patches for kvm&qemu to enable vcpu destruction in kvm Liu Ping Fan
                   ` (2 preceding siblings ...)
  2011-11-27  2:42 ` [PATCH 2/2] kvm: exit to userspace with reason KVM_EXIT_VCPU_DEAD Liu Ping Fan
@ 2011-11-27  2:45 ` Liu Ping Fan
  2011-11-27  2:45 ` [PATCH 2/5] QEMU Add cpu_free() to support arch related CPUState release Liu Ping Fan
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 78+ messages in thread
From: Liu Ping Fan @ 2011-11-27  2:45 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: linux-kernel, avi, aliguori, jan.kiszka, ryanh

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

The guest has different cpu logic id from qemu, but they have the
same phyid. When cpu phyid is told by guest, we need to obtain
the corresponding CPUState.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 target-i386/cpu.h    |    2 ++
 target-i386/helper.c |   12 ++++++++++++
 2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index abdeb40..251e63b 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -767,6 +767,7 @@ typedef struct CPUX86State {
 } CPUX86State;
 
 CPUX86State *cpu_x86_init(const char *cpu_model);
+CPUX86State *x86_phyid_to_cpu(int phy_id);
 int cpu_x86_exec(CPUX86State *s);
 void cpu_x86_close(CPUX86State *s);
 void x86_cpu_list (FILE *f, fprintf_function cpu_fprintf, const char *optarg);
@@ -1063,4 +1064,5 @@ void svm_check_intercept(CPUState *env1, uint32_t type);
 
 uint32_t cpu_cc_compute_all(CPUState *env1, int op);
 
+#define cpu_phyid_to_cpu  x86_phyid_to_cpu
 #endif /* CPU_I386_H */
diff --git a/target-i386/helper.c b/target-i386/helper.c
index 5df40d4..e35a75e 100644
--- a/target-i386/helper.c
+++ b/target-i386/helper.c
@@ -1263,6 +1263,18 @@ CPUX86State *cpu_x86_init(const char *cpu_model)
     return env;
 }
 
+CPUX86State *x86_phyid_to_cpu(int phy_id)
+{
+    CPUX86State *env = first_cpu;
+    while (env) {
+        if (env->cpuid_apic_id == phy_id) {
+            break;
+        }
+        env = env->next_cpu;
+    }
+    return env;
+}
+
 #if !defined(CONFIG_USER_ONLY)
 void do_cpu_init(CPUState *env)
 {
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 2/5] QEMU Add cpu_free() to support arch related CPUState release
  2011-11-25  2:35 [PATCH 0] A series patches for kvm&qemu to enable vcpu destruction in kvm Liu Ping Fan
                   ` (3 preceding siblings ...)
  2011-11-27  2:45 ` [PATCH 1/5] QEMU Add cpu_phyid_to_cpu() to map cpu phyid to CPUState Liu Ping Fan
@ 2011-11-27  2:45 ` Liu Ping Fan
  2011-11-27  2:45 ` [PATCH 3/5] QEMU Introduce a pci device "cpustate" to get CPU_DEAD event in guest Liu Ping Fan
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 78+ messages in thread
From: Liu Ping Fan @ 2011-11-27  2:45 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: linux-kernel, avi, aliguori, jan.kiszka, ryanh

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

When exiting from vcpu thread, the CPUState must be freed firstly.
And the handling process is an arch related.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 hw/apic.c            |    4 ++++
 target-i386/cpu.h    |    3 +++
 target-i386/helper.c |    8 ++++++++
 3 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/hw/apic.c b/hw/apic.c
index 34fa1dd..6472045 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -511,6 +511,10 @@ static void apic_get_delivery_bitmask(uint32_t *deliver_bitmask,
         }
     }
 }
+void apic_free(DeviceState *d)
+{
+    qdev_free(d);
+}
 
 void apic_init_reset(DeviceState *d)
 {
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 251e63b..da07781 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -767,6 +767,7 @@ typedef struct CPUX86State {
 } CPUX86State;
 
 CPUX86State *cpu_x86_init(const char *cpu_model);
+void cpu_x86_free(CPUState *env);
 CPUX86State *x86_phyid_to_cpu(int phy_id);
 int cpu_x86_exec(CPUX86State *s);
 void cpu_x86_close(CPUX86State *s);
@@ -950,6 +951,7 @@ CPUState *pc_new_cpu(const char *cpu_model);
 #define cpu_list_id x86_cpu_list
 #define cpudef_setup	x86_cpudef_setup
 
+#define cpu_free cpu_x86_free
 #define CPU_SAVE_VERSION 12
 
 /* MMU modes definitions */
@@ -1064,5 +1066,6 @@ void svm_check_intercept(CPUState *env1, uint32_t type);
 
 uint32_t cpu_cc_compute_all(CPUState *env1, int op);
 
+void apic_free(DeviceState *d);
 #define cpu_phyid_to_cpu  x86_phyid_to_cpu
 #endif /* CPU_I386_H */
diff --git a/target-i386/helper.c b/target-i386/helper.c
index e35a75e..c9fadc3 100644
--- a/target-i386/helper.c
+++ b/target-i386/helper.c
@@ -1263,6 +1263,14 @@ CPUX86State *cpu_x86_init(const char *cpu_model)
     return env;
 }
 
+void cpu_x86_free(CPUState *env)
+{
+    if (env->apic_state != NULL) {
+        apic_free(env->apic_state);
+    }
+    g_free(env);
+}
+
 CPUX86State *x86_phyid_to_cpu(int phy_id)
 {
     CPUX86State *env = first_cpu;
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 3/5] QEMU Introduce a pci device "cpustate" to get CPU_DEAD event in guest
  2011-11-25  2:35 [PATCH 0] A series patches for kvm&qemu to enable vcpu destruction in kvm Liu Ping Fan
                   ` (4 preceding siblings ...)
  2011-11-27  2:45 ` [PATCH 2/5] QEMU Add cpu_free() to support arch related CPUState release Liu Ping Fan
@ 2011-11-27  2:45 ` Liu Ping Fan
  2011-11-27 10:56   ` [Qemu-devel] " Gleb Natapov
  2011-11-27  2:45 ` [PATCH 4/5] QEMU Release vcpu and finally exit vcpu thread safely Liu Ping Fan
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 78+ messages in thread
From: Liu Ping Fan @ 2011-11-27  2:45 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: linux-kernel, avi, aliguori, jan.kiszka, ryanh

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

This device's driver in guest can get vcpu dead event and notify
qemu through the device.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 Makefile.target   |    1 +
 hw/pc_piix.c      |    1 +
 hw/pci.c          |   22 +++++++++++
 hw/pci.h          |    1 +
 hw/pci_cpustate.c |  105 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 130 insertions(+), 0 deletions(-)
 create mode 100644 hw/pci_cpustate.c

diff --git a/Makefile.target b/Makefile.target
index 5607c6d..c822f9f 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -242,6 +242,7 @@ obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o
 obj-i386-y += testdev.o
 obj-i386-y += acpi.o acpi_piix4.o
 obj-i386-y += icc_bus.o
+obj-i386-y += pci_cpustate.o
 
 obj-i386-y += pcspk.o i8254.o
 obj-i386-$(CONFIG_KVM_PIT) += i8254-kvm.o
diff --git a/hw/pc_piix.c b/hw/pc_piix.c
index 7c6f42d..090d7ba 100644
--- a/hw/pc_piix.c
+++ b/hw/pc_piix.c
@@ -199,6 +199,7 @@ static void pc_init1(MemoryRegion *system_memory,
             pci_nic_init_nofail(nd, "rtl8139", NULL);
     }
 
+    pc_cpustate_init(NULL);
     ide_drive_get(hd, MAX_IDE_BUS);
     if (pci_enabled) {
         PCIDevice *dev;
diff --git a/hw/pci.c b/hw/pci.c
index 5c87a62..74a8975 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -1663,6 +1663,28 @@ PCIDevice *pci_nic_init(NICInfo *nd, const char *default_model,
     return pci_dev;
 }
 
+PCIDevice *pc_cpustate_init(const char *default_devaddr)
+{
+    const char *devaddr = default_devaddr;
+    PCIBus *bus;
+    int devfn;
+    PCIDevice *pci_dev;
+    DeviceState *dev;
+    bus = pci_get_bus_devfn(&devfn, devaddr);
+    if (!bus) {
+        error_report("Invalid PCI device address %s for device %s",
+                     devaddr, "pcimmstub");
+        return NULL;
+    }
+
+    pci_dev = pci_create(bus, devfn, "cpustate");
+    dev = &pci_dev->qdev;
+    if (qdev_init(dev) < 0) {
+        return NULL;
+    }
+    return pci_dev;
+}
+
 PCIDevice *pci_nic_init_nofail(NICInfo *nd, const char *default_model,
                                const char *default_devaddr)
 {
diff --git a/hw/pci.h b/hw/pci.h
index 071a044..bbaa013 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -279,6 +279,7 @@ PCIDevice *pci_nic_init(NICInfo *nd, const char *default_model,
                         const char *default_devaddr);
 PCIDevice *pci_nic_init_nofail(NICInfo *nd, const char *default_model,
                                const char *default_devaddr);
+PCIDevice *pc_cpustate_init(const char *default_devaddr);
 int pci_bus_num(PCIBus *s);
 void pci_for_each_device(PCIBus *bus, int bus_num, void (*fn)(PCIBus *bus, PCIDevice *d));
 PCIBus *pci_find_root_bus(int domain);
diff --git a/hw/pci_cpustate.c b/hw/pci_cpustate.c
new file mode 100644
index 0000000..fd31a1f
--- /dev/null
+++ b/hw/pci_cpustate.c
@@ -0,0 +1,105 @@
+/* pci_cpustate.c
+ * emulate a pci device to get guest os CPU_DEAD event
+ *
+ * Copyright IBM, Corp. 2011
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>
+ */
+#include <zlib.h>
+#include "hw.h"
+#include "pci.h"
+#include "qemu-timer.h"
+#include "net.h"
+#include "loader.h"
+#include "sysemu.h"
+#include "iov.h"
+
+#define PCI_DEVICE_ID_CPUSTATE  0x1010
+#define CPUSTATE_REGS_SIZE  0x1000
+
+typedef struct VcpuState VcpuState;
+
+struct VcpuState {
+    PCIDevice dev;
+    MemoryRegion mmio;
+    int mmio_io_addr;
+    int mmio_index;
+    uint32_t cpuid;
+    uint32_t cpu_state;
+};
+
+static const VMStateDescription vmstate_cpustate = {
+    .name = "cpustate",
+    .version_id = 1,
+    .minimum_version_id = 0,
+    .fields      = (VMStateField[]) {
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+static void
+cpustate_mmio_write(void *opaque, target_phys_addr_t addr, uint64_t val,
+                 unsigned size)
+{
+}
+
+static uint64_t
+cpustate_mmio_read(void *opaque, target_phys_addr_t addr, unsigned size)
+{
+    return 0;
+}
+
+static const MemoryRegionOps cpustate_ops = {
+    .read = cpustate_mmio_read,
+    .write = cpustate_mmio_write,
+    .endianness = DEVICE_LITTLE_ENDIAN,
+};
+
+static int pci_cpustate_init(PCIDevice *dev)
+{
+    uint8_t *pci_cfg = dev->config;
+    VcpuState *s = DO_UPCAST(VcpuState, dev, dev);
+    memory_region_init_io(&s->mmio, &cpustate_ops, s, "cpustate",
+                        CPUSTATE_REGS_SIZE);
+    pci_cfg[PCI_INTERRUPT_PIN] = 1;
+    /* I/O handler for memory-mapped I/O */
+    pci_register_bar(&s->dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY,  &s->mmio);
+    return 0;
+}
+
+static int pci_cpustate_exit(PCIDevice *dev)
+{
+    return 0;
+}
+
+static PCIDeviceInfo cpustate_info = {
+    .qdev.name  = "cpustate",
+    .qdev.size  = sizeof(VcpuState),
+    .qdev.vmsd  = &vmstate_cpustate,
+    .init       = pci_cpustate_init,
+    .exit       = pci_cpustate_exit,
+    .vendor_id  = PCI_VENDOR_ID_IBM,
+    .device_id  = PCI_DEVICE_ID_CPUSTATE,
+    .revision   = 0x10,
+    .class_id   = PCI_CLASS_SYSTEM_OTHER,
+    .qdev.props = (Property[]) {
+        DEFINE_PROP_END_OF_LIST(),
+    }
+};
+
+static void cpustate_register_devices(void)
+{
+    pci_qdev_register(&cpustate_info);
+}
+device_init(cpustate_register_devices)
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 4/5] QEMU Release vcpu and finally exit vcpu thread safely
  2011-11-25  2:35 [PATCH 0] A series patches for kvm&qemu to enable vcpu destruction in kvm Liu Ping Fan
                   ` (5 preceding siblings ...)
  2011-11-27  2:45 ` [PATCH 3/5] QEMU Introduce a pci device "cpustate" to get CPU_DEAD event in guest Liu Ping Fan
@ 2011-11-27  2:45 ` Liu Ping Fan
  2011-11-27  2:45 ` [PATCH 5/5] QEMU tmp patches for linux-header files Liu Ping Fan
  2011-11-27  2:47 ` [PATCH] virtio: add a pci driver to notify host the CPU_DEAD event Liu Ping Fan
  8 siblings, 0 replies; 78+ messages in thread
From: Liu Ping Fan @ 2011-11-27  2:45 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: linux-kernel, avi, aliguori, jan.kiszka, ryanh

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

When guest driver tell us that the vcpu is no longer needed,
qemu can release the vcpu and finally exit vcpu thread

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 cpu-defs.h        |    5 +++++
 cpus.c            |   21 +++++++++++++++++++++
 hmp-commands.hx   |    2 +-
 hw/acpi_piix4.c   |   19 ++++++++++++++++---
 hw/pci_cpustate.c |   22 ++++++++++++++++++++++
 kvm-all.c         |   11 ++++++++++-
 monitor.c         |   12 +++++++-----
 7 files changed, 82 insertions(+), 10 deletions(-)

diff --git a/cpu-defs.h b/cpu-defs.h
index db48a7a..cb69a07 100644
--- a/cpu-defs.h
+++ b/cpu-defs.h
@@ -153,6 +153,10 @@ typedef struct CPUWatchpoint {
     QTAILQ_ENTRY(CPUWatchpoint) entry;
 } CPUWatchpoint;
 
+#define CPU_STATE_RUNNING 0
+#define CPU_STATE_ZAPREQ 1
+#define CPU_STATE_ZAPPED 2
+
 #define CPU_TEMP_BUF_NLONGS 128
 #define CPU_COMMON                                                      \
     struct TranslationBlock *current_tb; /* currently executing TB  */  \
@@ -210,6 +214,7 @@ typedef struct CPUWatchpoint {
     uint32_t created;                                                   \
     uint32_t stop;   /* Stop request */                                 \
     uint32_t stopped; /* Artificially stopped */                        \
+    uint32_t state; /*state indicator*/                             \
     struct QemuThread *thread;                                          \
     struct QemuCond *halt_cond;                                         \
     int thread_kicked;                                                  \
diff --git a/cpus.c b/cpus.c
index c996ac5..e479476 100644
--- a/cpus.c
+++ b/cpus.c
@@ -33,6 +33,7 @@
 
 #include "qemu-thread.h"
 #include "cpus.h"
+#include "cpu.h"
 
 #ifndef _WIN32
 #include "compatfd.h"
@@ -778,6 +779,7 @@ static void qemu_kvm_wait_io_event(CPUState *env)
 static void *qemu_kvm_cpu_thread_fn(void *arg)
 {
     CPUState *env = arg;
+    CPUState *prev = NULL;
     int r;
 
     qemu_mutex_lock(&qemu_global_mutex);
@@ -808,10 +810,29 @@ static void *qemu_kvm_cpu_thread_fn(void *arg)
                 cpu_handle_guest_debug(env);
             }
         }
+        /*1,try to zap; 2, can safe to destroy*/
+        if (env->state == CPU_STATE_ZAPPED) {
+            goto zapout;
+        }
         qemu_kvm_wait_io_event(env);
     }
 
     return NULL;
+zapout:
+    prev = first_cpu;
+    if (prev == env) {
+        first_cpu = env->next_cpu;
+    } else {
+        while (prev != NULL) {
+            if (prev->next_cpu == env) {
+                break;
+            }
+            prev = prev->next_cpu;
+        }
+        prev->next_cpu = env->next_cpu;
+    }
+    cpu_free(env);
+    return NULL;
 }
 
 static void *qemu_tcg_cpu_thread_fn(void *arg)
diff --git a/hmp-commands.hx b/hmp-commands.hx
index ed5c9b9..b642a34 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -1218,7 +1218,7 @@ ETEXI
     {
         .name       = "cpu_set",
         .args_type  = "cpu:i,state:s",
-        .params     = "cpu [online|offline]",
+        .params     = "cpu [online|offline|zap]",
         .help       = "change cpu state",
         .mhandler.cmd  = do_cpu_set_nr,
     },
diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c
index f585226..1f3ed06 100644
--- a/hw/acpi_piix4.c
+++ b/hw/acpi_piix4.c
@@ -605,10 +605,23 @@ void qemu_system_cpu_hot_add(int cpu, int state)
         env->cpuid_apic_id = cpu;
     }
 
-    if (state)
-        enable_processor(s, cpu);
-    else
+    switch (state) {
+    /*zap vcpu*/
+    case 0:
+        env = qemu_get_cpu(cpu);
+        /*1 means try to zap*/
+        env->state = CPU_STATE_ZAPREQ;
+        disable_processor(s, cpu);
+        break;
+    /*offline vcpu*/
+    case 1:
         disable_processor(s, cpu);
+        break;
+    /*onine vcpu*/
+    case 2:
+        enable_processor(s, cpu);
+        break;
+    }
 
     pm_update_sci(s);
 }
diff --git a/hw/pci_cpustate.c b/hw/pci_cpustate.c
index fd31a1f..18402cf 100644
--- a/hw/pci_cpustate.c
+++ b/hw/pci_cpustate.c
@@ -24,6 +24,8 @@
 #include "loader.h"
 #include "sysemu.h"
 #include "iov.h"
+#include <linux/kvm.h>
+#include "kvm.h"
 
 #define PCI_DEVICE_ID_CPUSTATE  0x1010
 #define CPUSTATE_REGS_SIZE  0x1000
@@ -52,6 +54,26 @@ static void
 cpustate_mmio_write(void *opaque, target_phys_addr_t addr, uint64_t val,
                  unsigned size)
 {
+    CPUState *env;
+    int ret;
+    struct kvm_vcpu_state state;
+    switch (addr) {
+    /*apic id*/
+    case 0:
+        env = cpu_phyid_to_cpu(val);
+        if (env != NULL) {
+            if (env->state == CPU_STATE_ZAPREQ) {
+                state.vcpu_id = env->cpu_index;
+                state.state = 1;
+                ret = kvm_vm_ioctl(env->kvm_state, KVM_SETSTATE_VCPU, &state);
+            }
+        }
+        break;
+    case 4:
+        break;
+    default:
+        break;
+    }
 }
 
 static uint64_t
diff --git a/kvm-all.c b/kvm-all.c
index 8dd354e..b295262 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -64,6 +64,7 @@ struct KVMState
     int vmfd;
     int coalesced_mmio;
     struct kvm_coalesced_mmio_ring *coalesced_mmio_ring;
+    long mmap_size;
     int broken_set_mem_region;
     int migration_log;
     int vcpu_events;
@@ -228,7 +229,7 @@ int kvm_init_vcpu(CPUState *env)
         DPRINTF("KVM_GET_VCPU_MMAP_SIZE failed\n");
         goto err;
     }
-
+    env->kvm_state->mmap_size = mmap_size;
     env->kvm_run = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED,
                         env->kvm_fd, 0);
     if (env->kvm_run == MAP_FAILED) {
@@ -1026,6 +1027,13 @@ int kvm_cpu_exec(CPUState *env)
         case KVM_EXIT_INTERNAL_ERROR:
             ret = kvm_handle_internal_error(env, run);
             break;
+        case KVM_EXIT_VCPU_DEAD:
+            ret = munmap(env->kvm_run, env->kvm_state->mmap_size);
+            ret = close(env->kvm_fd);
+            env->state = CPU_STATE_ZAPPED;
+            qemu_mutex_unlock_iothread();
+            goto out;
+            break;
         default:
             DPRINTF("kvm_arch_handle_exit\n");
             ret = kvm_arch_handle_exit(env, run);
@@ -1033,6 +1041,7 @@ int kvm_cpu_exec(CPUState *env)
         }
     } while (ret == 0);
 
+out:
     if (ret < 0) {
         cpu_dump_state(env, stderr, fprintf, CPU_DUMP_CODE);
         vm_stop(VMSTOP_PANIC);
diff --git a/monitor.c b/monitor.c
index cb485bf..51c8c52 100644
--- a/monitor.c
+++ b/monitor.c
@@ -971,11 +971,13 @@ static void do_cpu_set_nr(Monitor *mon, const QDict *qdict)
     status = qdict_get_str(qdict, "state");
     value = qdict_get_int(qdict, "cpu");
 
-    if (!strcmp(status, "online"))
-       state = 1;
-    else if (!strcmp(status, "offline"))
-       state = 0;
-    else {
+    if (!strcmp(status, "online")) {
+        state = 2;
+    } else if (!strcmp(status, "offline")) {
+        state = 1;
+    } else if (!strcmp(status, "zap")) {
+        state = 0;
+    } else {
         monitor_printf(mon, "invalid status: %s\n", status);
         return;
     }
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 5/5] QEMU tmp patches for linux-header files
  2011-11-25  2:35 [PATCH 0] A series patches for kvm&qemu to enable vcpu destruction in kvm Liu Ping Fan
                   ` (6 preceding siblings ...)
  2011-11-27  2:45 ` [PATCH 4/5] QEMU Release vcpu and finally exit vcpu thread safely Liu Ping Fan
@ 2011-11-27  2:45 ` Liu Ping Fan
  2011-11-27  2:47 ` [PATCH] virtio: add a pci driver to notify host the CPU_DEAD event Liu Ping Fan
  8 siblings, 0 replies; 78+ messages in thread
From: Liu Ping Fan @ 2011-11-27  2:45 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: linux-kernel, avi, aliguori, jan.kiszka, ryanh

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

Temporary patch for qemu to compile. Normally the headers should be
copied from kernel.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 kvm/include/linux/kvm.h   |    9 ++++++++-
 linux-headers/linux/kvm.h |    9 +++++++++
 2 files changed, 17 insertions(+), 1 deletions(-)

diff --git a/kvm/include/linux/kvm.h b/kvm/include/linux/kvm.h
index e46729e..a7fe019 100644
--- a/kvm/include/linux/kvm.h
+++ b/kvm/include/linux/kvm.h
@@ -162,6 +162,7 @@ struct kvm_pit_config {
 #define KVM_EXIT_INTERNAL_ERROR   17
 #define KVM_EXIT_OSI              18
 
+#define KVM_EXIT_VCPU_DEAD              20
 /* For KVM_EXIT_INTERNAL_ERROR */
 #define KVM_INTERNAL_ERROR_EMULATION 1
 #define KVM_INTERNAL_ERROR_SIMUL_EX 2
@@ -328,6 +329,12 @@ struct kvm_signal_mask {
 	__u8  sigset[0];
 };
 
+/*for KVM_VCPU_SET_STATE */
+struct kvm_vcpu_state {
+	int vcpu_id;
+	int state;
+};
+
 /* for KVM_TPR_ACCESS_REPORTING */
 struct kvm_tpr_access_ctl {
 	__u32 enabled;
@@ -726,7 +733,7 @@ struct kvm_clock_data {
 /* Available with KVM_CAP_XCRS */
 #define KVM_GET_XCRS		  _IOR(KVMIO,  0xa6, struct kvm_xcrs)
 #define KVM_SET_XCRS		  _IOW(KVMIO,  0xa7, struct kvm_xcrs)
-
+#define KVM_SETSTATE_VCPU     _IOW(KVMIO,   0xaa, struct kvm_vcpu_state)
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 
 struct kvm_assigned_pci_dev {
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index fc63b73..4422456 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -161,6 +161,8 @@ struct kvm_pit_config {
 #define KVM_EXIT_NMI              16
 #define KVM_EXIT_INTERNAL_ERROR   17
 #define KVM_EXIT_OSI              18
+#define KVM_EXIT_VCPU_DEAD              20
+
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 #define KVM_INTERNAL_ERROR_EMULATION 1
@@ -328,6 +330,12 @@ struct kvm_signal_mask {
 	__u8  sigset[0];
 };
 
+/*for KVM_VCPU_SET_STATE */
+struct kvm_vcpu_state {
+	int vcpu_id;
+	int state;
+};
+
 /* for KVM_TPR_ACCESS_REPORTING */
 struct kvm_tpr_access_ctl {
 	__u32 enabled;
@@ -747,6 +755,7 @@ struct kvm_clock_data {
 #define KVM_GET_XCRS		  _IOR(KVMIO,  0xa6, struct kvm_xcrs)
 #define KVM_SET_XCRS		  _IOW(KVMIO,  0xa7, struct kvm_xcrs)
 
+#define KVM_SETSTATE_VCPU     _IOW(KVMIO,   0xaa, struct kvm_vcpu_state)
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 
 struct kvm_assigned_pci_dev {
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH] virtio: add a pci driver to notify host the CPU_DEAD event
  2011-11-25  2:35 [PATCH 0] A series patches for kvm&qemu to enable vcpu destruction in kvm Liu Ping Fan
                   ` (7 preceding siblings ...)
  2011-11-27  2:45 ` [PATCH 5/5] QEMU tmp patches for linux-header files Liu Ping Fan
@ 2011-11-27  2:47 ` Liu Ping Fan
  2011-11-27 11:10   ` [Qemu-devel] " Gleb Natapov
  8 siblings, 1 reply; 78+ messages in thread
From: Liu Ping Fan @ 2011-11-27  2:47 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: linux-kernel, avi, aliguori, jan.kiszka, ryanh

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

A driver for qemu device "cpustate". This driver catch the guest
CPU_DEAD event, and notify host.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 drivers/virtio/Kconfig         |    6 ++
 drivers/virtio/Makefile        |    1 +
 drivers/virtio/cpustate_stub.c |  154 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 161 insertions(+), 0 deletions(-)
 create mode 100644 drivers/virtio/cpustate_stub.c

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 816ed08..96ad253 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -46,4 +46,10 @@ config VIRTIO_BALLOON
 
  	 If unsure, say N.
 
+ config VIRTIO_CPUSTATE
+ 	tristate "Driver to notify host the cpu dead event (EXPERIMENTAL)"
+ 	depends on EXPERIMENTAL
+ 	---help---
+ 	 This drivers provides support to notify host the cpu dead event.
+
 endmenu
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 5a4c63c..06a5ecf 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -3,3 +3,4 @@ obj-$(CONFIG_VIRTIO_RING) += virtio_ring.o
 obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o
 obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o
 obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
+obj-$(VIRTIO_CPUSTATE) += cpustate_stub.o
diff --git a/drivers/virtio/cpustate_stub.c b/drivers/virtio/cpustate_stub.c
new file mode 100644
index 0000000..614da9d
--- /dev/null
+++ b/drivers/virtio/cpustate_stub.c
@@ -0,0 +1,154 @@
+/*
+ * PCI driver for qemu cpustate device. It notifies host the CPU_DEAD event
+ * in guest.
+ *
+ * Copyright IBM Corp. 2011
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+#include <linux/module.h>
+#include <linux/list.h>
+#include <linux/cpu.h>
+#include <linux/pci.h>
+#include <linux/slab.h>
+#include <linux/interrupt.h>
+#include <linux/spinlock.h>
+
+#define PCI_DEVICE_ID_CPUSTATE 0x1010
+
+struct cpustate_stub_regs {
+	unsigned int cpu_phyid;
+	unsigned int event;
+};
+
+struct cpustate_stub {
+	struct work_struct work;
+
+	unsigned int cpu_phyid;
+	unsigned int event;
+
+	struct cpustate_stub_regs __iomem *regs;
+};
+
+static struct cpustate_stub *agent;
+
+static void cpustate_work(struct work_struct *work)
+{
+	struct cpustate_stub *stub = container_of(work,
+					struct cpustate_stub, work);
+	printk(KERN_INFO "%s,cpu_phyid=0x%x, event=0x%x\n",
+		__func__, stub->cpu_phyid, stub->event);
+	stub->regs->cpu_phyid = stub->cpu_phyid;
+	stub->regs->event = stub->event;
+	barrier();
+}
+
+static int cpu_dead_callback(struct notifier_block *b, unsigned long action,
+				 void *data)
+{
+	unsigned long cpu = (unsigned long)data;
+	int cpu_phyid;
+	switch (action) {
+	case CPU_DEAD:{
+		cpu_phyid = per_cpu(x86_cpu_to_apicid, cpu);
+		agent->cpu_phyid = cpu_phyid;
+		agent->event = CPU_DEAD;
+		schedule_work(&agent->work);
+		break;
+	}
+	default:
+		break;
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block __cpuinitdata cpu_dead_notifier = {
+	.notifier_call	= cpu_dead_callback,
+	.priority	= 10,
+};
+
+static int __devinit cpustate_probe(struct pci_dev *pci_dev,
+	const struct pci_device_id *id)
+{
+	int ret = 0;
+	agent = kzalloc(sizeof(struct cpustate_stub), GFP_KERNEL);
+	if (agent == NULL) {
+		ret = -1;
+		goto fail;
+	}
+	/* enable the device */
+	ret = pci_enable_device(pci_dev);
+	if (ret) {
+		printk(KERN_WARNING "%s, pci_enable_device fail,ret=0x%x\n",
+			__func__, ret);
+		goto fail;
+	}
+
+	ret = pci_request_regions(pci_dev, "cpustate");
+	if (ret) {
+		printk(KERN_WARNING "%s, pci_request_regions fail,ret=0x%x\n",
+			__func__, ret);
+		goto out_enable_device;
+	}
+
+	agent->regs = ioremap(pci_dev->resource[0].start,
+			pci_dev->resource[0].end - pci_dev->resource[0].start);
+	if (agent->regs == NULL) {
+		printk(KERN_WARNING "%s, ioremap fail\n", __func__);
+		goto out_req_regions;
+	}
+
+	INIT_WORK(&agent->work, cpustate_work);
+	register_cpu_notifier(&cpu_dead_notifier);
+	printk(KERN_INFO "%s, success\n", __func__);
+	return 0;
+
+out_req_regions:
+	pci_release_regions(pci_dev);
+out_enable_device:
+	pci_disable_device(pci_dev);
+	kfree(agent);
+	agent = NULL;
+fail:
+	printk(KERN_WARNING "%s fail\n", __func__);
+	return ret;
+}
+
+static void __devexit cpustate_remove(struct pci_dev *pci_dev)
+{
+	unregister_cpu_notifier(&cpu_dead_notifier);
+}
+
+/* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF. */
+static DEFINE_PCI_DEVICE_TABLE(pci_cpustate_id_table) = {
+	{ PCI_VENDOR_ID_IBM, PCI_DEVICE_ID_CPUSTATE,
+		PCI_ANY_ID, PCI_ANY_ID,
+		PCI_CLASS_SYSTEM_OTHER, 0,
+		0 },
+	{ 0 },
+};
+MODULE_DEVICE_TABLE(pci, pci_cpustate_id_table);
+
+static struct pci_driver pci_cpustate_driver = {
+	.name		= "cpustate",
+	.id_table	= pci_cpustate_id_table,
+	.probe		= cpustate_probe,
+	.remove		= __devexit_p(cpustate_remove),
+};
+
+static int __init pci_cpustate_init(void)
+{
+	return pci_register_driver(&pci_cpustate_driver);
+}
+module_init(pci_cpustate_init);
+
+static void __exit pci_cpustate_exit(void)
+{
+	pci_unregister_driver(&pci_cpustate_driver);
+}
+module_exit(pci_cpustate_exit);
+MODULE_DESCRIPTION("cpustate");
+MODULE_LICENSE("GPL");
+MODULE_VERSION("1");
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 0] A series patches for kvm&qemu to enable vcpu destruction in kvm
  2011-11-25 17:54 ` [PATCH 0] A series patches for kvm&qemu to enable vcpu destruction in kvm Jan Kiszka
@ 2011-11-27  3:07   ` Liu ping fan
  0 siblings, 0 replies; 78+ messages in thread
From: Liu ping fan @ 2011-11-27  3:07 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: kvm, qemu-devel, linux-kernel, avi, aliguori, ryanh

On Sat, Nov 26, 2011 at 1:54 AM, Jan Kiszka <jan.kiszka@web.de> wrote:
> On 2011-11-25 00:35, Liu Ping Fan wrote:
>> A series of patches from kvm, qemu to guest. These patches will finally enable vcpu destruction in kvm instance and let vcpu thread exit in qemu.
>>
>> Currently, the vcpu online feature enables the dynamical creation of vcpu and vcpu thread, while the offline feature can not destruct the vcpu and let vcpu thread exit, it just halt in kvm. Because currently, the vcpu will only be destructed when kvm instance is destroyed. We can
>> change vcpu as an refer of kvm instance, and then vcpu's destruction MUST and CAN come before kvm's destruction.
>>
>> These patches use guest driver to notify the CPU_DEAD event to qemu, and later qemu asks kvm to release the dead vcpu and finally exit the
>> thread.
>> The usage is:
>>       qemu$cpu_set n online
>>       qemu$cpu_set n zap   ------------ This will destroy the vcpu-n in kvm and let vcpu thread exit
>>      OR
>>       qemu$cpu_set n offline  --------- This will just block vcpu-n in kvm
>>
>> Any comment and suggestion are welcome.
>
> The cpu_set command will probably not make it to QEMU upstream
> (device_add/delete is the way to go - IMHO). So I would refrain from
> adding anything to qemu-kvm at this point anyway.
>
Ok, I will see more details in device_add/delete.
> Also, what would be> the advantage of 'zap' from user perspective?
>
Suppose we increase one user's cpu's utilization by creating more
threads for them (of course, task_group is another choice), later we
decide to reclaim the utilization from this user, so we remove some of
the vcpu from this user's guest OS. But the related vcpu structure are
not released in kernel in current code, and wasted.
>From another viewpoint, if we can dynamically create the vcpu & vcpu
thread, we had better to have the ability to dynamically destroy them.

>>
>>
>> Patches include:
>> |-- guest
>> |   `-- 0001-virtio-add-a-pci-driver-to-notify-host-the-CPU_DEAD-.patch
>> |-- kvm
>> |   |-- 0001-kvm-make-vcpu-life-cycle-separated-from-kvm-instance.patch
>> |   `-- 0002-kvm-exit-to-userspace-with-reason-KVM_EXIT_VCPU_DEAD.patch
>> `-- qemu
>>     |-- 0001-Add-cpu_phyid_to_cpu-to-map-cpu-phyid-to-CPUState.patch
>>     |-- 0002-Add-cpu_free-to-support-arch-related-CPUState-releas.patch
>>     |-- 0003-Introduce-a-pci-device-cpustate-to-get-CPU_DEAD-even.patch
>>     |-- 0004-Release-vcpu-and-finally-exit-vcpu-thread-safely.patch
>>     `-- 0005-tmp-patches-for-linux-header-files.patch
>>
>
> I only found kvm patch 0001 so far. Something probably went wrong with
> your postings.
>
Sorry, I have resent them, pls re-fetch them .

Thanks and regards,
ping fan

> Jan
>
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] kvm: make vcpu life cycle separated from kvm instance
  2011-11-25  2:35 ` [PATCH 1/2] kvm: make vcpu life cycle separated from kvm instance Liu Ping Fan
@ 2011-11-27 10:36   ` Avi Kivity
  2011-12-02  6:26     ` [PATCH] " Liu Ping Fan
  0 siblings, 1 reply; 78+ messages in thread
From: Avi Kivity @ 2011-11-27 10:36 UTC (permalink / raw)
  To: Liu Ping Fan
  Cc: kvm, qemu-devel, linux-kernel, aliguori, jan.kiszka, ryanh, Liu Ping Fan

On 11/25/2011 04:35 AM, Liu Ping Fan wrote:
> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>
> Currently, vcpu can be destructed only when kvm instance destroyed.
> Change this to vcpu as a refer to kvm, and then vcpu MUST and CAN be
> destroyed before kvm's destroy. Qemu will take advantage of this to
> exit the vcpu thread if the thread is no longer in use by guest.
>
> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> ---
>  arch/x86/kvm/x86.c       |   28 ++++++++--------------------
>  include/linux/kvm_host.h |    2 ++
>  virt/kvm/kvm_main.c      |   31 +++++++++++++++++++++++++++++--
>  3 files changed, 39 insertions(+), 22 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c38efd7..ea2315a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6560,27 +6560,16 @@ static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
>  	vcpu_put(vcpu);
>  }
>  
> -static void kvm_free_vcpus(struct kvm *kvm)
> +void kvm_arch_vcpu_zap(struct kref *ref)
>  {
> -	unsigned int i;
> -	struct kvm_vcpu *vcpu;
> -
> -	/*
> -	 * Unpin any mmu pages first.
> -	 */
> -	kvm_for_each_vcpu(i, vcpu, kvm) {
> -		kvm_clear_async_pf_completion_queue(vcpu);
> -		kvm_unload_vcpu_mmu(vcpu);
> -	}
> -	kvm_for_each_vcpu(i, vcpu, kvm)
> -		kvm_arch_vcpu_free(vcpu);
> -
> -	mutex_lock(&kvm->lock);
> -	for (i = 0; i < atomic_read(&kvm->online_vcpus); i++)
> -		kvm->vcpus[i] = NULL;
> +	struct kvm_vcpu *vcpu = container_of(ref, struct kvm_vcpu, refcount);
> +	struct kvm *kvm = vcpu->kvm;
>  
> -	atomic_set(&kvm->online_vcpus, 0);
> -	mutex_unlock(&kvm->lock);
> +	printk(KERN_INFO "%s, zap vcpu:0x%x\n", __func__, vcpu->vcpu_id);
> +	kvm_clear_async_pf_completion_queue(vcpu);
> +	kvm_unload_vcpu_mmu(vcpu);
> +	kvm_arch_vcpu_free(vcpu);
> +	kvm_put_kvm(kvm);
>  }
>  
>  void kvm_arch_sync_events(struct kvm *kvm)
> @@ -6594,7 +6583,6 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>  	kvm_iommu_unmap_guest(kvm);
>  	kfree(kvm->arch.vpic);
>  	kfree(kvm->arch.vioapic);
> -	kvm_free_vcpus(kvm);
>  	if (kvm->arch.apic_access_page)
>  		put_page(kvm->arch.apic_access_page);
>  	if (kvm->arch.ept_identity_pagetable)
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d526231..fe35078 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -113,6 +113,7 @@ enum {
>  
>  struct kvm_vcpu {
>  	struct kvm *kvm;
> +	struct kref refcount;
>  #ifdef CONFIG_PREEMPT_NOTIFIERS
>  	struct preempt_notifier preempt_notifier;
>  #endif
> @@ -460,6 +461,7 @@ void kvm_arch_exit(void);
>  int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu);
>  void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu);
>  
> +void kvm_arch_vcpu_zap(struct kref *ref);
>  void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu);
>  void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
>  void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d9cfb78..f166bc8 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -580,6 +580,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
>  	kvm_arch_free_vm(kvm);
>  	hardware_disable_all();
>  	mmdrop(mm);
> +	printk(KERN_INFO "%s finished\n", __func__);
>  }
>  
>  void kvm_get_kvm(struct kvm *kvm)
> @@ -1503,6 +1504,16 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
>  	mark_page_dirty_in_slot(kvm, memslot, gfn);
>  }
>  
> +void kvm_vcpu_get(struct kvm_vcpu *vcpu)
> +{
> +	kref_get(&vcpu->refcount);
> +}
> +
> +void kvm_vcpu_put(struct kvm_vcpu *vcpu)
> +{
> +	kref_put(&vcpu->refcount, kvm_arch_vcpu_zap);
> +}
> +
>  /*
>   * The vCPU has executed a HLT instruction with in-kernel mode enabled.
>   */
> @@ -1623,8 +1634,13 @@ static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
>  static int kvm_vcpu_release(struct inode *inode, struct file *filp)
>  {
>  	struct kvm_vcpu *vcpu = filp->private_data;
> +	struct kvm *kvm = vcpu->kvm;
>  
> -	kvm_put_kvm(vcpu->kvm);
> +	filp->private_data = NULL;
> +	mutex_lock(&kvm->lock);
> +	atomic_sub(1, &kvm->online_vcpus);
> +	mutex_unlock(&kvm->lock);
> +	kvm_vcpu_put(vcpu);
>  	return 0;
>  }
>  
> @@ -1646,6 +1662,17 @@ static int create_vcpu_fd(struct kvm_vcpu *vcpu)
>  	return anon_inode_getfd("kvm-vcpu", &kvm_vcpu_fops, vcpu, O_RDWR);
>  }
>  
> +static struct kvm_vcpu *kvm_vcpu_create(struct kvm *kvm, u32 id)
> +{
> +	struct kvm_vcpu *vcpu;
> +	vcpu = kvm_arch_vcpu_create(kvm, id);
> +	if (IS_ERR(vcpu))
> +		return vcpu;
> +
> +	kref_init(&vcpu->refcount);
> +	return vcpu;
> +}
> +
>  /*
>   * Creates some virtual cpus.  Good luck creating more than one.
>   */
> @@ -1654,7 +1681,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
>  	int r;
>  	struct kvm_vcpu *vcpu, *v;
>  
> -	vcpu = kvm_arch_vcpu_create(kvm, id);
> +	vcpu = kvm_vcpu_create(kvm, id);
>  	if (IS_ERR(vcpu))
>  		return PTR_ERR(vcpu);
>  

I don't think this is sufficient to actually remove a vcpu from the vcpu
table.  It may be referred to from other vcpus in the local APIC code. 
Practically the only thing that can accomplish this without a
substantial effort is rcu.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 2/2] kvm: exit to userspace with reason KVM_EXIT_VCPU_DEAD
  2011-11-27  2:42 ` [PATCH 2/2] kvm: exit to userspace with reason KVM_EXIT_VCPU_DEAD Liu Ping Fan
@ 2011-11-27 10:36   ` Avi Kivity
  2011-11-27 10:50     ` [Qemu-devel] " Gleb Natapov
  0 siblings, 1 reply; 78+ messages in thread
From: Avi Kivity @ 2011-11-27 10:36 UTC (permalink / raw)
  To: Liu Ping Fan; +Cc: kvm, qemu-devel, linux-kernel, aliguori, jan.kiszka, ryanh

On 11/27/2011 04:42 AM, Liu Ping Fan wrote:
> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>
> The vcpu can be safely released when
> --1.guest tells us that the vcpu is not needed any longer.
> --2.vcpu hits the last instruction _halt_
>
> If both of the conditions are satisfied, kvm exits to userspace
> with the reason vcpu dead. So the user thread can exit safely.
>
>

Seems to be completely unnecessary.  If you want to exit from the vcpu
thread, send it a signal.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] kvm: exit to userspace with reason KVM_EXIT_VCPU_DEAD
  2011-11-27 10:36   ` Avi Kivity
@ 2011-11-27 10:50     ` Gleb Natapov
  2011-11-28  7:16       ` Liu ping fan
  0 siblings, 1 reply; 78+ messages in thread
From: Gleb Natapov @ 2011-11-27 10:50 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Liu Ping Fan, aliguori, kvm, qemu-devel, linux-kernel, ryanh, jan.kiszka

On Sun, Nov 27, 2011 at 12:36:55PM +0200, Avi Kivity wrote:
> On 11/27/2011 04:42 AM, Liu Ping Fan wrote:
> > From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> >
> > The vcpu can be safely released when
> > --1.guest tells us that the vcpu is not needed any longer.
> > --2.vcpu hits the last instruction _halt_
> >
> > If both of the conditions are satisfied, kvm exits to userspace
> > with the reason vcpu dead. So the user thread can exit safely.
> >
> >
> 
> Seems to be completely unnecessary.  If you want to exit from the vcpu
> thread, send it a signal.
> 
Also if guest "tells us that the vcpu is not needed any longer" (via
ACPI I presume) and vcpu actually doing something critical instead of
sitting in 1:hlt; jmp 1b loop then it is guest's problem if it stops
working after vcpu destruction.

--
			Gleb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] QEMU Introduce a pci device "cpustate" to get CPU_DEAD event in guest
  2011-11-27  2:45 ` [PATCH 3/5] QEMU Introduce a pci device "cpustate" to get CPU_DEAD event in guest Liu Ping Fan
@ 2011-11-27 10:56   ` Gleb Natapov
  0 siblings, 0 replies; 78+ messages in thread
From: Gleb Natapov @ 2011-11-27 10:56 UTC (permalink / raw)
  To: Liu Ping Fan
  Cc: kvm, qemu-devel, aliguori, ryanh, jan.kiszka, linux-kernel, avi

On Sun, Nov 27, 2011 at 10:45:35AM +0800, Liu Ping Fan wrote:
> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> 
> This device's driver in guest can get vcpu dead event and notify
> qemu through the device.
> 
This should be done through ACPI device. Look at how PCI hotplug works
in hw/acpi_piix4.c.

> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> ---
>  Makefile.target   |    1 +
>  hw/pc_piix.c      |    1 +
>  hw/pci.c          |   22 +++++++++++
>  hw/pci.h          |    1 +
>  hw/pci_cpustate.c |  105 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 130 insertions(+), 0 deletions(-)
>  create mode 100644 hw/pci_cpustate.c
> 
> diff --git a/Makefile.target b/Makefile.target
> index 5607c6d..c822f9f 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -242,6 +242,7 @@ obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o
>  obj-i386-y += testdev.o
>  obj-i386-y += acpi.o acpi_piix4.o
>  obj-i386-y += icc_bus.o
> +obj-i386-y += pci_cpustate.o
>  
>  obj-i386-y += pcspk.o i8254.o
>  obj-i386-$(CONFIG_KVM_PIT) += i8254-kvm.o
> diff --git a/hw/pc_piix.c b/hw/pc_piix.c
> index 7c6f42d..090d7ba 100644
> --- a/hw/pc_piix.c
> +++ b/hw/pc_piix.c
> @@ -199,6 +199,7 @@ static void pc_init1(MemoryRegion *system_memory,
>              pci_nic_init_nofail(nd, "rtl8139", NULL);
>      }
>  
> +    pc_cpustate_init(NULL);
>      ide_drive_get(hd, MAX_IDE_BUS);
>      if (pci_enabled) {
>          PCIDevice *dev;
> diff --git a/hw/pci.c b/hw/pci.c
> index 5c87a62..74a8975 100644
> --- a/hw/pci.c
> +++ b/hw/pci.c
> @@ -1663,6 +1663,28 @@ PCIDevice *pci_nic_init(NICInfo *nd, const char *default_model,
>      return pci_dev;
>  }
>  
> +PCIDevice *pc_cpustate_init(const char *default_devaddr)
> +{
> +    const char *devaddr = default_devaddr;
> +    PCIBus *bus;
> +    int devfn;
> +    PCIDevice *pci_dev;
> +    DeviceState *dev;
> +    bus = pci_get_bus_devfn(&devfn, devaddr);
> +    if (!bus) {
> +        error_report("Invalid PCI device address %s for device %s",
> +                     devaddr, "pcimmstub");
> +        return NULL;
> +    }
> +
> +    pci_dev = pci_create(bus, devfn, "cpustate");
> +    dev = &pci_dev->qdev;
> +    if (qdev_init(dev) < 0) {
> +        return NULL;
> +    }
> +    return pci_dev;
> +}
> +
>  PCIDevice *pci_nic_init_nofail(NICInfo *nd, const char *default_model,
>                                 const char *default_devaddr)
>  {
> diff --git a/hw/pci.h b/hw/pci.h
> index 071a044..bbaa013 100644
> --- a/hw/pci.h
> +++ b/hw/pci.h
> @@ -279,6 +279,7 @@ PCIDevice *pci_nic_init(NICInfo *nd, const char *default_model,
>                          const char *default_devaddr);
>  PCIDevice *pci_nic_init_nofail(NICInfo *nd, const char *default_model,
>                                 const char *default_devaddr);
> +PCIDevice *pc_cpustate_init(const char *default_devaddr);
>  int pci_bus_num(PCIBus *s);
>  void pci_for_each_device(PCIBus *bus, int bus_num, void (*fn)(PCIBus *bus, PCIDevice *d));
>  PCIBus *pci_find_root_bus(int domain);
> diff --git a/hw/pci_cpustate.c b/hw/pci_cpustate.c
> new file mode 100644
> index 0000000..fd31a1f
> --- /dev/null
> +++ b/hw/pci_cpustate.c
> @@ -0,0 +1,105 @@
> +/* pci_cpustate.c
> + * emulate a pci device to get guest os CPU_DEAD event
> + *
> + * Copyright IBM, Corp. 2011
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see <http://www.gnu.org/licenses/>
> + */
> +#include <zlib.h>
> +#include "hw.h"
> +#include "pci.h"
> +#include "qemu-timer.h"
> +#include "net.h"
> +#include "loader.h"
> +#include "sysemu.h"
> +#include "iov.h"
> +
> +#define PCI_DEVICE_ID_CPUSTATE  0x1010
> +#define CPUSTATE_REGS_SIZE  0x1000
> +
> +typedef struct VcpuState VcpuState;
> +
> +struct VcpuState {
> +    PCIDevice dev;
> +    MemoryRegion mmio;
> +    int mmio_io_addr;
> +    int mmio_index;
> +    uint32_t cpuid;
> +    uint32_t cpu_state;
> +};
> +
> +static const VMStateDescription vmstate_cpustate = {
> +    .name = "cpustate",
> +    .version_id = 1,
> +    .minimum_version_id = 0,
> +    .fields      = (VMStateField[]) {
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
> +static void
> +cpustate_mmio_write(void *opaque, target_phys_addr_t addr, uint64_t val,
> +                 unsigned size)
> +{
> +}
> +
> +static uint64_t
> +cpustate_mmio_read(void *opaque, target_phys_addr_t addr, unsigned size)
> +{
> +    return 0;
> +}
> +
> +static const MemoryRegionOps cpustate_ops = {
> +    .read = cpustate_mmio_read,
> +    .write = cpustate_mmio_write,
> +    .endianness = DEVICE_LITTLE_ENDIAN,
> +};
> +
> +static int pci_cpustate_init(PCIDevice *dev)
> +{
> +    uint8_t *pci_cfg = dev->config;
> +    VcpuState *s = DO_UPCAST(VcpuState, dev, dev);
> +    memory_region_init_io(&s->mmio, &cpustate_ops, s, "cpustate",
> +                        CPUSTATE_REGS_SIZE);
> +    pci_cfg[PCI_INTERRUPT_PIN] = 1;
> +    /* I/O handler for memory-mapped I/O */
> +    pci_register_bar(&s->dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY,  &s->mmio);
> +    return 0;
> +}
> +
> +static int pci_cpustate_exit(PCIDevice *dev)
> +{
> +    return 0;
> +}
> +
> +static PCIDeviceInfo cpustate_info = {
> +    .qdev.name  = "cpustate",
> +    .qdev.size  = sizeof(VcpuState),
> +    .qdev.vmsd  = &vmstate_cpustate,
> +    .init       = pci_cpustate_init,
> +    .exit       = pci_cpustate_exit,
> +    .vendor_id  = PCI_VENDOR_ID_IBM,
> +    .device_id  = PCI_DEVICE_ID_CPUSTATE,
> +    .revision   = 0x10,
> +    .class_id   = PCI_CLASS_SYSTEM_OTHER,
> +    .qdev.props = (Property[]) {
> +        DEFINE_PROP_END_OF_LIST(),
> +    }
> +};
> +
> +static void cpustate_register_devices(void)
> +{
> +    pci_qdev_register(&cpustate_info);
> +}
> +device_init(cpustate_register_devices)
> -- 
> 1.7.4.4
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH] virtio: add a pci driver to notify host the CPU_DEAD event
  2011-11-27  2:47 ` [PATCH] virtio: add a pci driver to notify host the CPU_DEAD event Liu Ping Fan
@ 2011-11-27 11:10   ` Gleb Natapov
  0 siblings, 0 replies; 78+ messages in thread
From: Gleb Natapov @ 2011-11-27 11:10 UTC (permalink / raw)
  To: Liu Ping Fan
  Cc: kvm, qemu-devel, aliguori, ryanh, jan.kiszka, linux-kernel, avi

On Sun, Nov 27, 2011 at 10:47:43AM +0800, Liu Ping Fan wrote:
> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> 
> A driver for qemu device "cpustate". This driver catch the guest
> CPU_DEAD event, and notify host.
> 
And if you do eject properly via ACPI this driver is replaced by 3 lines
of ACPI code and works with older guests too.

> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> ---
>  drivers/virtio/Kconfig         |    6 ++
>  drivers/virtio/Makefile        |    1 +
>  drivers/virtio/cpustate_stub.c |  154 ++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 161 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/virtio/cpustate_stub.c
> 
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 816ed08..96ad253 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -46,4 +46,10 @@ config VIRTIO_BALLOON
>  
>   	 If unsure, say N.
>  
> + config VIRTIO_CPUSTATE
> + 	tristate "Driver to notify host the cpu dead event (EXPERIMENTAL)"
> + 	depends on EXPERIMENTAL
> + 	---help---
> + 	 This drivers provides support to notify host the cpu dead event.
> +
>  endmenu
> diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> index 5a4c63c..06a5ecf 100644
> --- a/drivers/virtio/Makefile
> +++ b/drivers/virtio/Makefile
> @@ -3,3 +3,4 @@ obj-$(CONFIG_VIRTIO_RING) += virtio_ring.o
>  obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o
>  obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o
>  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> +obj-$(VIRTIO_CPUSTATE) += cpustate_stub.o
> diff --git a/drivers/virtio/cpustate_stub.c b/drivers/virtio/cpustate_stub.c
> new file mode 100644
> index 0000000..614da9d
> --- /dev/null
> +++ b/drivers/virtio/cpustate_stub.c
> @@ -0,0 +1,154 @@
> +/*
> + * PCI driver for qemu cpustate device. It notifies host the CPU_DEAD event
> + * in guest.
> + *
> + * Copyright IBM Corp. 2011
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +#include <linux/module.h>
> +#include <linux/list.h>
> +#include <linux/cpu.h>
> +#include <linux/pci.h>
> +#include <linux/slab.h>
> +#include <linux/interrupt.h>
> +#include <linux/spinlock.h>
> +
> +#define PCI_DEVICE_ID_CPUSTATE 0x1010
> +
> +struct cpustate_stub_regs {
> +	unsigned int cpu_phyid;
> +	unsigned int event;
> +};
> +
> +struct cpustate_stub {
> +	struct work_struct work;
> +
> +	unsigned int cpu_phyid;
> +	unsigned int event;
> +
> +	struct cpustate_stub_regs __iomem *regs;
> +};
> +
> +static struct cpustate_stub *agent;
> +
> +static void cpustate_work(struct work_struct *work)
> +{
> +	struct cpustate_stub *stub = container_of(work,
> +					struct cpustate_stub, work);
> +	printk(KERN_INFO "%s,cpu_phyid=0x%x, event=0x%x\n",
> +		__func__, stub->cpu_phyid, stub->event);
> +	stub->regs->cpu_phyid = stub->cpu_phyid;
> +	stub->regs->event = stub->event;
> +	barrier();
> +}
> +
> +static int cpu_dead_callback(struct notifier_block *b, unsigned long action,
> +				 void *data)
> +{
> +	unsigned long cpu = (unsigned long)data;
> +	int cpu_phyid;
> +	switch (action) {
> +	case CPU_DEAD:{
> +		cpu_phyid = per_cpu(x86_cpu_to_apicid, cpu);
> +		agent->cpu_phyid = cpu_phyid;
> +		agent->event = CPU_DEAD;
> +		schedule_work(&agent->work);
> +		break;
> +	}
> +	default:
> +		break;
> +	}
> +	return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block __cpuinitdata cpu_dead_notifier = {
> +	.notifier_call	= cpu_dead_callback,
> +	.priority	= 10,
> +};
> +
> +static int __devinit cpustate_probe(struct pci_dev *pci_dev,
> +	const struct pci_device_id *id)
> +{
> +	int ret = 0;
> +	agent = kzalloc(sizeof(struct cpustate_stub), GFP_KERNEL);
> +	if (agent == NULL) {
> +		ret = -1;
> +		goto fail;
> +	}
> +	/* enable the device */
> +	ret = pci_enable_device(pci_dev);
> +	if (ret) {
> +		printk(KERN_WARNING "%s, pci_enable_device fail,ret=0x%x\n",
> +			__func__, ret);
> +		goto fail;
> +	}
> +
> +	ret = pci_request_regions(pci_dev, "cpustate");
> +	if (ret) {
> +		printk(KERN_WARNING "%s, pci_request_regions fail,ret=0x%x\n",
> +			__func__, ret);
> +		goto out_enable_device;
> +	}
> +
> +	agent->regs = ioremap(pci_dev->resource[0].start,
> +			pci_dev->resource[0].end - pci_dev->resource[0].start);
> +	if (agent->regs == NULL) {
> +		printk(KERN_WARNING "%s, ioremap fail\n", __func__);
> +		goto out_req_regions;
> +	}
> +
> +	INIT_WORK(&agent->work, cpustate_work);
> +	register_cpu_notifier(&cpu_dead_notifier);
> +	printk(KERN_INFO "%s, success\n", __func__);
> +	return 0;
> +
> +out_req_regions:
> +	pci_release_regions(pci_dev);
> +out_enable_device:
> +	pci_disable_device(pci_dev);
> +	kfree(agent);
> +	agent = NULL;
> +fail:
> +	printk(KERN_WARNING "%s fail\n", __func__);
> +	return ret;
> +}
> +
> +static void __devexit cpustate_remove(struct pci_dev *pci_dev)
> +{
> +	unregister_cpu_notifier(&cpu_dead_notifier);
> +}
> +
> +/* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF. */
> +static DEFINE_PCI_DEVICE_TABLE(pci_cpustate_id_table) = {
> +	{ PCI_VENDOR_ID_IBM, PCI_DEVICE_ID_CPUSTATE,
> +		PCI_ANY_ID, PCI_ANY_ID,
> +		PCI_CLASS_SYSTEM_OTHER, 0,
> +		0 },
> +	{ 0 },
> +};
> +MODULE_DEVICE_TABLE(pci, pci_cpustate_id_table);
> +
> +static struct pci_driver pci_cpustate_driver = {
> +	.name		= "cpustate",
> +	.id_table	= pci_cpustate_id_table,
> +	.probe		= cpustate_probe,
> +	.remove		= __devexit_p(cpustate_remove),
> +};
> +
> +static int __init pci_cpustate_init(void)
> +{
> +	return pci_register_driver(&pci_cpustate_driver);
> +}
> +module_init(pci_cpustate_init);
> +
> +static void __exit pci_cpustate_exit(void)
> +{
> +	pci_unregister_driver(&pci_cpustate_driver);
> +}
> +module_exit(pci_cpustate_exit);
> +MODULE_DESCRIPTION("cpustate");
> +MODULE_LICENSE("GPL");
> +MODULE_VERSION("1");
> -- 
> 1.7.4.4
> 

--
			Gleb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] kvm: exit to userspace with reason KVM_EXIT_VCPU_DEAD
  2011-11-27 10:50     ` [Qemu-devel] " Gleb Natapov
@ 2011-11-28  7:16       ` Liu ping fan
  2011-11-28  8:46         ` Gleb Natapov
  0 siblings, 1 reply; 78+ messages in thread
From: Liu ping fan @ 2011-11-28  7:16 UTC (permalink / raw)
  To: Avi Kivity, Gleb Natapov
  Cc: aliguori, kvm, qemu-devel, linux-kernel, ryanh, jan.kiszka

On Sun, Nov 27, 2011 at 6:50 PM, Gleb Natapov <gleb@redhat.com> wrote:
> On Sun, Nov 27, 2011 at 12:36:55PM +0200, Avi Kivity wrote:
>> On 11/27/2011 04:42 AM, Liu Ping Fan wrote:
>> > From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>> >
>> > The vcpu can be safely released when
>> > --1.guest tells us that the vcpu is not needed any longer.
>> > --2.vcpu hits the last instruction _halt_
>> >
>> > If both of the conditions are satisfied, kvm exits to userspace
>> > with the reason vcpu dead. So the user thread can exit safely.
>> >
>> >
>>
>> Seems to be completely unnecessary.  If you want to exit from the vcpu
>> thread, send it a signal.
>>
Hi Avi and Gleb,

First, I wanted to make sure my assumption is right, so I can grab
your meaning more clearly -:). Could you elaborate it for me, thanks.

I had thought that when a vcpu was being removed from guest, kvm must
satisfy the following conditions to safely remove the vcpu:
--1. The tasks on vcpu in GUEST  have already been migrated to other
vcpus and ONLY idle_task left ---- The CPU_DEAD is the checkpoint.
--2. We must wait the idle task to hit native_halt() in GUEST, till
that time, this vcpu is no needed even by idle_task. In KVM, the vcpu
thread will finally sit on "kvm_vcpu_block(vcpu);"
We CAN NOT suppose the sequence of the two condition because they come
from different threads.  Am I right?

And here comes my question,
--1. I think the signal will make vcpu_run exit to user, but is it
allow vcpu thread to finally call  "kernel/exit.c : void do_exit(long
code)" in current code in kvm or in qemu?
--2. If we got CPU_DEAD event, and then send a signal to vcpu thread,
could we ensure that we have already sit on "kvm_vcpu_block(vcpu);"

Thanks and regards,
ping fan

> Also if guest "tells us that the vcpu is not needed any longer" (via
> ACPI I presume) and vcpu actually doing something critical instead of
> sitting in 1:hlt; jmp 1b loop then it is guest's problem if it stops
> working after vcpu destruction.
>


> --
>                        Gleb.
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] kvm: exit to userspace with reason KVM_EXIT_VCPU_DEAD
  2011-11-28  7:16       ` Liu ping fan
@ 2011-11-28  8:46         ` Gleb Natapov
  0 siblings, 0 replies; 78+ messages in thread
From: Gleb Natapov @ 2011-11-28  8:46 UTC (permalink / raw)
  To: Liu ping fan
  Cc: Avi Kivity, aliguori, kvm, qemu-devel, linux-kernel, ryanh, jan.kiszka

On Mon, Nov 28, 2011 at 03:16:01PM +0800, Liu ping fan wrote:
> On Sun, Nov 27, 2011 at 6:50 PM, Gleb Natapov <gleb@redhat.com> wrote:
> > On Sun, Nov 27, 2011 at 12:36:55PM +0200, Avi Kivity wrote:
> >> On 11/27/2011 04:42 AM, Liu Ping Fan wrote:
> >> > From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> >> >
> >> > The vcpu can be safely released when
> >> > --1.guest tells us that the vcpu is not needed any longer.
> >> > --2.vcpu hits the last instruction _halt_
> >> >
> >> > If both of the conditions are satisfied, kvm exits to userspace
> >> > with the reason vcpu dead. So the user thread can exit safely.
> >> >
> >> >
> >>
> >> Seems to be completely unnecessary.  If you want to exit from the vcpu
> >> thread, send it a signal.
> >>
> Hi Avi and Gleb,
> 
> First, I wanted to make sure my assumption is right, so I can grab
> your meaning more clearly -:). Could you elaborate it for me, thanks.
> 
> I had thought that when a vcpu was being removed from guest, kvm must
> satisfy the following conditions to safely remove the vcpu:
> --1. The tasks on vcpu in GUEST  have already been migrated to other
> vcpus and ONLY idle_task left ---- The CPU_DEAD is the checkpoint.
> --2. We must wait the idle task to hit native_halt() in GUEST, till
> that time, this vcpu is no needed even by idle_task. In KVM, the vcpu
> thread will finally sit on "kvm_vcpu_block(vcpu);"
> We CAN NOT suppose the sequence of the two condition because they come
> from different threads.  Am I right?
> 
No, KVM can remove vcpu whenever it told to do so (may be not in the
middle of emulated io though). It is a guest responsibility to eject cpu
only when it is safe to do so from guest's point of view.

> And here comes my question,
> --1. I think the signal will make vcpu_run exit to user, but is it
> allow vcpu thread to finally call  "kernel/exit.c : void do_exit(long
> code)" in current code in kvm or in qemu?
Yes. Why not?

> --2. If we got CPU_DEAD event, and then send a signal to vcpu thread,
> could we ensure that we have already sit on "kvm_vcpu_block(vcpu);"
CPU_DEAD event is internal to a guest (one of them). KVM does not care
about it. And to remove vcpu it does not have to sit in kvm_vcpu_block().
And actually since signal kicks vcpu thread out from kernel into userspace
you can be sure it is not sitting in kvm_vcpu_block(). 

> 
> Thanks and regards,
> ping fan
> 
> > Also if guest "tells us that the vcpu is not needed any longer" (via
> > ACPI I presume) and vcpu actually doing something critical instead of
> > sitting in 1:hlt; jmp 1b loop then it is guest's problem if it stops
> > working after vcpu destruction.
> >
> 
> 
> > --
> >                        Gleb.
> >

--
			Gleb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH] kvm: make vcpu life cycle separated from kvm instance
  2011-11-27 10:36   ` Avi Kivity
@ 2011-12-02  6:26     ` Liu Ping Fan
  2011-12-02 18:26       ` Jan Kiszka
                         ` (2 more replies)
  0 siblings, 3 replies; 78+ messages in thread
From: Liu Ping Fan @ 2011-12-02  6:26 UTC (permalink / raw)
  To: avi, kvm; +Cc: linux-kernel, aliguori, gleb, jan.kiszka

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

Currently, vcpu can be destructed only when kvm instance destroyed.
Change this to vcpu's destruction taken when its refcnt is zero,
and then vcpu MUST and CAN be destroyed before kvm's destroy.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 arch/x86/kvm/i8254.c     |   14 +++++-
 arch/x86/kvm/i8259.c     |   14 +++++-
 arch/x86/kvm/mmu.c       |   12 ++++-
 arch/x86/kvm/x86.c       |   66 +++++++++++++++++++------------
 include/linux/kvm_host.h |   24 +++++++++--
 virt/kvm/irq_comm.c      |   16 ++++++--
 virt/kvm/kvm_main.c      |   98 ++++++++++++++++++++++++++++++++++++++++------
 7 files changed, 188 insertions(+), 56 deletions(-)

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index 76e3f1c..36e9943 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -289,7 +289,7 @@ static void pit_do_work(struct work_struct *work)
 	struct kvm_pit *pit = container_of(work, struct kvm_pit, expired);
 	struct kvm *kvm = pit->kvm;
 	struct kvm_vcpu *vcpu;
-	int i;
+	int i, cnt;
 	struct kvm_kpit_state *ps = &pit->pit_state;
 	int inject = 0;
 
@@ -315,9 +315,17 @@ static void pit_do_work(struct work_struct *work)
 		 * LVT0 to NMI delivery. Other PIC interrupts are just sent to
 		 * VCPU0, and only if its LVT0 is in EXTINT mode.
 		 */
-		if (kvm->arch.vapics_in_nmi_mode > 0)
-			kvm_for_each_vcpu(i, vcpu, kvm)
+		if (kvm->arch.vapics_in_nmi_mode > 0) {
+			rcu_read_lock();
+			kvm_for_each_vcpu(i, cnt, vcpu, kvm) {
+				vcpu = kvm_get_vcpu(kvm, i);
+				if (vcpu == NULL)
+					continue;
+				cnt++;
 				kvm_apic_nmi_wd_deliver(vcpu);
+			}
+			rcu_read_unlock();
+		}
 	}
 }
 
diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index cac4746..529057c 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -50,25 +50,33 @@ static void pic_unlock(struct kvm_pic *s)
 {
 	bool wakeup = s->wakeup_needed;
 	struct kvm_vcpu *vcpu, *found = NULL;
-	int i;
+	struct kvm *kvm = s->kvm;
+	int i, cnt;
 
 	s->wakeup_needed = false;
 
 	spin_unlock(&s->lock);
 
 	if (wakeup) {
-		kvm_for_each_vcpu(i, vcpu, s->kvm) {
+		rcu_read_lock();
+		kvm_for_each_vcpu(i, cnt, vcpu, kvm) {
+			vcpu = kvm_get_vcpu(kvm, i);
+			if (vcpu == NULL)
+				continue;
+			cnt++;
 			if (kvm_apic_accept_pic_intr(vcpu)) {
 				found = vcpu;
 				break;
 			}
 		}
-
+		found = kvm_vcpu_get(found);
+		rcu_read_unlock();
 		if (!found)
 			return;
 
 		kvm_make_request(KVM_REQ_EVENT, found);
 		kvm_vcpu_kick(found);
+		kvm_vcpu_put(found);
 	}
 }
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f1b36cf..b9c3a01 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1833,11 +1833,17 @@ static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte)
 
 static void kvm_mmu_reset_last_pte_updated(struct kvm *kvm)
 {
-	int i;
+	int i, cnt;
 	struct kvm_vcpu *vcpu;
-
-	kvm_for_each_vcpu(i, vcpu, kvm)
+	rcu_read_lock();
+	kvm_for_each_vcpu(i, cnt, vcpu, kvm) {
+		vcpu = kvm_get_vcpu(kvm, i);
+		if (vcpu == NULL)
+			continue;
+		cnt++;
 		vcpu->arch.last_pte_updated = NULL;
+	}
+	rcu_read_unlock();
 }
 
 static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c38efd7..5bd8b95 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1830,11 +1830,19 @@ static int get_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 
 	switch (msr) {
 	case HV_X64_MSR_VP_INDEX: {
-		int r;
+		int r, cnt;
 		struct kvm_vcpu *v;
-		kvm_for_each_vcpu(r, v, vcpu->kvm)
+		struct kvm *kvm =  vcpu->kvm;
+		rcu_read_lock();
+		kvm_for_each_vcpu(r, cnt, v, kvm) {
+			v = kvm_get_vcpu(kvm, r);
+			if (v == NULL)
+				continue;
+			cnt++;
 			if (v == vcpu)
 				data = r;
+		}
+		rcu_read_unlock();
 		break;
 	}
 	case HV_X64_MSR_EOI:
@@ -4966,7 +4974,7 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
 	struct cpufreq_freqs *freq = data;
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
-	int i, send_ipi = 0;
+	int i, cnt, send_ipi = 0;
 
 	/*
 	 * We allow guests to temporarily run on slowing clocks,
@@ -5016,13 +5024,20 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
 
 	raw_spin_lock(&kvm_lock);
 	list_for_each_entry(kvm, &vm_list, vm_list) {
-		kvm_for_each_vcpu(i, vcpu, kvm) {
+
+		rcu_read_lock();
+		kvm_for_each_vcpu(i, cnt, vcpu, kvm) {
+			vcpu = kvm_get_vcpu(kvm, i);
+			if (vcpu == NULL)
+				continue;
+			cnt++;
 			if (vcpu->cpu != freq->cpu)
 				continue;
 			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 			if (vcpu->cpu != smp_processor_id())
 				send_ipi = 1;
 		}
+		rcu_read_unlock();
 	}
 	raw_spin_unlock(&kvm_lock);
 
@@ -6433,13 +6448,21 @@ int kvm_arch_hardware_enable(void *garbage)
 {
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
-	int i;
+	int i, cnt;
 
 	kvm_shared_msr_cpu_online();
-	list_for_each_entry(kvm, &vm_list, vm_list)
-		kvm_for_each_vcpu(i, vcpu, kvm)
+	list_for_each_entry(kvm, &vm_list, vm_list) {
+		rcu_read_lock();
+		kvm_for_each_vcpu(i, cnt, vcpu, kvm) {
+			vcpu = kvm_get_vcpu(kvm, i);
+			if (vcpu == NULL)
+				continue;
+			cnt++;
 			if (vcpu->cpu == smp_processor_id())
 				kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+		}
+		rcu_read_unlock();
+	}
 	return kvm_x86_ops->hardware_enable(garbage);
 }
 
@@ -6560,27 +6583,19 @@ static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
 	vcpu_put(vcpu);
 }
 
-static void kvm_free_vcpus(struct kvm *kvm)
-{
-	unsigned int i;
-	struct kvm_vcpu *vcpu;
 
-	/*
-	 * Unpin any mmu pages first.
-	 */
-	kvm_for_each_vcpu(i, vcpu, kvm) {
-		kvm_clear_async_pf_completion_queue(vcpu);
-		kvm_unload_vcpu_mmu(vcpu);
-	}
-	kvm_for_each_vcpu(i, vcpu, kvm)
-		kvm_arch_vcpu_free(vcpu);
 
-	mutex_lock(&kvm->lock);
-	for (i = 0; i < atomic_read(&kvm->online_vcpus); i++)
-		kvm->vcpus[i] = NULL;
+void kvm_arch_vcpu_zap(struct work_struct *work)
+{
+	struct kvm_vcpu *vcpu = container_of(work, struct kvm_vcpu,
+			zap_work);
+	struct kvm *kvm = vcpu->kvm;
 
-	atomic_set(&kvm->online_vcpus, 0);
-	mutex_unlock(&kvm->lock);
+	printk(KERN_INFO "%s, zap vcpu:0x%x\n", __func__, vcpu->vcpu_id);
+	kvm_clear_async_pf_completion_queue(vcpu);
+	kvm_unload_vcpu_mmu(vcpu);
+	kvm_arch_vcpu_free(vcpu);
+	kvm_put_kvm(kvm);
 }
 
 void kvm_arch_sync_events(struct kvm *kvm)
@@ -6594,7 +6609,6 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 	kvm_iommu_unmap_guest(kvm);
 	kfree(kvm->arch.vpic);
 	kfree(kvm->arch.vioapic);
-	kvm_free_vcpus(kvm);
 	if (kvm->arch.apic_access_page)
 		put_page(kvm->arch.apic_access_page);
 	if (kvm->arch.ept_identity_pagetable)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d526231..4d70ff5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -19,6 +19,7 @@
 #include <linux/slab.h>
 #include <linux/rcupdate.h>
 #include <linux/ratelimit.h>
+#include <linux/atomic.h>
 #include <asm/signal.h>
 
 #include <linux/kvm.h>
@@ -113,6 +114,9 @@ enum {
 
 struct kvm_vcpu {
 	struct kvm *kvm;
+	atomic_t refcount;
+	struct rcu_head head;
+	struct work_struct zap_work;
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	struct preempt_notifier preempt_notifier;
 #endif
@@ -290,16 +294,26 @@ struct kvm {
 #define kvm_printf(kvm, fmt ...) printk(KERN_DEBUG fmt)
 #define vcpu_printf(vcpu, fmt...) kvm_printf(vcpu->kvm, fmt)
 
+struct kvm_vcpu *kvm_vcpu_get(struct kvm_vcpu *vcpu);
+void kvm_vcpu_put(struct kvm_vcpu *vcpu);
+void kvm_arch_vcpu_zap(struct work_struct *work);
+
+/*search vcpu, must be protected by rcu_read_lock*/
 static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i)
 {
+	struct kvm_vcpu *vcpu;
 	smp_rmb();
-	return kvm->vcpus[i];
+	vcpu = kvm->vcpus[i];
+	if (vcpu != NULL && atomic_read(&vcpu->refcount) != 0)
+		return vcpu;
+
+	return NULL;
 }
 
-#define kvm_for_each_vcpu(idx, vcpup, kvm) \
-	for (idx = 0; \
-	     idx < atomic_read(&kvm->online_vcpus) && \
-	     (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
+#define kvm_for_each_vcpu(idx, cnt, vcpup, kvm) \
+	for (idx = 0, cnt = 0; \
+	     cnt < atomic_read(&kvm->online_vcpus) && \
+	     idx < KVM_MAX_VCPUS; \
 	     idx++)
 
 int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id);
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 9f614b4..3b805f3 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -81,14 +81,19 @@ inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
 int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 		struct kvm_lapic_irq *irq)
 {
-	int i, r = -1;
+	int i, cnt, r = -1;
 	struct kvm_vcpu *vcpu, *lowest = NULL;
 
 	if (irq->dest_mode == 0 && irq->dest_id == 0xff &&
 			kvm_is_dm_lowest_prio(irq))
 		printk(KERN_INFO "kvm: apic: phys broadcast and lowest prio\n");
 
-	kvm_for_each_vcpu(i, vcpu, kvm) {
+	rcu_read_lock();
+	kvm_for_each_vcpu(i, cnt, vcpu, kvm) {
+		vcpu = kvm_get_vcpu(kvm, i);
+		if (vcpu == NULL)
+			continue;
+		cnt++;
 		if (!kvm_apic_present(vcpu))
 			continue;
 
@@ -107,10 +112,13 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 				lowest = vcpu;
 		}
 	}
+	lowest = kvm_vcpu_get(lowest);
+	rcu_read_unlock();
 
-	if (lowest)
+	if (lowest) {
 		r = kvm_apic_set_irq(lowest, irq);
-
+		kvm_vcpu_put(lowest);
+	}
 	return r;
 }
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d9cfb78..87191bb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -171,7 +171,7 @@ static void ack_flush(void *_completed)
 
 static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 {
-	int i, cpu, me;
+	int i, cnt, cpu, me;
 	cpumask_var_t cpus;
 	bool called = true;
 	struct kvm_vcpu *vcpu;
@@ -179,7 +179,13 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 	zalloc_cpumask_var(&cpus, GFP_ATOMIC);
 
 	me = get_cpu();
-	kvm_for_each_vcpu(i, vcpu, kvm) {
+
+	rcu_read_lock();
+	kvm_for_each_vcpu(i, cnt, vcpu, kvm) {
+		vcpu = kvm_get_vcpu(kvm, i);
+		if (vcpu == NULL)
+			continue;
+		cnt++;
 		kvm_make_request(req, vcpu);
 		cpu = vcpu->cpu;
 
@@ -190,12 +196,15 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 		      kvm_vcpu_exiting_guest_mode(vcpu) != OUTSIDE_GUEST_MODE)
 			cpumask_set_cpu(cpu, cpus);
 	}
+
 	if (unlikely(cpus == NULL))
 		smp_call_function_many(cpu_online_mask, ack_flush, NULL, 1);
 	else if (!cpumask_empty(cpus))
 		smp_call_function_many(cpus, ack_flush, NULL, 1);
 	else
 		called = false;
+	rcu_read_unlock();
+
 	put_cpu();
 	free_cpumask_var(cpus);
 	return called;
@@ -580,6 +589,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	kvm_arch_free_vm(kvm);
 	hardware_disable_all();
 	mmdrop(mm);
+	printk(KERN_INFO "%s finished\n", __func__);
 }
 
 void kvm_get_kvm(struct kvm *kvm)
@@ -1543,7 +1553,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
 	int yielded = 0;
 	int pass;
-	int i;
+	int i, cnt;
 
 	/*
 	 * We boost the priority of a VCPU that is runnable but not
@@ -1553,9 +1563,14 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 	 * We approximate round-robin by starting at the last boosted VCPU.
 	 */
 	for (pass = 0; pass < 2 && !yielded; pass++) {
-		kvm_for_each_vcpu(i, vcpu, kvm) {
+		rcu_read_lock();
+		kvm_for_each_vcpu(i, cnt, vcpu, kvm) {
 			struct task_struct *task = NULL;
 			struct pid *pid;
+			vcpu = kvm_get_vcpu(kvm, i);
+			if (vcpu == NULL)
+				continue;
+			cnt++;
 			if (!pass && i < last_boosted_vcpu) {
 				i = last_boosted_vcpu;
 				continue;
@@ -1584,6 +1599,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 			}
 			put_task_struct(task);
 		}
+		rcu_read_unlock();
 	}
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
@@ -1623,8 +1639,8 @@ static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
 static int kvm_vcpu_release(struct inode *inode, struct file *filp)
 {
 	struct kvm_vcpu *vcpu = filp->private_data;
-
-	kvm_put_kvm(vcpu->kvm);
+	filp->private_data = NULL;
+	kvm_vcpu_put(vcpu);
 	return 0;
 }
 
@@ -1646,15 +1662,57 @@ static int create_vcpu_fd(struct kvm_vcpu *vcpu)
 	return anon_inode_getfd("kvm-vcpu", &kvm_vcpu_fops, vcpu, O_RDWR);
 }
 
+/*Can not block*/
+void kvm_vcpu_zap(struct rcu_head *rcu)
+{
+	struct kvm_vcpu *vcpu = container_of(rcu, struct kvm_vcpu, head);
+	schedule_work(&vcpu->zap_work);
+}
+
+/*increase refcnt*/
+struct kvm_vcpu *kvm_vcpu_get(struct kvm_vcpu *vcpu)
+{
+	if (vcpu == NULL)
+		return NULL;
+	if (atomic_add_unless(&vcpu->refcount, 1, 0))
+		return vcpu;
+	return NULL;
+}
+
+void kvm_vcpu_put(struct kvm_vcpu *vcpu)
+{
+	struct kvm *kvm;
+	if (atomic_dec_and_test(&vcpu->refcount)) {
+		kvm = vcpu->kvm;
+		mutex_lock(&kvm->lock);
+		kvm->vcpus[vcpu->vcpu_id] = NULL;
+		atomic_dec(&kvm->online_vcpus);
+		mutex_unlock(&kvm->lock);
+		call_rcu(&vcpu->head, kvm_vcpu_zap);
+	}
+}
+
+static struct kvm_vcpu *kvm_vcpu_create(struct kvm *kvm, u32 id)
+{
+	struct kvm_vcpu *vcpu;
+	vcpu = kvm_arch_vcpu_create(kvm, id);
+	if (IS_ERR(vcpu))
+		return vcpu;
+
+	atomic_set(&vcpu->refcount, 1);
+	INIT_WORK(&vcpu->zap_work, kvm_arch_vcpu_zap);
+	return vcpu;
+}
+
 /*
  * Creates some virtual cpus.  Good luck creating more than one.
  */
 static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 {
-	int r;
+	int r, cnt;
 	struct kvm_vcpu *vcpu, *v;
 
-	vcpu = kvm_arch_vcpu_create(kvm, id);
+	vcpu = kvm_vcpu_create(kvm, id);
 	if (IS_ERR(vcpu))
 		return PTR_ERR(vcpu);
 
@@ -1670,11 +1728,19 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 		goto unlock_vcpu_destroy;
 	}
 
-	kvm_for_each_vcpu(r, v, kvm)
+	rcu_read_lock();
+	kvm_for_each_vcpu(r, cnt, v, kvm) {
+		v = kvm_get_vcpu(kvm, r);
+		if (v == NULL)
+			continue;
+		cnt++;
 		if (v->vcpu_id == id) {
+			rcu_read_unlock();
 			r = -EEXIST;
 			goto unlock_vcpu_destroy;
 		}
+	}
+	rcu_read_unlock();
 
 	BUG_ON(kvm->vcpus[atomic_read(&kvm->online_vcpus)]);
 
@@ -2593,13 +2659,21 @@ static int vcpu_stat_get(void *_offset, u64 *val)
 	unsigned offset = (long)_offset;
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
-	int i;
+	int i, cnt;
 
 	*val = 0;
 	raw_spin_lock(&kvm_lock);
-	list_for_each_entry(kvm, &vm_list, vm_list)
-		kvm_for_each_vcpu(i, vcpu, kvm)
+	list_for_each_entry(kvm, &vm_list, vm_list) {
+		rcu_read_lock();
+		kvm_for_each_vcpu(i, cnt, vcpu, kvm) {
+			vcpu = kvm_get_vcpu(kvm, i);
+			if (vcpu == NULL)
+				continue;
+			cnt++;
 			*val += *(u32 *)((void *)vcpu + offset);
+		}
+		rcu_read_unlock();
+	}
 
 	raw_spin_unlock(&kvm_lock);
 	return 0;
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH] kvm: make vcpu life cycle separated from kvm instance
  2011-12-02  6:26     ` [PATCH] " Liu Ping Fan
@ 2011-12-02 18:26       ` Jan Kiszka
  2011-12-04 11:53         ` Liu ping fan
  2011-12-04 10:23       ` Avi Kivity
  2011-12-09  5:23       ` [PATCH V2] " Liu Ping Fan
  2 siblings, 1 reply; 78+ messages in thread
From: Jan Kiszka @ 2011-12-02 18:26 UTC (permalink / raw)
  To: Liu Ping Fan; +Cc: avi, kvm, linux-kernel, aliguori, gleb

On 2011-12-02 07:26, Liu Ping Fan wrote:
> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> 
> Currently, vcpu can be destructed only when kvm instance destroyed.
> Change this to vcpu's destruction taken when its refcnt is zero,
> and then vcpu MUST and CAN be destroyed before kvm's destroy.

I'm lacking the big picture yet (would be good to have in the change log
- at least I'm too lazy to read the code):

What increments the refcnt, what decrements it again? IOW, how does user
space controls the life-cycle of a vcpu after your changes?

Thanks,
Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] kvm: make vcpu life cycle separated from kvm instance
  2011-12-02  6:26     ` [PATCH] " Liu Ping Fan
  2011-12-02 18:26       ` Jan Kiszka
@ 2011-12-04 10:23       ` Avi Kivity
  2011-12-05  5:29         ` Liu ping fan
  2011-12-09  5:23       ` [PATCH V2] " Liu Ping Fan
  2 siblings, 1 reply; 78+ messages in thread
From: Avi Kivity @ 2011-12-04 10:23 UTC (permalink / raw)
  To: Liu Ping Fan; +Cc: kvm, linux-kernel, aliguori, gleb, jan.kiszka

On 12/02/2011 08:26 AM, Liu Ping Fan wrote:
> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>
> Currently, vcpu can be destructed only when kvm instance destroyed.
> Change this to vcpu's destruction taken when its refcnt is zero,
> and then vcpu MUST and CAN be destroyed before kvm's destroy.
>
>  
> @@ -315,9 +315,17 @@ static void pit_do_work(struct work_struct *work)
>  		 * LVT0 to NMI delivery. Other PIC interrupts are just sent to
>  		 * VCPU0, and only if its LVT0 is in EXTINT mode.
>  		 */
> -		if (kvm->arch.vapics_in_nmi_mode > 0)
> -			kvm_for_each_vcpu(i, vcpu, kvm)
> +		if (kvm->arch.vapics_in_nmi_mode > 0) {
> +			rcu_read_lock();
> +			kvm_for_each_vcpu(i, cnt, vcpu, kvm) {
> +				vcpu = kvm_get_vcpu(kvm, i);
> +				if (vcpu == NULL)
> +					continue;
> +				cnt++;
>  				kvm_apic_nmi_wd_deliver(vcpu);
> +			}
> +			rcu_read_unlock();
> +		}
>  	}
>  }

This pattern keeps repeating, please fold it into kvm_for_each_vcpu().

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] kvm: make vcpu life cycle separated from kvm instance
  2011-12-02 18:26       ` Jan Kiszka
@ 2011-12-04 11:53         ` Liu ping fan
  2011-12-04 12:10           ` Gleb Natapov
  0 siblings, 1 reply; 78+ messages in thread
From: Liu ping fan @ 2011-12-04 11:53 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: avi, kvm, linux-kernel, aliguori, gleb

On Sat, Dec 3, 2011 at 2:26 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> On 2011-12-02 07:26, Liu Ping Fan wrote:
>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>
>> Currently, vcpu can be destructed only when kvm instance destroyed.
>> Change this to vcpu's destruction taken when its refcnt is zero,
>> and then vcpu MUST and CAN be destroyed before kvm's destroy.
>
> I'm lacking the big picture yet (would be good to have in the change log
> - at least I'm too lazy to read the code):
>
> What increments the refcnt, what decrements it again? IOW, how does user
> space controls the life-cycle of a vcpu after your changes?
>
In local APIC mode, delivering IPI to target APIC, target's refcnt is
incremented, and decremented when finished. At other times, using RCU to
protect the vcpu's reference from its destruction.

If kvm_vcpu is not needed by guest, user space can close the
kvm_vcpu's file
descriptors, and then,if the kvm_vcpu has crossed the period of local
APCI mode's reference,it will be destroyed.

Regards,
ping fan

> Thanks,
> Jan
>
> --
> Siemens AG, Corporate Technology, CT T DE IT 1
> Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] kvm: make vcpu life cycle separated from kvm instance
  2011-12-04 11:53         ` Liu ping fan
@ 2011-12-04 12:10           ` Gleb Natapov
  2011-12-05  5:39             ` Liu ping fan
  0 siblings, 1 reply; 78+ messages in thread
From: Gleb Natapov @ 2011-12-04 12:10 UTC (permalink / raw)
  To: Liu ping fan; +Cc: Jan Kiszka, avi, kvm, linux-kernel, aliguori

On Sun, Dec 04, 2011 at 07:53:37PM +0800, Liu ping fan wrote:
> On Sat, Dec 3, 2011 at 2:26 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> > On 2011-12-02 07:26, Liu Ping Fan wrote:
> >> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> >>
> >> Currently, vcpu can be destructed only when kvm instance destroyed.
> >> Change this to vcpu's destruction taken when its refcnt is zero,
> >> and then vcpu MUST and CAN be destroyed before kvm's destroy.
> >
> > I'm lacking the big picture yet (would be good to have in the change log
> > - at least I'm too lazy to read the code):
> >
> > What increments the refcnt, what decrements it again? IOW, how does user
> > space controls the life-cycle of a vcpu after your changes?
> >
> In local APIC mode, delivering IPI to target APIC, target's refcnt is
> incremented, and decremented when finished. At other times, using RCU to
Why is this needed?

> protect the vcpu's reference from its destruction.
> 
> If kvm_vcpu is not needed by guest, user space can close the
> kvm_vcpu's file
> descriptors, and then,if the kvm_vcpu has crossed the period of local
> APCI mode's reference,it will be destroyed.
> 
> Regards,
> ping fan
> 
> > Thanks,
> > Jan
> >
> > --
> > Siemens AG, Corporate Technology, CT T DE IT 1
> > Corporate Competence Center Embedded Linux

--
			Gleb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] kvm: make vcpu life cycle separated from kvm instance
  2011-12-04 10:23       ` Avi Kivity
@ 2011-12-05  5:29         ` Liu ping fan
  2011-12-05  9:30           ` Avi Kivity
  0 siblings, 1 reply; 78+ messages in thread
From: Liu ping fan @ 2011-12-05  5:29 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, linux-kernel, aliguori, gleb, jan.kiszka

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 2375 bytes --]

On Sun, Dec 4, 2011 at 6:23 PM, Avi Kivity <avi@redhat.com> wrote:
> On 12/02/2011 08:26 AM, Liu Ping Fan wrote:
>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>
>> Currently, vcpu can be destructed only when kvm instance destroyed.
>> Change this to vcpu's destruction taken when its refcnt is zero,
>> and then vcpu MUST and CAN be destroyed before kvm's destroy.
>>
>>
>> @@ -315,9 +315,17 @@ static void pit_do_work(struct work_struct *work)
>>                * LVT0 to NMI delivery. Other PIC interrupts are just sent to
>>                * VCPU0, and only if its LVT0 is in EXTINT mode.
>>                */
>> -             if (kvm->arch.vapics_in_nmi_mode > 0)
>> -                     kvm_for_each_vcpu(i, vcpu, kvm)
>> +             if (kvm->arch.vapics_in_nmi_mode > 0) {
>> +                     rcu_read_lock();
>> +                     kvm_for_each_vcpu(i, cnt, vcpu, kvm) {
>> +                             vcpu = kvm_get_vcpu(kvm, i);
>> +                             if (vcpu == NULL)
>> +                                     continue;
>> +                             cnt++;
>>                               kvm_apic_nmi_wd_deliver(vcpu);
>> +                     }
>> +                     rcu_read_unlock();
>> +             }
>>       }
>>  }
>
> This pattern keeps repeating, please fold it into kvm_for_each_vcpu().
>
What about folding
kvm_for_each_vcpu(i, cnt, vcpu, kvm) {
              vcpu = kvm_get_vcpu(kvm, i);
              if (vcpu == NULL)
                        continue;
              cnt++;

like this,
#define kvm_for_each_vcpu(idx, cnt, vcpup, kvm) \
	for (idx = 0, cnt = 0, vcpup = kvm_get_vcpu(kvm, idx); \
	     cnt < atomic_read(&kvm->online_vcpus) && \
	     idx < KVM_MAX_VCPUS; \
	     idx++, (vcpup == NULL)?:cnt++, vcpup = kvm_get_vcpu(kvm, idx)) \
	     if (vcpup == NULL) \
	          continue; \
	     else


A little ugly, but have not thought a better way out :-)

Thanks,
ping fan
> --
> error compiling committee.c: too many arguments to function
>
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] kvm: make vcpu life cycle separated from kvm instance
  2011-12-04 12:10           ` Gleb Natapov
@ 2011-12-05  5:39             ` Liu ping fan
  2011-12-05  8:41               ` Gleb Natapov
  0 siblings, 1 reply; 78+ messages in thread
From: Liu ping fan @ 2011-12-05  5:39 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, avi, kvm, linux-kernel, aliguori

On Sun, Dec 4, 2011 at 8:10 PM, Gleb Natapov <gleb@redhat.com> wrote:
> On Sun, Dec 04, 2011 at 07:53:37PM +0800, Liu ping fan wrote:
>> On Sat, Dec 3, 2011 at 2:26 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>> > On 2011-12-02 07:26, Liu Ping Fan wrote:
>> >> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>> >>
>> >> Currently, vcpu can be destructed only when kvm instance destroyed.
>> >> Change this to vcpu's destruction taken when its refcnt is zero,
>> >> and then vcpu MUST and CAN be destroyed before kvm's destroy.
>> >
>> > I'm lacking the big picture yet (would be good to have in the change log
>> > - at least I'm too lazy to read the code):
>> >
>> > What increments the refcnt, what decrements it again? IOW, how does user
>> > space controls the life-cycle of a vcpu after your changes?
>> >
>> In local APIC mode, delivering IPI to target APIC, target's refcnt is
>> incremented, and decremented when finished. At other times, using RCU to
> Why is this needed?
>
Suppose the following scene:

#define kvm_for_each_vcpu(idx, vcpup, kvm) \
        for (idx = 0; \
             idx < atomic_read(&kvm->online_vcpus) && \
             (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
             idx++)

------------------------------------------------------------------------------------------>
Here kvm_vcpu's destruction is called
              vcpup->vcpu_id ...  //oops!


Regards,
ping fan



>> protect the vcpu's reference from its destruction.
>>
>> If kvm_vcpu is not needed by guest, user space can close the
>> kvm_vcpu's file
>> descriptors, and then,if the kvm_vcpu has crossed the period of local
>> APCI mode's reference,it will be destroyed.
>>
>> Regards,
>> ping fan
>>
>> > Thanks,
>> > Jan
>> >
>> > --
>> > Siemens AG, Corporate Technology, CT T DE IT 1
>> > Corporate Competence Center Embedded Linux
>
> --
>                        Gleb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] kvm: make vcpu life cycle separated from kvm instance
  2011-12-05  5:39             ` Liu ping fan
@ 2011-12-05  8:41               ` Gleb Natapov
  2011-12-06  6:54                 ` Liu ping fan
  0 siblings, 1 reply; 78+ messages in thread
From: Gleb Natapov @ 2011-12-05  8:41 UTC (permalink / raw)
  To: Liu ping fan; +Cc: Jan Kiszka, avi, kvm, linux-kernel, aliguori

On Mon, Dec 05, 2011 at 01:39:37PM +0800, Liu ping fan wrote:
> On Sun, Dec 4, 2011 at 8:10 PM, Gleb Natapov <gleb@redhat.com> wrote:
> > On Sun, Dec 04, 2011 at 07:53:37PM +0800, Liu ping fan wrote:
> >> On Sat, Dec 3, 2011 at 2:26 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> >> > On 2011-12-02 07:26, Liu Ping Fan wrote:
> >> >> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> >> >>
> >> >> Currently, vcpu can be destructed only when kvm instance destroyed.
> >> >> Change this to vcpu's destruction taken when its refcnt is zero,
> >> >> and then vcpu MUST and CAN be destroyed before kvm's destroy.
> >> >
> >> > I'm lacking the big picture yet (would be good to have in the change log
> >> > - at least I'm too lazy to read the code):
> >> >
> >> > What increments the refcnt, what decrements it again? IOW, how does user
> >> > space controls the life-cycle of a vcpu after your changes?
> >> >
> >> In local APIC mode, delivering IPI to target APIC, target's refcnt is
> >> incremented, and decremented when finished. At other times, using RCU to
> > Why is this needed?
> >
> Suppose the following scene:
> 
> #define kvm_for_each_vcpu(idx, vcpup, kvm) \
>         for (idx = 0; \
>              idx < atomic_read(&kvm->online_vcpus) && \
>              (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
>              idx++)
> 
> ------------------------------------------------------------------------------------------>
> Here kvm_vcpu's destruction is called
>               vcpup->vcpu_id ...  //oops!
> 
> 
And this is exactly how your code looks. i.e you do not increment
reference count in most of the loops, you only increment it twice
(in pic_unlock() and kvm_irq_delivery_to_apic()) because you are using
vcpu outside of rcu_read_lock() protected section and I do not see why
not just extend protected section to include kvm_vcpu_kick(). As far as
I can see this function does not sleep.

What should protect vcpu from disappearing in your example above is RCU
itself if you are using it right. But since I do not see any calls to
rcu_assign_pointer()/rcu_dereference() I doubt you are using it right
actually.

--
			Gleb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] kvm: make vcpu life cycle separated from kvm instance
  2011-12-05  5:29         ` Liu ping fan
@ 2011-12-05  9:30           ` Avi Kivity
  2011-12-05  9:42             ` Gleb Natapov
  0 siblings, 1 reply; 78+ messages in thread
From: Avi Kivity @ 2011-12-05  9:30 UTC (permalink / raw)
  To: Liu ping fan; +Cc: kvm, linux-kernel, aliguori, gleb, jan.kiszka

On 12/05/2011 07:29 AM, Liu ping fan wrote:
> like this,
> #define kvm_for_each_vcpu(idx, cnt, vcpup, kvm) \
> 	for (idx = 0, cnt = 0, vcpup = kvm_get_vcpu(kvm, idx); \
> 	     cnt < atomic_read(&kvm->online_vcpus) && \
> 	     idx < KVM_MAX_VCPUS; \
> 	     idx++, (vcpup == NULL)?:cnt++, vcpup = kvm_get_vcpu(kvm, idx)) \
> 	     if (vcpup == NULL) \
> 	          continue; \
> 	     else
>
>
> A little ugly, but have not thought a better way out :-)
>

#define kvm_for_each_vcpu(vcpu, it) for (vcpu = kvm_fev_init(&it); vcpu;
vcpu = kvm_fev_next(&it, vcpu))

Though that doesn't give a good place for rcu_read_unlock().



-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] kvm: make vcpu life cycle separated from kvm instance
  2011-12-05  9:30           ` Avi Kivity
@ 2011-12-05  9:42             ` Gleb Natapov
  2011-12-05  9:58               ` Avi Kivity
  0 siblings, 1 reply; 78+ messages in thread
From: Gleb Natapov @ 2011-12-05  9:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Liu ping fan, kvm, linux-kernel, aliguori, jan.kiszka

On Mon, Dec 05, 2011 at 11:30:51AM +0200, Avi Kivity wrote:
> On 12/05/2011 07:29 AM, Liu ping fan wrote:
> > like this,
> > #define kvm_for_each_vcpu(idx, cnt, vcpup, kvm) \
> > 	for (idx = 0, cnt = 0, vcpup = kvm_get_vcpu(kvm, idx); \
> > 	     cnt < atomic_read(&kvm->online_vcpus) && \
> > 	     idx < KVM_MAX_VCPUS; \
> > 	     idx++, (vcpup == NULL)?:cnt++, vcpup = kvm_get_vcpu(kvm, idx)) \
> > 	     if (vcpup == NULL) \
> > 	          continue; \
> > 	     else
> >
> >
> > A little ugly, but have not thought a better way out :-)
> >
> 
> #define kvm_for_each_vcpu(vcpu, it) for (vcpu = kvm_fev_init(&it); vcpu;
> vcpu = kvm_fev_next(&it, vcpu))
> 
> Though that doesn't give a good place for rcu_read_unlock().
> 
> 
Why not use rculist to store vcpus and use list_for_each_entry_rcu()?

--
			Gleb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] kvm: make vcpu life cycle separated from kvm instance
  2011-12-05  9:42             ` Gleb Natapov
@ 2011-12-05  9:58               ` Avi Kivity
  2011-12-05 10:18                 ` Gleb Natapov
  0 siblings, 1 reply; 78+ messages in thread
From: Avi Kivity @ 2011-12-05  9:58 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Liu ping fan, kvm, linux-kernel, aliguori, jan.kiszka

On 12/05/2011 11:42 AM, Gleb Natapov wrote:
> On Mon, Dec 05, 2011 at 11:30:51AM +0200, Avi Kivity wrote:
> > On 12/05/2011 07:29 AM, Liu ping fan wrote:
> > > like this,
> > > #define kvm_for_each_vcpu(idx, cnt, vcpup, kvm) \
> > > 	for (idx = 0, cnt = 0, vcpup = kvm_get_vcpu(kvm, idx); \
> > > 	     cnt < atomic_read(&kvm->online_vcpus) && \
> > > 	     idx < KVM_MAX_VCPUS; \
> > > 	     idx++, (vcpup == NULL)?:cnt++, vcpup = kvm_get_vcpu(kvm, idx)) \
> > > 	     if (vcpup == NULL) \
> > > 	          continue; \
> > > 	     else
> > >
> > >
> > > A little ugly, but have not thought a better way out :-)
> > >
> > 
> > #define kvm_for_each_vcpu(vcpu, it) for (vcpu = kvm_fev_init(&it); vcpu;
> > vcpu = kvm_fev_next(&it, vcpu))
> > 
> > Though that doesn't give a good place for rcu_read_unlock().
> > 
> > 
> Why not use rculist to store vcpus and use list_for_each_entry_rcu()?

We can, but that's a bigger change.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] kvm: make vcpu life cycle separated from kvm instance
  2011-12-05  9:58               ` Avi Kivity
@ 2011-12-05 10:18                 ` Gleb Natapov
  2011-12-05 10:22                   ` Avi Kivity
  0 siblings, 1 reply; 78+ messages in thread
From: Gleb Natapov @ 2011-12-05 10:18 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Liu ping fan, kvm, linux-kernel, aliguori, jan.kiszka

On Mon, Dec 05, 2011 at 11:58:56AM +0200, Avi Kivity wrote:
> On 12/05/2011 11:42 AM, Gleb Natapov wrote:
> > On Mon, Dec 05, 2011 at 11:30:51AM +0200, Avi Kivity wrote:
> > > On 12/05/2011 07:29 AM, Liu ping fan wrote:
> > > > like this,
> > > > #define kvm_for_each_vcpu(idx, cnt, vcpup, kvm) \
> > > > 	for (idx = 0, cnt = 0, vcpup = kvm_get_vcpu(kvm, idx); \
> > > > 	     cnt < atomic_read(&kvm->online_vcpus) && \
> > > > 	     idx < KVM_MAX_VCPUS; \
> > > > 	     idx++, (vcpup == NULL)?:cnt++, vcpup = kvm_get_vcpu(kvm, idx)) \
> > > > 	     if (vcpup == NULL) \
> > > > 	          continue; \
> > > > 	     else
> > > >
> > > >
> > > > A little ugly, but have not thought a better way out :-)
> > > >
> > > 
> > > #define kvm_for_each_vcpu(vcpu, it) for (vcpu = kvm_fev_init(&it); vcpu;
> > > vcpu = kvm_fev_next(&it, vcpu))
> > > 
> > > Though that doesn't give a good place for rcu_read_unlock().
> > > 
> > > 
> > Why not use rculist to store vcpus and use list_for_each_entry_rcu()?
> 
> We can, but that's a bigger change.
> 
Is it? I do not see a lot of accesses to vcpu array except those loops.

--
			Gleb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] kvm: make vcpu life cycle separated from kvm instance
  2011-12-05 10:18                 ` Gleb Natapov
@ 2011-12-05 10:22                   ` Avi Kivity
  2011-12-05 10:40                     ` Gleb Natapov
  0 siblings, 1 reply; 78+ messages in thread
From: Avi Kivity @ 2011-12-05 10:22 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Liu ping fan, kvm, linux-kernel, aliguori, jan.kiszka

On 12/05/2011 12:18 PM, Gleb Natapov wrote:
> > 
> > We can, but that's a bigger change.
> > 
> Is it? I do not see a lot of accesses to vcpu array except those loops.
>

Well actually some of those loops have to go away and be replaced by a
hash lookup with apic id as key.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] kvm: make vcpu life cycle separated from kvm instance
  2011-12-05 10:22                   ` Avi Kivity
@ 2011-12-05 10:40                     ` Gleb Natapov
  0 siblings, 0 replies; 78+ messages in thread
From: Gleb Natapov @ 2011-12-05 10:40 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Liu ping fan, kvm, linux-kernel, aliguori, jan.kiszka

On Mon, Dec 05, 2011 at 12:22:53PM +0200, Avi Kivity wrote:
> On 12/05/2011 12:18 PM, Gleb Natapov wrote:
> > > 
> > > We can, but that's a bigger change.
> > > 
> > Is it? I do not see a lot of accesses to vcpu array except those loops.
> >
> 
> Well actually some of those loops have to go away and be replaced by a
> hash lookup with apic id as key.
> 
Yes, but apic ids are guest controllable, so there should be separate hash that
will hold vcpu to gust configured apic id mapping. Shouldn't prevent us
from moving to rculist now.

--
			Gleb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] kvm: make vcpu life cycle separated from kvm instance
  2011-12-05  8:41               ` Gleb Natapov
@ 2011-12-06  6:54                 ` Liu ping fan
  2011-12-06  8:14                   ` Gleb Natapov
  0 siblings, 1 reply; 78+ messages in thread
From: Liu ping fan @ 2011-12-06  6:54 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Jan Kiszka, avi, kvm, linux-kernel, aliguori

On Mon, Dec 5, 2011 at 4:41 PM, Gleb Natapov <gleb@redhat.com> wrote:
> On Mon, Dec 05, 2011 at 01:39:37PM +0800, Liu ping fan wrote:
>> On Sun, Dec 4, 2011 at 8:10 PM, Gleb Natapov <gleb@redhat.com> wrote:
>> > On Sun, Dec 04, 2011 at 07:53:37PM +0800, Liu ping fan wrote:
>> >> On Sat, Dec 3, 2011 at 2:26 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>> >> > On 2011-12-02 07:26, Liu Ping Fan wrote:
>> >> >> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>> >> >>
>> >> >> Currently, vcpu can be destructed only when kvm instance destroyed.
>> >> >> Change this to vcpu's destruction taken when its refcnt is zero,
>> >> >> and then vcpu MUST and CAN be destroyed before kvm's destroy.
>> >> >
>> >> > I'm lacking the big picture yet (would be good to have in the change log
>> >> > - at least I'm too lazy to read the code):
>> >> >
>> >> > What increments the refcnt, what decrements it again? IOW, how does user
>> >> > space controls the life-cycle of a vcpu after your changes?
>> >> >
>> >> In local APIC mode, delivering IPI to target APIC, target's refcnt is
>> >> incremented, and decremented when finished. At other times, using RCU to
>> > Why is this needed?
>> >
>> Suppose the following scene:
>>
>> #define kvm_for_each_vcpu(idx, vcpup, kvm) \
>>         for (idx = 0; \
>>              idx < atomic_read(&kvm->online_vcpus) && \
>>              (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
>>              idx++)
>>
>> ------------------------------------------------------------------------------------------>
>> Here kvm_vcpu's destruction is called
>>               vcpup->vcpu_id ...  //oops!
>>
>>
> And this is exactly how your code looks. i.e you do not increment
> reference count in most of the loops, you only increment it twice
> (in pic_unlock() and kvm_irq_delivery_to_apic()) because you are using
> vcpu outside of rcu_read_lock() protected section and I do not see why
> not just extend protected section to include kvm_vcpu_kick(). As far as
> I can see this function does not sleep.
>
:-), I just want to minimize the RCU critical area, and as you say, we
can  extend protected section to include kvm_vcpu_kick()

> What should protect vcpu from disappearing in your example above is RCU
> itself if you are using it right. But since I do not see any calls to
> rcu_assign_pointer()/rcu_dereference() I doubt you are using it right
> actually.
>
Sorry, but I thought it would not be. Please help me to check my thoughts :

struct kvm_vcpu *kvm_vcpu_get(struct kvm_vcpu *vcpu)
{
	if (vcpu == NULL)
		return NULL;
	if (atomic_add_unless(&vcpu->refcount, 1, 0))
------------------------------increment
		return vcpu;
	return NULL;
}

void kvm_vcpu_put(struct kvm_vcpu *vcpu)
{
	struct kvm *kvm;
	if (atomic_dec_and_test(&vcpu->refcount)) {
--------------------------decrement
		kvm = vcpu->kvm;
		mutex_lock(&kvm->lock);
		kvm->vcpus[vcpu->vcpu_id] = NULL;
		atomic_dec(&kvm->online_vcpus);
		mutex_unlock(&kvm->lock);
		call_rcu(&vcpu->head, kvm_vcpu_zap);
	}
}

The atomic of decrement and increment are protected by cache coherent protocol.
So once we hold a valid kvm_vcpu pointer through kvm_vcpu_get(),
we will always keep it until we release it, then, the destruction may happen.

Thanks and regards,
ping fan

> --
>                        Gleb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] kvm: make vcpu life cycle separated from kvm instance
  2011-12-06  6:54                 ` Liu ping fan
@ 2011-12-06  8:14                   ` Gleb Natapov
  0 siblings, 0 replies; 78+ messages in thread
From: Gleb Natapov @ 2011-12-06  8:14 UTC (permalink / raw)
  To: Liu ping fan; +Cc: Jan Kiszka, avi, kvm, linux-kernel, aliguori

On Tue, Dec 06, 2011 at 02:54:06PM +0800, Liu ping fan wrote:
> On Mon, Dec 5, 2011 at 4:41 PM, Gleb Natapov <gleb@redhat.com> wrote:
> > On Mon, Dec 05, 2011 at 01:39:37PM +0800, Liu ping fan wrote:
> >> On Sun, Dec 4, 2011 at 8:10 PM, Gleb Natapov <gleb@redhat.com> wrote:
> >> > On Sun, Dec 04, 2011 at 07:53:37PM +0800, Liu ping fan wrote:
> >> >> On Sat, Dec 3, 2011 at 2:26 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> >> >> > On 2011-12-02 07:26, Liu Ping Fan wrote:
> >> >> >> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> >> >> >>
> >> >> >> Currently, vcpu can be destructed only when kvm instance destroyed.
> >> >> >> Change this to vcpu's destruction taken when its refcnt is zero,
> >> >> >> and then vcpu MUST and CAN be destroyed before kvm's destroy.
> >> >> >
> >> >> > I'm lacking the big picture yet (would be good to have in the change log
> >> >> > - at least I'm too lazy to read the code):
> >> >> >
> >> >> > What increments the refcnt, what decrements it again? IOW, how does user
> >> >> > space controls the life-cycle of a vcpu after your changes?
> >> >> >
> >> >> In local APIC mode, delivering IPI to target APIC, target's refcnt is
> >> >> incremented, and decremented when finished. At other times, using RCU to
> >> > Why is this needed?
> >> >
> >> Suppose the following scene:
> >>
> >> #define kvm_for_each_vcpu(idx, vcpup, kvm) \
> >>         for (idx = 0; \
> >>              idx < atomic_read(&kvm->online_vcpus) && \
> >>              (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
> >>              idx++)
> >>
> >> ------------------------------------------------------------------------------------------>
> >> Here kvm_vcpu's destruction is called
> >>               vcpup->vcpu_id ...  //oops!
> >>
> >>
> > And this is exactly how your code looks. i.e you do not increment
> > reference count in most of the loops, you only increment it twice
> > (in pic_unlock() and kvm_irq_delivery_to_apic()) because you are using
> > vcpu outside of rcu_read_lock() protected section and I do not see why
> > not just extend protected section to include kvm_vcpu_kick(). As far as
> > I can see this function does not sleep.
> >
> :-), I just want to minimize the RCU critical area, and as you say, we
> can  extend protected section to include kvm_vcpu_kick()
> 
What's the point of trying to minimize it? vcpu will not be freed quicker.

> > What should protect vcpu from disappearing in your example above is RCU
> > itself if you are using it right. But since I do not see any calls to
> > rcu_assign_pointer()/rcu_dereference() I doubt you are using it right
> > actually.
> >
> Sorry, but I thought it would not be. Please help me to check my thoughts :
> 
> struct kvm_vcpu *kvm_vcpu_get(struct kvm_vcpu *vcpu)
> {
> 	if (vcpu == NULL)
> 		return NULL;
> 	if (atomic_add_unless(&vcpu->refcount, 1, 0))
> ------------------------------increment
> 		return vcpu;
> 	return NULL;
> }
> 
> void kvm_vcpu_put(struct kvm_vcpu *vcpu)
> {
> 	struct kvm *kvm;
> 	if (atomic_dec_and_test(&vcpu->refcount)) {
> --------------------------decrement
> 		kvm = vcpu->kvm;
> 		mutex_lock(&kvm->lock);
> 		kvm->vcpus[vcpu->vcpu_id] = NULL;
> 		atomic_dec(&kvm->online_vcpus);
> 		mutex_unlock(&kvm->lock);
> 		call_rcu(&vcpu->head, kvm_vcpu_zap);
> 	}
> }
> 
> The atomic of decrement and increment are protected by cache coherent protocol.
> So once we hold a valid kvm_vcpu pointer through kvm_vcpu_get(),
> we will always keep it until we release it, then, the destruction may happen.
> 
My point is you do not need those atomics at all, not that they are
incorrect. You either protect vcpus with reference counters or RCU, but
not both. The point of RCU is that you do not need any locking
on read access to data structure, so if you add locking (by means of
reference counting) just use rwlock around access to vcpus array and be
done with it.

--
			Gleb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH V2] kvm: make vcpu life cycle separated from kvm instance
  2011-12-02  6:26     ` [PATCH] " Liu Ping Fan
  2011-12-02 18:26       ` Jan Kiszka
  2011-12-04 10:23       ` Avi Kivity
@ 2011-12-09  5:23       ` Liu Ping Fan
  2011-12-09 14:23         ` Gleb Natapov
  2 siblings, 1 reply; 78+ messages in thread
From: Liu Ping Fan @ 2011-12-09  5:23 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, avi, aliguori, gleb, jan.kiszka

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

Currently, vcpu can be destructed only when kvm instance destroyed.
Change this to vcpu's destruction taken when its refcnt is zero,
and then vcpu MUST and CAN be destroyed before kvm's destroy.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 arch/x86/kvm/i8254.c     |   10 ++++--
 arch/x86/kvm/i8259.c     |   12 ++++--
 arch/x86/kvm/mmu.c       |    7 ++--
 arch/x86/kvm/x86.c       |   54 ++++++++++++++++--------------
 include/linux/kvm_host.h |   77 +++++++++++++++++++++++++++++++++++++++---
 virt/kvm/irq_comm.c      |    7 +++-
 virt/kvm/kvm_main.c      |   82 ++++++++++++++++++++++++++++++++++++++++------
 7 files changed, 196 insertions(+), 53 deletions(-)

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index 76e3f1c..ac79598 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -289,7 +289,7 @@ static void pit_do_work(struct work_struct *work)
 	struct kvm_pit *pit = container_of(work, struct kvm_pit, expired);
 	struct kvm *kvm = pit->kvm;
 	struct kvm_vcpu *vcpu;
-	int i;
+	struct kvm_iter it;
 	struct kvm_kpit_state *ps = &pit->pit_state;
 	int inject = 0;
 
@@ -315,9 +315,13 @@ static void pit_do_work(struct work_struct *work)
 		 * LVT0 to NMI delivery. Other PIC interrupts are just sent to
 		 * VCPU0, and only if its LVT0 is in EXTINT mode.
 		 */
-		if (kvm->arch.vapics_in_nmi_mode > 0)
-			kvm_for_each_vcpu(i, vcpu, kvm)
+		if (kvm->arch.vapics_in_nmi_mode > 0) {
+			rcu_read_lock();
+			kvm_for_each_vcpu(it, vcpu, kvm) {
 				kvm_apic_nmi_wd_deliver(vcpu);
+			}
+			rcu_read_unlock();
+		}
 	}
 }
 
diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index cac4746..2186b30 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -50,25 +50,29 @@ static void pic_unlock(struct kvm_pic *s)
 {
 	bool wakeup = s->wakeup_needed;
 	struct kvm_vcpu *vcpu, *found = NULL;
-	int i;
+	struct kvm *kvm = s->kvm;
+	struct kvm_iter it;
 
 	s->wakeup_needed = false;
 
 	spin_unlock(&s->lock);
 
 	if (wakeup) {
-		kvm_for_each_vcpu(i, vcpu, s->kvm) {
+		rcu_read_lock();
+		kvm_for_each_vcpu(it, vcpu, kvm)
 			if (kvm_apic_accept_pic_intr(vcpu)) {
 				found = vcpu;
 				break;
 			}
-		}
 
-		if (!found)
+		if (!found) {
+			rcu_read_unlock();
 			return;
+		}
 
 		kvm_make_request(KVM_REQ_EVENT, found);
 		kvm_vcpu_kick(found);
+		rcu_read_unlock();
 	}
 }
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f1b36cf..c16887e 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1833,11 +1833,12 @@ static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte)
 
 static void kvm_mmu_reset_last_pte_updated(struct kvm *kvm)
 {
-	int i;
+	struct kvm_iter it;
 	struct kvm_vcpu *vcpu;
-
-	kvm_for_each_vcpu(i, vcpu, kvm)
+	rcu_read_lock();
+	kvm_for_each_vcpu(it, vcpu, kvm)
 		vcpu->arch.last_pte_updated = NULL;
+	rcu_read_unlock();
 }
 
 static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c38efd7..a302470 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1831,10 +1831,15 @@ static int get_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 	switch (msr) {
 	case HV_X64_MSR_VP_INDEX: {
 		int r;
+		struct kvm_iter it;
 		struct kvm_vcpu *v;
-		kvm_for_each_vcpu(r, v, vcpu->kvm)
+		struct kvm *kvm =  vcpu->kvm;
+		rcu_read_lock();
+		kvm_for_each_vcpu(it, v, kvm) {
 			if (v == vcpu)
 				data = r;
+		}
+		rcu_read_unlock();
 		break;
 	}
 	case HV_X64_MSR_EOI:
@@ -4966,7 +4971,8 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
 	struct cpufreq_freqs *freq = data;
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
-	int i, send_ipi = 0;
+	int send_ipi = 0;
+	struct kvm_iter it;
 
 	/*
 	 * We allow guests to temporarily run on slowing clocks,
@@ -5016,13 +5022,16 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
 
 	raw_spin_lock(&kvm_lock);
 	list_for_each_entry(kvm, &vm_list, vm_list) {
-		kvm_for_each_vcpu(i, vcpu, kvm) {
+
+		rcu_read_lock();
+		kvm_for_each_vcpu(it, vcpu, kvm) {
 			if (vcpu->cpu != freq->cpu)
 				continue;
 			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 			if (vcpu->cpu != smp_processor_id())
 				send_ipi = 1;
 		}
+		rcu_read_unlock();
 	}
 	raw_spin_unlock(&kvm_lock);
 
@@ -6433,13 +6442,17 @@ int kvm_arch_hardware_enable(void *garbage)
 {
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
-	int i;
+	struct kvm_iter it;
 
 	kvm_shared_msr_cpu_online();
-	list_for_each_entry(kvm, &vm_list, vm_list)
-		kvm_for_each_vcpu(i, vcpu, kvm)
+	list_for_each_entry(kvm, &vm_list, vm_list) {
+		rcu_read_lock();
+		kvm_for_each_vcpu(it, vcpu, kvm) {
 			if (vcpu->cpu == smp_processor_id())
 				kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+		}
+		rcu_read_unlock();
+	}
 	return kvm_x86_ops->hardware_enable(garbage);
 }
 
@@ -6560,27 +6573,19 @@ static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
 	vcpu_put(vcpu);
 }
 
-static void kvm_free_vcpus(struct kvm *kvm)
-{
-	unsigned int i;
-	struct kvm_vcpu *vcpu;
 
-	/*
-	 * Unpin any mmu pages first.
-	 */
-	kvm_for_each_vcpu(i, vcpu, kvm) {
-		kvm_clear_async_pf_completion_queue(vcpu);
-		kvm_unload_vcpu_mmu(vcpu);
-	}
-	kvm_for_each_vcpu(i, vcpu, kvm)
-		kvm_arch_vcpu_free(vcpu);
 
-	mutex_lock(&kvm->lock);
-	for (i = 0; i < atomic_read(&kvm->online_vcpus); i++)
-		kvm->vcpus[i] = NULL;
+void kvm_arch_vcpu_zap(struct work_struct *work)
+{
+	struct kvm_vcpu *vcpu = container_of(work, struct kvm_vcpu,
+			zap_work);
+	struct kvm *kvm = vcpu->kvm;
 
-	atomic_set(&kvm->online_vcpus, 0);
-	mutex_unlock(&kvm->lock);
+	printk(KERN_INFO "%s, zap vcpu:0x%x\n", __func__, vcpu->vcpu_id);
+	kvm_clear_async_pf_completion_queue(vcpu);
+	kvm_unload_vcpu_mmu(vcpu);
+	kvm_arch_vcpu_free(vcpu);
+	kvm_put_kvm(kvm);
 }
 
 void kvm_arch_sync_events(struct kvm *kvm)
@@ -6594,7 +6599,6 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 	kvm_iommu_unmap_guest(kvm);
 	kfree(kvm->arch.vpic);
 	kfree(kvm->arch.vioapic);
-	kvm_free_vcpus(kvm);
 	if (kvm->arch.apic_access_page)
 		put_page(kvm->arch.apic_access_page);
 	if (kvm->arch.ept_identity_pagetable)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d526231..f16fd09 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -19,6 +19,7 @@
 #include <linux/slab.h>
 #include <linux/rcupdate.h>
 #include <linux/ratelimit.h>
+#include <linux/atomic.h>
 #include <asm/signal.h>
 
 #include <linux/kvm.h>
@@ -113,6 +114,9 @@ enum {
 
 struct kvm_vcpu {
 	struct kvm *kvm;
+	atomic_t refcount;
+	struct rcu_head head;
+	struct work_struct zap_work;
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	struct preempt_notifier preempt_notifier;
 #endif
@@ -290,17 +294,78 @@ struct kvm {
 #define kvm_printf(kvm, fmt ...) printk(KERN_DEBUG fmt)
 #define vcpu_printf(vcpu, fmt...) kvm_printf(vcpu->kvm, fmt)
 
+struct kvm_vcpu *kvm_vcpu_get(struct kvm_vcpu *vcpu);
+void kvm_vcpu_put(struct kvm_vcpu *vcpu);
+void kvm_arch_vcpu_zap(struct work_struct *work);
+
+/*search vcpu, must be protected by rcu_read_lock*/
 static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i)
 {
+	struct kvm_vcpu *vcpu;
 	smp_rmb();
-	return kvm->vcpus[i];
+	vcpu = rcu_dereference(kvm->vcpus[i]);
+	if (vcpu != NULL && atomic_read(&vcpu->refcount) != 0)
+		return vcpu;
+
+	return NULL;
+}
+
+/*Must be protected by RCU*/
+struct kvm_iter {
+	struct kvm *kvm;
+	int idx;
+	int cnt;
+};
+
+static inline
+struct kvm_vcpu *kvm_fev_init(struct kvm *kvm, struct kvm_iter *it)
+{
+	int idx, cnt;
+	struct kvm_vcpu *vcpup;
+	vcpup = NULL;
+	for (idx = 0, cnt = 0;
+		cnt < atomic_read(&kvm->online_vcpus) && idx < KVM_MAX_VCPUS;
+		idx++) {
+			vcpup = kvm_get_vcpu(kvm, idx);
+			if (unlikely(vcpup == NULL))
+				continue;
+			cnt++;
+			break;
+	}
+
+	it->kvm = kvm;
+	it->idx = idx;
+	it->cnt = cnt;
+	return vcpup;
+}
+
+static inline
+struct kvm_vcpu *kvm_fev_next(struct kvm_iter *it)
+{
+	int idx, cnt;
+	struct kvm_vcpu *vcpup;
+	struct kvm *kvm = it->kvm;
+
+	vcpup = NULL;
+	for (idx = it->idx+1, cnt = it->cnt;
+		cnt < atomic_read(&kvm->online_vcpus) && idx < KVM_MAX_VCPUS;
+		idx++) {
+			vcpup = kvm_get_vcpu(kvm, idx);
+			if (unlikely(vcpup == NULL))
+				continue;
+			 cnt++;
+			 break;
+	}
+
+	it->idx = idx;
+	it->cnt = cnt;
+	return vcpup;
 }
 
-#define kvm_for_each_vcpu(idx, vcpup, kvm) \
-	for (idx = 0; \
-	     idx < atomic_read(&kvm->online_vcpus) && \
-	     (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
-	     idx++)
+#define kvm_for_each_vcpu(it, vcpu, kvm) \
+	for (vcpu = kvm_fev_init(kvm, &it); \
+		vcpu; \
+		vcpu = kvm_fev_next(&it))
 
 int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id);
 void kvm_vcpu_uninit(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 9f614b4..87eae96 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -81,14 +81,16 @@ inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
 int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 		struct kvm_lapic_irq *irq)
 {
-	int i, r = -1;
+	int r = -1;
+	struct kvm_iter it;
 	struct kvm_vcpu *vcpu, *lowest = NULL;
 
 	if (irq->dest_mode == 0 && irq->dest_id == 0xff &&
 			kvm_is_dm_lowest_prio(irq))
 		printk(KERN_INFO "kvm: apic: phys broadcast and lowest prio\n");
 
-	kvm_for_each_vcpu(i, vcpu, kvm) {
+	rcu_read_lock();
+	kvm_for_each_vcpu(it, vcpu, kvm) {
 		if (!kvm_apic_present(vcpu))
 			continue;
 
@@ -111,6 +113,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 	if (lowest)
 		r = kvm_apic_set_irq(lowest, irq);
 
+	rcu_read_unlock();
 	return r;
 }
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d9cfb78..929cfce 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -171,7 +171,8 @@ static void ack_flush(void *_completed)
 
 static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 {
-	int i, cpu, me;
+	int cpu, me;
+	struct kvm_iter it;
 	cpumask_var_t cpus;
 	bool called = true;
 	struct kvm_vcpu *vcpu;
@@ -179,7 +180,9 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 	zalloc_cpumask_var(&cpus, GFP_ATOMIC);
 
 	me = get_cpu();
-	kvm_for_each_vcpu(i, vcpu, kvm) {
+
+	rcu_read_lock();
+	kvm_for_each_vcpu(it, vcpu, kvm) {
 		kvm_make_request(req, vcpu);
 		cpu = vcpu->cpu;
 
@@ -190,12 +193,15 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 		      kvm_vcpu_exiting_guest_mode(vcpu) != OUTSIDE_GUEST_MODE)
 			cpumask_set_cpu(cpu, cpus);
 	}
+
 	if (unlikely(cpus == NULL))
 		smp_call_function_many(cpu_online_mask, ack_flush, NULL, 1);
 	else if (!cpumask_empty(cpus))
 		smp_call_function_many(cpus, ack_flush, NULL, 1);
 	else
 		called = false;
+	rcu_read_unlock();
+
 	put_cpu();
 	free_cpumask_var(cpus);
 	return called;
@@ -580,6 +586,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	kvm_arch_free_vm(kvm);
 	hardware_disable_all();
 	mmdrop(mm);
+	printk(KERN_INFO "%s finished\n", __func__);
 }
 
 void kvm_get_kvm(struct kvm *kvm)
@@ -1543,6 +1550,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
 	int yielded = 0;
 	int pass;
+	struct kvm_iter it;
 	int i;
 
 	/*
@@ -1553,9 +1561,11 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 	 * We approximate round-robin by starting at the last boosted VCPU.
 	 */
 	for (pass = 0; pass < 2 && !yielded; pass++) {
-		kvm_for_each_vcpu(i, vcpu, kvm) {
+		rcu_read_lock();
+		kvm_for_each_vcpu(it, vcpu, kvm) {
 			struct task_struct *task = NULL;
 			struct pid *pid;
+			i = it.idx;
 			if (!pass && i < last_boosted_vcpu) {
 				i = last_boosted_vcpu;
 				continue;
@@ -1584,6 +1594,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 			}
 			put_task_struct(task);
 		}
+		rcu_read_unlock();
 	}
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
@@ -1623,8 +1634,8 @@ static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
 static int kvm_vcpu_release(struct inode *inode, struct file *filp)
 {
 	struct kvm_vcpu *vcpu = filp->private_data;
-
-	kvm_put_kvm(vcpu->kvm);
+	filp->private_data = NULL;
+	kvm_vcpu_put(vcpu);
 	return 0;
 }
 
@@ -1646,6 +1657,48 @@ static int create_vcpu_fd(struct kvm_vcpu *vcpu)
 	return anon_inode_getfd("kvm-vcpu", &kvm_vcpu_fops, vcpu, O_RDWR);
 }
 
+/*Can not block*/
+void kvm_vcpu_zap(struct rcu_head *rcu)
+{
+	struct kvm_vcpu *vcpu = container_of(rcu, struct kvm_vcpu, head);
+	schedule_work(&vcpu->zap_work);
+}
+
+/*increase refcnt*/
+struct kvm_vcpu *kvm_vcpu_get(struct kvm_vcpu *vcpu)
+{
+	if (vcpu == NULL)
+		return NULL;
+	if (atomic_add_unless(&vcpu->refcount, 1, 0))
+		return vcpu;
+	return NULL;
+}
+
+void kvm_vcpu_put(struct kvm_vcpu *vcpu)
+{
+	struct kvm *kvm;
+	if (atomic_dec_and_test(&vcpu->refcount)) {
+		kvm = vcpu->kvm;
+		mutex_lock(&kvm->lock);
+		rcu_assign_pointer(kvm->vcpus[vcpu->vcpu_id], NULL);
+		atomic_dec(&kvm->online_vcpus);
+		mutex_unlock(&kvm->lock);
+		call_rcu(&vcpu->head, kvm_vcpu_zap);
+	}
+}
+
+static struct kvm_vcpu *kvm_vcpu_create(struct kvm *kvm, u32 id)
+{
+	struct kvm_vcpu *vcpu;
+	vcpu = kvm_arch_vcpu_create(kvm, id);
+	if (IS_ERR(vcpu))
+		return vcpu;
+
+	atomic_set(&vcpu->refcount, 1);
+	INIT_WORK(&vcpu->zap_work, kvm_arch_vcpu_zap);
+	return vcpu;
+}
+
 /*
  * Creates some virtual cpus.  Good luck creating more than one.
  */
@@ -1653,8 +1706,9 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 {
 	int r;
 	struct kvm_vcpu *vcpu, *v;
+	struct kvm_iter it;
 
-	vcpu = kvm_arch_vcpu_create(kvm, id);
+	vcpu = kvm_vcpu_create(kvm, id);
 	if (IS_ERR(vcpu))
 		return PTR_ERR(vcpu);
 
@@ -1670,11 +1724,15 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 		goto unlock_vcpu_destroy;
 	}
 
-	kvm_for_each_vcpu(r, v, kvm)
+	rcu_read_lock();
+	kvm_for_each_vcpu(it, v, kvm) {
 		if (v->vcpu_id == id) {
+			rcu_read_unlock();
 			r = -EEXIST;
 			goto unlock_vcpu_destroy;
 		}
+	}
+	rcu_read_unlock();
 
 	BUG_ON(kvm->vcpus[atomic_read(&kvm->online_vcpus)]);
 
@@ -2593,13 +2651,17 @@ static int vcpu_stat_get(void *_offset, u64 *val)
 	unsigned offset = (long)_offset;
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
-	int i;
+	struct kvm_iter it;
 
 	*val = 0;
 	raw_spin_lock(&kvm_lock);
-	list_for_each_entry(kvm, &vm_list, vm_list)
-		kvm_for_each_vcpu(i, vcpu, kvm)
+	list_for_each_entry(kvm, &vm_list, vm_list) {
+		rcu_read_lock();
+		kvm_for_each_vcpu(it, vcpu, kvm) {
 			*val += *(u32 *)((void *)vcpu + offset);
+		}
+		rcu_read_unlock();
+	}
 
 	raw_spin_unlock(&kvm_lock);
 	return 0;
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH V2] kvm: make vcpu life cycle separated from kvm instance
  2011-12-09  5:23       ` [PATCH V2] " Liu Ping Fan
@ 2011-12-09 14:23         ` Gleb Natapov
  2011-12-12  2:41           ` [PATCH v3] " Liu Ping Fan
  0 siblings, 1 reply; 78+ messages in thread
From: Gleb Natapov @ 2011-12-09 14:23 UTC (permalink / raw)
  To: Liu Ping Fan; +Cc: kvm, linux-kernel, avi, aliguori, jan.kiszka

On Fri, Dec 09, 2011 at 01:23:18PM +0800, Liu Ping Fan wrote:
> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> 
> Currently, vcpu can be destructed only when kvm instance destroyed.
> Change this to vcpu's destruction taken when its refcnt is zero,
> and then vcpu MUST and CAN be destroyed before kvm's destroy.
> 
Now refcount is completely unused. It's just set to 1 during vcpu
creation and reset to 0 during destruction. Just drop it.

> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> ---
>  arch/x86/kvm/i8254.c     |   10 ++++--
>  arch/x86/kvm/i8259.c     |   12 ++++--
>  arch/x86/kvm/mmu.c       |    7 ++--
>  arch/x86/kvm/x86.c       |   54 ++++++++++++++++--------------
>  include/linux/kvm_host.h |   77 +++++++++++++++++++++++++++++++++++++++---
>  virt/kvm/irq_comm.c      |    7 +++-
>  virt/kvm/kvm_main.c      |   82 ++++++++++++++++++++++++++++++++++++++++------
>  7 files changed, 196 insertions(+), 53 deletions(-)
> 
> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
> index 76e3f1c..ac79598 100644
> --- a/arch/x86/kvm/i8254.c
> +++ b/arch/x86/kvm/i8254.c
> @@ -289,7 +289,7 @@ static void pit_do_work(struct work_struct *work)
>  	struct kvm_pit *pit = container_of(work, struct kvm_pit, expired);
>  	struct kvm *kvm = pit->kvm;
>  	struct kvm_vcpu *vcpu;
> -	int i;
> +	struct kvm_iter it;
>  	struct kvm_kpit_state *ps = &pit->pit_state;
>  	int inject = 0;
>  
> @@ -315,9 +315,13 @@ static void pit_do_work(struct work_struct *work)
>  		 * LVT0 to NMI delivery. Other PIC interrupts are just sent to
>  		 * VCPU0, and only if its LVT0 is in EXTINT mode.
>  		 */
> -		if (kvm->arch.vapics_in_nmi_mode > 0)
> -			kvm_for_each_vcpu(i, vcpu, kvm)
> +		if (kvm->arch.vapics_in_nmi_mode > 0) {
> +			rcu_read_lock();
> +			kvm_for_each_vcpu(it, vcpu, kvm) {
>  				kvm_apic_nmi_wd_deliver(vcpu);
> +			}
> +			rcu_read_unlock();
> +		}
>  	}
>  }
>  
> diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
> index cac4746..2186b30 100644
> --- a/arch/x86/kvm/i8259.c
> +++ b/arch/x86/kvm/i8259.c
> @@ -50,25 +50,29 @@ static void pic_unlock(struct kvm_pic *s)
>  {
>  	bool wakeup = s->wakeup_needed;
>  	struct kvm_vcpu *vcpu, *found = NULL;
> -	int i;
> +	struct kvm *kvm = s->kvm;
> +	struct kvm_iter it;
>  
>  	s->wakeup_needed = false;
>  
>  	spin_unlock(&s->lock);
>  
>  	if (wakeup) {
> -		kvm_for_each_vcpu(i, vcpu, s->kvm) {
> +		rcu_read_lock();
> +		kvm_for_each_vcpu(it, vcpu, kvm)
>  			if (kvm_apic_accept_pic_intr(vcpu)) {
>  				found = vcpu;
>  				break;
>  			}
> -		}
>  
> -		if (!found)
> +		if (!found) {
> +			rcu_read_unlock();
>  			return;
> +		}
>  
>  		kvm_make_request(KVM_REQ_EVENT, found);
>  		kvm_vcpu_kick(found);
> +		rcu_read_unlock();
>  	}
>  }
>  
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index f1b36cf..c16887e 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1833,11 +1833,12 @@ static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte)
>  
>  static void kvm_mmu_reset_last_pte_updated(struct kvm *kvm)
>  {
> -	int i;
> +	struct kvm_iter it;
>  	struct kvm_vcpu *vcpu;
> -
> -	kvm_for_each_vcpu(i, vcpu, kvm)
> +	rcu_read_lock();
> +	kvm_for_each_vcpu(it, vcpu, kvm)
>  		vcpu->arch.last_pte_updated = NULL;
> +	rcu_read_unlock();
>  }
>  
>  static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c38efd7..a302470 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1831,10 +1831,15 @@ static int get_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
>  	switch (msr) {
>  	case HV_X64_MSR_VP_INDEX: {
>  		int r;
> +		struct kvm_iter it;
>  		struct kvm_vcpu *v;
> -		kvm_for_each_vcpu(r, v, vcpu->kvm)
> +		struct kvm *kvm =  vcpu->kvm;
> +		rcu_read_lock();
> +		kvm_for_each_vcpu(it, v, kvm) {
>  			if (v == vcpu)
>  				data = r;
> +		}
> +		rcu_read_unlock();
>  		break;
>  	}
>  	case HV_X64_MSR_EOI:
> @@ -4966,7 +4971,8 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
>  	struct cpufreq_freqs *freq = data;
>  	struct kvm *kvm;
>  	struct kvm_vcpu *vcpu;
> -	int i, send_ipi = 0;
> +	int send_ipi = 0;
> +	struct kvm_iter it;
>  
>  	/*
>  	 * We allow guests to temporarily run on slowing clocks,
> @@ -5016,13 +5022,16 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
>  
>  	raw_spin_lock(&kvm_lock);
>  	list_for_each_entry(kvm, &vm_list, vm_list) {
> -		kvm_for_each_vcpu(i, vcpu, kvm) {
> +
> +		rcu_read_lock();
> +		kvm_for_each_vcpu(it, vcpu, kvm) {
>  			if (vcpu->cpu != freq->cpu)
>  				continue;
>  			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>  			if (vcpu->cpu != smp_processor_id())
>  				send_ipi = 1;
>  		}
> +		rcu_read_unlock();
>  	}
>  	raw_spin_unlock(&kvm_lock);
>  
> @@ -6433,13 +6442,17 @@ int kvm_arch_hardware_enable(void *garbage)
>  {
>  	struct kvm *kvm;
>  	struct kvm_vcpu *vcpu;
> -	int i;
> +	struct kvm_iter it;
>  
>  	kvm_shared_msr_cpu_online();
> -	list_for_each_entry(kvm, &vm_list, vm_list)
> -		kvm_for_each_vcpu(i, vcpu, kvm)
> +	list_for_each_entry(kvm, &vm_list, vm_list) {
> +		rcu_read_lock();
> +		kvm_for_each_vcpu(it, vcpu, kvm) {
>  			if (vcpu->cpu == smp_processor_id())
>  				kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> +		}
> +		rcu_read_unlock();
> +	}
>  	return kvm_x86_ops->hardware_enable(garbage);
>  }
>  
> @@ -6560,27 +6573,19 @@ static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
>  	vcpu_put(vcpu);
>  }
>  
> -static void kvm_free_vcpus(struct kvm *kvm)
> -{
> -	unsigned int i;
> -	struct kvm_vcpu *vcpu;
>  
> -	/*
> -	 * Unpin any mmu pages first.
> -	 */
> -	kvm_for_each_vcpu(i, vcpu, kvm) {
> -		kvm_clear_async_pf_completion_queue(vcpu);
> -		kvm_unload_vcpu_mmu(vcpu);
> -	}
> -	kvm_for_each_vcpu(i, vcpu, kvm)
> -		kvm_arch_vcpu_free(vcpu);
>  
> -	mutex_lock(&kvm->lock);
> -	for (i = 0; i < atomic_read(&kvm->online_vcpus); i++)
> -		kvm->vcpus[i] = NULL;
> +void kvm_arch_vcpu_zap(struct work_struct *work)
> +{
> +	struct kvm_vcpu *vcpu = container_of(work, struct kvm_vcpu,
> +			zap_work);
> +	struct kvm *kvm = vcpu->kvm;
>  
> -	atomic_set(&kvm->online_vcpus, 0);
> -	mutex_unlock(&kvm->lock);
> +	printk(KERN_INFO "%s, zap vcpu:0x%x\n", __func__, vcpu->vcpu_id);
> +	kvm_clear_async_pf_completion_queue(vcpu);
> +	kvm_unload_vcpu_mmu(vcpu);
> +	kvm_arch_vcpu_free(vcpu);
> +	kvm_put_kvm(kvm);
>  }
>  
>  void kvm_arch_sync_events(struct kvm *kvm)
> @@ -6594,7 +6599,6 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>  	kvm_iommu_unmap_guest(kvm);
>  	kfree(kvm->arch.vpic);
>  	kfree(kvm->arch.vioapic);
> -	kvm_free_vcpus(kvm);
>  	if (kvm->arch.apic_access_page)
>  		put_page(kvm->arch.apic_access_page);
>  	if (kvm->arch.ept_identity_pagetable)
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d526231..f16fd09 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -19,6 +19,7 @@
>  #include <linux/slab.h>
>  #include <linux/rcupdate.h>
>  #include <linux/ratelimit.h>
> +#include <linux/atomic.h>
>  #include <asm/signal.h>
>  
>  #include <linux/kvm.h>
> @@ -113,6 +114,9 @@ enum {
>  
>  struct kvm_vcpu {
>  	struct kvm *kvm;
> +	atomic_t refcount;
> +	struct rcu_head head;
> +	struct work_struct zap_work;
>  #ifdef CONFIG_PREEMPT_NOTIFIERS
>  	struct preempt_notifier preempt_notifier;
>  #endif
> @@ -290,17 +294,78 @@ struct kvm {
>  #define kvm_printf(kvm, fmt ...) printk(KERN_DEBUG fmt)
>  #define vcpu_printf(vcpu, fmt...) kvm_printf(vcpu->kvm, fmt)
>  
> +struct kvm_vcpu *kvm_vcpu_get(struct kvm_vcpu *vcpu);
> +void kvm_vcpu_put(struct kvm_vcpu *vcpu);
> +void kvm_arch_vcpu_zap(struct work_struct *work);
> +
> +/*search vcpu, must be protected by rcu_read_lock*/
>  static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i)
>  {
> +	struct kvm_vcpu *vcpu;
>  	smp_rmb();
> -	return kvm->vcpus[i];
> +	vcpu = rcu_dereference(kvm->vcpus[i]);
> +	if (vcpu != NULL && atomic_read(&vcpu->refcount) != 0)
> +		return vcpu;
> +
> +	return NULL;
> +}
> +
> +/*Must be protected by RCU*/
> +struct kvm_iter {
> +	struct kvm *kvm;
> +	int idx;
> +	int cnt;
> +};
> +
> +static inline
> +struct kvm_vcpu *kvm_fev_init(struct kvm *kvm, struct kvm_iter *it)
> +{
> +	int idx, cnt;
> +	struct kvm_vcpu *vcpup;
> +	vcpup = NULL;
> +	for (idx = 0, cnt = 0;
> +		cnt < atomic_read(&kvm->online_vcpus) && idx < KVM_MAX_VCPUS;
> +		idx++) {
> +			vcpup = kvm_get_vcpu(kvm, idx);
> +			if (unlikely(vcpup == NULL))
> +				continue;
> +			cnt++;
> +			break;
> +	}
> +
> +	it->kvm = kvm;
> +	it->idx = idx;
> +	it->cnt = cnt;
> +	return vcpup;
> +}
> +
> +static inline
> +struct kvm_vcpu *kvm_fev_next(struct kvm_iter *it)
> +{
> +	int idx, cnt;
> +	struct kvm_vcpu *vcpup;
> +	struct kvm *kvm = it->kvm;
> +
> +	vcpup = NULL;
> +	for (idx = it->idx+1, cnt = it->cnt;
> +		cnt < atomic_read(&kvm->online_vcpus) && idx < KVM_MAX_VCPUS;
> +		idx++) {
> +			vcpup = kvm_get_vcpu(kvm, idx);
> +			if (unlikely(vcpup == NULL))
> +				continue;
> +			 cnt++;
> +			 break;
> +	}
> +
> +	it->idx = idx;
> +	it->cnt = cnt;
> +	return vcpup;
>  }
>  
> -#define kvm_for_each_vcpu(idx, vcpup, kvm) \
> -	for (idx = 0; \
> -	     idx < atomic_read(&kvm->online_vcpus) && \
> -	     (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
> -	     idx++)
> +#define kvm_for_each_vcpu(it, vcpu, kvm) \
> +	for (vcpu = kvm_fev_init(kvm, &it); \
> +		vcpu; \
> +		vcpu = kvm_fev_next(&it))
>  
>  int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id);
>  void kvm_vcpu_uninit(struct kvm_vcpu *vcpu);
> diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> index 9f614b4..87eae96 100644
> --- a/virt/kvm/irq_comm.c
> +++ b/virt/kvm/irq_comm.c
> @@ -81,14 +81,16 @@ inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
>  int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
>  		struct kvm_lapic_irq *irq)
>  {
> -	int i, r = -1;
> +	int r = -1;
> +	struct kvm_iter it;
>  	struct kvm_vcpu *vcpu, *lowest = NULL;
>  
>  	if (irq->dest_mode == 0 && irq->dest_id == 0xff &&
>  			kvm_is_dm_lowest_prio(irq))
>  		printk(KERN_INFO "kvm: apic: phys broadcast and lowest prio\n");
>  
> -	kvm_for_each_vcpu(i, vcpu, kvm) {
> +	rcu_read_lock();
> +	kvm_for_each_vcpu(it, vcpu, kvm) {
>  		if (!kvm_apic_present(vcpu))
>  			continue;
>  
> @@ -111,6 +113,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
>  	if (lowest)
>  		r = kvm_apic_set_irq(lowest, irq);
>  
> +	rcu_read_unlock();
>  	return r;
>  }
>  
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d9cfb78..929cfce 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -171,7 +171,8 @@ static void ack_flush(void *_completed)
>  
>  static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
>  {
> -	int i, cpu, me;
> +	int cpu, me;
> +	struct kvm_iter it;
>  	cpumask_var_t cpus;
>  	bool called = true;
>  	struct kvm_vcpu *vcpu;
> @@ -179,7 +180,9 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
>  	zalloc_cpumask_var(&cpus, GFP_ATOMIC);
>  
>  	me = get_cpu();
> -	kvm_for_each_vcpu(i, vcpu, kvm) {
> +
> +	rcu_read_lock();
> +	kvm_for_each_vcpu(it, vcpu, kvm) {
>  		kvm_make_request(req, vcpu);
>  		cpu = vcpu->cpu;
>  
> @@ -190,12 +193,15 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
>  		      kvm_vcpu_exiting_guest_mode(vcpu) != OUTSIDE_GUEST_MODE)
>  			cpumask_set_cpu(cpu, cpus);
>  	}
> +
>  	if (unlikely(cpus == NULL))
>  		smp_call_function_many(cpu_online_mask, ack_flush, NULL, 1);
>  	else if (!cpumask_empty(cpus))
>  		smp_call_function_many(cpus, ack_flush, NULL, 1);
>  	else
>  		called = false;
> +	rcu_read_unlock();
> +
>  	put_cpu();
>  	free_cpumask_var(cpus);
>  	return called;
> @@ -580,6 +586,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
>  	kvm_arch_free_vm(kvm);
>  	hardware_disable_all();
>  	mmdrop(mm);
> +	printk(KERN_INFO "%s finished\n", __func__);
>  }
>  
>  void kvm_get_kvm(struct kvm *kvm)
> @@ -1543,6 +1550,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
>  	int yielded = 0;
>  	int pass;
> +	struct kvm_iter it;
>  	int i;
>  
>  	/*
> @@ -1553,9 +1561,11 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  	 * We approximate round-robin by starting at the last boosted VCPU.
>  	 */
>  	for (pass = 0; pass < 2 && !yielded; pass++) {
> -		kvm_for_each_vcpu(i, vcpu, kvm) {
> +		rcu_read_lock();
> +		kvm_for_each_vcpu(it, vcpu, kvm) {
>  			struct task_struct *task = NULL;
>  			struct pid *pid;
> +			i = it.idx;
>  			if (!pass && i < last_boosted_vcpu) {
>  				i = last_boosted_vcpu;
>  				continue;
> @@ -1584,6 +1594,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  			}
>  			put_task_struct(task);
>  		}
> +		rcu_read_unlock();
>  	}
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
> @@ -1623,8 +1634,8 @@ static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
>  static int kvm_vcpu_release(struct inode *inode, struct file *filp)
>  {
>  	struct kvm_vcpu *vcpu = filp->private_data;
> -
> -	kvm_put_kvm(vcpu->kvm);
> +	filp->private_data = NULL;
> +	kvm_vcpu_put(vcpu);
>  	return 0;
>  }
>  
> @@ -1646,6 +1657,48 @@ static int create_vcpu_fd(struct kvm_vcpu *vcpu)
>  	return anon_inode_getfd("kvm-vcpu", &kvm_vcpu_fops, vcpu, O_RDWR);
>  }
>  
> +/*Can not block*/
> +void kvm_vcpu_zap(struct rcu_head *rcu)
> +{
> +	struct kvm_vcpu *vcpu = container_of(rcu, struct kvm_vcpu, head);
> +	schedule_work(&vcpu->zap_work);
> +}
> +
> +/*increase refcnt*/
> +struct kvm_vcpu *kvm_vcpu_get(struct kvm_vcpu *vcpu)
> +{
> +	if (vcpu == NULL)
> +		return NULL;
> +	if (atomic_add_unless(&vcpu->refcount, 1, 0))
> +		return vcpu;
> +	return NULL;
> +}
> +
> +void kvm_vcpu_put(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm *kvm;
> +	if (atomic_dec_and_test(&vcpu->refcount)) {
> +		kvm = vcpu->kvm;
> +		mutex_lock(&kvm->lock);
> +		rcu_assign_pointer(kvm->vcpus[vcpu->vcpu_id], NULL);
> +		atomic_dec(&kvm->online_vcpus);
> +		mutex_unlock(&kvm->lock);
> +		call_rcu(&vcpu->head, kvm_vcpu_zap);
> +	}
> +}
> +
> +static struct kvm_vcpu *kvm_vcpu_create(struct kvm *kvm, u32 id)
> +{
> +	struct kvm_vcpu *vcpu;
> +	vcpu = kvm_arch_vcpu_create(kvm, id);
> +	if (IS_ERR(vcpu))
> +		return vcpu;
> +
> +	atomic_set(&vcpu->refcount, 1);
> +	INIT_WORK(&vcpu->zap_work, kvm_arch_vcpu_zap);
> +	return vcpu;
> +}
> +
>  /*
>   * Creates some virtual cpus.  Good luck creating more than one.
>   */
> @@ -1653,8 +1706,9 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
>  {
>  	int r;
>  	struct kvm_vcpu *vcpu, *v;
> +	struct kvm_iter it;
>  
> -	vcpu = kvm_arch_vcpu_create(kvm, id);
> +	vcpu = kvm_vcpu_create(kvm, id);
>  	if (IS_ERR(vcpu))
>  		return PTR_ERR(vcpu);
>  
> @@ -1670,11 +1724,15 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
>  		goto unlock_vcpu_destroy;
>  	}
>  
> -	kvm_for_each_vcpu(r, v, kvm)
> +	rcu_read_lock();
> +	kvm_for_each_vcpu(it, v, kvm) {
>  		if (v->vcpu_id == id) {
> +			rcu_read_unlock();
>  			r = -EEXIST;
>  			goto unlock_vcpu_destroy;
>  		}
> +	}
> +	rcu_read_unlock();
>  
>  	BUG_ON(kvm->vcpus[atomic_read(&kvm->online_vcpus)]);
>  
> @@ -2593,13 +2651,17 @@ static int vcpu_stat_get(void *_offset, u64 *val)
>  	unsigned offset = (long)_offset;
>  	struct kvm *kvm;
>  	struct kvm_vcpu *vcpu;
> -	int i;
> +	struct kvm_iter it;
>  
>  	*val = 0;
>  	raw_spin_lock(&kvm_lock);
> -	list_for_each_entry(kvm, &vm_list, vm_list)
> -		kvm_for_each_vcpu(i, vcpu, kvm)
> +	list_for_each_entry(kvm, &vm_list, vm_list) {
> +		rcu_read_lock();
> +		kvm_for_each_vcpu(it, vcpu, kvm) {
>  			*val += *(u32 *)((void *)vcpu + offset);
> +		}
> +		rcu_read_unlock();
> +	}
>  
>  	raw_spin_unlock(&kvm_lock);
>  	return 0;
> -- 
> 1.7.4.4

--
			Gleb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v3] kvm: make vcpu life cycle separated from kvm instance
  2011-12-09 14:23         ` Gleb Natapov
@ 2011-12-12  2:41           ` Liu Ping Fan
  2011-12-12 12:54             ` Gleb Natapov
                               ` (2 more replies)
  0 siblings, 3 replies; 78+ messages in thread
From: Liu Ping Fan @ 2011-12-12  2:41 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, avi, aliguori, gleb, jan.kiszka

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

Currently, vcpu can be destructed only when kvm instance destroyed.
Change this to vcpu's destruction taken when its refcnt is zero,
and then vcpu MUST and CAN be destroyed before kvm's destroy.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 arch/x86/kvm/i8254.c     |   10 ++++--
 arch/x86/kvm/i8259.c     |   12 +++++--
 arch/x86/kvm/mmu.c       |    7 ++--
 arch/x86/kvm/x86.c       |   54 +++++++++++++++++++----------------
 include/linux/kvm_host.h |   71 ++++++++++++++++++++++++++++++++++++++++++----
 virt/kvm/irq_comm.c      |    7 +++-
 virt/kvm/kvm_main.c      |   62 +++++++++++++++++++++++++++++++++------
 7 files changed, 170 insertions(+), 53 deletions(-)

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index 76e3f1c..ac79598 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -289,7 +289,7 @@ static void pit_do_work(struct work_struct *work)
 	struct kvm_pit *pit = container_of(work, struct kvm_pit, expired);
 	struct kvm *kvm = pit->kvm;
 	struct kvm_vcpu *vcpu;
-	int i;
+	struct kvm_iter it;
 	struct kvm_kpit_state *ps = &pit->pit_state;
 	int inject = 0;
 
@@ -315,9 +315,13 @@ static void pit_do_work(struct work_struct *work)
 		 * LVT0 to NMI delivery. Other PIC interrupts are just sent to
 		 * VCPU0, and only if its LVT0 is in EXTINT mode.
 		 */
-		if (kvm->arch.vapics_in_nmi_mode > 0)
-			kvm_for_each_vcpu(i, vcpu, kvm)
+		if (kvm->arch.vapics_in_nmi_mode > 0) {
+			rcu_read_lock();
+			kvm_for_each_vcpu(it, vcpu, kvm) {
 				kvm_apic_nmi_wd_deliver(vcpu);
+			}
+			rcu_read_unlock();
+		}
 	}
 }
 
diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index cac4746..2186b30 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -50,25 +50,29 @@ static void pic_unlock(struct kvm_pic *s)
 {
 	bool wakeup = s->wakeup_needed;
 	struct kvm_vcpu *vcpu, *found = NULL;
-	int i;
+	struct kvm *kvm = s->kvm;
+	struct kvm_iter it;
 
 	s->wakeup_needed = false;
 
 	spin_unlock(&s->lock);
 
 	if (wakeup) {
-		kvm_for_each_vcpu(i, vcpu, s->kvm) {
+		rcu_read_lock();
+		kvm_for_each_vcpu(it, vcpu, kvm)
 			if (kvm_apic_accept_pic_intr(vcpu)) {
 				found = vcpu;
 				break;
 			}
-		}
 
-		if (!found)
+		if (!found) {
+			rcu_read_unlock();
 			return;
+		}
 
 		kvm_make_request(KVM_REQ_EVENT, found);
 		kvm_vcpu_kick(found);
+		rcu_read_unlock();
 	}
 }
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f1b36cf..c16887e 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1833,11 +1833,12 @@ static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte)
 
 static void kvm_mmu_reset_last_pte_updated(struct kvm *kvm)
 {
-	int i;
+	struct kvm_iter it;
 	struct kvm_vcpu *vcpu;
-
-	kvm_for_each_vcpu(i, vcpu, kvm)
+	rcu_read_lock();
+	kvm_for_each_vcpu(it, vcpu, kvm)
 		vcpu->arch.last_pte_updated = NULL;
+	rcu_read_unlock();
 }
 
 static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c38efd7..a302470 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1831,10 +1831,15 @@ static int get_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 	switch (msr) {
 	case HV_X64_MSR_VP_INDEX: {
 		int r;
+		struct kvm_iter it;
 		struct kvm_vcpu *v;
-		kvm_for_each_vcpu(r, v, vcpu->kvm)
+		struct kvm *kvm =  vcpu->kvm;
+		rcu_read_lock();
+		kvm_for_each_vcpu(it, v, kvm) {
 			if (v == vcpu)
 				data = r;
+		}
+		rcu_read_unlock();
 		break;
 	}
 	case HV_X64_MSR_EOI:
@@ -4966,7 +4971,8 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
 	struct cpufreq_freqs *freq = data;
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
-	int i, send_ipi = 0;
+	int send_ipi = 0;
+	struct kvm_iter it;
 
 	/*
 	 * We allow guests to temporarily run on slowing clocks,
@@ -5016,13 +5022,16 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
 
 	raw_spin_lock(&kvm_lock);
 	list_for_each_entry(kvm, &vm_list, vm_list) {
-		kvm_for_each_vcpu(i, vcpu, kvm) {
+
+		rcu_read_lock();
+		kvm_for_each_vcpu(it, vcpu, kvm) {
 			if (vcpu->cpu != freq->cpu)
 				continue;
 			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 			if (vcpu->cpu != smp_processor_id())
 				send_ipi = 1;
 		}
+		rcu_read_unlock();
 	}
 	raw_spin_unlock(&kvm_lock);
 
@@ -6433,13 +6442,17 @@ int kvm_arch_hardware_enable(void *garbage)
 {
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
-	int i;
+	struct kvm_iter it;
 
 	kvm_shared_msr_cpu_online();
-	list_for_each_entry(kvm, &vm_list, vm_list)
-		kvm_for_each_vcpu(i, vcpu, kvm)
+	list_for_each_entry(kvm, &vm_list, vm_list) {
+		rcu_read_lock();
+		kvm_for_each_vcpu(it, vcpu, kvm) {
 			if (vcpu->cpu == smp_processor_id())
 				kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+		}
+		rcu_read_unlock();
+	}
 	return kvm_x86_ops->hardware_enable(garbage);
 }
 
@@ -6560,27 +6573,19 @@ static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
 	vcpu_put(vcpu);
 }
 
-static void kvm_free_vcpus(struct kvm *kvm)
-{
-	unsigned int i;
-	struct kvm_vcpu *vcpu;
 
-	/*
-	 * Unpin any mmu pages first.
-	 */
-	kvm_for_each_vcpu(i, vcpu, kvm) {
-		kvm_clear_async_pf_completion_queue(vcpu);
-		kvm_unload_vcpu_mmu(vcpu);
-	}
-	kvm_for_each_vcpu(i, vcpu, kvm)
-		kvm_arch_vcpu_free(vcpu);
 
-	mutex_lock(&kvm->lock);
-	for (i = 0; i < atomic_read(&kvm->online_vcpus); i++)
-		kvm->vcpus[i] = NULL;
+void kvm_arch_vcpu_zap(struct work_struct *work)
+{
+	struct kvm_vcpu *vcpu = container_of(work, struct kvm_vcpu,
+			zap_work);
+	struct kvm *kvm = vcpu->kvm;
 
-	atomic_set(&kvm->online_vcpus, 0);
-	mutex_unlock(&kvm->lock);
+	printk(KERN_INFO "%s, zap vcpu:0x%x\n", __func__, vcpu->vcpu_id);
+	kvm_clear_async_pf_completion_queue(vcpu);
+	kvm_unload_vcpu_mmu(vcpu);
+	kvm_arch_vcpu_free(vcpu);
+	kvm_put_kvm(kvm);
 }
 
 void kvm_arch_sync_events(struct kvm *kvm)
@@ -6594,7 +6599,6 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 	kvm_iommu_unmap_guest(kvm);
 	kfree(kvm->arch.vpic);
 	kfree(kvm->arch.vioapic);
-	kvm_free_vcpus(kvm);
 	if (kvm->arch.apic_access_page)
 		put_page(kvm->arch.apic_access_page);
 	if (kvm->arch.ept_identity_pagetable)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d526231..2faafcb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -19,6 +19,7 @@
 #include <linux/slab.h>
 #include <linux/rcupdate.h>
 #include <linux/ratelimit.h>
+#include <linux/atomic.h>
 #include <asm/signal.h>
 
 #include <linux/kvm.h>
@@ -113,6 +114,8 @@ enum {
 
 struct kvm_vcpu {
 	struct kvm *kvm;
+	struct rcu_head head;
+	struct work_struct zap_work;
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	struct preempt_notifier preempt_notifier;
 #endif
@@ -290,17 +293,73 @@ struct kvm {
 #define kvm_printf(kvm, fmt ...) printk(KERN_DEBUG fmt)
 #define vcpu_printf(vcpu, fmt...) kvm_printf(vcpu->kvm, fmt)
 
+void kvm_arch_vcpu_zap(struct work_struct *work);
+
+/*search vcpu, must be protected by rcu_read_lock*/
 static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i)
 {
+	struct kvm_vcpu *vcpu;
 	smp_rmb();
-	return kvm->vcpus[i];
+	vcpu = rcu_dereference(kvm->vcpus[i]);
+	return vcpu;
+}
+
+/*Must be protected by RCU*/
+struct kvm_iter {
+	struct kvm *kvm;
+	int idx;
+	int cnt;
+};
+
+static inline
+struct kvm_vcpu *kvm_fev_init(struct kvm *kvm, struct kvm_iter *it)
+{
+	int idx, cnt;
+	struct kvm_vcpu *vcpup;
+	vcpup = NULL;
+	for (idx = 0, cnt = 0;
+		cnt < atomic_read(&kvm->online_vcpus) && idx < KVM_MAX_VCPUS;
+		idx++) {
+			vcpup = kvm_get_vcpu(kvm, idx);
+			if (unlikely(vcpup == NULL))
+				continue;
+			cnt++;
+			break;
+	}
+
+	it->kvm = kvm;
+	it->idx = idx;
+	it->cnt = cnt;
+	return vcpup;
+}
+
+static inline
+struct kvm_vcpu *kvm_fev_next(struct kvm_iter *it)
+{
+	int idx, cnt;
+	struct kvm_vcpu *vcpup;
+	struct kvm *kvm = it->kvm;
+
+	vcpup = NULL;
+	for (idx = it->idx+1, cnt = it->cnt;
+		cnt < atomic_read(&kvm->online_vcpus) && idx < KVM_MAX_VCPUS;
+		idx++) {
+			vcpup = kvm_get_vcpu(kvm, idx);
+			if (unlikely(vcpup == NULL))
+				continue;
+			 cnt++;
+			 break;
+	}
+
+	it->idx = idx;
+	it->cnt = cnt;
+	return vcpup;
 }
 
-#define kvm_for_each_vcpu(idx, vcpup, kvm) \
-	for (idx = 0; \
-	     idx < atomic_read(&kvm->online_vcpus) && \
-	     (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
-	     idx++)
+#define kvm_for_each_vcpu(it, vcpu, kvm) \
+	for (vcpu = kvm_fev_init(kvm, &it); \
+		vcpu; \
+		vcpu = kvm_fev_next(&it))
 
 int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id);
 void kvm_vcpu_uninit(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 9f614b4..87eae96 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -81,14 +81,16 @@ inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
 int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 		struct kvm_lapic_irq *irq)
 {
-	int i, r = -1;
+	int r = -1;
+	struct kvm_iter it;
 	struct kvm_vcpu *vcpu, *lowest = NULL;
 
 	if (irq->dest_mode == 0 && irq->dest_id == 0xff &&
 			kvm_is_dm_lowest_prio(irq))
 		printk(KERN_INFO "kvm: apic: phys broadcast and lowest prio\n");
 
-	kvm_for_each_vcpu(i, vcpu, kvm) {
+	rcu_read_lock();
+	kvm_for_each_vcpu(it, vcpu, kvm) {
 		if (!kvm_apic_present(vcpu))
 			continue;
 
@@ -111,6 +113,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 	if (lowest)
 		r = kvm_apic_set_irq(lowest, irq);
 
+	rcu_read_unlock();
 	return r;
 }
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d9cfb78..d28356a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -171,7 +171,8 @@ static void ack_flush(void *_completed)
 
 static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 {
-	int i, cpu, me;
+	int cpu, me;
+	struct kvm_iter it;
 	cpumask_var_t cpus;
 	bool called = true;
 	struct kvm_vcpu *vcpu;
@@ -179,7 +180,9 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 	zalloc_cpumask_var(&cpus, GFP_ATOMIC);
 
 	me = get_cpu();
-	kvm_for_each_vcpu(i, vcpu, kvm) {
+
+	rcu_read_lock();
+	kvm_for_each_vcpu(it, vcpu, kvm) {
 		kvm_make_request(req, vcpu);
 		cpu = vcpu->cpu;
 
@@ -190,12 +193,15 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 		      kvm_vcpu_exiting_guest_mode(vcpu) != OUTSIDE_GUEST_MODE)
 			cpumask_set_cpu(cpu, cpus);
 	}
+
 	if (unlikely(cpus == NULL))
 		smp_call_function_many(cpu_online_mask, ack_flush, NULL, 1);
 	else if (!cpumask_empty(cpus))
 		smp_call_function_many(cpus, ack_flush, NULL, 1);
 	else
 		called = false;
+	rcu_read_unlock();
+
 	put_cpu();
 	free_cpumask_var(cpus);
 	return called;
@@ -580,6 +586,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	kvm_arch_free_vm(kvm);
 	hardware_disable_all();
 	mmdrop(mm);
+	printk(KERN_INFO "%s finished\n", __func__);
 }
 
 void kvm_get_kvm(struct kvm *kvm)
@@ -1543,6 +1550,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
 	int yielded = 0;
 	int pass;
+	struct kvm_iter it;
 	int i;
 
 	/*
@@ -1553,9 +1561,11 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 	 * We approximate round-robin by starting at the last boosted VCPU.
 	 */
 	for (pass = 0; pass < 2 && !yielded; pass++) {
-		kvm_for_each_vcpu(i, vcpu, kvm) {
+		rcu_read_lock();
+		kvm_for_each_vcpu(it, vcpu, kvm) {
 			struct task_struct *task = NULL;
 			struct pid *pid;
+			i = it.idx;
 			if (!pass && i < last_boosted_vcpu) {
 				i = last_boosted_vcpu;
 				continue;
@@ -1584,6 +1594,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 			}
 			put_task_struct(task);
 		}
+		rcu_read_unlock();
 	}
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
@@ -1620,11 +1631,23 @@ static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
 	return 0;
 }
 
+/*Can not block*/
+static void kvm_vcpu_zap(struct rcu_head *rcu)
+{
+	struct kvm_vcpu *vcpu = container_of(rcu, struct kvm_vcpu, head);
+	schedule_work(&vcpu->zap_work);
+}
+
 static int kvm_vcpu_release(struct inode *inode, struct file *filp)
 {
 	struct kvm_vcpu *vcpu = filp->private_data;
-
-	kvm_put_kvm(vcpu->kvm);
+	struct kvm *kvm = vcpu->kvm;
+	filp->private_data = NULL;
+	mutex_lock(&kvm->lock);
+	rcu_assign_pointer(kvm->vcpus[vcpu->vcpu_id], NULL);
+	atomic_dec(&kvm->online_vcpus);
+	mutex_unlock(&kvm->lock);
+	call_rcu(&vcpu->head, kvm_vcpu_zap);
 	return 0;
 }
 
@@ -1646,6 +1669,16 @@ static int create_vcpu_fd(struct kvm_vcpu *vcpu)
 	return anon_inode_getfd("kvm-vcpu", &kvm_vcpu_fops, vcpu, O_RDWR);
 }
 
+static struct kvm_vcpu *kvm_vcpu_create(struct kvm *kvm, u32 id)
+{
+	struct kvm_vcpu *vcpu;
+	vcpu = kvm_arch_vcpu_create(kvm, id);
+	if (IS_ERR(vcpu))
+		return vcpu;
+	INIT_WORK(&vcpu->zap_work, kvm_arch_vcpu_zap);
+	return vcpu;
+}
+
 /*
  * Creates some virtual cpus.  Good luck creating more than one.
  */
@@ -1653,8 +1686,9 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 {
 	int r;
 	struct kvm_vcpu *vcpu, *v;
+	struct kvm_iter it;
 
-	vcpu = kvm_arch_vcpu_create(kvm, id);
+	vcpu = kvm_vcpu_create(kvm, id);
 	if (IS_ERR(vcpu))
 		return PTR_ERR(vcpu);
 
@@ -1670,11 +1704,15 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 		goto unlock_vcpu_destroy;
 	}
 
-	kvm_for_each_vcpu(r, v, kvm)
+	rcu_read_lock();
+	kvm_for_each_vcpu(it, v, kvm) {
 		if (v->vcpu_id == id) {
+			rcu_read_unlock();
 			r = -EEXIST;
 			goto unlock_vcpu_destroy;
 		}
+	}
+	rcu_read_unlock();
 
 	BUG_ON(kvm->vcpus[atomic_read(&kvm->online_vcpus)]);
 
@@ -2593,13 +2631,17 @@ static int vcpu_stat_get(void *_offset, u64 *val)
 	unsigned offset = (long)_offset;
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
-	int i;
+	struct kvm_iter it;
 
 	*val = 0;
 	raw_spin_lock(&kvm_lock);
-	list_for_each_entry(kvm, &vm_list, vm_list)
-		kvm_for_each_vcpu(i, vcpu, kvm)
+	list_for_each_entry(kvm, &vm_list, vm_list) {
+		rcu_read_lock();
+		kvm_for_each_vcpu(it, vcpu, kvm) {
 			*val += *(u32 *)((void *)vcpu + offset);
+		}
+		rcu_read_unlock();
+	}
 
 	raw_spin_unlock(&kvm_lock);
 	return 0;
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH v3] kvm: make vcpu life cycle separated from kvm instance
  2011-12-12  2:41           ` [PATCH v3] " Liu Ping Fan
@ 2011-12-12 12:54             ` Gleb Natapov
  2011-12-13  9:29               ` Liu ping fan
  2011-12-13 11:36             ` Marcelo Tosatti
  2011-12-17  3:19             ` [PATCH v5] " Liu Ping Fan
  2 siblings, 1 reply; 78+ messages in thread
From: Gleb Natapov @ 2011-12-12 12:54 UTC (permalink / raw)
  To: Liu Ping Fan; +Cc: kvm, linux-kernel, avi, aliguori, jan.kiszka

On Mon, Dec 12, 2011 at 10:41:23AM +0800, Liu Ping Fan wrote:
> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> 
> Currently, vcpu can be destructed only when kvm instance destroyed.
> Change this to vcpu's destruction taken when its refcnt is zero,
> and then vcpu MUST and CAN be destroyed before kvm's destroy.
> 
Please drop all printks that you add. You do not use rcu_assign_pointer()
during vcpu creation and BTW the code there is incorrect now. It assumed
that online_vcpus is never decremented so it is OK to put newly created
vcpu into kvm->vcpus[kvm->online_vcpus], but now it is not longer true.
We even have BUG_ON() to catch that which I believe you can trigger with
this patch by creating 3 vcpus, removing second one and then adding one
more. Moving to rculist would solve this of course, and will simplify
code that iterates over all vcpus too.

Also see below.

> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> ---
>  arch/x86/kvm/i8254.c     |   10 ++++--
>  arch/x86/kvm/i8259.c     |   12 +++++--
>  arch/x86/kvm/mmu.c       |    7 ++--
>  arch/x86/kvm/x86.c       |   54 +++++++++++++++++++----------------
>  include/linux/kvm_host.h |   71 ++++++++++++++++++++++++++++++++++++++++++----
>  virt/kvm/irq_comm.c      |    7 +++-
>  virt/kvm/kvm_main.c      |   62 +++++++++++++++++++++++++++++++++------
>  7 files changed, 170 insertions(+), 53 deletions(-)
> 
> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
> index 76e3f1c..ac79598 100644
> --- a/arch/x86/kvm/i8254.c
> +++ b/arch/x86/kvm/i8254.c
> @@ -289,7 +289,7 @@ static void pit_do_work(struct work_struct *work)
>  	struct kvm_pit *pit = container_of(work, struct kvm_pit, expired);
>  	struct kvm *kvm = pit->kvm;
>  	struct kvm_vcpu *vcpu;
> -	int i;
> +	struct kvm_iter it;
>  	struct kvm_kpit_state *ps = &pit->pit_state;
>  	int inject = 0;
>  
> @@ -315,9 +315,13 @@ static void pit_do_work(struct work_struct *work)
>  		 * LVT0 to NMI delivery. Other PIC interrupts are just sent to
>  		 * VCPU0, and only if its LVT0 is in EXTINT mode.
>  		 */
> -		if (kvm->arch.vapics_in_nmi_mode > 0)
> -			kvm_for_each_vcpu(i, vcpu, kvm)
> +		if (kvm->arch.vapics_in_nmi_mode > 0) {
> +			rcu_read_lock();
> +			kvm_for_each_vcpu(it, vcpu, kvm) {
>  				kvm_apic_nmi_wd_deliver(vcpu);
> +			}
> +			rcu_read_unlock();
> +		}
>  	}
>  }
>  
> diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
> index cac4746..2186b30 100644
> --- a/arch/x86/kvm/i8259.c
> +++ b/arch/x86/kvm/i8259.c
> @@ -50,25 +50,29 @@ static void pic_unlock(struct kvm_pic *s)
>  {
>  	bool wakeup = s->wakeup_needed;
>  	struct kvm_vcpu *vcpu, *found = NULL;
> -	int i;
> +	struct kvm *kvm = s->kvm;
> +	struct kvm_iter it;
>  
>  	s->wakeup_needed = false;
>  
>  	spin_unlock(&s->lock);
>  
>  	if (wakeup) {
> -		kvm_for_each_vcpu(i, vcpu, s->kvm) {
> +		rcu_read_lock();
> +		kvm_for_each_vcpu(it, vcpu, kvm)
>  			if (kvm_apic_accept_pic_intr(vcpu)) {
>  				found = vcpu;
>  				break;
>  			}
> -		}
>  
> -		if (!found)
> +		if (!found) {
> +			rcu_read_unlock();
>  			return;
> +		}
>  
>  		kvm_make_request(KVM_REQ_EVENT, found);
>  		kvm_vcpu_kick(found);
> +		rcu_read_unlock();
>  	}
>  }
>  
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index f1b36cf..c16887e 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1833,11 +1833,12 @@ static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte)
>  
>  static void kvm_mmu_reset_last_pte_updated(struct kvm *kvm)
>  {
> -	int i;
> +	struct kvm_iter it;
>  	struct kvm_vcpu *vcpu;
> -
> -	kvm_for_each_vcpu(i, vcpu, kvm)
> +	rcu_read_lock();
> +	kvm_for_each_vcpu(it, vcpu, kvm)
>  		vcpu->arch.last_pte_updated = NULL;
> +	rcu_read_unlock();
>  }
>  
>  static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c38efd7..a302470 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1831,10 +1831,15 @@ static int get_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
>  	switch (msr) {
>  	case HV_X64_MSR_VP_INDEX: {
>  		int r;
> +		struct kvm_iter it;
>  		struct kvm_vcpu *v;
> -		kvm_for_each_vcpu(r, v, vcpu->kvm)
> +		struct kvm *kvm =  vcpu->kvm;
> +		rcu_read_lock();
> +		kvm_for_each_vcpu(it, v, kvm) {
>  			if (v == vcpu)
>  				data = r;
> +		}
> +		rcu_read_unlock();
>  		break;
>  	}
>  	case HV_X64_MSR_EOI:
> @@ -4966,7 +4971,8 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
>  	struct cpufreq_freqs *freq = data;
>  	struct kvm *kvm;
>  	struct kvm_vcpu *vcpu;
> -	int i, send_ipi = 0;
> +	int send_ipi = 0;
> +	struct kvm_iter it;
>  
>  	/*
>  	 * We allow guests to temporarily run on slowing clocks,
> @@ -5016,13 +5022,16 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
>  
>  	raw_spin_lock(&kvm_lock);
>  	list_for_each_entry(kvm, &vm_list, vm_list) {
> -		kvm_for_each_vcpu(i, vcpu, kvm) {
> +
> +		rcu_read_lock();
> +		kvm_for_each_vcpu(it, vcpu, kvm) {
>  			if (vcpu->cpu != freq->cpu)
>  				continue;
>  			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>  			if (vcpu->cpu != smp_processor_id())
>  				send_ipi = 1;
>  		}
> +		rcu_read_unlock();
>  	}
>  	raw_spin_unlock(&kvm_lock);
>  
> @@ -6433,13 +6442,17 @@ int kvm_arch_hardware_enable(void *garbage)
>  {
>  	struct kvm *kvm;
>  	struct kvm_vcpu *vcpu;
> -	int i;
> +	struct kvm_iter it;
>  
>  	kvm_shared_msr_cpu_online();
> -	list_for_each_entry(kvm, &vm_list, vm_list)
> -		kvm_for_each_vcpu(i, vcpu, kvm)
> +	list_for_each_entry(kvm, &vm_list, vm_list) {
> +		rcu_read_lock();
> +		kvm_for_each_vcpu(it, vcpu, kvm) {
>  			if (vcpu->cpu == smp_processor_id())
>  				kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> +		}
> +		rcu_read_unlock();
> +	}
>  	return kvm_x86_ops->hardware_enable(garbage);
>  }
>  
> @@ -6560,27 +6573,19 @@ static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
>  	vcpu_put(vcpu);
>  }
>  
> -static void kvm_free_vcpus(struct kvm *kvm)
> -{
> -	unsigned int i;
> -	struct kvm_vcpu *vcpu;
>  
> -	/*
> -	 * Unpin any mmu pages first.
> -	 */
> -	kvm_for_each_vcpu(i, vcpu, kvm) {
> -		kvm_clear_async_pf_completion_queue(vcpu);
> -		kvm_unload_vcpu_mmu(vcpu);
> -	}
> -	kvm_for_each_vcpu(i, vcpu, kvm)
> -		kvm_arch_vcpu_free(vcpu);
>  
> -	mutex_lock(&kvm->lock);
> -	for (i = 0; i < atomic_read(&kvm->online_vcpus); i++)
> -		kvm->vcpus[i] = NULL;
> +void kvm_arch_vcpu_zap(struct work_struct *work)
> +{
> +	struct kvm_vcpu *vcpu = container_of(work, struct kvm_vcpu,
> +			zap_work);
> +	struct kvm *kvm = vcpu->kvm;
>  
> -	atomic_set(&kvm->online_vcpus, 0);
> -	mutex_unlock(&kvm->lock);
> +	printk(KERN_INFO "%s, zap vcpu:0x%x\n", __func__, vcpu->vcpu_id);
> +	kvm_clear_async_pf_completion_queue(vcpu);
> +	kvm_unload_vcpu_mmu(vcpu);
> +	kvm_arch_vcpu_free(vcpu);
> +	kvm_put_kvm(kvm);
>  }
>  
>  void kvm_arch_sync_events(struct kvm *kvm)
> @@ -6594,7 +6599,6 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>  	kvm_iommu_unmap_guest(kvm);
>  	kfree(kvm->arch.vpic);
>  	kfree(kvm->arch.vioapic);
> -	kvm_free_vcpus(kvm);
>  	if (kvm->arch.apic_access_page)
>  		put_page(kvm->arch.apic_access_page);
>  	if (kvm->arch.ept_identity_pagetable)
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d526231..2faafcb 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -19,6 +19,7 @@
>  #include <linux/slab.h>
>  #include <linux/rcupdate.h>
>  #include <linux/ratelimit.h>
> +#include <linux/atomic.h>
>  #include <asm/signal.h>
>  
>  #include <linux/kvm.h>
> @@ -113,6 +114,8 @@ enum {
>  
>  struct kvm_vcpu {
>  	struct kvm *kvm;
> +	struct rcu_head head;
> +	struct work_struct zap_work;
>  #ifdef CONFIG_PREEMPT_NOTIFIERS
>  	struct preempt_notifier preempt_notifier;
>  #endif
> @@ -290,17 +293,73 @@ struct kvm {
>  #define kvm_printf(kvm, fmt ...) printk(KERN_DEBUG fmt)
>  #define vcpu_printf(vcpu, fmt...) kvm_printf(vcpu->kvm, fmt)
>  
> +void kvm_arch_vcpu_zap(struct work_struct *work);
> +
> +/*search vcpu, must be protected by rcu_read_lock*/
>  static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i)
>  {
> +	struct kvm_vcpu *vcpu;
>  	smp_rmb();
> -	return kvm->vcpus[i];
> +	vcpu = rcu_dereference(kvm->vcpus[i]);
> +	return vcpu;
> +}
> +
> +/*Must be protected by RCU*/
> +struct kvm_iter {
> +	struct kvm *kvm;
> +	int idx;
> +	int cnt;
> +};
> +
> +static inline
> +struct kvm_vcpu *kvm_fev_init(struct kvm *kvm, struct kvm_iter *it)
> +{
> +	int idx, cnt;
> +	struct kvm_vcpu *vcpup;
> +	vcpup = NULL;
> +	for (idx = 0, cnt = 0;
> +		cnt < atomic_read(&kvm->online_vcpus) && idx < KVM_MAX_VCPUS;
> +		idx++) {
> +			vcpup = kvm_get_vcpu(kvm, idx);
> +			if (unlikely(vcpup == NULL))
> +				continue;
> +			cnt++;
> +			break;
> +	}
> +
> +	it->kvm = kvm;
> +	it->idx = idx;
> +	it->cnt = cnt;
> +	return vcpup;
> +}
> +
> +static inline
> +struct kvm_vcpu *kvm_fev_next(struct kvm_iter *it)
> +{
> +	int idx, cnt;
> +	struct kvm_vcpu *vcpup;
> +	struct kvm *kvm = it->kvm;
> +
> +	vcpup = NULL;
> +	for (idx = it->idx+1, cnt = it->cnt;
> +		cnt < atomic_read(&kvm->online_vcpus) && idx < KVM_MAX_VCPUS;
> +		idx++) {
> +			vcpup = kvm_get_vcpu(kvm, idx);
> +			if (unlikely(vcpup == NULL))
> +				continue;
> +			 cnt++;
> +			 break;
> +	}
> +
> +	it->idx = idx;
> +	it->cnt = cnt;
> +	return vcpup;
>  }
>  
> -#define kvm_for_each_vcpu(idx, vcpup, kvm) \
> -	for (idx = 0; \
> -	     idx < atomic_read(&kvm->online_vcpus) && \
> -	     (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
> -	     idx++)
> +#define kvm_for_each_vcpu(it, vcpu, kvm) \
> +	for (vcpu = kvm_fev_init(kvm, &it); \
> +		vcpu; \
> +		vcpu = kvm_fev_next(&it))
>  
>  int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id);
>  void kvm_vcpu_uninit(struct kvm_vcpu *vcpu);
> diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> index 9f614b4..87eae96 100644
> --- a/virt/kvm/irq_comm.c
> +++ b/virt/kvm/irq_comm.c
> @@ -81,14 +81,16 @@ inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
>  int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
>  		struct kvm_lapic_irq *irq)
>  {
> -	int i, r = -1;
> +	int r = -1;
> +	struct kvm_iter it;
>  	struct kvm_vcpu *vcpu, *lowest = NULL;
>  
>  	if (irq->dest_mode == 0 && irq->dest_id == 0xff &&
>  			kvm_is_dm_lowest_prio(irq))
>  		printk(KERN_INFO "kvm: apic: phys broadcast and lowest prio\n");
>  
> -	kvm_for_each_vcpu(i, vcpu, kvm) {
> +	rcu_read_lock();
> +	kvm_for_each_vcpu(it, vcpu, kvm) {
>  		if (!kvm_apic_present(vcpu))
>  			continue;
>  
> @@ -111,6 +113,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
>  	if (lowest)
>  		r = kvm_apic_set_irq(lowest, irq);
>  
> +	rcu_read_unlock();
>  	return r;
>  }
>  
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d9cfb78..d28356a 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -171,7 +171,8 @@ static void ack_flush(void *_completed)
>  
>  static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
>  {
> -	int i, cpu, me;
> +	int cpu, me;
> +	struct kvm_iter it;
>  	cpumask_var_t cpus;
>  	bool called = true;
>  	struct kvm_vcpu *vcpu;
> @@ -179,7 +180,9 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
>  	zalloc_cpumask_var(&cpus, GFP_ATOMIC);
>  
>  	me = get_cpu();
> -	kvm_for_each_vcpu(i, vcpu, kvm) {
> +
> +	rcu_read_lock();
> +	kvm_for_each_vcpu(it, vcpu, kvm) {
>  		kvm_make_request(req, vcpu);
>  		cpu = vcpu->cpu;
>  
> @@ -190,12 +193,15 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
>  		      kvm_vcpu_exiting_guest_mode(vcpu) != OUTSIDE_GUEST_MODE)
>  			cpumask_set_cpu(cpu, cpus);
>  	}
> +
>  	if (unlikely(cpus == NULL))
>  		smp_call_function_many(cpu_online_mask, ack_flush, NULL, 1);
>  	else if (!cpumask_empty(cpus))
>  		smp_call_function_many(cpus, ack_flush, NULL, 1);
>  	else
>  		called = false;
> +	rcu_read_unlock();
> +
>  	put_cpu();
>  	free_cpumask_var(cpus);
>  	return called;
> @@ -580,6 +586,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
>  	kvm_arch_free_vm(kvm);
>  	hardware_disable_all();
>  	mmdrop(mm);
> +	printk(KERN_INFO "%s finished\n", __func__);
>  }
>  
>  void kvm_get_kvm(struct kvm *kvm)
> @@ -1543,6 +1550,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
>  	int yielded = 0;
>  	int pass;
> +	struct kvm_iter it;
>  	int i;
>  
>  	/*
> @@ -1553,9 +1561,11 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  	 * We approximate round-robin by starting at the last boosted VCPU.
>  	 */
>  	for (pass = 0; pass < 2 && !yielded; pass++) {
> -		kvm_for_each_vcpu(i, vcpu, kvm) {
> +		rcu_read_lock();
> +		kvm_for_each_vcpu(it, vcpu, kvm) {
>  			struct task_struct *task = NULL;
>  			struct pid *pid;
> +			i = it.idx;
>  			if (!pass && i < last_boosted_vcpu) {
>  				i = last_boosted_vcpu;
>  				continue;
> @@ -1584,6 +1594,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  			}
>  			put_task_struct(task);
>  		}
> +		rcu_read_unlock();
>  	}
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
> @@ -1620,11 +1631,23 @@ static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
>  	return 0;
>  }
>  
> +/*Can not block*/
> +static void kvm_vcpu_zap(struct rcu_head *rcu)
> +{
> +	struct kvm_vcpu *vcpu = container_of(rcu, struct kvm_vcpu, head);
> +	schedule_work(&vcpu->zap_work);
> +}
> +
>  static int kvm_vcpu_release(struct inode *inode, struct file *filp)
>  {
>  	struct kvm_vcpu *vcpu = filp->private_data;
> -
> -	kvm_put_kvm(vcpu->kvm);
> +	struct kvm *kvm = vcpu->kvm;
> +	filp->private_data = NULL;
> +	mutex_lock(&kvm->lock);
> +	rcu_assign_pointer(kvm->vcpus[vcpu->vcpu_id], NULL);
vcpu->vcpu_id is not an index into vcpus array.

> +	atomic_dec(&kvm->online_vcpus);
> +	mutex_unlock(&kvm->lock);
> +	call_rcu(&vcpu->head, kvm_vcpu_zap);
>  	return 0;
>  }
>  
> @@ -1646,6 +1669,16 @@ static int create_vcpu_fd(struct kvm_vcpu *vcpu)
>  	return anon_inode_getfd("kvm-vcpu", &kvm_vcpu_fops, vcpu, O_RDWR);
>  }
>  
> +static struct kvm_vcpu *kvm_vcpu_create(struct kvm *kvm, u32 id)
> +{
> +	struct kvm_vcpu *vcpu;
> +	vcpu = kvm_arch_vcpu_create(kvm, id);
> +	if (IS_ERR(vcpu))
> +		return vcpu;
> +	INIT_WORK(&vcpu->zap_work, kvm_arch_vcpu_zap);
> +	return vcpu;
> +}
> +
>  /*
>   * Creates some virtual cpus.  Good luck creating more than one.
>   */
> @@ -1653,8 +1686,9 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
>  {
>  	int r;
>  	struct kvm_vcpu *vcpu, *v;
> +	struct kvm_iter it;
>  
> -	vcpu = kvm_arch_vcpu_create(kvm, id);
> +	vcpu = kvm_vcpu_create(kvm, id);
>  	if (IS_ERR(vcpu))
>  		return PTR_ERR(vcpu);
>  
> @@ -1670,11 +1704,15 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
>  		goto unlock_vcpu_destroy;
>  	}
>  
> -	kvm_for_each_vcpu(r, v, kvm)
> +	rcu_read_lock();
> +	kvm_for_each_vcpu(it, v, kvm) {
>  		if (v->vcpu_id == id) {
> +			rcu_read_unlock();
>  			r = -EEXIST;
>  			goto unlock_vcpu_destroy;
>  		}
> +	}
> +	rcu_read_unlock();
>  
>  	BUG_ON(kvm->vcpus[atomic_read(&kvm->online_vcpus)]);
>  
> @@ -2593,13 +2631,17 @@ static int vcpu_stat_get(void *_offset, u64 *val)
>  	unsigned offset = (long)_offset;
>  	struct kvm *kvm;
>  	struct kvm_vcpu *vcpu;
> -	int i;
> +	struct kvm_iter it;
>  
>  	*val = 0;
>  	raw_spin_lock(&kvm_lock);
> -	list_for_each_entry(kvm, &vm_list, vm_list)
> -		kvm_for_each_vcpu(i, vcpu, kvm)
> +	list_for_each_entry(kvm, &vm_list, vm_list) {
> +		rcu_read_lock();
> +		kvm_for_each_vcpu(it, vcpu, kvm) {
>  			*val += *(u32 *)((void *)vcpu + offset);
> +		}
> +		rcu_read_unlock();
> +	}
>  
>  	raw_spin_unlock(&kvm_lock);
>  	return 0;
> -- 
> 1.7.4.4

--
			Gleb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3] kvm: make vcpu life cycle separated from kvm instance
  2011-12-12 12:54             ` Gleb Natapov
@ 2011-12-13  9:29               ` Liu ping fan
  2011-12-13  9:47                 ` Gleb Natapov
  0 siblings, 1 reply; 78+ messages in thread
From: Liu ping fan @ 2011-12-13  9:29 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, linux-kernel, avi, aliguori, jan.kiszka

On Mon, Dec 12, 2011 at 8:54 PM, Gleb Natapov <gleb@redhat.com> wrote:
> On Mon, Dec 12, 2011 at 10:41:23AM +0800, Liu Ping Fan wrote:
>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>
>> Currently, vcpu can be destructed only when kvm instance destroyed.
>> Change this to vcpu's destruction taken when its refcnt is zero,
>> and then vcpu MUST and CAN be destroyed before kvm's destroy.
>>
> Please drop all printks that you add. You do not use rcu_assign_pointer()
> during vcpu creation and BTW the code there is incorrect now. It assumed
> that online_vcpus is never decremented so it is OK to put newly created
> vcpu into kvm->vcpus[kvm->online_vcpus], but now it is not longer true.
> We even have BUG_ON() to catch that which I believe you can trigger with
> this patch by creating 3 vcpus, removing second one and then adding one
> more. Moving to rculist would solve this of course, and will simplify
> code that iterates over all vcpus too.
>
OK, it seems unavoidable to use rculist now :-).  Just one more question, is it
useless for "case HV_X64_MSR_VP_INDEX" after adopting rculist?

Thanks and regards,
ping fan
> Also see below.
>
>> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>> ---
>>  arch/x86/kvm/i8254.c     |   10 ++++--
>>  arch/x86/kvm/i8259.c     |   12 +++++--
>>  arch/x86/kvm/mmu.c       |    7 ++--
>>  arch/x86/kvm/x86.c       |   54 +++++++++++++++++++----------------
>>  include/linux/kvm_host.h |   71 ++++++++++++++++++++++++++++++++++++++++++----
>>  virt/kvm/irq_comm.c      |    7 +++-
>>  virt/kvm/kvm_main.c      |   62 +++++++++++++++++++++++++++++++++------
>>  7 files changed, 170 insertions(+), 53 deletions(-)
>>
>> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
>> index 76e3f1c..ac79598 100644
>> --- a/arch/x86/kvm/i8254.c
>> +++ b/arch/x86/kvm/i8254.c
>> @@ -289,7 +289,7 @@ static void pit_do_work(struct work_struct *work)
>>       struct kvm_pit *pit = container_of(work, struct kvm_pit, expired);
>>       struct kvm *kvm = pit->kvm;
>>       struct kvm_vcpu *vcpu;
>> -     int i;
>> +     struct kvm_iter it;
>>       struct kvm_kpit_state *ps = &pit->pit_state;
>>       int inject = 0;
>>
>> @@ -315,9 +315,13 @@ static void pit_do_work(struct work_struct *work)
>>                * LVT0 to NMI delivery. Other PIC interrupts are just sent to
>>                * VCPU0, and only if its LVT0 is in EXTINT mode.
>>                */
>> -             if (kvm->arch.vapics_in_nmi_mode > 0)
>> -                     kvm_for_each_vcpu(i, vcpu, kvm)
>> +             if (kvm->arch.vapics_in_nmi_mode > 0) {
>> +                     rcu_read_lock();
>> +                     kvm_for_each_vcpu(it, vcpu, kvm) {
>>                               kvm_apic_nmi_wd_deliver(vcpu);
>> +                     }
>> +                     rcu_read_unlock();
>> +             }
>>       }
>>  }
>>
>> diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
>> index cac4746..2186b30 100644
>> --- a/arch/x86/kvm/i8259.c
>> +++ b/arch/x86/kvm/i8259.c
>> @@ -50,25 +50,29 @@ static void pic_unlock(struct kvm_pic *s)
>>  {
>>       bool wakeup = s->wakeup_needed;
>>       struct kvm_vcpu *vcpu, *found = NULL;
>> -     int i;
>> +     struct kvm *kvm = s->kvm;
>> +     struct kvm_iter it;
>>
>>       s->wakeup_needed = false;
>>
>>       spin_unlock(&s->lock);
>>
>>       if (wakeup) {
>> -             kvm_for_each_vcpu(i, vcpu, s->kvm) {
>> +             rcu_read_lock();
>> +             kvm_for_each_vcpu(it, vcpu, kvm)
>>                       if (kvm_apic_accept_pic_intr(vcpu)) {
>>                               found = vcpu;
>>                               break;
>>                       }
>> -             }
>>
>> -             if (!found)
>> +             if (!found) {
>> +                     rcu_read_unlock();
>>                       return;
>> +             }
>>
>>               kvm_make_request(KVM_REQ_EVENT, found);
>>               kvm_vcpu_kick(found);
>> +             rcu_read_unlock();
>>       }
>>  }
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index f1b36cf..c16887e 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -1833,11 +1833,12 @@ static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte)
>>
>>  static void kvm_mmu_reset_last_pte_updated(struct kvm *kvm)
>>  {
>> -     int i;
>> +     struct kvm_iter it;
>>       struct kvm_vcpu *vcpu;
>> -
>> -     kvm_for_each_vcpu(i, vcpu, kvm)
>> +     rcu_read_lock();
>> +     kvm_for_each_vcpu(it, vcpu, kvm)
>>               vcpu->arch.last_pte_updated = NULL;
>> +     rcu_read_unlock();
>>  }
>>
>>  static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp)
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index c38efd7..a302470 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1831,10 +1831,15 @@ static int get_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
>>       switch (msr) {
>>       case HV_X64_MSR_VP_INDEX: {
>>               int r;
>> +             struct kvm_iter it;
>>               struct kvm_vcpu *v;
>> -             kvm_for_each_vcpu(r, v, vcpu->kvm)
>> +             struct kvm *kvm =  vcpu->kvm;
>> +             rcu_read_lock();
>> +             kvm_for_each_vcpu(it, v, kvm) {
>>                       if (v == vcpu)
>>                               data = r;
>> +             }
>> +             rcu_read_unlock();
>>               break;
>>       }
>>       case HV_X64_MSR_EOI:
>> @@ -4966,7 +4971,8 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
>>       struct cpufreq_freqs *freq = data;
>>       struct kvm *kvm;
>>       struct kvm_vcpu *vcpu;
>> -     int i, send_ipi = 0;
>> +     int send_ipi = 0;
>> +     struct kvm_iter it;
>>
>>       /*
>>        * We allow guests to temporarily run on slowing clocks,
>> @@ -5016,13 +5022,16 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
>>
>>       raw_spin_lock(&kvm_lock);
>>       list_for_each_entry(kvm, &vm_list, vm_list) {
>> -             kvm_for_each_vcpu(i, vcpu, kvm) {
>> +
>> +             rcu_read_lock();
>> +             kvm_for_each_vcpu(it, vcpu, kvm) {
>>                       if (vcpu->cpu != freq->cpu)
>>                               continue;
>>                       kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>>                       if (vcpu->cpu != smp_processor_id())
>>                               send_ipi = 1;
>>               }
>> +             rcu_read_unlock();
>>       }
>>       raw_spin_unlock(&kvm_lock);
>>
>> @@ -6433,13 +6442,17 @@ int kvm_arch_hardware_enable(void *garbage)
>>  {
>>       struct kvm *kvm;
>>       struct kvm_vcpu *vcpu;
>> -     int i;
>> +     struct kvm_iter it;
>>
>>       kvm_shared_msr_cpu_online();
>> -     list_for_each_entry(kvm, &vm_list, vm_list)
>> -             kvm_for_each_vcpu(i, vcpu, kvm)
>> +     list_for_each_entry(kvm, &vm_list, vm_list) {
>> +             rcu_read_lock();
>> +             kvm_for_each_vcpu(it, vcpu, kvm) {
>>                       if (vcpu->cpu == smp_processor_id())
>>                               kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>> +             }
>> +             rcu_read_unlock();
>> +     }
>>       return kvm_x86_ops->hardware_enable(garbage);
>>  }
>>
>> @@ -6560,27 +6573,19 @@ static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
>>       vcpu_put(vcpu);
>>  }
>>
>> -static void kvm_free_vcpus(struct kvm *kvm)
>> -{
>> -     unsigned int i;
>> -     struct kvm_vcpu *vcpu;
>>
>> -     /*
>> -      * Unpin any mmu pages first.
>> -      */
>> -     kvm_for_each_vcpu(i, vcpu, kvm) {
>> -             kvm_clear_async_pf_completion_queue(vcpu);
>> -             kvm_unload_vcpu_mmu(vcpu);
>> -     }
>> -     kvm_for_each_vcpu(i, vcpu, kvm)
>> -             kvm_arch_vcpu_free(vcpu);
>>
>> -     mutex_lock(&kvm->lock);
>> -     for (i = 0; i < atomic_read(&kvm->online_vcpus); i++)
>> -             kvm->vcpus[i] = NULL;
>> +void kvm_arch_vcpu_zap(struct work_struct *work)
>> +{
>> +     struct kvm_vcpu *vcpu = container_of(work, struct kvm_vcpu,
>> +                     zap_work);
>> +     struct kvm *kvm = vcpu->kvm;
>>
>> -     atomic_set(&kvm->online_vcpus, 0);
>> -     mutex_unlock(&kvm->lock);
>> +     printk(KERN_INFO "%s, zap vcpu:0x%x\n", __func__, vcpu->vcpu_id);
>> +     kvm_clear_async_pf_completion_queue(vcpu);
>> +     kvm_unload_vcpu_mmu(vcpu);
>> +     kvm_arch_vcpu_free(vcpu);
>> +     kvm_put_kvm(kvm);
>>  }
>>
>>  void kvm_arch_sync_events(struct kvm *kvm)
>> @@ -6594,7 +6599,6 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>>       kvm_iommu_unmap_guest(kvm);
>>       kfree(kvm->arch.vpic);
>>       kfree(kvm->arch.vioapic);
>> -     kvm_free_vcpus(kvm);
>>       if (kvm->arch.apic_access_page)
>>               put_page(kvm->arch.apic_access_page);
>>       if (kvm->arch.ept_identity_pagetable)
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index d526231..2faafcb 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -19,6 +19,7 @@
>>  #include <linux/slab.h>
>>  #include <linux/rcupdate.h>
>>  #include <linux/ratelimit.h>
>> +#include <linux/atomic.h>
>>  #include <asm/signal.h>
>>
>>  #include <linux/kvm.h>
>> @@ -113,6 +114,8 @@ enum {
>>
>>  struct kvm_vcpu {
>>       struct kvm *kvm;
>> +     struct rcu_head head;
>> +     struct work_struct zap_work;
>>  #ifdef CONFIG_PREEMPT_NOTIFIERS
>>       struct preempt_notifier preempt_notifier;
>>  #endif
>> @@ -290,17 +293,73 @@ struct kvm {
>>  #define kvm_printf(kvm, fmt ...) printk(KERN_DEBUG fmt)
>>  #define vcpu_printf(vcpu, fmt...) kvm_printf(vcpu->kvm, fmt)
>>
>> +void kvm_arch_vcpu_zap(struct work_struct *work);
>> +
>> +/*search vcpu, must be protected by rcu_read_lock*/
>>  static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i)
>>  {
>> +     struct kvm_vcpu *vcpu;
>>       smp_rmb();
>> -     return kvm->vcpus[i];
>> +     vcpu = rcu_dereference(kvm->vcpus[i]);
>> +     return vcpu;
>> +}
>> +
>> +/*Must be protected by RCU*/
>> +struct kvm_iter {
>> +     struct kvm *kvm;
>> +     int idx;
>> +     int cnt;
>> +};
>> +
>> +static inline
>> +struct kvm_vcpu *kvm_fev_init(struct kvm *kvm, struct kvm_iter *it)
>> +{
>> +     int idx, cnt;
>> +     struct kvm_vcpu *vcpup;
>> +     vcpup = NULL;
>> +     for (idx = 0, cnt = 0;
>> +             cnt < atomic_read(&kvm->online_vcpus) && idx < KVM_MAX_VCPUS;
>> +             idx++) {
>> +                     vcpup = kvm_get_vcpu(kvm, idx);
>> +                     if (unlikely(vcpup == NULL))
>> +                             continue;
>> +                     cnt++;
>> +                     break;
>> +     }
>> +
>> +     it->kvm = kvm;
>> +     it->idx = idx;
>> +     it->cnt = cnt;
>> +     return vcpup;
>> +}
>> +
>> +static inline
>> +struct kvm_vcpu *kvm_fev_next(struct kvm_iter *it)
>> +{
>> +     int idx, cnt;
>> +     struct kvm_vcpu *vcpup;
>> +     struct kvm *kvm = it->kvm;
>> +
>> +     vcpup = NULL;
>> +     for (idx = it->idx+1, cnt = it->cnt;
>> +             cnt < atomic_read(&kvm->online_vcpus) && idx < KVM_MAX_VCPUS;
>> +             idx++) {
>> +                     vcpup = kvm_get_vcpu(kvm, idx);
>> +                     if (unlikely(vcpup == NULL))
>> +                             continue;
>> +                      cnt++;
>> +                      break;
>> +     }
>> +
>> +     it->idx = idx;
>> +     it->cnt = cnt;
>> +     return vcpup;
>>  }
>>
>> -#define kvm_for_each_vcpu(idx, vcpup, kvm) \
>> -     for (idx = 0; \
>> -          idx < atomic_read(&kvm->online_vcpus) && \
>> -          (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
>> -          idx++)
>> +#define kvm_for_each_vcpu(it, vcpu, kvm) \
>> +     for (vcpu = kvm_fev_init(kvm, &it); \
>> +             vcpu; \
>> +             vcpu = kvm_fev_next(&it))
>>
>>  int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id);
>>  void kvm_vcpu_uninit(struct kvm_vcpu *vcpu);
>> diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
>> index 9f614b4..87eae96 100644
>> --- a/virt/kvm/irq_comm.c
>> +++ b/virt/kvm/irq_comm.c
>> @@ -81,14 +81,16 @@ inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
>>  int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
>>               struct kvm_lapic_irq *irq)
>>  {
>> -     int i, r = -1;
>> +     int r = -1;
>> +     struct kvm_iter it;
>>       struct kvm_vcpu *vcpu, *lowest = NULL;
>>
>>       if (irq->dest_mode == 0 && irq->dest_id == 0xff &&
>>                       kvm_is_dm_lowest_prio(irq))
>>               printk(KERN_INFO "kvm: apic: phys broadcast and lowest prio\n");
>>
>> -     kvm_for_each_vcpu(i, vcpu, kvm) {
>> +     rcu_read_lock();
>> +     kvm_for_each_vcpu(it, vcpu, kvm) {
>>               if (!kvm_apic_present(vcpu))
>>                       continue;
>>
>> @@ -111,6 +113,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
>>       if (lowest)
>>               r = kvm_apic_set_irq(lowest, irq);
>>
>> +     rcu_read_unlock();
>>       return r;
>>  }
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index d9cfb78..d28356a 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -171,7 +171,8 @@ static void ack_flush(void *_completed)
>>
>>  static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
>>  {
>> -     int i, cpu, me;
>> +     int cpu, me;
>> +     struct kvm_iter it;
>>       cpumask_var_t cpus;
>>       bool called = true;
>>       struct kvm_vcpu *vcpu;
>> @@ -179,7 +180,9 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
>>       zalloc_cpumask_var(&cpus, GFP_ATOMIC);
>>
>>       me = get_cpu();
>> -     kvm_for_each_vcpu(i, vcpu, kvm) {
>> +
>> +     rcu_read_lock();
>> +     kvm_for_each_vcpu(it, vcpu, kvm) {
>>               kvm_make_request(req, vcpu);
>>               cpu = vcpu->cpu;
>>
>> @@ -190,12 +193,15 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
>>                     kvm_vcpu_exiting_guest_mode(vcpu) != OUTSIDE_GUEST_MODE)
>>                       cpumask_set_cpu(cpu, cpus);
>>       }
>> +
>>       if (unlikely(cpus == NULL))
>>               smp_call_function_many(cpu_online_mask, ack_flush, NULL, 1);
>>       else if (!cpumask_empty(cpus))
>>               smp_call_function_many(cpus, ack_flush, NULL, 1);
>>       else
>>               called = false;
>> +     rcu_read_unlock();
>> +
>>       put_cpu();
>>       free_cpumask_var(cpus);
>>       return called;
>> @@ -580,6 +586,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
>>       kvm_arch_free_vm(kvm);
>>       hardware_disable_all();
>>       mmdrop(mm);
>> +     printk(KERN_INFO "%s finished\n", __func__);
>>  }
>>
>>  void kvm_get_kvm(struct kvm *kvm)
>> @@ -1543,6 +1550,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>       int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
>>       int yielded = 0;
>>       int pass;
>> +     struct kvm_iter it;
>>       int i;
>>
>>       /*
>> @@ -1553,9 +1561,11 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>        * We approximate round-robin by starting at the last boosted VCPU.
>>        */
>>       for (pass = 0; pass < 2 && !yielded; pass++) {
>> -             kvm_for_each_vcpu(i, vcpu, kvm) {
>> +             rcu_read_lock();
>> +             kvm_for_each_vcpu(it, vcpu, kvm) {
>>                       struct task_struct *task = NULL;
>>                       struct pid *pid;
>> +                     i = it.idx;
>>                       if (!pass && i < last_boosted_vcpu) {
>>                               i = last_boosted_vcpu;
>>                               continue;
>> @@ -1584,6 +1594,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>                       }
>>                       put_task_struct(task);
>>               }
>> +             rcu_read_unlock();
>>       }
>>  }
>>  EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
>> @@ -1620,11 +1631,23 @@ static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
>>       return 0;
>>  }
>>
>> +/*Can not block*/
>> +static void kvm_vcpu_zap(struct rcu_head *rcu)
>> +{
>> +     struct kvm_vcpu *vcpu = container_of(rcu, struct kvm_vcpu, head);
>> +     schedule_work(&vcpu->zap_work);
>> +}
>> +
>>  static int kvm_vcpu_release(struct inode *inode, struct file *filp)
>>  {
>>       struct kvm_vcpu *vcpu = filp->private_data;
>> -
>> -     kvm_put_kvm(vcpu->kvm);
>> +     struct kvm *kvm = vcpu->kvm;
>> +     filp->private_data = NULL;
>> +     mutex_lock(&kvm->lock);
>> +     rcu_assign_pointer(kvm->vcpus[vcpu->vcpu_id], NULL);
> vcpu->vcpu_id is not an index into vcpus array.
>
>> +     atomic_dec(&kvm->online_vcpus);
>> +     mutex_unlock(&kvm->lock);
>> +     call_rcu(&vcpu->head, kvm_vcpu_zap);
>>       return 0;
>>  }
>>
>> @@ -1646,6 +1669,16 @@ static int create_vcpu_fd(struct kvm_vcpu *vcpu)
>>       return anon_inode_getfd("kvm-vcpu", &kvm_vcpu_fops, vcpu, O_RDWR);
>>  }
>>
>> +static struct kvm_vcpu *kvm_vcpu_create(struct kvm *kvm, u32 id)
>> +{
>> +     struct kvm_vcpu *vcpu;
>> +     vcpu = kvm_arch_vcpu_create(kvm, id);
>> +     if (IS_ERR(vcpu))
>> +             return vcpu;
>> +     INIT_WORK(&vcpu->zap_work, kvm_arch_vcpu_zap);
>> +     return vcpu;
>> +}
>> +
>>  /*
>>   * Creates some virtual cpus.  Good luck creating more than one.
>>   */
>> @@ -1653,8 +1686,9 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
>>  {
>>       int r;
>>       struct kvm_vcpu *vcpu, *v;
>> +     struct kvm_iter it;
>>
>> -     vcpu = kvm_arch_vcpu_create(kvm, id);
>> +     vcpu = kvm_vcpu_create(kvm, id);
>>       if (IS_ERR(vcpu))
>>               return PTR_ERR(vcpu);
>>
>> @@ -1670,11 +1704,15 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
>>               goto unlock_vcpu_destroy;
>>       }
>>
>> -     kvm_for_each_vcpu(r, v, kvm)
>> +     rcu_read_lock();
>> +     kvm_for_each_vcpu(it, v, kvm) {
>>               if (v->vcpu_id == id) {
>> +                     rcu_read_unlock();
>>                       r = -EEXIST;
>>                       goto unlock_vcpu_destroy;
>>               }
>> +     }
>> +     rcu_read_unlock();
>>
>>       BUG_ON(kvm->vcpus[atomic_read(&kvm->online_vcpus)]);
>>
>> @@ -2593,13 +2631,17 @@ static int vcpu_stat_get(void *_offset, u64 *val)
>>       unsigned offset = (long)_offset;
>>       struct kvm *kvm;
>>       struct kvm_vcpu *vcpu;
>> -     int i;
>> +     struct kvm_iter it;
>>
>>       *val = 0;
>>       raw_spin_lock(&kvm_lock);
>> -     list_for_each_entry(kvm, &vm_list, vm_list)
>> -             kvm_for_each_vcpu(i, vcpu, kvm)
>> +     list_for_each_entry(kvm, &vm_list, vm_list) {
>> +             rcu_read_lock();
>> +             kvm_for_each_vcpu(it, vcpu, kvm) {
>>                       *val += *(u32 *)((void *)vcpu + offset);
>> +             }
>> +             rcu_read_unlock();
>> +     }
>>
>>       raw_spin_unlock(&kvm_lock);
>>       return 0;
>> --
>> 1.7.4.4
>
> --
>                        Gleb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3] kvm: make vcpu life cycle separated from kvm instance
  2011-12-13  9:29               ` Liu ping fan
@ 2011-12-13  9:47                 ` Gleb Natapov
  0 siblings, 0 replies; 78+ messages in thread
From: Gleb Natapov @ 2011-12-13  9:47 UTC (permalink / raw)
  To: Liu ping fan; +Cc: kvm, linux-kernel, avi, aliguori, jan.kiszka

On Tue, Dec 13, 2011 at 05:29:50PM +0800, Liu ping fan wrote:
> On Mon, Dec 12, 2011 at 8:54 PM, Gleb Natapov <gleb@redhat.com> wrote:
> > On Mon, Dec 12, 2011 at 10:41:23AM +0800, Liu Ping Fan wrote:
> >> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> >>
> >> Currently, vcpu can be destructed only when kvm instance destroyed.
> >> Change this to vcpu's destruction taken when its refcnt is zero,
> >> and then vcpu MUST and CAN be destroyed before kvm's destroy.
> >>
> > Please drop all printks that you add. You do not use rcu_assign_pointer()
> > during vcpu creation and BTW the code there is incorrect now. It assumed
> > that online_vcpus is never decremented so it is OK to put newly created
> > vcpu into kvm->vcpus[kvm->online_vcpus], but now it is not longer true.
> > We even have BUG_ON() to catch that which I believe you can trigger with
> > this patch by creating 3 vcpus, removing second one and then adding one
> > more. Moving to rculist would solve this of course, and will simplify
> > code that iterates over all vcpus too.
> >
> OK, it seems unavoidable to use rculist now :-).  Just one more question, is it
> useless for "case HV_X64_MSR_VP_INDEX" after adopting rculist?
> 
Windows does not support cpu hot-unplug IIRC. Just return the index of
the vcpu in the vcpus list there.

--
			Gleb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3] kvm: make vcpu life cycle separated from kvm instance
  2011-12-12  2:41           ` [PATCH v3] " Liu Ping Fan
  2011-12-12 12:54             ` Gleb Natapov
@ 2011-12-13 11:36             ` Marcelo Tosatti
  2011-12-13 11:54               ` Gleb Natapov
  2011-12-15  3:21               ` Liu ping fan
  2011-12-17  3:19             ` [PATCH v5] " Liu Ping Fan
  2 siblings, 2 replies; 78+ messages in thread
From: Marcelo Tosatti @ 2011-12-13 11:36 UTC (permalink / raw)
  To: Liu Ping Fan; +Cc: kvm, linux-kernel, avi, aliguori, gleb, jan.kiszka

On Mon, Dec 12, 2011 at 10:41:23AM +0800, Liu Ping Fan wrote:
> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> 
> Currently, vcpu can be destructed only when kvm instance destroyed.
> Change this to vcpu's destruction taken when its refcnt is zero,
> and then vcpu MUST and CAN be destroyed before kvm's destroy.
> 
> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> ---
>  arch/x86/kvm/i8254.c     |   10 ++++--
>  arch/x86/kvm/i8259.c     |   12 +++++--
>  arch/x86/kvm/mmu.c       |    7 ++--
>  arch/x86/kvm/x86.c       |   54 +++++++++++++++++++----------------
>  include/linux/kvm_host.h |   71 ++++++++++++++++++++++++++++++++++++++++++----
>  virt/kvm/irq_comm.c      |    7 +++-
>  virt/kvm/kvm_main.c      |   62 +++++++++++++++++++++++++++++++++------
>  7 files changed, 170 insertions(+), 53 deletions(-)

This needs a full audit of paths that access vcpus. See for one example
bsp_vcpu pointer.



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3] kvm: make vcpu life cycle separated from kvm instance
  2011-12-13 11:36             ` Marcelo Tosatti
@ 2011-12-13 11:54               ` Gleb Natapov
  2011-12-15  3:21               ` Liu ping fan
  1 sibling, 0 replies; 78+ messages in thread
From: Gleb Natapov @ 2011-12-13 11:54 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Liu Ping Fan, kvm, linux-kernel, avi, aliguori, jan.kiszka

On Tue, Dec 13, 2011 at 09:36:28AM -0200, Marcelo Tosatti wrote:
> On Mon, Dec 12, 2011 at 10:41:23AM +0800, Liu Ping Fan wrote:
> > From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> > 
> > Currently, vcpu can be destructed only when kvm instance destroyed.
> > Change this to vcpu's destruction taken when its refcnt is zero,
> > and then vcpu MUST and CAN be destroyed before kvm's destroy.
> > 
> > Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> > ---
> >  arch/x86/kvm/i8254.c     |   10 ++++--
> >  arch/x86/kvm/i8259.c     |   12 +++++--
> >  arch/x86/kvm/mmu.c       |    7 ++--
> >  arch/x86/kvm/x86.c       |   54 +++++++++++++++++++----------------
> >  include/linux/kvm_host.h |   71 ++++++++++++++++++++++++++++++++++++++++++----
> >  virt/kvm/irq_comm.c      |    7 +++-
> >  virt/kvm/kvm_main.c      |   62 +++++++++++++++++++++++++++++++++------
> >  7 files changed, 170 insertions(+), 53 deletions(-)
> 
> This needs a full audit of paths that access vcpus. See for one example
> bsp_vcpu pointer.
> 
Yes. For now we should probably disallow removal of bsp, but we need to
get rid of bsp_vcpu at all. It is used only in two places. IOAPIC can be
changed to use kvm->bsp_vcpu_id and pic's use looks incorrect anyway.

--
			Gleb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3] kvm: make vcpu life cycle separated from kvm instance
  2011-12-13 11:36             ` Marcelo Tosatti
  2011-12-13 11:54               ` Gleb Natapov
@ 2011-12-15  3:21               ` Liu ping fan
  2011-12-15  4:28                 ` [PATCH v4] " Liu Ping Fan
  2011-12-15  8:33                 ` [PATCH v3] " Gleb Natapov
  1 sibling, 2 replies; 78+ messages in thread
From: Liu ping fan @ 2011-12-15  3:21 UTC (permalink / raw)
  To: Marcelo Tosatti, gleb; +Cc: kvm, linux-kernel, avi, aliguori, jan.kiszka

On Tue, Dec 13, 2011 at 7:36 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Mon, Dec 12, 2011 at 10:41:23AM +0800, Liu Ping Fan wrote:
>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>
>> Currently, vcpu can be destructed only when kvm instance destroyed.
>> Change this to vcpu's destruction taken when its refcnt is zero,
>> and then vcpu MUST and CAN be destroyed before kvm's destroy.
>>
>> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>> ---
>>  arch/x86/kvm/i8254.c     |   10 ++++--
>>  arch/x86/kvm/i8259.c     |   12 +++++--
>>  arch/x86/kvm/mmu.c       |    7 ++--
>>  arch/x86/kvm/x86.c       |   54 +++++++++++++++++++----------------
>>  include/linux/kvm_host.h |   71 ++++++++++++++++++++++++++++++++++++++++++----
>>  virt/kvm/irq_comm.c      |    7 +++-
>>  virt/kvm/kvm_main.c      |   62 +++++++++++++++++++++++++++++++++------
>>  7 files changed, 170 insertions(+), 53 deletions(-)
>
> This needs a full audit of paths that access vcpus. See for one example
> bsp_vcpu pointer.
>
Yes, I had missed it and just paid attention to the access path to
vcpu in kvm_lapic and the path used in async_pf. I will correct it
later.
BTW, I want to make it sure that because kvm_lapic will be destroyed
before vcpu, so  it is safe to bypass the access path there, and the
situation is the same in async_pf for we have called
kvm_clear_async_pf_completion_queue before zapping vcpu.  Am I right?

As to the scene like bsp_vcpu, I think that introducing refcount like
in V2 can handle it easier. Please help to review these changes in V4
which I will send a little later.

Thanks and regards
ping fan
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v4] kvm: make vcpu life cycle separated from kvm instance
  2011-12-15  3:21               ` Liu ping fan
@ 2011-12-15  4:28                 ` Liu Ping Fan
  2011-12-15  5:33                   ` Xiao Guangrong
                                     ` (2 more replies)
  2011-12-15  8:33                 ` [PATCH v3] " Gleb Natapov
  1 sibling, 3 replies; 78+ messages in thread
From: Liu Ping Fan @ 2011-12-15  4:28 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, avi, aliguori, gleb, mtosatti, jan.kiszka

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

Currently, vcpu can be destructed only when kvm instance destroyed.
Change this to vcpu's destruction before kvm instance, so vcpu MUST
and CAN be destroyed before kvm's destroy.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 arch/x86/kvm/i8254.c     |    8 ++-
 arch/x86/kvm/i8259.c     |   11 +++--
 arch/x86/kvm/mmu.c       |    5 +-
 arch/x86/kvm/x86.c       |   50 ++++++++---------
 include/linux/kvm_host.h |   27 +++++----
 virt/kvm/irq_comm.c      |    6 ++-
 virt/kvm/kvm_main.c      |  131 ++++++++++++++++++++++++++++++++++++----------
 7 files changed, 161 insertions(+), 77 deletions(-)

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index 76e3f1c..b8990ca 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -289,7 +289,6 @@ static void pit_do_work(struct work_struct *work)
 	struct kvm_pit *pit = container_of(work, struct kvm_pit, expired);
 	struct kvm *kvm = pit->kvm;
 	struct kvm_vcpu *vcpu;
-	int i;
 	struct kvm_kpit_state *ps = &pit->pit_state;
 	int inject = 0;
 
@@ -315,9 +314,12 @@ static void pit_do_work(struct work_struct *work)
 		 * LVT0 to NMI delivery. Other PIC interrupts are just sent to
 		 * VCPU0, and only if its LVT0 is in EXTINT mode.
 		 */
-		if (kvm->arch.vapics_in_nmi_mode > 0)
-			kvm_for_each_vcpu(i, vcpu, kvm)
+		if (kvm->arch.vapics_in_nmi_mode > 0) {
+			rcu_read_lock();
+			kvm_for_each_vcpu(vcpu, kvm)
 				kvm_apic_nmi_wd_deliver(vcpu);
+			rcu_read_unlock();
+		}
 	}
 }
 
diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index cac4746..f275b8c 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -50,25 +50,28 @@ static void pic_unlock(struct kvm_pic *s)
 {
 	bool wakeup = s->wakeup_needed;
 	struct kvm_vcpu *vcpu, *found = NULL;
-	int i;
+	struct kvm *kvm = s->kvm;
 
 	s->wakeup_needed = false;
 
 	spin_unlock(&s->lock);
 
 	if (wakeup) {
-		kvm_for_each_vcpu(i, vcpu, s->kvm) {
+		rcu_read_lock();
+		kvm_for_each_vcpu(vcpu, kvm)
 			if (kvm_apic_accept_pic_intr(vcpu)) {
 				found = vcpu;
 				break;
 			}
-		}
 
-		if (!found)
+		if (!found) {
+			rcu_read_unlock();
 			return;
+		}
 
 		kvm_make_request(KVM_REQ_EVENT, found);
 		kvm_vcpu_kick(found);
+		rcu_read_unlock();
 	}
 }
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f1b36cf..ba082cd 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1833,11 +1833,12 @@ static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte)
 
 static void kvm_mmu_reset_last_pte_updated(struct kvm *kvm)
 {
-	int i;
 	struct kvm_vcpu *vcpu;
 
-	kvm_for_each_vcpu(i, vcpu, kvm)
+	rcu_read_lock();
+	kvm_for_each_vcpu(vcpu, kvm)
 		vcpu->arch.last_pte_updated = NULL;
+	rcu_read_unlock();
 }
 
 static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c38efd7..acaa154 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1830,11 +1830,13 @@ static int get_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 
 	switch (msr) {
 	case HV_X64_MSR_VP_INDEX: {
-		int r;
+		int r = 0;
 		struct kvm_vcpu *v;
-		kvm_for_each_vcpu(r, v, vcpu->kvm)
+		kvm_for_each_vcpu(v, vcpu->kvm) {
 			if (v == vcpu)
 				data = r;
+			r++;
+		}
 		break;
 	}
 	case HV_X64_MSR_EOI:
@@ -4966,7 +4968,7 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
 	struct cpufreq_freqs *freq = data;
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
-	int i, send_ipi = 0;
+	int send_ipi = 0;
 
 	/*
 	 * We allow guests to temporarily run on slowing clocks,
@@ -5016,13 +5018,16 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
 
 	raw_spin_lock(&kvm_lock);
 	list_for_each_entry(kvm, &vm_list, vm_list) {
-		kvm_for_each_vcpu(i, vcpu, kvm) {
+		rcu_read_lock();
+		kvm_for_each_vcpu(vcpu, kvm) {
 			if (vcpu->cpu != freq->cpu)
 				continue;
 			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 			if (vcpu->cpu != smp_processor_id())
 				send_ipi = 1;
 		}
+		rcu_read_unlock();
+
 	}
 	raw_spin_unlock(&kvm_lock);
 
@@ -6433,13 +6438,16 @@ int kvm_arch_hardware_enable(void *garbage)
 {
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
-	int i;
 
 	kvm_shared_msr_cpu_online();
-	list_for_each_entry(kvm, &vm_list, vm_list)
-		kvm_for_each_vcpu(i, vcpu, kvm)
+	list_for_each_entry(kvm, &vm_list, vm_list) {
+		rcu_read_lock();
+		kvm_for_each_vcpu(vcpu, kvm) {
 			if (vcpu->cpu == smp_processor_id())
 				kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+		}
+		rcu_read_unlock();
+	}
 	return kvm_x86_ops->hardware_enable(garbage);
 }
 
@@ -6560,27 +6568,18 @@ static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
 	vcpu_put(vcpu);
 }
 
-static void kvm_free_vcpus(struct kvm *kvm)
-{
-	unsigned int i;
-	struct kvm_vcpu *vcpu;
 
-	/*
-	 * Unpin any mmu pages first.
-	 */
-	kvm_for_each_vcpu(i, vcpu, kvm) {
-		kvm_clear_async_pf_completion_queue(vcpu);
-		kvm_unload_vcpu_mmu(vcpu);
-	}
-	kvm_for_each_vcpu(i, vcpu, kvm)
-		kvm_arch_vcpu_free(vcpu);
 
-	mutex_lock(&kvm->lock);
-	for (i = 0; i < atomic_read(&kvm->online_vcpus); i++)
-		kvm->vcpus[i] = NULL;
+void kvm_arch_vcpu_zap(struct work_struct *work)
+{
+	struct kvm_vcpu *vcpu = container_of(work, struct kvm_vcpu,
+			zap_work);
+	struct kvm *kvm = vcpu->kvm;
 
-	atomic_set(&kvm->online_vcpus, 0);
-	mutex_unlock(&kvm->lock);
+	kvm_clear_async_pf_completion_queue(vcpu);
+	kvm_unload_vcpu_mmu(vcpu);
+	kvm_arch_vcpu_free(vcpu);
+	kvm_put_kvm(kvm);
 }
 
 void kvm_arch_sync_events(struct kvm *kvm)
@@ -6594,7 +6593,6 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 	kvm_iommu_unmap_guest(kvm);
 	kfree(kvm->arch.vpic);
 	kfree(kvm->arch.vioapic);
-	kvm_free_vcpus(kvm);
 	if (kvm->arch.apic_access_page)
 		put_page(kvm->arch.apic_access_page);
 	if (kvm->arch.ept_identity_pagetable)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d526231..733de1c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -19,6 +19,7 @@
 #include <linux/slab.h>
 #include <linux/rcupdate.h>
 #include <linux/ratelimit.h>
+#include <linux/atomic.h>
 #include <asm/signal.h>
 
 #include <linux/kvm.h>
@@ -113,6 +114,10 @@ enum {
 
 struct kvm_vcpu {
 	struct kvm *kvm;
+	atomic_t refcount;
+	struct list_head list;
+	struct rcu_head head;
+	struct work_struct zap_work;
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	struct preempt_notifier preempt_notifier;
 #endif
@@ -241,9 +246,9 @@ struct kvm {
 	u32 bsp_vcpu_id;
 	struct kvm_vcpu *bsp_vcpu;
 #endif
-	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
+	struct list_head vcpus;
 	atomic_t online_vcpus;
-	int last_boosted_vcpu;
+	struct kvm_vcpu *last_boosted_vcpu;
 	struct list_head vm_list;
 	struct mutex lock;
 	struct kvm_io_bus *buses[KVM_NR_BUSES];
@@ -290,17 +295,15 @@ struct kvm {
 #define kvm_printf(kvm, fmt ...) printk(KERN_DEBUG fmt)
 #define vcpu_printf(vcpu, fmt...) kvm_printf(vcpu->kvm, fmt)
 
-static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i)
-{
-	smp_rmb();
-	return kvm->vcpus[i];
-}
+struct kvm_vcpu *kvm_vcpu_get(struct kvm_vcpu *vcpu);
+void kvm_vcpu_put(struct kvm_vcpu *vcpu);
+void kvm_arch_vcpu_zap(struct work_struct *work);
+
+#define kvm_for_each_vcpu(vcpu, kvm) \
+	list_for_each_entry_rcu(vcpu, &kvm->vcpus, list)
 
-#define kvm_for_each_vcpu(idx, vcpup, kvm) \
-	for (idx = 0; \
-	     idx < atomic_read(&kvm->online_vcpus) && \
-	     (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
-	     idx++)
+#define kvm_for_each_vcpu_continue(vcpu, kvm) \
+	list_for_each_entry_continue_rcu(vcpu, &kvm->vcpus, list)
 
 int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id);
 void kvm_vcpu_uninit(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 9f614b4..1d0c3ab 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -81,14 +81,15 @@ inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
 int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 		struct kvm_lapic_irq *irq)
 {
-	int i, r = -1;
+	int r = -1;
 	struct kvm_vcpu *vcpu, *lowest = NULL;
 
 	if (irq->dest_mode == 0 && irq->dest_id == 0xff &&
 			kvm_is_dm_lowest_prio(irq))
 		printk(KERN_INFO "kvm: apic: phys broadcast and lowest prio\n");
 
-	kvm_for_each_vcpu(i, vcpu, kvm) {
+	rcu_read_lock();
+	kvm_for_each_vcpu(vcpu, kvm) {
 		if (!kvm_apic_present(vcpu))
 			continue;
 
@@ -111,6 +112,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 	if (lowest)
 		r = kvm_apic_set_irq(lowest, irq);
 
+	rcu_read_unlock();
 	return r;
 }
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d9cfb78..71dda47 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -141,6 +141,7 @@ void vcpu_load(struct kvm_vcpu *vcpu)
 {
 	int cpu;
 
+	kvm_vcpu_get(vcpu);
 	mutex_lock(&vcpu->mutex);
 	if (unlikely(vcpu->pid != current->pids[PIDTYPE_PID].pid)) {
 		/* The thread running this VCPU changed. */
@@ -163,6 +164,7 @@ void vcpu_put(struct kvm_vcpu *vcpu)
 	preempt_notifier_unregister(&vcpu->preempt_notifier);
 	preempt_enable();
 	mutex_unlock(&vcpu->mutex);
+	kvm_vcpu_put(vcpu);
 }
 
 static void ack_flush(void *_completed)
@@ -171,7 +173,7 @@ static void ack_flush(void *_completed)
 
 static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 {
-	int i, cpu, me;
+	int cpu, me;
 	cpumask_var_t cpus;
 	bool called = true;
 	struct kvm_vcpu *vcpu;
@@ -179,7 +181,8 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 	zalloc_cpumask_var(&cpus, GFP_ATOMIC);
 
 	me = get_cpu();
-	kvm_for_each_vcpu(i, vcpu, kvm) {
+	rcu_read_lock();
+	kvm_for_each_vcpu(vcpu, kvm) {
 		kvm_make_request(req, vcpu);
 		cpu = vcpu->cpu;
 
@@ -190,12 +193,15 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 		      kvm_vcpu_exiting_guest_mode(vcpu) != OUTSIDE_GUEST_MODE)
 			cpumask_set_cpu(cpu, cpus);
 	}
+	rcu_read_unlock();
+
 	if (unlikely(cpus == NULL))
 		smp_call_function_many(cpu_online_mask, ack_flush, NULL, 1);
 	else if (!cpumask_empty(cpus))
 		smp_call_function_many(cpus, ack_flush, NULL, 1);
 	else
 		called = false;
+
 	put_cpu();
 	free_cpumask_var(cpus);
 	return called;
@@ -490,6 +496,7 @@ static struct kvm *kvm_create_vm(void)
 	raw_spin_lock(&kvm_lock);
 	list_add(&kvm->vm_list, &vm_list);
 	raw_spin_unlock(&kvm_lock);
+	INIT_LIST_HEAD(&kvm->vcpus);
 
 	return kvm;
 
@@ -600,6 +607,7 @@ static int kvm_vm_release(struct inode *inode, struct file *filp)
 {
 	struct kvm *kvm = filp->private_data;
 
+	kvm_vcpu_put(kvm->bsp_vcpu);
 	kvm_irqfd_release(kvm);
 
 	kvm_put_kvm(kvm);
@@ -1539,12 +1547,10 @@ EXPORT_SYMBOL_GPL(kvm_resched);
 void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 {
 	struct kvm *kvm = me->kvm;
-	struct kvm_vcpu *vcpu;
-	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
-	int yielded = 0;
-	int pass;
-	int i;
-
+	struct kvm_vcpu *vcpu, *v;
+	struct task_struct *task = NULL;
+	struct pid *pid;
+	int pass, firststart, lastone, yielded;
 	/*
 	 * We boost the priority of a VCPU that is runnable but not
 	 * currently running, because it got preempted by something
@@ -1552,15 +1558,22 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 	 * VCPU is holding the lock that we need and will release it.
 	 * We approximate round-robin by starting at the last boosted VCPU.
 	 */
-	for (pass = 0; pass < 2 && !yielded; pass++) {
-		kvm_for_each_vcpu(i, vcpu, kvm) {
-			struct task_struct *task = NULL;
-			struct pid *pid;
-			if (!pass && i < last_boosted_vcpu) {
-				i = last_boosted_vcpu;
+	for (pass = 0, firststart = 0; pass < 2 && !yielded; pass++) {
+
+		rcu_read_lock();
+		kvm_for_each_vcpu(vcpu, kvm) {
+			if (!pass && !firststart &&
+			    vcpu != kvm->last_boosted_vcpu &&
+			    kvm->last_boosted_vcpu != NULL) {
+				vcpu = kvm->last_boosted_vcpu;
+				firststart = 1;
 				continue;
-			} else if (pass && i > last_boosted_vcpu)
+			} else if (pass && !lastone) {
+				if (vcpu == kvm->last_boosted_vcpu)
+					lastone = 1;
+			} else if (pass && lastone)
 				break;
+
 			if (vcpu == me)
 				continue;
 			if (waitqueue_active(&vcpu->wq))
@@ -1576,15 +1589,29 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 				put_task_struct(task);
 				continue;
 			}
+			v = kvm_vcpu_get(vcpu);
+			if (v == NULL)
+				continue;
+
+			rcu_read_unlock();
 			if (yield_to(task, 1)) {
 				put_task_struct(task);
-				kvm->last_boosted_vcpu = i;
+				mutex_lock(&kvm->lock);
+				/*Remeber to release it.*/
+				if (kvm->last_boosted_vcpu != NULL)
+					kvm_vcpu_put(kvm->last_boosted_vcpu);
+				kvm->last_boosted_vcpu = vcpu;
+				mutex_unlock(&kvm->lock);
 				yielded = 1;
 				break;
 			}
+			kvm_vcpu_put(vcpu);
 			put_task_struct(task);
+			rcu_read_lock();
 		}
+		rcu_read_unlock();
 	}
+
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
 
@@ -1620,11 +1647,18 @@ static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
 	return 0;
 }
 
+/*Can not block*/
+static void kvm_vcpu_zap(struct rcu_head *rcu)
+{
+	struct kvm_vcpu *vcpu = container_of(rcu, struct kvm_vcpu, head);
+	schedule_work(&vcpu->zap_work);
+}
+
 static int kvm_vcpu_release(struct inode *inode, struct file *filp)
 {
 	struct kvm_vcpu *vcpu = filp->private_data;
-
-	kvm_put_kvm(vcpu->kvm);
+	filp->private_data = NULL;
+	kvm_vcpu_put(vcpu);
 	return 0;
 }
 
@@ -1646,6 +1680,43 @@ static int create_vcpu_fd(struct kvm_vcpu *vcpu)
 	return anon_inode_getfd("kvm-vcpu", &kvm_vcpu_fops, vcpu, O_RDWR);
 }
 
+struct kvm_vcpu *kvm_vcpu_get(struct kvm_vcpu *vcpu)
+{
+	if (vcpu == NULL)
+		return NULL;
+	if (atomic_add_unless(&vcpu->refcount, 1, 0))
+		return vcpu;
+	return NULL;
+}
+
+void kvm_vcpu_put(struct kvm_vcpu *vcpu)
+{
+	struct kvm *kvm;
+	if (atomic_dec_and_test(&vcpu->refcount)) {
+		kvm = vcpu->kvm;
+		mutex_lock(&kvm->lock);
+		list_del_rcu(&vcpu->list);
+		atomic_dec(&kvm->online_vcpus);
+		if (kvm->last_boosted_vcpu == vcpu)
+			kvm->last_boosted_vcpu = NULL;
+		mutex_unlock(&kvm->lock);
+
+		call_rcu(&vcpu->head, kvm_vcpu_zap);
+	}
+}
+
+static struct kvm_vcpu *kvm_vcpu_create(struct kvm *kvm, u32 id)
+{
+	struct kvm_vcpu *vcpu;
+	vcpu = kvm_arch_vcpu_create(kvm, id);
+	if (IS_ERR(vcpu))
+		return vcpu;
+	atomic_set(&vcpu->refcount, 1);
+	INIT_LIST_HEAD(&vcpu->list);
+	INIT_WORK(&vcpu->zap_work, kvm_arch_vcpu_zap);
+	return vcpu;
+}
+
 /*
  * Creates some virtual cpus.  Good luck creating more than one.
  */
@@ -1654,7 +1725,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 	int r;
 	struct kvm_vcpu *vcpu, *v;
 
-	vcpu = kvm_arch_vcpu_create(kvm, id);
+	vcpu = kvm_vcpu_create(kvm, id);
 	if (IS_ERR(vcpu))
 		return PTR_ERR(vcpu);
 
@@ -1670,13 +1741,14 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 		goto unlock_vcpu_destroy;
 	}
 
-	kvm_for_each_vcpu(r, v, kvm)
+	rcu_read_lock();
+	kvm_for_each_vcpu(v, kvm) {
 		if (v->vcpu_id == id) {
 			r = -EEXIST;
 			goto unlock_vcpu_destroy;
 		}
-
-	BUG_ON(kvm->vcpus[atomic_read(&kvm->online_vcpus)]);
+	}
+	rcu_read_unlock();
 
 	/* Now it's all set up, let userspace reach it */
 	kvm_get_kvm(kvm);
@@ -1686,13 +1758,15 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 		goto unlock_vcpu_destroy;
 	}
 
-	kvm->vcpus[atomic_read(&kvm->online_vcpus)] = vcpu;
+	/*Protected by kvm->lock*/
+	list_add_rcu(&vcpu->list, &kvm->vcpus);
+
 	smp_wmb();
 	atomic_inc(&kvm->online_vcpus);
 
 #ifdef CONFIG_KVM_APIC_ARCHITECTURE
 	if (kvm->bsp_vcpu_id == id)
-		kvm->bsp_vcpu = vcpu;
+		kvm->bsp_vcpu = kvm_vcpu_get(vcpu);
 #endif
 	mutex_unlock(&kvm->lock);
 	return r;
@@ -2593,13 +2667,15 @@ static int vcpu_stat_get(void *_offset, u64 *val)
 	unsigned offset = (long)_offset;
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
-	int i;
 
 	*val = 0;
 	raw_spin_lock(&kvm_lock);
-	list_for_each_entry(kvm, &vm_list, vm_list)
-		kvm_for_each_vcpu(i, vcpu, kvm)
+	list_for_each_entry(kvm, &vm_list, vm_list) {
+		rcu_read_lock();
+		kvm_for_each_vcpu(vcpu, kvm)
 			*val += *(u32 *)((void *)vcpu + offset);
+		rcu_read_unlock();
+	}
 
 	raw_spin_unlock(&kvm_lock);
 	return 0;
@@ -2765,7 +2841,6 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 	kvm_preempt_ops.sched_out = kvm_sched_out;
 
 	kvm_init_debug();
-
 	return 0;
 
 out_unreg:
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH v4] kvm: make vcpu life cycle separated from kvm instance
  2011-12-15  4:28                 ` [PATCH v4] " Liu Ping Fan
@ 2011-12-15  5:33                   ` Xiao Guangrong
  2011-12-15  6:53                     ` Liu ping fan
  2011-12-15  6:48                   ` Takuya Yoshikawa
  2011-12-15  9:10                   ` Gleb Natapov
  2 siblings, 1 reply; 78+ messages in thread
From: Xiao Guangrong @ 2011-12-15  5:33 UTC (permalink / raw)
  To: Liu Ping Fan; +Cc: kvm, linux-kernel, avi, aliguori, gleb, mtosatti, jan.kiszka

On 12/15/2011 12:28 PM, Liu Ping Fan wrote:


> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1833,11 +1833,12 @@ static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte)
> 
>  static void kvm_mmu_reset_last_pte_updated(struct kvm *kvm)
>  {
> -	int i;
>  	struct kvm_vcpu *vcpu;
> 
> -	kvm_for_each_vcpu(i, vcpu, kvm)
> +	rcu_read_lock();
> +	kvm_for_each_vcpu(vcpu, kvm)
>  		vcpu->arch.last_pte_updated = NULL;
> +	rcu_read_unlock();
>  }
> 


I am sure that you should rebase it on the current kvm tree.

>  static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c38efd7..acaa154 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1830,11 +1830,13 @@ static int get_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
> 
>  	switch (msr) {
>  	case HV_X64_MSR_VP_INDEX: {
> -		int r;
> +		int r = 0;
>  		struct kvm_vcpu *v;
> -		kvm_for_each_vcpu(r, v, vcpu->kvm)
> +		kvm_for_each_vcpu(v, vcpu->kvm) {
>  			if (v == vcpu)
>  				data = r;
> +			r++;
> +		}


Do not need rcu_lock?

> +struct kvm_vcpu *kvm_vcpu_get(struct kvm_vcpu *vcpu);
> +void kvm_vcpu_put(struct kvm_vcpu *vcpu);
> +void kvm_arch_vcpu_zap(struct work_struct *work);
> +
> +#define kvm_for_each_vcpu(vcpu, kvm) \
> +	list_for_each_entry_rcu(vcpu, &kvm->vcpus, list)
> 
> -#define kvm_for_each_vcpu(idx, vcpup, kvm) \
> -	for (idx = 0; \
> -	     idx < atomic_read(&kvm->online_vcpus) && \
> -	     (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
> -	     idx++)
> +#define kvm_for_each_vcpu_continue(vcpu, kvm) \
> +	list_for_each_entry_continue_rcu(vcpu, &kvm->vcpus, list)
> 


Where is it used?

> +struct kvm_vcpu *kvm_vcpu_get(struct kvm_vcpu *vcpu)
> +{
> +	if (vcpu == NULL)
> +		return NULL;
> +	if (atomic_add_unless(&vcpu->refcount, 1, 0))


Why do not use atomic_inc()?
Also, i think a memory barrier is needed after increasing refcount.

> -	kvm->vcpus[atomic_read(&kvm->online_vcpus)] = vcpu;
> +	/*Protected by kvm->lock*/
> +	list_add_rcu(&vcpu->list, &kvm->vcpus);
> +
>  	smp_wmb();


This barrier can also be removed.

>  	atomic_inc(&kvm->online_vcpus);
> 
>  #ifdef CONFIG_KVM_APIC_ARCHITECTURE
>  	if (kvm->bsp_vcpu_id == id)
> -		kvm->bsp_vcpu = vcpu;
> +		kvm->bsp_vcpu = kvm_vcpu_get(vcpu);
>  #endif
>  	mutex_unlock(&kvm->lock);
>  	return r;
> @@ -2593,13 +2667,15 @@ static int vcpu_stat_get(void *_offset, u64 *val)
>  	unsigned offset = (long)_offset;
>  	struct kvm *kvm;
>  	struct kvm_vcpu *vcpu;
> -	int i;
> 
>  	*val = 0;
>  	raw_spin_lock(&kvm_lock);
> -	list_for_each_entry(kvm, &vm_list, vm_list)
> -		kvm_for_each_vcpu(i, vcpu, kvm)
> +	list_for_each_entry(kvm, &vm_list, vm_list) {
> +		rcu_read_lock();
> +		kvm_for_each_vcpu(vcpu, kvm)
>  			*val += *(u32 *)((void *)vcpu + offset);
> +		rcu_read_unlock();
> +	}
> 
>  	raw_spin_unlock(&kvm_lock);
>  	return 0;
> @@ -2765,7 +2841,6 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
>  	kvm_preempt_ops.sched_out = kvm_sched_out;
> 
>  	kvm_init_debug();
> -


You don not change anything, please do not touch this line.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4] kvm: make vcpu life cycle separated from kvm instance
  2011-12-15  4:28                 ` [PATCH v4] " Liu Ping Fan
  2011-12-15  5:33                   ` Xiao Guangrong
@ 2011-12-15  6:48                   ` Takuya Yoshikawa
  2011-12-16  9:38                     ` Marcelo Tosatti
  2011-12-17  3:57                     ` Liu ping fan
  2011-12-15  9:10                   ` Gleb Natapov
  2 siblings, 2 replies; 78+ messages in thread
From: Takuya Yoshikawa @ 2011-12-15  6:48 UTC (permalink / raw)
  To: Liu Ping Fan; +Cc: kvm, linux-kernel, avi, aliguori, gleb, mtosatti, jan.kiszka

(2011/12/15 13:28), Liu Ping Fan wrote:
> From: Liu Ping Fan<pingfank@linux.vnet.ibm.com>
> 
> Currently, vcpu can be destructed only when kvm instance destroyed.
> Change this to vcpu's destruction before kvm instance, so vcpu MUST
> and CAN be destroyed before kvm's destroy.

Could you explain why this change is needed here?
Would be helpful for those, including me, who will read the commit later.

> 
> Signed-off-by: Liu Ping Fan<pingfank@linux.vnet.ibm.com>
> ---

...

> diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
> index cac4746..f275b8c 100644
> --- a/arch/x86/kvm/i8259.c
> +++ b/arch/x86/kvm/i8259.c
> @@ -50,25 +50,28 @@ static void pic_unlock(struct kvm_pic *s)
>   {
>   	bool wakeup = s->wakeup_needed;
>   	struct kvm_vcpu *vcpu, *found = NULL;
> -	int i;
> +	struct kvm *kvm = s->kvm;
> 
>   	s->wakeup_needed = false;
> 
>   	spin_unlock(&s->lock);
> 
>   	if (wakeup) {
> -		kvm_for_each_vcpu(i, vcpu, s->kvm) {
> +		rcu_read_lock();
> +		kvm_for_each_vcpu(vcpu, kvm)
>   			if (kvm_apic_accept_pic_intr(vcpu)) {
>   				found = vcpu;
>   				break;
>   			}
> -		}
> 
> -		if (!found)
> +		if (!found) {
> +			rcu_read_unlock();
>   			return;
> +		}
> 
>   		kvm_make_request(KVM_REQ_EVENT, found);
>   		kvm_vcpu_kick(found);
> +		rcu_read_unlock();
>   	}
>   }

How about this? (just about stylistic issues)

	if (!wakeup)
		return;

	rcu_read_lock();
	kvm_for_each_vcpu(vcpu, kvm)
		if (kvm_apic_accept_pic_intr(vcpu)) {
			found = vcpu;
			break;
		}

	if (!found)
		goto out;

	kvm_make_request(KVM_REQ_EVENT, found);
	kvm_vcpu_kick(found);
out:
	rcu_read_unlock();

...

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c

...

> +void kvm_arch_vcpu_zap(struct work_struct *work)
> +{
> +	struct kvm_vcpu *vcpu = container_of(work, struct kvm_vcpu,
> +			zap_work);
> +	struct kvm *kvm = vcpu->kvm;
> 
> -	atomic_set(&kvm->online_vcpus, 0);
> -	mutex_unlock(&kvm->lock);
> +	kvm_clear_async_pf_completion_queue(vcpu);
> +	kvm_unload_vcpu_mmu(vcpu);
> +	kvm_arch_vcpu_free(vcpu);
> +	kvm_put_kvm(kvm);
>   }

zap is really a good name for this?

...

> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d526231..733de1c 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -19,6 +19,7 @@
>   #include<linux/slab.h>
>   #include<linux/rcupdate.h>
>   #include<linux/ratelimit.h>
> +#include<linux/atomic.h>
>   #include<asm/signal.h>
> 
>   #include<linux/kvm.h>
> @@ -113,6 +114,10 @@ enum {
> 
>   struct kvm_vcpu {
>   	struct kvm *kvm;
> +	atomic_t refcount;
> +	struct list_head list;
> +	struct rcu_head head;
> +	struct work_struct zap_work;

How about adding some comments?
zap_work is not at all self explanatory, IMO.


>   #ifdef CONFIG_PREEMPT_NOTIFIERS
>   	struct preempt_notifier preempt_notifier;
>   #endif
> @@ -241,9 +246,9 @@ struct kvm {
>   	u32 bsp_vcpu_id;
>   	struct kvm_vcpu *bsp_vcpu;
>   #endif
> -	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
> +	struct list_head vcpus;
>   	atomic_t online_vcpus;
> -	int last_boosted_vcpu;
> +	struct kvm_vcpu *last_boosted_vcpu;
>   	struct list_head vm_list;
>   	struct mutex lock;
>   	struct kvm_io_bus *buses[KVM_NR_BUSES];
> @@ -290,17 +295,15 @@ struct kvm {
>   #define kvm_printf(kvm, fmt ...) printk(KERN_DEBUG fmt)
>   #define vcpu_printf(vcpu, fmt...) kvm_printf(vcpu->kvm, fmt)
> 
> -static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i)
> -{
> -	smp_rmb();
> -	return kvm->vcpus[i];
> -}
> +struct kvm_vcpu *kvm_vcpu_get(struct kvm_vcpu *vcpu);
> +void kvm_vcpu_put(struct kvm_vcpu *vcpu);
> +void kvm_arch_vcpu_zap(struct work_struct *work);
> +
> +#define kvm_for_each_vcpu(vcpu, kvm) \
> +	list_for_each_entry_rcu(vcpu,&kvm->vcpus, list)

Is this macro really worth it?
_rcu shows readers important information, I think.

> 
> -#define kvm_for_each_vcpu(idx, vcpup, kvm) \
> -	for (idx = 0; \
> -	     idx<  atomic_read(&kvm->online_vcpus)&&  \
> -	     (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
> -	     idx++)
> +#define kvm_for_each_vcpu_continue(vcpu, kvm) \
> +	list_for_each_entry_continue_rcu(vcpu,&kvm->vcpus, list)

Same here.
Why do you want to hide _rcu from readers?


	Takuya

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4] kvm: make vcpu life cycle separated from kvm instance
  2011-12-15  5:33                   ` Xiao Guangrong
@ 2011-12-15  6:53                     ` Liu ping fan
  2011-12-15  8:25                       ` Xiao Guangrong
  0 siblings, 1 reply; 78+ messages in thread
From: Liu ping fan @ 2011-12-15  6:53 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: kvm, linux-kernel, avi, aliguori, gleb, mtosatti, jan.kiszka

On Thu, Dec 15, 2011 at 1:33 PM, Xiao Guangrong
<xiaoguangrong@linux.vnet.ibm.com> wrote:
> On 12/15/2011 12:28 PM, Liu Ping Fan wrote:
>
>
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -1833,11 +1833,12 @@ static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte)
>>
>>  static void kvm_mmu_reset_last_pte_updated(struct kvm *kvm)
>>  {
>> -     int i;
>>       struct kvm_vcpu *vcpu;
>>
>> -     kvm_for_each_vcpu(i, vcpu, kvm)
>> +     rcu_read_lock();
>> +     kvm_for_each_vcpu(vcpu, kvm)
>>               vcpu->arch.last_pte_updated = NULL;
>> +     rcu_read_unlock();
>>  }
>>
>
>
> I am sure that you should rebase it on the current kvm tree.
>
OK, I will rebase it in next patch

>>  static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp)
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index c38efd7..acaa154 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1830,11 +1830,13 @@ static int get_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
>>
>>       switch (msr) {
>>       case HV_X64_MSR_VP_INDEX: {
>> -             int r;
>> +             int r = 0;
>>               struct kvm_vcpu *v;
>> -             kvm_for_each_vcpu(r, v, vcpu->kvm)
>> +             kvm_for_each_vcpu(v, vcpu->kvm) {
>>                       if (v == vcpu)
>>                               data = r;
>> +                     r++;
>> +             }
>
>
> Do not need rcu_lock?
>
Need! Sorry, forget.

>> +struct kvm_vcpu *kvm_vcpu_get(struct kvm_vcpu *vcpu);
>> +void kvm_vcpu_put(struct kvm_vcpu *vcpu);
>> +void kvm_arch_vcpu_zap(struct work_struct *work);
>> +
>> +#define kvm_for_each_vcpu(vcpu, kvm) \
>> +     list_for_each_entry_rcu(vcpu, &kvm->vcpus, list)
>>
>> -#define kvm_for_each_vcpu(idx, vcpup, kvm) \
>> -     for (idx = 0; \
>> -          idx < atomic_read(&kvm->online_vcpus) && \
>> -          (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
>> -          idx++)
>> +#define kvm_for_each_vcpu_continue(vcpu, kvm) \
>> +     list_for_each_entry_continue_rcu(vcpu, &kvm->vcpus, list)
>>
>
>
> Where is it used?
>
Once I used it in kvm_vcpu_on_spin, but now, it is useless. I will remove it.

>> +struct kvm_vcpu *kvm_vcpu_get(struct kvm_vcpu *vcpu)
>> +{
>> +     if (vcpu == NULL)
>> +             return NULL;
>> +     if (atomic_add_unless(&vcpu->refcount, 1, 0))
>
>
> Why do not use atomic_inc()?
> Also, i think a memory barrier is needed after increasing refcount.
>
Because when refcout==0, we prepare to destroy vcpu, and do not to
disturb it by increasing the refcount.
And sorry but I can not figure out the scene why memory barrier needed
here.  Seems no risks on SMP.

>> -     kvm->vcpus[atomic_read(&kvm->online_vcpus)] = vcpu;
>> +     /*Protected by kvm->lock*/
>> +     list_add_rcu(&vcpu->list, &kvm->vcpus);
>> +
>>       smp_wmb();
>
>
> This barrier can also be removed.
>
Yes, I think you are right.

Thanks and regards,
ping fan


>>       atomic_inc(&kvm->online_vcpus);
>>
>>  #ifdef CONFIG_KVM_APIC_ARCHITECTURE
>>       if (kvm->bsp_vcpu_id == id)
>> -             kvm->bsp_vcpu = vcpu;
>> +             kvm->bsp_vcpu = kvm_vcpu_get(vcpu);
>>  #endif
>>       mutex_unlock(&kvm->lock);
>>       return r;
>> @@ -2593,13 +2667,15 @@ static int vcpu_stat_get(void *_offset, u64 *val)
>>       unsigned offset = (long)_offset;
>>       struct kvm *kvm;
>>       struct kvm_vcpu *vcpu;
>> -     int i;
>>
>>       *val = 0;
>>       raw_spin_lock(&kvm_lock);
>> -     list_for_each_entry(kvm, &vm_list, vm_list)
>> -             kvm_for_each_vcpu(i, vcpu, kvm)
>> +     list_for_each_entry(kvm, &vm_list, vm_list) {
>> +             rcu_read_lock();
>> +             kvm_for_each_vcpu(vcpu, kvm)
>>                       *val += *(u32 *)((void *)vcpu + offset);
>> +             rcu_read_unlock();
>> +     }
>>
>>       raw_spin_unlock(&kvm_lock);
>>       return 0;
>> @@ -2765,7 +2841,6 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
>>       kvm_preempt_ops.sched_out = kvm_sched_out;
>>
>>       kvm_init_debug();
>> -
>
>
> You don not change anything, please do not touch this line.
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4] kvm: make vcpu life cycle separated from kvm instance
  2011-12-15  6:53                     ` Liu ping fan
@ 2011-12-15  8:25                       ` Xiao Guangrong
  2011-12-15  8:57                         ` Xiao Guangrong
  0 siblings, 1 reply; 78+ messages in thread
From: Xiao Guangrong @ 2011-12-15  8:25 UTC (permalink / raw)
  To: Liu ping fan; +Cc: kvm, linux-kernel, avi, aliguori, gleb, mtosatti, jan.kiszka

On 12/15/2011 02:53 PM, Liu ping fan wrote:


> 
>>> +struct kvm_vcpu *kvm_vcpu_get(struct kvm_vcpu *vcpu)
>>> +{
>>> +     if (vcpu == NULL)
>>> +             return NULL;
>>> +     if (atomic_add_unless(&vcpu->refcount, 1, 0))
>>
>>
>> Why do not use atomic_inc()?
>> Also, i think a memory barrier is needed after increasing refcount.
>>
> Because when refcout==0, we prepare to destroy vcpu, and do not to
> disturb it by increasing the refcount.


Oh, get it.

> And sorry but I can not figure out the scene why memory barrier needed
> here.  Seems no risks on SMP.
> 


If atomic_add_unless is necessary, memory barrier is not needed here.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3] kvm: make vcpu life cycle separated from kvm instance
  2011-12-15  3:21               ` Liu ping fan
  2011-12-15  4:28                 ` [PATCH v4] " Liu Ping Fan
@ 2011-12-15  8:33                 ` Gleb Natapov
  2011-12-15  9:06                   ` Liu ping fan
  1 sibling, 1 reply; 78+ messages in thread
From: Gleb Natapov @ 2011-12-15  8:33 UTC (permalink / raw)
  To: Liu ping fan
  Cc: Marcelo Tosatti, kvm, linux-kernel, avi, aliguori, jan.kiszka

On Thu, Dec 15, 2011 at 11:21:37AM +0800, Liu ping fan wrote:
> On Tue, Dec 13, 2011 at 7:36 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Mon, Dec 12, 2011 at 10:41:23AM +0800, Liu Ping Fan wrote:
> >> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> >>
> >> Currently, vcpu can be destructed only when kvm instance destroyed.
> >> Change this to vcpu's destruction taken when its refcnt is zero,
> >> and then vcpu MUST and CAN be destroyed before kvm's destroy.
> >>
> >> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> >> ---
> >>  arch/x86/kvm/i8254.c     |   10 ++++--
> >>  arch/x86/kvm/i8259.c     |   12 +++++--
> >>  arch/x86/kvm/mmu.c       |    7 ++--
> >>  arch/x86/kvm/x86.c       |   54 +++++++++++++++++++----------------
> >>  include/linux/kvm_host.h |   71 ++++++++++++++++++++++++++++++++++++++++++----
> >>  virt/kvm/irq_comm.c      |    7 +++-
> >>  virt/kvm/kvm_main.c      |   62 +++++++++++++++++++++++++++++++++------
> >>  7 files changed, 170 insertions(+), 53 deletions(-)
> >
> > This needs a full audit of paths that access vcpus. See for one example
> > bsp_vcpu pointer.
> >
> Yes, I had missed it and just paid attention to the access path to
> vcpu in kvm_lapic and the path used in async_pf. I will correct it
> later.
> BTW, I want to make it sure that because kvm_lapic will be destroyed
> before vcpu, so  it is safe to bypass the access path there, and the
> situation is the same in async_pf for we have called
> kvm_clear_async_pf_completion_queue before zapping vcpu.  Am I right?
> 
> As to the scene like bsp_vcpu, I think that introducing refcount like
> in V2 can handle it easier. Please help to review these changes in V4
> which I will send a little later.
> 
Since bsp_vcpu pointer will never be released or re-assigned introducing
reference count to keep the pointer valid is not necessary. The counter
will never reach 0 and bsp vcpu will never be freed. Just disallow
removal of bsp_vcpu. Or better get rid of bsp_vcpu at all since its only
use is invalid anyway.

--
			Gleb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4] kvm: make vcpu life cycle separated from kvm instance
  2011-12-15  8:25                       ` Xiao Guangrong
@ 2011-12-15  8:57                         ` Xiao Guangrong
  0 siblings, 0 replies; 78+ messages in thread
From: Xiao Guangrong @ 2011-12-15  8:57 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Liu ping fan, kvm, linux-kernel, avi, aliguori, gleb, mtosatti,
	jan.kiszka

On 12/15/2011 04:25 PM, Xiao Guangrong wrote:

> On 12/15/2011 02:53 PM, Liu ping fan wrote:
> 
> 
>>
>>>> +struct kvm_vcpu *kvm_vcpu_get(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +     if (vcpu == NULL)
>>>> +             return NULL;
>>>> +     if (atomic_add_unless(&vcpu->refcount, 1, 0))
>>>
>>>
>>> Why do not use atomic_inc()?
>>> Also, i think a memory barrier is needed after increasing refcount.
>>>
>> Because when refcout==0, we prepare to destroy vcpu, and do not to
>> disturb it by increasing the refcount.
> 
> 
> Oh, get it.
> 


But i think we can do it like this:

On the vcpu free path:

hold kvm->lock
delete vcpu from the kvm->vcpus
release kvm->lock

synchronize_rcu()
kvm_vcpu_put()

then, we can avoid get invalid instance and it can make the code simple?

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3] kvm: make vcpu life cycle separated from kvm instance
  2011-12-15  8:33                 ` [PATCH v3] " Gleb Natapov
@ 2011-12-15  9:06                   ` Liu ping fan
  2011-12-15  9:08                     ` Gleb Natapov
  0 siblings, 1 reply; 78+ messages in thread
From: Liu ping fan @ 2011-12-15  9:06 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Marcelo Tosatti, kvm, linux-kernel, avi, aliguori, jan.kiszka

2011/12/15 Gleb Natapov <gleb@redhat.com>:
> On Thu, Dec 15, 2011 at 11:21:37AM +0800, Liu ping fan wrote:
>> On Tue, Dec 13, 2011 at 7:36 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> > On Mon, Dec 12, 2011 at 10:41:23AM +0800, Liu Ping Fan wrote:
>> >> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>> >>
>> >> Currently, vcpu can be destructed only when kvm instance destroyed.
>> >> Change this to vcpu's destruction taken when its refcnt is zero,
>> >> and then vcpu MUST and CAN be destroyed before kvm's destroy.
>> >>
>> >> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>> >> ---
>> >>  arch/x86/kvm/i8254.c     |   10 ++++--
>> >>  arch/x86/kvm/i8259.c     |   12 +++++--
>> >>  arch/x86/kvm/mmu.c       |    7 ++--
>> >>  arch/x86/kvm/x86.c       |   54 +++++++++++++++++++----------------
>> >>  include/linux/kvm_host.h |   71 ++++++++++++++++++++++++++++++++++++++++++----
>> >>  virt/kvm/irq_comm.c      |    7 +++-
>> >>  virt/kvm/kvm_main.c      |   62 +++++++++++++++++++++++++++++++++------
>> >>  7 files changed, 170 insertions(+), 53 deletions(-)
>> >
>> > This needs a full audit of paths that access vcpus. See for one example
>> > bsp_vcpu pointer.
>> >
>> Yes, I had missed it and just paid attention to the access path to
>> vcpu in kvm_lapic and the path used in async_pf. I will correct it
>> later.
>> BTW, I want to make it sure that because kvm_lapic will be destroyed
>> before vcpu, so  it is safe to bypass the access path there, and the
>> situation is the same in async_pf for we have called
>> kvm_clear_async_pf_completion_queue before zapping vcpu.  Am I right?
>>
>> As to the scene like bsp_vcpu, I think that introducing refcount like
>> in V2 can handle it easier. Please help to review these changes in V4
>> which I will send a little later.
>>
> Since bsp_vcpu pointer will never be released or re-assigned introducing
> reference count to keep the pointer valid is not necessary. The counter
> will never reach 0 and bsp vcpu will never be freed. Just disallow

OK. And I have a question -- who will play the role to guard bsp_vcpu?
kernel or qemu?  Must I add something in kernel to protect the
bsp_vcpu

> removal of bsp_vcpu. Or better get rid of bsp_vcpu at all since its only
> use is invalid anyway.
>
I will dig into it and see how to handle it.

Thanks and regards,
ping fan
> --
>                        Gleb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3] kvm: make vcpu life cycle separated from kvm instance
  2011-12-15  9:06                   ` Liu ping fan
@ 2011-12-15  9:08                     ` Gleb Natapov
  0 siblings, 0 replies; 78+ messages in thread
From: Gleb Natapov @ 2011-12-15  9:08 UTC (permalink / raw)
  To: Liu ping fan
  Cc: Marcelo Tosatti, kvm, linux-kernel, avi, aliguori, jan.kiszka

On Thu, Dec 15, 2011 at 05:06:09PM +0800, Liu ping fan wrote:
> 2011/12/15 Gleb Natapov <gleb@redhat.com>:
> > On Thu, Dec 15, 2011 at 11:21:37AM +0800, Liu ping fan wrote:
> >> On Tue, Dec 13, 2011 at 7:36 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >> > On Mon, Dec 12, 2011 at 10:41:23AM +0800, Liu Ping Fan wrote:
> >> >> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> >> >>
> >> >> Currently, vcpu can be destructed only when kvm instance destroyed.
> >> >> Change this to vcpu's destruction taken when its refcnt is zero,
> >> >> and then vcpu MUST and CAN be destroyed before kvm's destroy.
> >> >>
> >> >> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> >> >> ---
> >> >>  arch/x86/kvm/i8254.c     |   10 ++++--
> >> >>  arch/x86/kvm/i8259.c     |   12 +++++--
> >> >>  arch/x86/kvm/mmu.c       |    7 ++--
> >> >>  arch/x86/kvm/x86.c       |   54 +++++++++++++++++++----------------
> >> >>  include/linux/kvm_host.h |   71 ++++++++++++++++++++++++++++++++++++++++++----
> >> >>  virt/kvm/irq_comm.c      |    7 +++-
> >> >>  virt/kvm/kvm_main.c      |   62 +++++++++++++++++++++++++++++++++------
> >> >>  7 files changed, 170 insertions(+), 53 deletions(-)
> >> >
> >> > This needs a full audit of paths that access vcpus. See for one example
> >> > bsp_vcpu pointer.
> >> >
> >> Yes, I had missed it and just paid attention to the access path to
> >> vcpu in kvm_lapic and the path used in async_pf. I will correct it
> >> later.
> >> BTW, I want to make it sure that because kvm_lapic will be destroyed
> >> before vcpu, so  it is safe to bypass the access path there, and the
> >> situation is the same in async_pf for we have called
> >> kvm_clear_async_pf_completion_queue before zapping vcpu.  Am I right?
> >>
> >> As to the scene like bsp_vcpu, I think that introducing refcount like
> >> in V2 can handle it easier. Please help to review these changes in V4
> >> which I will send a little later.
> >>
> > Since bsp_vcpu pointer will never be released or re-assigned introducing
> > reference count to keep the pointer valid is not necessary. The counter
> > will never reach 0 and bsp vcpu will never be freed. Just disallow
> 
> OK. And I have a question -- who will play the role to guard bsp_vcpu?
> kernel or qemu?  Must I add something in kernel to protect the
> bsp_vcpu
> 
Kernel of course. But I prefer just to rid of bsp_vcpu. I'll try to send
patch today.

> > removal of bsp_vcpu. Or better get rid of bsp_vcpu at all since its only
> > use is invalid anyway.
> >
> I will dig into it and see how to handle it.
> 
> Thanks and regards,
> ping fan
> > --
> >                        Gleb.

--
			Gleb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4] kvm: make vcpu life cycle separated from kvm instance
  2011-12-15  4:28                 ` [PATCH v4] " Liu Ping Fan
  2011-12-15  5:33                   ` Xiao Guangrong
  2011-12-15  6:48                   ` Takuya Yoshikawa
@ 2011-12-15  9:10                   ` Gleb Natapov
  2011-12-16  7:50                     ` Liu ping fan
  2 siblings, 1 reply; 78+ messages in thread
From: Gleb Natapov @ 2011-12-15  9:10 UTC (permalink / raw)
  To: Liu Ping Fan; +Cc: kvm, linux-kernel, avi, aliguori, mtosatti, jan.kiszka

On Thu, Dec 15, 2011 at 12:28:48PM +0800, Liu Ping Fan wrote:
> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> 
> Currently, vcpu can be destructed only when kvm instance destroyed.
> Change this to vcpu's destruction before kvm instance, so vcpu MUST
> and CAN be destroyed before kvm's destroy.
> 
I see reference counting is back.

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d9cfb78..71dda47 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -141,6 +141,7 @@ void vcpu_load(struct kvm_vcpu *vcpu)
>  {
>  	int cpu;
>  
> +	kvm_vcpu_get(vcpu);
>  	mutex_lock(&vcpu->mutex);
>  	if (unlikely(vcpu->pid != current->pids[PIDTYPE_PID].pid)) {
>  		/* The thread running this VCPU changed. */
> @@ -163,6 +164,7 @@ void vcpu_put(struct kvm_vcpu *vcpu)
>  	preempt_notifier_unregister(&vcpu->preempt_notifier);
>  	preempt_enable();
>  	mutex_unlock(&vcpu->mutex);
> +	kvm_vcpu_put(vcpu);
>  }
>  
Why is kvm_vcpu_get/kvm_vcpu_put is needed in vcpu_load/vcpu_put? 
As far as I see load/put are only called in vcpu ioctl,
kvm_arch_vcpu_setup(), kvm_arch_vcpu_destroy() and kvm_arch_destroy_vm().

kvm_arch_vcpu_setup() and kvm_arch_vcpu_destroy() are called before vcpu is
added to vcpus list, so it can't be accessed by other thread at this
point. kvm_arch_destroy_vm() is  called on KVM destruction path when all
vcpus should be destroyed already. So the only interesting place is vcpu
ioctl and I think we are protected by fd refcount there. vcpu fd can't
be closed while ioctl is executing for that vcpu. Otherwise we would
have problem now too.

> @@ -1539,12 +1547,10 @@ EXPORT_SYMBOL_GPL(kvm_resched);
>  void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  {
>  	struct kvm *kvm = me->kvm;
> -	struct kvm_vcpu *vcpu;
> -	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
> -	int yielded = 0;
> -	int pass;
> -	int i;
> -
> +	struct kvm_vcpu *vcpu, *v;
> +	struct task_struct *task = NULL;
> +	struct pid *pid;
> +	int pass, firststart, lastone, yielded;
>  	/*
>  	 * We boost the priority of a VCPU that is runnable but not
>  	 * currently running, because it got preempted by something
> @@ -1552,15 +1558,22 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  	 * VCPU is holding the lock that we need and will release it.
>  	 * We approximate round-robin by starting at the last boosted VCPU.
>  	 */
> -	for (pass = 0; pass < 2 && !yielded; pass++) {
> -		kvm_for_each_vcpu(i, vcpu, kvm) {
> -			struct task_struct *task = NULL;
> -			struct pid *pid;
> -			if (!pass && i < last_boosted_vcpu) {
> -				i = last_boosted_vcpu;
> +	for (pass = 0, firststart = 0; pass < 2 && !yielded; pass++) {
> +
> +		rcu_read_lock();
> +		kvm_for_each_vcpu(vcpu, kvm) {
> +			if (!pass && !firststart &&
> +			    vcpu != kvm->last_boosted_vcpu &&
> +			    kvm->last_boosted_vcpu != NULL) {
> +				vcpu = kvm->last_boosted_vcpu;
> +				firststart = 1;
>  				continue;
> -			} else if (pass && i > last_boosted_vcpu)
> +			} else if (pass && !lastone) {
> +				if (vcpu == kvm->last_boosted_vcpu)
> +					lastone = 1;
> +			} else if (pass && lastone)
>  				break;
> +
>  			if (vcpu == me)
>  				continue;
>  			if (waitqueue_active(&vcpu->wq))
> @@ -1576,15 +1589,29 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  				put_task_struct(task);
>  				continue;
>  			}
> +			v = kvm_vcpu_get(vcpu);
> +			if (v == NULL)
> +				continue;
> +
> +			rcu_read_unlock();
>  			if (yield_to(task, 1)) {
>  				put_task_struct(task);
> -				kvm->last_boosted_vcpu = i;
> +				mutex_lock(&kvm->lock);
> +				/*Remeber to release it.*/
> +				if (kvm->last_boosted_vcpu != NULL)
> +					kvm_vcpu_put(kvm->last_boosted_vcpu);
> +				kvm->last_boosted_vcpu = vcpu;
> +				mutex_unlock(&kvm->lock);
>  				yielded = 1;
I think we can be smart and protect kvm->last_boosted_vcpu with the same
rcu as vcpus, but yeild_to() can sleep anyway. Hmm may be we should use
srcu in the first place :( Or rewrite the logic of the functions
somehow to call yield_to() outside of the loop. This is heuristics anyway.

--
			Gleb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4] kvm: make vcpu life cycle separated from kvm instance
  2011-12-15  9:10                   ` Gleb Natapov
@ 2011-12-16  7:50                     ` Liu ping fan
  0 siblings, 0 replies; 78+ messages in thread
From: Liu ping fan @ 2011-12-16  7:50 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, linux-kernel, avi, aliguori, mtosatti, jan.kiszka

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 6221 bytes --]

On Thu, Dec 15, 2011 at 5:10 PM, Gleb Natapov <gleb@redhat.com> wrote:
> On Thu, Dec 15, 2011 at 12:28:48PM +0800, Liu Ping Fan wrote:
>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>
>> Currently, vcpu can be destructed only when kvm instance destroyed.
>> Change this to vcpu's destruction before kvm instance, so vcpu MUST
>> and CAN be destroyed before kvm's destroy.
>>
> I see reference counting is back.
>
Resort to SRCU in next version, I will remove refcnt.

>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index d9cfb78..71dda47 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -141,6 +141,7 @@ void vcpu_load(struct kvm_vcpu *vcpu)
>>  {
>>       int cpu;
>>
>> +     kvm_vcpu_get(vcpu);
>>       mutex_lock(&vcpu->mutex);
>>       if (unlikely(vcpu->pid != current->pids[PIDTYPE_PID].pid)) {
>>               /* The thread running this VCPU changed. */
>> @@ -163,6 +164,7 @@ void vcpu_put(struct kvm_vcpu *vcpu)
>>       preempt_notifier_unregister(&vcpu->preempt_notifier);
>>       preempt_enable();
>>       mutex_unlock(&vcpu->mutex);
>> +     kvm_vcpu_put(vcpu);
>>  }
>>
> Why is kvm_vcpu_get/kvm_vcpu_put is needed in vcpu_load/vcpu_put?
> As far as I see load/put are only called in vcpu ioctl,
> kvm_arch_vcpu_setup(), kvm_arch_vcpu_destroy() and kvm_arch_destroy_vm().
>
> kvm_arch_vcpu_setup() and kvm_arch_vcpu_destroy() are called before vcpu is
> added to vcpus list, so it can't be accessed by other thread at this
> point. kvm_arch_destroy_vm() is  called on KVM destruction path when all
> vcpus should be destroyed already. So the only interesting place is vcpu
> ioctl and I think we are protected by fd refcount there. vcpu fd can't
> be closed while ioctl is executing for that vcpu. Otherwise we would
> have problem now too.
>
Yeah, ioctl is protected by fd refcount. That is what I had aimed to,
but as you pointed out, it is unnecessary at all.

>> @@ -1539,12 +1547,10 @@ EXPORT_SYMBOL_GPL(kvm_resched);
>>  void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>  {
>>       struct kvm *kvm = me->kvm;
>> -     struct kvm_vcpu *vcpu;
>> -     int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
>> -     int yielded = 0;
>> -     int pass;
>> -     int i;
>> -
>> +     struct kvm_vcpu *vcpu, *v;
>> +     struct task_struct *task = NULL;
>> +     struct pid *pid;
>> +     int pass, firststart, lastone, yielded;
>>       /*
>>        * We boost the priority of a VCPU that is runnable but not
>>        * currently running, because it got preempted by something
>> @@ -1552,15 +1558,22 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>        * VCPU is holding the lock that we need and will release it.
>>        * We approximate round-robin by starting at the last boosted VCPU.
>>        */
>> -     for (pass = 0; pass < 2 && !yielded; pass++) {
>> -             kvm_for_each_vcpu(i, vcpu, kvm) {
>> -                     struct task_struct *task = NULL;
>> -                     struct pid *pid;
>> -                     if (!pass && i < last_boosted_vcpu) {
>> -                             i = last_boosted_vcpu;
>> +     for (pass = 0, firststart = 0; pass < 2 && !yielded; pass++) {
>> +
>> +             rcu_read_lock();
>> +             kvm_for_each_vcpu(vcpu, kvm) {
>> +                     if (!pass && !firststart &&
>> +                         vcpu != kvm->last_boosted_vcpu &&
>> +                         kvm->last_boosted_vcpu != NULL) {
>> +                             vcpu = kvm->last_boosted_vcpu;
>> +                             firststart = 1;
>>                               continue;
>> -                     } else if (pass && i > last_boosted_vcpu)
>> +                     } else if (pass && !lastone) {
>> +                             if (vcpu == kvm->last_boosted_vcpu)
>> +                                     lastone = 1;
>> +                     } else if (pass && lastone)
>>                               break;
>> +
>>                       if (vcpu == me)
>>                               continue;
>>                       if (waitqueue_active(&vcpu->wq))
>> @@ -1576,15 +1589,29 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>                               put_task_struct(task);
>>                               continue;
>>                       }
>> +                     v = kvm_vcpu_get(vcpu);
>> +                     if (v == NULL)
>> +                             continue;
>> +
>> +                     rcu_read_unlock();
>>                       if (yield_to(task, 1)) {
>>                               put_task_struct(task);
>> -                             kvm->last_boosted_vcpu = i;
>> +                             mutex_lock(&kvm->lock);
>> +                             /*Remeber to release it.*/
>> +                             if (kvm->last_boosted_vcpu != NULL)
>> +                                     kvm_vcpu_put(kvm->last_boosted_vcpu);
>> +                             kvm->last_boosted_vcpu = vcpu;
>> +                             mutex_unlock(&kvm->lock);
>>                               yielded = 1;
> I think we can be smart and protect kvm->last_boosted_vcpu with the same
> rcu as vcpus, but yeild_to() can sleep anyway. Hmm may be we should use
> srcu in the first place :( Or rewrite the logic of the functions
> somehow to call yield_to() outside of the loop. This is heuristics anyway.
>
And I think changing to srcu will be easier :-), and have started to do it.

Thanks and regards
ping fan
> --
>                        Gleb.
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4] kvm: make vcpu life cycle separated from kvm instance
  2011-12-15  6:48                   ` Takuya Yoshikawa
@ 2011-12-16  9:38                     ` Marcelo Tosatti
  2011-12-17  3:57                     ` Liu ping fan
  1 sibling, 0 replies; 78+ messages in thread
From: Marcelo Tosatti @ 2011-12-16  9:38 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: Liu Ping Fan, kvm, linux-kernel, avi, aliguori, gleb, jan.kiszka

On Thu, Dec 15, 2011 at 03:48:38PM +0900, Takuya Yoshikawa wrote:
> (2011/12/15 13:28), Liu Ping Fan wrote:
> > From: Liu Ping Fan<pingfank@linux.vnet.ibm.com>
> > 
> > Currently, vcpu can be destructed only when kvm instance destroyed.
> > Change this to vcpu's destruction before kvm instance, so vcpu MUST
> > and CAN be destroyed before kvm's destroy.
> 
> Could you explain why this change is needed here?
> Would be helpful for those, including me, who will read the commit later.

I fail to see the motivation for this change also.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v5] kvm: make vcpu life cycle separated from kvm instance
  2011-12-12  2:41           ` [PATCH v3] " Liu Ping Fan
  2011-12-12 12:54             ` Gleb Natapov
  2011-12-13 11:36             ` Marcelo Tosatti
@ 2011-12-17  3:19             ` Liu Ping Fan
  2011-12-26 11:09               ` Gleb Natapov
                                 ` (2 more replies)
  2 siblings, 3 replies; 78+ messages in thread
From: Liu Ping Fan @ 2011-12-17  3:19 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, avi, aliguori, gleb, mtosatti, jan.kiszka

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

Currently, vcpu can be destructed only when kvm instance destroyed.
Change this to vcpu's destruction before kvm instance, so vcpu MUST
and CAN be destroyed before kvm's destroy.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 arch/x86/kvm/i8254.c     |   10 +++--
 arch/x86/kvm/i8259.c     |   12 ++++--
 arch/x86/kvm/x86.c       |   53 +++++++++++------------
 include/linux/kvm_host.h |   20 ++++-----
 virt/kvm/irq_comm.c      |    6 ++-
 virt/kvm/kvm_main.c      |  106 ++++++++++++++++++++++++++++++++++-----------
 6 files changed, 132 insertions(+), 75 deletions(-)

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index 76e3f1c..a3a5506 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -289,9 +289,8 @@ static void pit_do_work(struct work_struct *work)
 	struct kvm_pit *pit = container_of(work, struct kvm_pit, expired);
 	struct kvm *kvm = pit->kvm;
 	struct kvm_vcpu *vcpu;
-	int i;
 	struct kvm_kpit_state *ps = &pit->pit_state;
-	int inject = 0;
+	int idx, inject = 0;
 
 	/* Try to inject pending interrupts when
 	 * last one has been acked.
@@ -315,9 +314,12 @@ static void pit_do_work(struct work_struct *work)
 		 * LVT0 to NMI delivery. Other PIC interrupts are just sent to
 		 * VCPU0, and only if its LVT0 is in EXTINT mode.
 		 */
-		if (kvm->arch.vapics_in_nmi_mode > 0)
-			kvm_for_each_vcpu(i, vcpu, kvm)
+		if (kvm->arch.vapics_in_nmi_mode > 0) {
+			idx = srcu_read_lock(&kvm->srcu_vcpus);
+			kvm_for_each_vcpu(vcpu, kvm)
 				kvm_apic_nmi_wd_deliver(vcpu);
+			srcu_read_unlock(&kvm->srcu_vcpus, idx);
+		}
 	}
 }
 
diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index cac4746..5ef5c05 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -50,25 +50,29 @@ static void pic_unlock(struct kvm_pic *s)
 {
 	bool wakeup = s->wakeup_needed;
 	struct kvm_vcpu *vcpu, *found = NULL;
-	int i;
+	struct kvm *kvm = s->kvm;
+	int idx;
 
 	s->wakeup_needed = false;
 
 	spin_unlock(&s->lock);
 
 	if (wakeup) {
-		kvm_for_each_vcpu(i, vcpu, s->kvm) {
+		idx = srcu_read_lock(&kvm->srcu_vcpus);
+		kvm_for_each_vcpu(vcpu, kvm)
 			if (kvm_apic_accept_pic_intr(vcpu)) {
 				found = vcpu;
 				break;
 			}
-		}
 
-		if (!found)
+		if (!found) {
+			srcu_read_unlock(&kvm->srcu_vcpus, idx);
 			return;
+		}
 
 		kvm_make_request(KVM_REQ_EVENT, found);
 		kvm_vcpu_kick(found);
+		srcu_read_unlock(&kvm->srcu_vcpus, idx);
 	}
 }
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 23c93fe..b79739d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1774,14 +1774,20 @@ static int get_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 static int get_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 {
 	u64 data = 0;
+	int idx;
 
 	switch (msr) {
 	case HV_X64_MSR_VP_INDEX: {
-		int r;
+		int r = 0;
 		struct kvm_vcpu *v;
-		kvm_for_each_vcpu(r, v, vcpu->kvm)
+		struct kvm *kvm = vcpu->kvm;
+		idx = srcu_read_lock(&kvm->srcu_vcpus);
+		kvm_for_each_vcpu(v, vcpu->kvm) {
 			if (v == vcpu)
 				data = r;
+			r++;
+		}
+		srcu_read_unlock(&kvm->srcu_vcpus, idx);
 		break;
 	}
 	case HV_X64_MSR_EOI:
@@ -4529,7 +4535,7 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
 	struct cpufreq_freqs *freq = data;
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
-	int i, send_ipi = 0;
+	int idx, send_ipi = 0;
 
 	/*
 	 * We allow guests to temporarily run on slowing clocks,
@@ -4579,13 +4585,16 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
 
 	raw_spin_lock(&kvm_lock);
 	list_for_each_entry(kvm, &vm_list, vm_list) {
-		kvm_for_each_vcpu(i, vcpu, kvm) {
+		idx = srcu_read_lock(&kvm->srcu_vcpus);
+		kvm_for_each_vcpu(vcpu, kvm) {
 			if (vcpu->cpu != freq->cpu)
 				continue;
 			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 			if (vcpu->cpu != smp_processor_id())
 				send_ipi = 1;
 		}
+		srcu_read_unlock(&kvm->srcu_vcpus, idx);
+
 	}
 	raw_spin_unlock(&kvm_lock);
 
@@ -5866,13 +5875,17 @@ int kvm_arch_hardware_enable(void *garbage)
 {
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
-	int i;
+	int idx;
 
 	kvm_shared_msr_cpu_online();
-	list_for_each_entry(kvm, &vm_list, vm_list)
-		kvm_for_each_vcpu(i, vcpu, kvm)
+	list_for_each_entry(kvm, &vm_list, vm_list) {
+		idx = srcu_read_lock(&kvm->srcu_vcpus);
+		kvm_for_each_vcpu(vcpu, kvm) {
 			if (vcpu->cpu == smp_processor_id())
 				kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+		}
+		srcu_read_unlock(&kvm->srcu_vcpus, idx);
+	}
 	return kvm_x86_ops->hardware_enable(garbage);
 }
 
@@ -5989,27 +6002,14 @@ static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
 	vcpu_put(vcpu);
 }
 
-static void kvm_free_vcpus(struct kvm *kvm)
+void kvm_arch_vcpu_zap(struct kvm_vcpu *vcpu)
 {
-	unsigned int i;
-	struct kvm_vcpu *vcpu;
-
-	/*
-	 * Unpin any mmu pages first.
-	 */
-	kvm_for_each_vcpu(i, vcpu, kvm) {
-		kvm_clear_async_pf_completion_queue(vcpu);
-		kvm_unload_vcpu_mmu(vcpu);
-	}
-	kvm_for_each_vcpu(i, vcpu, kvm)
-		kvm_arch_vcpu_free(vcpu);
-
-	mutex_lock(&kvm->lock);
-	for (i = 0; i < atomic_read(&kvm->online_vcpus); i++)
-		kvm->vcpus[i] = NULL;
+	struct kvm *kvm = vcpu->kvm;
 
-	atomic_set(&kvm->online_vcpus, 0);
-	mutex_unlock(&kvm->lock);
+	kvm_clear_async_pf_completion_queue(vcpu);
+	kvm_unload_vcpu_mmu(vcpu);
+	kvm_arch_vcpu_free(vcpu);
+	kvm_put_kvm(kvm);
 }
 
 void kvm_arch_sync_events(struct kvm *kvm)
@@ -6023,7 +6023,6 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 	kvm_iommu_unmap_guest(kvm);
 	kfree(kvm->arch.vpic);
 	kfree(kvm->arch.vioapic);
-	kvm_free_vcpus(kvm);
 	if (kvm->arch.apic_access_page)
 		put_page(kvm->arch.apic_access_page);
 	if (kvm->arch.ept_identity_pagetable)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8c5c303..ab22828 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -115,6 +115,7 @@ enum {
 
 struct kvm_vcpu {
 	struct kvm *kvm;
+	struct list_head list;
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	struct preempt_notifier preempt_notifier;
 #endif
@@ -249,13 +250,15 @@ struct kvm {
 	struct mm_struct *mm; /* userspace tied to this vm */
 	struct kvm_memslots *memslots;
 	struct srcu_struct srcu;
+	struct srcu_struct srcu_vcpus;
+
 #ifdef CONFIG_KVM_APIC_ARCHITECTURE
 	u32 bsp_vcpu_id;
 	struct kvm_vcpu *bsp_vcpu;
 #endif
-	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
+	struct list_head vcpus;
 	atomic_t online_vcpus;
-	int last_boosted_vcpu;
+	struct kvm_vcpu *last_boosted_vcpu;
 	struct list_head vm_list;
 	struct mutex lock;
 	struct kvm_io_bus *buses[KVM_NR_BUSES];
@@ -302,17 +305,10 @@ struct kvm {
 #define kvm_printf(kvm, fmt ...) printk(KERN_DEBUG fmt)
 #define vcpu_printf(vcpu, fmt...) kvm_printf(vcpu->kvm, fmt)
 
-static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i)
-{
-	smp_rmb();
-	return kvm->vcpus[i];
-}
+void kvm_arch_vcpu_zap(struct kvm_vcpu *vcpu);
 
-#define kvm_for_each_vcpu(idx, vcpup, kvm) \
-	for (idx = 0; \
-	     idx < atomic_read(&kvm->online_vcpus) && \
-	     (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
-	     idx++)
+#define kvm_for_each_vcpu(vcpu, kvm) \
+	list_for_each_entry_rcu(vcpu, &kvm->vcpus, list)
 
 #define kvm_for_each_memslot(memslot, slots)	\
 	for (memslot = &slots->memslots[0];	\
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 9f614b4..78dc97c 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -81,14 +81,15 @@ inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
 int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 		struct kvm_lapic_irq *irq)
 {
-	int i, r = -1;
+	int idx, r = -1;
 	struct kvm_vcpu *vcpu, *lowest = NULL;
 
 	if (irq->dest_mode == 0 && irq->dest_id == 0xff &&
 			kvm_is_dm_lowest_prio(irq))
 		printk(KERN_INFO "kvm: apic: phys broadcast and lowest prio\n");
 
-	kvm_for_each_vcpu(i, vcpu, kvm) {
+	idx = srcu_read_lock(&kvm->srcu_vcpus);
+	kvm_for_each_vcpu(vcpu, kvm) {
 		if (!kvm_apic_present(vcpu))
 			continue;
 
@@ -111,6 +112,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 	if (lowest)
 		r = kvm_apic_set_irq(lowest, irq);
 
+	srcu_read_unlock(&kvm->srcu_vcpus, idx);
 	return r;
 }
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e289486..ec0c920 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -171,7 +171,7 @@ static void ack_flush(void *_completed)
 
 static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 {
-	int i, cpu, me;
+	int cpu, me, idx;
 	cpumask_var_t cpus;
 	bool called = true;
 	struct kvm_vcpu *vcpu;
@@ -179,7 +179,8 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 	zalloc_cpumask_var(&cpus, GFP_ATOMIC);
 
 	me = get_cpu();
-	kvm_for_each_vcpu(i, vcpu, kvm) {
+	idx = srcu_read_lock(&kvm->srcu_vcpus);
+	kvm_for_each_vcpu(vcpu, kvm) {
 		kvm_make_request(req, vcpu);
 		cpu = vcpu->cpu;
 
@@ -190,12 +191,15 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 		      kvm_vcpu_exiting_guest_mode(vcpu) != OUTSIDE_GUEST_MODE)
 			cpumask_set_cpu(cpu, cpus);
 	}
+	srcu_read_unlock(&kvm->srcu_vcpus, idx);
+
 	if (unlikely(cpus == NULL))
 		smp_call_function_many(cpu_online_mask, ack_flush, NULL, 1);
 	else if (!cpumask_empty(cpus))
 		smp_call_function_many(cpus, ack_flush, NULL, 1);
 	else
 		called = false;
+
 	put_cpu();
 	free_cpumask_var(cpus);
 	return called;
@@ -477,6 +481,8 @@ static struct kvm *kvm_create_vm(void)
 	kvm_init_memslots_id(kvm);
 	if (init_srcu_struct(&kvm->srcu))
 		goto out_err_nosrcu;
+	if (init_srcu_struct(&kvm->srcu_vcpus))
+		goto out_err_nosrcu_vcpus;
 	for (i = 0; i < KVM_NR_BUSES; i++) {
 		kvm->buses[i] = kzalloc(sizeof(struct kvm_io_bus),
 					GFP_KERNEL);
@@ -500,10 +506,13 @@ static struct kvm *kvm_create_vm(void)
 	raw_spin_lock(&kvm_lock);
 	list_add(&kvm->vm_list, &vm_list);
 	raw_spin_unlock(&kvm_lock);
+	INIT_LIST_HEAD(&kvm->vcpus);
 
 	return kvm;
 
 out_err:
+	cleanup_srcu_struct(&kvm->srcu_vcpus);
+out_err_nosrcu_vcpus:
 	cleanup_srcu_struct(&kvm->srcu);
 out_err_nosrcu:
 	hardware_disable_all();
@@ -587,6 +596,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	kvm_arch_destroy_vm(kvm);
 	kvm_free_physmem(kvm);
 	cleanup_srcu_struct(&kvm->srcu);
+	cleanup_srcu_struct(&kvm->srcu_vcpus);
 	kvm_arch_free_vm(kvm);
 	hardware_disable_all();
 	mmdrop(mm);
@@ -1593,11 +1603,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 {
 	struct kvm *kvm = me->kvm;
 	struct kvm_vcpu *vcpu;
-	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
-	int yielded = 0;
-	int pass;
-	int i;
-
+	struct task_struct *task = NULL;
+	struct pid *pid;
+	int pass, firststart, lastone, yielded, idx;
 	/*
 	 * We boost the priority of a VCPU that is runnable but not
 	 * currently running, because it got preempted by something
@@ -1605,15 +1613,22 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 	 * VCPU is holding the lock that we need and will release it.
 	 * We approximate round-robin by starting at the last boosted VCPU.
 	 */
-	for (pass = 0; pass < 2 && !yielded; pass++) {
-		kvm_for_each_vcpu(i, vcpu, kvm) {
-			struct task_struct *task = NULL;
-			struct pid *pid;
-			if (!pass && i < last_boosted_vcpu) {
-				i = last_boosted_vcpu;
+	for (pass = 0, firststart = 0; pass < 2 && !yielded; pass++) {
+
+		idx = srcu_read_lock(&kvm->srcu_vcpus);
+		kvm_for_each_vcpu(vcpu, kvm) {
+			if (!pass && !firststart &&
+			    vcpu != kvm->last_boosted_vcpu &&
+			    kvm->last_boosted_vcpu != NULL) {
+				vcpu = kvm->last_boosted_vcpu;
+				firststart = 1;
 				continue;
-			} else if (pass && i > last_boosted_vcpu)
+			} else if (pass && !lastone) {
+				if (vcpu == kvm->last_boosted_vcpu)
+					lastone = 1;
+			} else if (pass && lastone)
 				break;
+
 			if (vcpu == me)
 				continue;
 			if (waitqueue_active(&vcpu->wq))
@@ -1629,15 +1644,20 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 				put_task_struct(task);
 				continue;
 			}
+
 			if (yield_to(task, 1)) {
 				put_task_struct(task);
-				kvm->last_boosted_vcpu = i;
+				mutex_lock(&kvm->lock);
+				kvm->last_boosted_vcpu = vcpu;
+				mutex_unlock(&kvm->lock);
 				yielded = 1;
 				break;
 			}
 			put_task_struct(task);
 		}
+		srcu_read_unlock(&kvm->srcu_vcpus, idx);
 	}
+
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
 
@@ -1673,11 +1693,30 @@ static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
 	return 0;
 }
 
+static void kvm_vcpu_zap(struct kvm_vcpu *vcpu)
+{
+	kvm_arch_vcpu_zap(vcpu);
+}
+
 static int kvm_vcpu_release(struct inode *inode, struct file *filp)
 {
 	struct kvm_vcpu *vcpu = filp->private_data;
+	struct kvm *kvm = vcpu->kvm;
+	filp->private_data = NULL;
+
+	mutex_lock(&kvm->lock);
+	list_del_rcu(&vcpu->list);
+	atomic_dec(&kvm->online_vcpus);
+	mutex_unlock(&kvm->lock);
+	synchronize_srcu_expedited(&kvm->srcu_vcpus);
+
+	mutex_lock(&kvm->lock);
+	if (kvm->last_boosted_vcpu == vcpu)
+		kvm->last_boosted_vcpu = NULL;
+	mutex_unlock(&kvm->lock);
 
-	kvm_put_kvm(vcpu->kvm);
+	/*vcpu is out of list,drop it safely*/
+	kvm_vcpu_zap(vcpu);
 	return 0;
 }
 
@@ -1699,15 +1738,25 @@ static int create_vcpu_fd(struct kvm_vcpu *vcpu)
 	return anon_inode_getfd("kvm-vcpu", &kvm_vcpu_fops, vcpu, O_RDWR);
 }
 
+static struct kvm_vcpu *kvm_vcpu_create(struct kvm *kvm, u32 id)
+{
+	struct kvm_vcpu *vcpu;
+	vcpu = kvm_arch_vcpu_create(kvm, id);
+	if (IS_ERR(vcpu))
+		return vcpu;
+	INIT_LIST_HEAD(&vcpu->list);
+	return vcpu;
+}
+
 /*
  * Creates some virtual cpus.  Good luck creating more than one.
  */
 static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 {
-	int r;
+	int r, idx;
 	struct kvm_vcpu *vcpu, *v;
 
-	vcpu = kvm_arch_vcpu_create(kvm, id);
+	vcpu = kvm_vcpu_create(kvm, id);
 	if (IS_ERR(vcpu))
 		return PTR_ERR(vcpu);
 
@@ -1723,13 +1772,15 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 		goto unlock_vcpu_destroy;
 	}
 
-	kvm_for_each_vcpu(r, v, kvm)
+	idx = srcu_read_lock(&kvm->srcu_vcpus);
+	kvm_for_each_vcpu(v, kvm) {
 		if (v->vcpu_id == id) {
 			r = -EEXIST;
+			srcu_read_unlock(&kvm->srcu_vcpus, idx);
 			goto unlock_vcpu_destroy;
 		}
-
-	BUG_ON(kvm->vcpus[atomic_read(&kvm->online_vcpus)]);
+	}
+	srcu_read_unlock(&kvm->srcu_vcpus, idx);
 
 	/* Now it's all set up, let userspace reach it */
 	kvm_get_kvm(kvm);
@@ -1739,8 +1790,8 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 		goto unlock_vcpu_destroy;
 	}
 
-	kvm->vcpus[atomic_read(&kvm->online_vcpus)] = vcpu;
-	smp_wmb();
+	/*Protected by kvm->lock*/
+	list_add_rcu(&vcpu->list, &kvm->vcpus);
 	atomic_inc(&kvm->online_vcpus);
 
 #ifdef CONFIG_KVM_APIC_ARCHITECTURE
@@ -2645,13 +2696,16 @@ static int vcpu_stat_get(void *_offset, u64 *val)
 	unsigned offset = (long)_offset;
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
-	int i;
+	int idx;
 
 	*val = 0;
 	raw_spin_lock(&kvm_lock);
-	list_for_each_entry(kvm, &vm_list, vm_list)
-		kvm_for_each_vcpu(i, vcpu, kvm)
+	list_for_each_entry(kvm, &vm_list, vm_list) {
+		idx = srcu_read_lock(&kvm->srcu_vcpus);
+		kvm_for_each_vcpu(vcpu, kvm)
 			*val += *(u32 *)((void *)vcpu + offset);
+		srcu_read_unlock(&kvm->srcu_vcpus, idx);
+	}
 
 	raw_spin_unlock(&kvm_lock);
 	return 0;
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH v4] kvm: make vcpu life cycle separated from kvm instance
  2011-12-15  6:48                   ` Takuya Yoshikawa
  2011-12-16  9:38                     ` Marcelo Tosatti
@ 2011-12-17  3:57                     ` Liu ping fan
  2011-12-19  1:16                       ` Takuya Yoshikawa
  1 sibling, 1 reply; 78+ messages in thread
From: Liu ping fan @ 2011-12-17  3:57 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: kvm, linux-kernel, avi, aliguori, gleb, mtosatti, jan.kiszka

2011/12/15 Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>:
> (2011/12/15 13:28), Liu Ping Fan wrote:
>> From: Liu Ping Fan<pingfank@linux.vnet.ibm.com>
>>
>> Currently, vcpu can be destructed only when kvm instance destroyed.
>> Change this to vcpu's destruction before kvm instance, so vcpu MUST
>> and CAN be destroyed before kvm's destroy.
>
> Could you explain why this change is needed here?
> Would be helpful for those, including me, who will read the commit later.
>
Suppose the following scene,
Firstly, creating 10 kvm_vcpu for guest to take the advantage of
multi-core. Now, reclaiming some of the kvm_vcpu, so we can limit the
guest's usage of cpu. Then what about the kvm_vcpu unused? Currently
they are just idle in kernel, but with this patch, we can remove them.

>>
>> Signed-off-by: Liu Ping Fan<pingfank@linux.vnet.ibm.com>
>> ---
>
> ...
>
>> diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
>> index cac4746..f275b8c 100644
>> --- a/arch/x86/kvm/i8259.c
>> +++ b/arch/x86/kvm/i8259.c
>> @@ -50,25 +50,28 @@ static void pic_unlock(struct kvm_pic *s)
>>   {
>>       bool wakeup = s->wakeup_needed;
>>       struct kvm_vcpu *vcpu, *found = NULL;
>> -     int i;
>> +     struct kvm *kvm = s->kvm;
>>
>>       s->wakeup_needed = false;
>>
>>       spin_unlock(&s->lock);
>>
>>       if (wakeup) {
>> -             kvm_for_each_vcpu(i, vcpu, s->kvm) {
>> +             rcu_read_lock();
>> +             kvm_for_each_vcpu(vcpu, kvm)
>>                       if (kvm_apic_accept_pic_intr(vcpu)) {
>>                               found = vcpu;
>>                               break;
>>                       }
>> -             }
>>
>> -             if (!found)
>> +             if (!found) {
>> +                     rcu_read_unlock();
>>                       return;
>> +             }
>>
>>               kvm_make_request(KVM_REQ_EVENT, found);
>>               kvm_vcpu_kick(found);
>> +             rcu_read_unlock();
>>       }
>>   }
>
> How about this? (just about stylistic issues)
>
:-), I just want to change based on old code. But your style is OK too.

>        if (!wakeup)
>                return;
>
>        rcu_read_lock();
>        kvm_for_each_vcpu(vcpu, kvm)
>                if (kvm_apic_accept_pic_intr(vcpu)) {
>                        found = vcpu;
>                        break;
>                }
>
>        if (!found)
>                goto out;
>
>        kvm_make_request(KVM_REQ_EVENT, found);
>        kvm_vcpu_kick(found);
> out:
>        rcu_read_unlock();
>
> ...
>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>
> ...
>
>> +void kvm_arch_vcpu_zap(struct work_struct *work)
>> +{
>> +     struct kvm_vcpu *vcpu = container_of(work, struct kvm_vcpu,
>> +                     zap_work);
>> +     struct kvm *kvm = vcpu->kvm;
>>
>> -     atomic_set(&kvm->online_vcpus, 0);
>> -     mutex_unlock(&kvm->lock);
>> +     kvm_clear_async_pf_completion_queue(vcpu);
>> +     kvm_unload_vcpu_mmu(vcpu);
>> +     kvm_arch_vcpu_free(vcpu);
>> +     kvm_put_kvm(kvm);
>>   }
>
> zap is really a good name for this?
>
zap = destroy, so I think it is OK.
> ...
>
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index d526231..733de1c 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -19,6 +19,7 @@
>>   #include<linux/slab.h>
>>   #include<linux/rcupdate.h>
>>   #include<linux/ratelimit.h>
>> +#include<linux/atomic.h>
>>   #include<asm/signal.h>
>>
>>   #include<linux/kvm.h>
>> @@ -113,6 +114,10 @@ enum {
>>
>>   struct kvm_vcpu {
>>       struct kvm *kvm;
>> +     atomic_t refcount;
>> +     struct list_head list;
>> +     struct rcu_head head;
>> +     struct work_struct zap_work;
>
> How about adding some comments?
> zap_work is not at all self explanatory, IMO.
>
>
>>   #ifdef CONFIG_PREEMPT_NOTIFIERS
>>       struct preempt_notifier preempt_notifier;
>>   #endif
>> @@ -241,9 +246,9 @@ struct kvm {
>>       u32 bsp_vcpu_id;
>>       struct kvm_vcpu *bsp_vcpu;
>>   #endif
>> -     struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
>> +     struct list_head vcpus;
>>       atomic_t online_vcpus;
>> -     int last_boosted_vcpu;
>> +     struct kvm_vcpu *last_boosted_vcpu;
>>       struct list_head vm_list;
>>       struct mutex lock;
>>       struct kvm_io_bus *buses[KVM_NR_BUSES];
>> @@ -290,17 +295,15 @@ struct kvm {
>>   #define kvm_printf(kvm, fmt ...) printk(KERN_DEBUG fmt)
>>   #define vcpu_printf(vcpu, fmt...) kvm_printf(vcpu->kvm, fmt)
>>
>> -static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i)
>> -{
>> -     smp_rmb();
>> -     return kvm->vcpus[i];
>> -}
>> +struct kvm_vcpu *kvm_vcpu_get(struct kvm_vcpu *vcpu);
>> +void kvm_vcpu_put(struct kvm_vcpu *vcpu);
>> +void kvm_arch_vcpu_zap(struct work_struct *work);
>> +
>> +#define kvm_for_each_vcpu(vcpu, kvm) \
>> +     list_for_each_entry_rcu(vcpu,&kvm->vcpus, list)
>
> Is this macro really worth it?
> _rcu shows readers important information, I think.
>
I guest kvm_for_each_vcpu is designed for hiding the details of
internal implement, and currently it is implemented by array, and my
patch will change it to linked-list,
so IMO, we can still hide the details.

Regards,
ping fan

>>
>> -#define kvm_for_each_vcpu(idx, vcpup, kvm) \
>> -     for (idx = 0; \
>> -          idx<  atomic_read(&kvm->online_vcpus)&&  \
>> -          (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
>> -          idx++)
>> +#define kvm_for_each_vcpu_continue(vcpu, kvm) \
>> +     list_for_each_entry_continue_rcu(vcpu,&kvm->vcpus, list)
>
> Same here.
> Why do you want to hide _rcu from readers?
>
>
>        Takuya

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v4] kvm: make vcpu life cycle separated from kvm instance
  2011-12-17  3:57                     ` Liu ping fan
@ 2011-12-19  1:16                       ` Takuya Yoshikawa
  0 siblings, 0 replies; 78+ messages in thread
From: Takuya Yoshikawa @ 2011-12-19  1:16 UTC (permalink / raw)
  To: Liu ping fan; +Cc: kvm, linux-kernel, avi, aliguori, gleb, mtosatti, jan.kiszka

Liu ping fan wrote:
> Suppose the following scene,
> Firstly, creating 10 kvm_vcpu for guest to take the advantage of
> multi-core. Now, reclaiming some of the kvm_vcpu, so we can limit the
> guest's usage of cpu. Then what about the kvm_vcpu unused? Currently
> they are just idle in kernel, but with this patch, we can remove them.

Then why not write it in the changelog?

>>> +void kvm_arch_vcpu_zap(struct work_struct *work)
>>> +{
>>> +     struct kvm_vcpu *vcpu = container_of(work, struct kvm_vcpu,
>>> +                     zap_work);
>>> +     struct kvm *kvm = vcpu->kvm;
>>>
>>> -     atomic_set(&kvm->online_vcpus, 0);
>>> -     mutex_unlock(&kvm->lock);
>>> +     kvm_clear_async_pf_completion_queue(vcpu);
>>> +     kvm_unload_vcpu_mmu(vcpu);
>>> +     kvm_arch_vcpu_free(vcpu);
>>> +     kvm_put_kvm(kvm);
>>>    }
>>
>> zap is really a good name for this?
>>
> zap = destroy, so I think it is OK.

Stronger than that.
My dictionary says "to destroy sth suddenly and with force."

In the case of shadow pages, I see what the author wanted to mean by "zap".

In your case, the host really destroy a VCPU suddenly?
The guest have to unplug it before, I guess.

If you just mean "destroy", why not use it?

>>> +#define kvm_for_each_vcpu(vcpu, kvm) \
>>> +     list_for_each_entry_rcu(vcpu,&kvm->vcpus, list)
>>
>> Is this macro really worth it?
>> _rcu shows readers important information, I think.
>>
> I guest kvm_for_each_vcpu is designed for hiding the details of
> internal implement, and currently it is implemented by array, and my
> patch will change it to linked-list,
> so IMO, we can still hide the details.

Then why are you doing
	list_add_rcu(&vcpu->list, &kvm->vcpus);
without introducing kvm_add_vcpu()?

You are just hiding part of the interface.
I believe this kind of incomplete abstraction should not be added.

The original code was complex enough to introduce a macro, but
list_for_each_entry_rcu(vcpu, &kvm->vcpus, list)
is simple and shows clear meaning by itself.

	Takuya

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5] kvm: make vcpu life cycle separated from kvm instance
  2011-12-17  3:19             ` [PATCH v5] " Liu Ping Fan
@ 2011-12-26 11:09               ` Gleb Natapov
  2011-12-26 11:17                 ` Avi Kivity
  2011-12-27  7:53                 ` Liu ping fan
  2011-12-27  8:38               ` [PATCH v6] " Liu Ping Fan
  2012-01-07  2:55               ` [PATCH v7] " Liu Ping Fan
  2 siblings, 2 replies; 78+ messages in thread
From: Gleb Natapov @ 2011-12-26 11:09 UTC (permalink / raw)
  To: Liu Ping Fan; +Cc: kvm, linux-kernel, avi, aliguori, mtosatti, jan.kiszka

On Sat, Dec 17, 2011 at 11:19:35AM +0800, Liu Ping Fan wrote:
> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> 
> Currently, vcpu can be destructed only when kvm instance destroyed.
> Change this to vcpu's destruction before kvm instance, so vcpu MUST
> and CAN be destroyed before kvm's destroy.
> 
> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> ---
>  arch/x86/kvm/i8254.c     |   10 +++--
>  arch/x86/kvm/i8259.c     |   12 ++++--
>  arch/x86/kvm/x86.c       |   53 +++++++++++------------
>  include/linux/kvm_host.h |   20 ++++-----
>  virt/kvm/irq_comm.c      |    6 ++-
>  virt/kvm/kvm_main.c      |  106 ++++++++++++++++++++++++++++++++++-----------
>  6 files changed, 132 insertions(+), 75 deletions(-)
> 
> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
> index 76e3f1c..a3a5506 100644
> --- a/arch/x86/kvm/i8254.c
> +++ b/arch/x86/kvm/i8254.c
> @@ -289,9 +289,8 @@ static void pit_do_work(struct work_struct *work)
>  	struct kvm_pit *pit = container_of(work, struct kvm_pit, expired);
>  	struct kvm *kvm = pit->kvm;
>  	struct kvm_vcpu *vcpu;
> -	int i;
>  	struct kvm_kpit_state *ps = &pit->pit_state;
> -	int inject = 0;
> +	int idx, inject = 0;
>  
>  	/* Try to inject pending interrupts when
>  	 * last one has been acked.
> @@ -315,9 +314,12 @@ static void pit_do_work(struct work_struct *work)
>  		 * LVT0 to NMI delivery. Other PIC interrupts are just sent to
>  		 * VCPU0, and only if its LVT0 is in EXTINT mode.
>  		 */
> -		if (kvm->arch.vapics_in_nmi_mode > 0)
> -			kvm_for_each_vcpu(i, vcpu, kvm)
> +		if (kvm->arch.vapics_in_nmi_mode > 0) {
> +			idx = srcu_read_lock(&kvm->srcu_vcpus);
> +			kvm_for_each_vcpu(vcpu, kvm)
>  				kvm_apic_nmi_wd_deliver(vcpu);
> +			srcu_read_unlock(&kvm->srcu_vcpus, idx);
> +		}
>  	}
>  }
>  
> diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
> index cac4746..5ef5c05 100644
> --- a/arch/x86/kvm/i8259.c
> +++ b/arch/x86/kvm/i8259.c
> @@ -50,25 +50,29 @@ static void pic_unlock(struct kvm_pic *s)
>  {
>  	bool wakeup = s->wakeup_needed;
>  	struct kvm_vcpu *vcpu, *found = NULL;
> -	int i;
> +	struct kvm *kvm = s->kvm;
> +	int idx;
>  
>  	s->wakeup_needed = false;
>  
>  	spin_unlock(&s->lock);
>  
>  	if (wakeup) {
> -		kvm_for_each_vcpu(i, vcpu, s->kvm) {
> +		idx = srcu_read_lock(&kvm->srcu_vcpus);
> +		kvm_for_each_vcpu(vcpu, kvm)
>  			if (kvm_apic_accept_pic_intr(vcpu)) {
>  				found = vcpu;
>  				break;
>  			}
> -		}
>  
> -		if (!found)
> +		if (!found) {
> +			srcu_read_unlock(&kvm->srcu_vcpus, idx);
>  			return;
> +		}
>  
>  		kvm_make_request(KVM_REQ_EVENT, found);
>  		kvm_vcpu_kick(found);
> +		srcu_read_unlock(&kvm->srcu_vcpus, idx);
>  	}
>  }
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 23c93fe..b79739d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1774,14 +1774,20 @@ static int get_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
>  static int get_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
>  {
>  	u64 data = 0;
> +	int idx;
>  
>  	switch (msr) {
>  	case HV_X64_MSR_VP_INDEX: {
> -		int r;
> +		int r = 0;
>  		struct kvm_vcpu *v;
> -		kvm_for_each_vcpu(r, v, vcpu->kvm)
> +		struct kvm *kvm = vcpu->kvm;
> +		idx = srcu_read_lock(&kvm->srcu_vcpus);
> +		kvm_for_each_vcpu(v, vcpu->kvm) {
>  			if (v == vcpu)
>  				data = r;
> +			r++;
> +		}
> +		srcu_read_unlock(&kvm->srcu_vcpus, idx);
>  		break;
>  	}
>  	case HV_X64_MSR_EOI:
> @@ -4529,7 +4535,7 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
>  	struct cpufreq_freqs *freq = data;
>  	struct kvm *kvm;
>  	struct kvm_vcpu *vcpu;
> -	int i, send_ipi = 0;
> +	int idx, send_ipi = 0;
>  
>  	/*
>  	 * We allow guests to temporarily run on slowing clocks,
> @@ -4579,13 +4585,16 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
>  
>  	raw_spin_lock(&kvm_lock);
>  	list_for_each_entry(kvm, &vm_list, vm_list) {
> -		kvm_for_each_vcpu(i, vcpu, kvm) {
> +		idx = srcu_read_lock(&kvm->srcu_vcpus);
> +		kvm_for_each_vcpu(vcpu, kvm) {
>  			if (vcpu->cpu != freq->cpu)
>  				continue;
>  			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>  			if (vcpu->cpu != smp_processor_id())
>  				send_ipi = 1;
>  		}
> +		srcu_read_unlock(&kvm->srcu_vcpus, idx);
> +
>  	}
>  	raw_spin_unlock(&kvm_lock);
>  
> @@ -5866,13 +5875,17 @@ int kvm_arch_hardware_enable(void *garbage)
>  {
>  	struct kvm *kvm;
>  	struct kvm_vcpu *vcpu;
> -	int i;
> +	int idx;
>  
>  	kvm_shared_msr_cpu_online();
> -	list_for_each_entry(kvm, &vm_list, vm_list)
> -		kvm_for_each_vcpu(i, vcpu, kvm)
> +	list_for_each_entry(kvm, &vm_list, vm_list) {
> +		idx = srcu_read_lock(&kvm->srcu_vcpus);
> +		kvm_for_each_vcpu(vcpu, kvm) {
>  			if (vcpu->cpu == smp_processor_id())
>  				kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> +		}
> +		srcu_read_unlock(&kvm->srcu_vcpus, idx);
> +	}
>  	return kvm_x86_ops->hardware_enable(garbage);
>  }
>  
> @@ -5989,27 +6002,14 @@ static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
>  	vcpu_put(vcpu);
>  }
>  
> -static void kvm_free_vcpus(struct kvm *kvm)
> +void kvm_arch_vcpu_zap(struct kvm_vcpu *vcpu)
>  {
> -	unsigned int i;
> -	struct kvm_vcpu *vcpu;
> -
> -	/*
> -	 * Unpin any mmu pages first.
> -	 */
> -	kvm_for_each_vcpu(i, vcpu, kvm) {
> -		kvm_clear_async_pf_completion_queue(vcpu);
> -		kvm_unload_vcpu_mmu(vcpu);
> -	}
> -	kvm_for_each_vcpu(i, vcpu, kvm)
> -		kvm_arch_vcpu_free(vcpu);
> -
> -	mutex_lock(&kvm->lock);
> -	for (i = 0; i < atomic_read(&kvm->online_vcpus); i++)
> -		kvm->vcpus[i] = NULL;
> +	struct kvm *kvm = vcpu->kvm;
>  
> -	atomic_set(&kvm->online_vcpus, 0);
> -	mutex_unlock(&kvm->lock);
> +	kvm_clear_async_pf_completion_queue(vcpu);
> +	kvm_unload_vcpu_mmu(vcpu);
> +	kvm_arch_vcpu_free(vcpu);
> +	kvm_put_kvm(kvm);
>  }
>  
>  void kvm_arch_sync_events(struct kvm *kvm)
> @@ -6023,7 +6023,6 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>  	kvm_iommu_unmap_guest(kvm);
>  	kfree(kvm->arch.vpic);
>  	kfree(kvm->arch.vioapic);
> -	kvm_free_vcpus(kvm);
>  	if (kvm->arch.apic_access_page)
>  		put_page(kvm->arch.apic_access_page);
>  	if (kvm->arch.ept_identity_pagetable)
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 8c5c303..ab22828 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -115,6 +115,7 @@ enum {
>  
>  struct kvm_vcpu {
>  	struct kvm *kvm;
> +	struct list_head list;
>  #ifdef CONFIG_PREEMPT_NOTIFIERS
>  	struct preempt_notifier preempt_notifier;
>  #endif
> @@ -249,13 +250,15 @@ struct kvm {
>  	struct mm_struct *mm; /* userspace tied to this vm */
>  	struct kvm_memslots *memslots;
>  	struct srcu_struct srcu;
> +	struct srcu_struct srcu_vcpus;
> +
>  #ifdef CONFIG_KVM_APIC_ARCHITECTURE
>  	u32 bsp_vcpu_id;
>  	struct kvm_vcpu *bsp_vcpu;
Rebase to latest kvm.git.

>  #endif
> -	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
> +	struct list_head vcpus;
>  	atomic_t online_vcpus;
> -	int last_boosted_vcpu;
> +	struct kvm_vcpu *last_boosted_vcpu;
>  	struct list_head vm_list;
>  	struct mutex lock;
>  	struct kvm_io_bus *buses[KVM_NR_BUSES];
> @@ -302,17 +305,10 @@ struct kvm {
>  #define kvm_printf(kvm, fmt ...) printk(KERN_DEBUG fmt)
>  #define vcpu_printf(vcpu, fmt...) kvm_printf(vcpu->kvm, fmt)
>  
> -static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i)
> -{
> -	smp_rmb();
> -	return kvm->vcpus[i];
> -}
> +void kvm_arch_vcpu_zap(struct kvm_vcpu *vcpu);
>  
> -#define kvm_for_each_vcpu(idx, vcpup, kvm) \
> -	for (idx = 0; \
> -	     idx < atomic_read(&kvm->online_vcpus) && \
> -	     (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
> -	     idx++)
> +#define kvm_for_each_vcpu(vcpu, kvm) \
> +	list_for_each_entry_rcu(vcpu, &kvm->vcpus, list)
>  
>  #define kvm_for_each_memslot(memslot, slots)	\
>  	for (memslot = &slots->memslots[0];	\
> diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> index 9f614b4..78dc97c 100644
> --- a/virt/kvm/irq_comm.c
> +++ b/virt/kvm/irq_comm.c
> @@ -81,14 +81,15 @@ inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
>  int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
>  		struct kvm_lapic_irq *irq)
>  {
> -	int i, r = -1;
> +	int idx, r = -1;
>  	struct kvm_vcpu *vcpu, *lowest = NULL;
>  
>  	if (irq->dest_mode == 0 && irq->dest_id == 0xff &&
>  			kvm_is_dm_lowest_prio(irq))
>  		printk(KERN_INFO "kvm: apic: phys broadcast and lowest prio\n");
>  
> -	kvm_for_each_vcpu(i, vcpu, kvm) {
> +	idx = srcu_read_lock(&kvm->srcu_vcpus);
> +	kvm_for_each_vcpu(vcpu, kvm) {
>  		if (!kvm_apic_present(vcpu))
>  			continue;
>  
> @@ -111,6 +112,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
>  	if (lowest)
>  		r = kvm_apic_set_irq(lowest, irq);
>  
> +	srcu_read_unlock(&kvm->srcu_vcpus, idx);
>  	return r;
>  }
>  
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index e289486..ec0c920 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -171,7 +171,7 @@ static void ack_flush(void *_completed)
>  
>  static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
>  {
> -	int i, cpu, me;
> +	int cpu, me, idx;
>  	cpumask_var_t cpus;
>  	bool called = true;
>  	struct kvm_vcpu *vcpu;
> @@ -179,7 +179,8 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
>  	zalloc_cpumask_var(&cpus, GFP_ATOMIC);
>  
>  	me = get_cpu();
> -	kvm_for_each_vcpu(i, vcpu, kvm) {
> +	idx = srcu_read_lock(&kvm->srcu_vcpus);
> +	kvm_for_each_vcpu(vcpu, kvm) {
>  		kvm_make_request(req, vcpu);
>  		cpu = vcpu->cpu;
>  
> @@ -190,12 +191,15 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
>  		      kvm_vcpu_exiting_guest_mode(vcpu) != OUTSIDE_GUEST_MODE)
>  			cpumask_set_cpu(cpu, cpus);
>  	}
> +	srcu_read_unlock(&kvm->srcu_vcpus, idx);
> +
>  	if (unlikely(cpus == NULL))
>  		smp_call_function_many(cpu_online_mask, ack_flush, NULL, 1);
>  	else if (!cpumask_empty(cpus))
>  		smp_call_function_many(cpus, ack_flush, NULL, 1);
>  	else
>  		called = false;
> +
>  	put_cpu();
>  	free_cpumask_var(cpus);
>  	return called;
> @@ -477,6 +481,8 @@ static struct kvm *kvm_create_vm(void)
>  	kvm_init_memslots_id(kvm);
>  	if (init_srcu_struct(&kvm->srcu))
>  		goto out_err_nosrcu;
> +	if (init_srcu_struct(&kvm->srcu_vcpus))
> +		goto out_err_nosrcu_vcpus;
>  	for (i = 0; i < KVM_NR_BUSES; i++) {
>  		kvm->buses[i] = kzalloc(sizeof(struct kvm_io_bus),
>  					GFP_KERNEL);
> @@ -500,10 +506,13 @@ static struct kvm *kvm_create_vm(void)
>  	raw_spin_lock(&kvm_lock);
>  	list_add(&kvm->vm_list, &vm_list);
>  	raw_spin_unlock(&kvm_lock);
> +	INIT_LIST_HEAD(&kvm->vcpus);
>  
>  	return kvm;
>  
>  out_err:
> +	cleanup_srcu_struct(&kvm->srcu_vcpus);
> +out_err_nosrcu_vcpus:
>  	cleanup_srcu_struct(&kvm->srcu);
>  out_err_nosrcu:
>  	hardware_disable_all();
> @@ -587,6 +596,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
>  	kvm_arch_destroy_vm(kvm);
>  	kvm_free_physmem(kvm);
>  	cleanup_srcu_struct(&kvm->srcu);
> +	cleanup_srcu_struct(&kvm->srcu_vcpus);
>  	kvm_arch_free_vm(kvm);
>  	hardware_disable_all();
>  	mmdrop(mm);
> @@ -1593,11 +1603,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  {
>  	struct kvm *kvm = me->kvm;
>  	struct kvm_vcpu *vcpu;
> -	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
> -	int yielded = 0;
> -	int pass;
> -	int i;
> -
> +	struct task_struct *task = NULL;
> +	struct pid *pid;
> +	int pass, firststart, lastone, yielded, idx;
>  	/*
>  	 * We boost the priority of a VCPU that is runnable but not
>  	 * currently running, because it got preempted by something
> @@ -1605,15 +1613,22 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  	 * VCPU is holding the lock that we need and will release it.
>  	 * We approximate round-robin by starting at the last boosted VCPU.
>  	 */
> -	for (pass = 0; pass < 2 && !yielded; pass++) {
> -		kvm_for_each_vcpu(i, vcpu, kvm) {
> -			struct task_struct *task = NULL;
> -			struct pid *pid;
> -			if (!pass && i < last_boosted_vcpu) {
> -				i = last_boosted_vcpu;
> +	for (pass = 0, firststart = 0; pass < 2 && !yielded; pass++) {
> +
> +		idx = srcu_read_lock(&kvm->srcu_vcpus);
> +		kvm_for_each_vcpu(vcpu, kvm) {
> +			if (!pass && !firststart &&
> +			    vcpu != kvm->last_boosted_vcpu &&
> +			    kvm->last_boosted_vcpu != NULL) {
> +				vcpu = kvm->last_boosted_vcpu;
You access last_boosted_vcpu as if it is protected by srcu, but it
isn't. kvm_vcpu_release() changes it after synchronize_srcu_expedited()
call.

I do not like this last_boosted_vcpu pointer much. May be we can rid of
it by remembering last apic_id and searching for it each time we enter
the function. I do not think this function is to performance sensitive.
We enter here when vcpu is spinning anyway.

> +				firststart = 1;
>  				continue;
> -			} else if (pass && i > last_boosted_vcpu)
> +			} else if (pass && !lastone) {
> +				if (vcpu == kvm->last_boosted_vcpu)
> +					lastone = 1;
> +			} else if (pass && lastone)
>  				break;
> +
>  			if (vcpu == me)
>  				continue;
>  			if (waitqueue_active(&vcpu->wq))
> @@ -1629,15 +1644,20 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  				put_task_struct(task);
>  				continue;
>  			}
> +
>  			if (yield_to(task, 1)) {
>  				put_task_struct(task);
> -				kvm->last_boosted_vcpu = i;
> +				mutex_lock(&kvm->lock);
> +				kvm->last_boosted_vcpu = vcpu;
> +				mutex_unlock(&kvm->lock);
>  				yielded = 1;
>  				break;
>  			}
>  			put_task_struct(task);
>  		}
> +		srcu_read_unlock(&kvm->srcu_vcpus, idx);
>  	}
> +
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
>  
> @@ -1673,11 +1693,30 @@ static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
>  	return 0;
>  }
>  
> +static void kvm_vcpu_zap(struct kvm_vcpu *vcpu)
> +{
> +	kvm_arch_vcpu_zap(vcpu);
> +}
> +
>  static int kvm_vcpu_release(struct inode *inode, struct file *filp)
>  {
>  	struct kvm_vcpu *vcpu = filp->private_data;
> +	struct kvm *kvm = vcpu->kvm;
> +	filp->private_data = NULL;
> +
> +	mutex_lock(&kvm->lock);
> +	list_del_rcu(&vcpu->list);
> +	atomic_dec(&kvm->online_vcpus);
> +	mutex_unlock(&kvm->lock);
> +	synchronize_srcu_expedited(&kvm->srcu_vcpus);
> +
> +	mutex_lock(&kvm->lock);
> +	if (kvm->last_boosted_vcpu == vcpu)
> +		kvm->last_boosted_vcpu = NULL;
> +	mutex_unlock(&kvm->lock);
>  
> -	kvm_put_kvm(vcpu->kvm);
> +	/*vcpu is out of list,drop it safely*/
> +	kvm_vcpu_zap(vcpu);
>  	return 0;
>  }
>  
> @@ -1699,15 +1738,25 @@ static int create_vcpu_fd(struct kvm_vcpu *vcpu)
>  	return anon_inode_getfd("kvm-vcpu", &kvm_vcpu_fops, vcpu, O_RDWR);
>  }
>  
> +static struct kvm_vcpu *kvm_vcpu_create(struct kvm *kvm, u32 id)
> +{
> +	struct kvm_vcpu *vcpu;
> +	vcpu = kvm_arch_vcpu_create(kvm, id);
> +	if (IS_ERR(vcpu))
> +		return vcpu;
> +	INIT_LIST_HEAD(&vcpu->list);
> +	return vcpu;
> +}
> +
>  /*
>   * Creates some virtual cpus.  Good luck creating more than one.
>   */
>  static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
>  {
> -	int r;
> +	int r, idx;
>  	struct kvm_vcpu *vcpu, *v;
>  
> -	vcpu = kvm_arch_vcpu_create(kvm, id);
> +	vcpu = kvm_vcpu_create(kvm, id);
>  	if (IS_ERR(vcpu))
>  		return PTR_ERR(vcpu);
>  
> @@ -1723,13 +1772,15 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
>  		goto unlock_vcpu_destroy;
>  	}
>  
> -	kvm_for_each_vcpu(r, v, kvm)
> +	idx = srcu_read_lock(&kvm->srcu_vcpus);
> +	kvm_for_each_vcpu(v, kvm) {
>  		if (v->vcpu_id == id) {
>  			r = -EEXIST;
> +			srcu_read_unlock(&kvm->srcu_vcpus, idx);
>  			goto unlock_vcpu_destroy;
>  		}
> -
> -	BUG_ON(kvm->vcpus[atomic_read(&kvm->online_vcpus)]);
> +	}
> +	srcu_read_unlock(&kvm->srcu_vcpus, idx);
>  
>  	/* Now it's all set up, let userspace reach it */
>  	kvm_get_kvm(kvm);
> @@ -1739,8 +1790,8 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
>  		goto unlock_vcpu_destroy;
>  	}
>  
> -	kvm->vcpus[atomic_read(&kvm->online_vcpus)] = vcpu;
> -	smp_wmb();
> +	/*Protected by kvm->lock*/
> +	list_add_rcu(&vcpu->list, &kvm->vcpus);
>  	atomic_inc(&kvm->online_vcpus);
>  
>  #ifdef CONFIG_KVM_APIC_ARCHITECTURE
> @@ -2645,13 +2696,16 @@ static int vcpu_stat_get(void *_offset, u64 *val)
>  	unsigned offset = (long)_offset;
>  	struct kvm *kvm;
>  	struct kvm_vcpu *vcpu;
> -	int i;
> +	int idx;
>  
>  	*val = 0;
>  	raw_spin_lock(&kvm_lock);
> -	list_for_each_entry(kvm, &vm_list, vm_list)
> -		kvm_for_each_vcpu(i, vcpu, kvm)
> +	list_for_each_entry(kvm, &vm_list, vm_list) {
> +		idx = srcu_read_lock(&kvm->srcu_vcpus);
> +		kvm_for_each_vcpu(vcpu, kvm)
>  			*val += *(u32 *)((void *)vcpu + offset);
> +		srcu_read_unlock(&kvm->srcu_vcpus, idx);
> +	}
>  
>  	raw_spin_unlock(&kvm_lock);
>  	return 0;
> -- 
> 1.7.4.4

--
			Gleb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5] kvm: make vcpu life cycle separated from kvm instance
  2011-12-26 11:09               ` Gleb Natapov
@ 2011-12-26 11:17                 ` Avi Kivity
  2011-12-26 11:21                   ` Gleb Natapov
  2011-12-27  7:53                 ` Liu ping fan
  1 sibling, 1 reply; 78+ messages in thread
From: Avi Kivity @ 2011-12-26 11:17 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: Liu Ping Fan, kvm, linux-kernel, aliguori, mtosatti, jan.kiszka

On 12/26/2011 01:09 PM, Gleb Natapov wrote:
> > +
> > +		idx = srcu_read_lock(&kvm->srcu_vcpus);
> > +		kvm_for_each_vcpu(vcpu, kvm) {
> > +			if (!pass && !firststart &&
> > +			    vcpu != kvm->last_boosted_vcpu &&
> > +			    kvm->last_boosted_vcpu != NULL) {
> > +				vcpu = kvm->last_boosted_vcpu;
> You access last_boosted_vcpu as if it is protected by srcu, but it
> isn't. kvm_vcpu_release() changes it after synchronize_srcu_expedited()
> call.
>
> I do not like this last_boosted_vcpu pointer much. May be we can rid of
> it by remembering last apic_id and searching for it each time we enter
> the function. I do not think this function is to performance sensitive.
> We enter here when vcpu is spinning anyway.

We aren't guaranteed to have an apic_id, so it has to be done using rcu,
or maybe vcpu_id.  I prefer using srcu, we can't run away from vcpu
pointers.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5] kvm: make vcpu life cycle separated from kvm instance
  2011-12-26 11:17                 ` Avi Kivity
@ 2011-12-26 11:21                   ` Gleb Natapov
  0 siblings, 0 replies; 78+ messages in thread
From: Gleb Natapov @ 2011-12-26 11:21 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Liu Ping Fan, kvm, linux-kernel, aliguori, mtosatti, jan.kiszka

On Mon, Dec 26, 2011 at 01:17:39PM +0200, Avi Kivity wrote:
> On 12/26/2011 01:09 PM, Gleb Natapov wrote:
> > > +
> > > +		idx = srcu_read_lock(&kvm->srcu_vcpus);
> > > +		kvm_for_each_vcpu(vcpu, kvm) {
> > > +			if (!pass && !firststart &&
> > > +			    vcpu != kvm->last_boosted_vcpu &&
> > > +			    kvm->last_boosted_vcpu != NULL) {
> > > +				vcpu = kvm->last_boosted_vcpu;
> > You access last_boosted_vcpu as if it is protected by srcu, but it
> > isn't. kvm_vcpu_release() changes it after synchronize_srcu_expedited()
> > call.
> >
> > I do not like this last_boosted_vcpu pointer much. May be we can rid of
> > it by remembering last apic_id and searching for it each time we enter
> > the function. I do not think this function is to performance sensitive.
> > We enter here when vcpu is spinning anyway.
> 
> We aren't guaranteed to have an apic_id, so it has to be done using rcu,
> or maybe vcpu_id.  I prefer using srcu, we can't run away from vcpu
> pointers.
> 
Yeah, I meant vcpu_id (it is used as initial apic_id for x86, but this
code is not x86 specific).

--
			Gleb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v5] kvm: make vcpu life cycle separated from kvm instance
  2011-12-26 11:09               ` Gleb Natapov
  2011-12-26 11:17                 ` Avi Kivity
@ 2011-12-27  7:53                 ` Liu ping fan
  1 sibling, 0 replies; 78+ messages in thread
From: Liu ping fan @ 2011-12-27  7:53 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: kvm, linux-kernel, avi, aliguori, mtosatti, jan.kiszka, Xiao Guangrong

On Mon, Dec 26, 2011 at 7:09 PM, Gleb Natapov <gleb@redhat.com> wrote:
> On Sat, Dec 17, 2011 at 11:19:35AM +0800, Liu Ping Fan wrote:
>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>
>> Currently, vcpu can be destructed only when kvm instance destroyed.
>> Change this to vcpu's destruction before kvm instance, so vcpu MUST
>> and CAN be destroyed before kvm's destroy.
>>
>> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>> ---
>>  arch/x86/kvm/i8254.c     |   10 +++--
>>  arch/x86/kvm/i8259.c     |   12 ++++--
>>  arch/x86/kvm/x86.c       |   53 +++++++++++------------
>>  include/linux/kvm_host.h |   20 ++++-----
>>  virt/kvm/irq_comm.c      |    6 ++-
>>  virt/kvm/kvm_main.c      |  106 ++++++++++++++++++++++++++++++++++-----------
>>  6 files changed, 132 insertions(+), 75 deletions(-)
>>
>> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
>> index 76e3f1c..a3a5506 100644
>> --- a/arch/x86/kvm/i8254.c
>> +++ b/arch/x86/kvm/i8254.c
>> @@ -289,9 +289,8 @@ static void pit_do_work(struct work_struct *work)
>>       struct kvm_pit *pit = container_of(work, struct kvm_pit, expired);
>>       struct kvm *kvm = pit->kvm;
>>       struct kvm_vcpu *vcpu;
>> -     int i;
>>       struct kvm_kpit_state *ps = &pit->pit_state;
>> -     int inject = 0;
>> +     int idx, inject = 0;
>>
>>       /* Try to inject pending interrupts when
>>        * last one has been acked.
>> @@ -315,9 +314,12 @@ static void pit_do_work(struct work_struct *work)
>>                * LVT0 to NMI delivery. Other PIC interrupts are just sent to
>>                * VCPU0, and only if its LVT0 is in EXTINT mode.
>>                */
>> -             if (kvm->arch.vapics_in_nmi_mode > 0)
>> -                     kvm_for_each_vcpu(i, vcpu, kvm)
>> +             if (kvm->arch.vapics_in_nmi_mode > 0) {
>> +                     idx = srcu_read_lock(&kvm->srcu_vcpus);
>> +                     kvm_for_each_vcpu(vcpu, kvm)
>>                               kvm_apic_nmi_wd_deliver(vcpu);
>> +                     srcu_read_unlock(&kvm->srcu_vcpus, idx);
>> +             }
>>       }
>>  }
>>
>> diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
>> index cac4746..5ef5c05 100644
>> --- a/arch/x86/kvm/i8259.c
>> +++ b/arch/x86/kvm/i8259.c
>> @@ -50,25 +50,29 @@ static void pic_unlock(struct kvm_pic *s)
>>  {
>>       bool wakeup = s->wakeup_needed;
>>       struct kvm_vcpu *vcpu, *found = NULL;
>> -     int i;
>> +     struct kvm *kvm = s->kvm;
>> +     int idx;
>>
>>       s->wakeup_needed = false;
>>
>>       spin_unlock(&s->lock);
>>
>>       if (wakeup) {
>> -             kvm_for_each_vcpu(i, vcpu, s->kvm) {
>> +             idx = srcu_read_lock(&kvm->srcu_vcpus);
>> +             kvm_for_each_vcpu(vcpu, kvm)
>>                       if (kvm_apic_accept_pic_intr(vcpu)) {
>>                               found = vcpu;
>>                               break;
>>                       }
>> -             }
>>
>> -             if (!found)
>> +             if (!found) {
>> +                     srcu_read_unlock(&kvm->srcu_vcpus, idx);
>>                       return;
>> +             }
>>
>>               kvm_make_request(KVM_REQ_EVENT, found);
>>               kvm_vcpu_kick(found);
>> +             srcu_read_unlock(&kvm->srcu_vcpus, idx);
>>       }
>>  }
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 23c93fe..b79739d 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1774,14 +1774,20 @@ static int get_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
>>  static int get_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
>>  {
>>       u64 data = 0;
>> +     int idx;
>>
>>       switch (msr) {
>>       case HV_X64_MSR_VP_INDEX: {
>> -             int r;
>> +             int r = 0;
>>               struct kvm_vcpu *v;
>> -             kvm_for_each_vcpu(r, v, vcpu->kvm)
>> +             struct kvm *kvm = vcpu->kvm;
>> +             idx = srcu_read_lock(&kvm->srcu_vcpus);
>> +             kvm_for_each_vcpu(v, vcpu->kvm) {
>>                       if (v == vcpu)
>>                               data = r;
>> +                     r++;
>> +             }
>> +             srcu_read_unlock(&kvm->srcu_vcpus, idx);
>>               break;
>>       }
>>       case HV_X64_MSR_EOI:
>> @@ -4529,7 +4535,7 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
>>       struct cpufreq_freqs *freq = data;
>>       struct kvm *kvm;
>>       struct kvm_vcpu *vcpu;
>> -     int i, send_ipi = 0;
>> +     int idx, send_ipi = 0;
>>
>>       /*
>>        * We allow guests to temporarily run on slowing clocks,
>> @@ -4579,13 +4585,16 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
>>
>>       raw_spin_lock(&kvm_lock);
>>       list_for_each_entry(kvm, &vm_list, vm_list) {
>> -             kvm_for_each_vcpu(i, vcpu, kvm) {
>> +             idx = srcu_read_lock(&kvm->srcu_vcpus);
>> +             kvm_for_each_vcpu(vcpu, kvm) {
>>                       if (vcpu->cpu != freq->cpu)
>>                               continue;
>>                       kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>>                       if (vcpu->cpu != smp_processor_id())
>>                               send_ipi = 1;
>>               }
>> +             srcu_read_unlock(&kvm->srcu_vcpus, idx);
>> +
>>       }
>>       raw_spin_unlock(&kvm_lock);
>>
>> @@ -5866,13 +5875,17 @@ int kvm_arch_hardware_enable(void *garbage)
>>  {
>>       struct kvm *kvm;
>>       struct kvm_vcpu *vcpu;
>> -     int i;
>> +     int idx;
>>
>>       kvm_shared_msr_cpu_online();
>> -     list_for_each_entry(kvm, &vm_list, vm_list)
>> -             kvm_for_each_vcpu(i, vcpu, kvm)
>> +     list_for_each_entry(kvm, &vm_list, vm_list) {
>> +             idx = srcu_read_lock(&kvm->srcu_vcpus);
>> +             kvm_for_each_vcpu(vcpu, kvm) {
>>                       if (vcpu->cpu == smp_processor_id())
>>                               kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>> +             }
>> +             srcu_read_unlock(&kvm->srcu_vcpus, idx);
>> +     }
>>       return kvm_x86_ops->hardware_enable(garbage);
>>  }
>>
>> @@ -5989,27 +6002,14 @@ static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
>>       vcpu_put(vcpu);
>>  }
>>
>> -static void kvm_free_vcpus(struct kvm *kvm)
>> +void kvm_arch_vcpu_zap(struct kvm_vcpu *vcpu)
>>  {
>> -     unsigned int i;
>> -     struct kvm_vcpu *vcpu;
>> -
>> -     /*
>> -      * Unpin any mmu pages first.
>> -      */
>> -     kvm_for_each_vcpu(i, vcpu, kvm) {
>> -             kvm_clear_async_pf_completion_queue(vcpu);
>> -             kvm_unload_vcpu_mmu(vcpu);
>> -     }
>> -     kvm_for_each_vcpu(i, vcpu, kvm)
>> -             kvm_arch_vcpu_free(vcpu);
>> -
>> -     mutex_lock(&kvm->lock);
>> -     for (i = 0; i < atomic_read(&kvm->online_vcpus); i++)
>> -             kvm->vcpus[i] = NULL;
>> +     struct kvm *kvm = vcpu->kvm;
>>
>> -     atomic_set(&kvm->online_vcpus, 0);
>> -     mutex_unlock(&kvm->lock);
>> +     kvm_clear_async_pf_completion_queue(vcpu);
>> +     kvm_unload_vcpu_mmu(vcpu);
>> +     kvm_arch_vcpu_free(vcpu);
>> +     kvm_put_kvm(kvm);
>>  }
>>
>>  void kvm_arch_sync_events(struct kvm *kvm)
>> @@ -6023,7 +6023,6 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>>       kvm_iommu_unmap_guest(kvm);
>>       kfree(kvm->arch.vpic);
>>       kfree(kvm->arch.vioapic);
>> -     kvm_free_vcpus(kvm);
>>       if (kvm->arch.apic_access_page)
>>               put_page(kvm->arch.apic_access_page);
>>       if (kvm->arch.ept_identity_pagetable)
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 8c5c303..ab22828 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -115,6 +115,7 @@ enum {
>>
>>  struct kvm_vcpu {
>>       struct kvm *kvm;
>> +     struct list_head list;
>>  #ifdef CONFIG_PREEMPT_NOTIFIERS
>>       struct preempt_notifier preempt_notifier;
>>  #endif
>> @@ -249,13 +250,15 @@ struct kvm {
>>       struct mm_struct *mm; /* userspace tied to this vm */
>>       struct kvm_memslots *memslots;
>>       struct srcu_struct srcu;
>> +     struct srcu_struct srcu_vcpus;
>> +
>>  #ifdef CONFIG_KVM_APIC_ARCHITECTURE
>>       u32 bsp_vcpu_id;
>>       struct kvm_vcpu *bsp_vcpu;
> Rebase to latest kvm.git.
>
>>  #endif
>> -     struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
>> +     struct list_head vcpus;
>>       atomic_t online_vcpus;
>> -     int last_boosted_vcpu;
>> +     struct kvm_vcpu *last_boosted_vcpu;
>>       struct list_head vm_list;
>>       struct mutex lock;
>>       struct kvm_io_bus *buses[KVM_NR_BUSES];
>> @@ -302,17 +305,10 @@ struct kvm {
>>  #define kvm_printf(kvm, fmt ...) printk(KERN_DEBUG fmt)
>>  #define vcpu_printf(vcpu, fmt...) kvm_printf(vcpu->kvm, fmt)
>>
>> -static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i)
>> -{
>> -     smp_rmb();
>> -     return kvm->vcpus[i];
>> -}
>> +void kvm_arch_vcpu_zap(struct kvm_vcpu *vcpu);
>>
>> -#define kvm_for_each_vcpu(idx, vcpup, kvm) \
>> -     for (idx = 0; \
>> -          idx < atomic_read(&kvm->online_vcpus) && \
>> -          (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
>> -          idx++)
>> +#define kvm_for_each_vcpu(vcpu, kvm) \
>> +     list_for_each_entry_rcu(vcpu, &kvm->vcpus, list)
>>
>>  #define kvm_for_each_memslot(memslot, slots) \
>>       for (memslot = &slots->memslots[0];     \
>> diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
>> index 9f614b4..78dc97c 100644
>> --- a/virt/kvm/irq_comm.c
>> +++ b/virt/kvm/irq_comm.c
>> @@ -81,14 +81,15 @@ inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
>>  int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
>>               struct kvm_lapic_irq *irq)
>>  {
>> -     int i, r = -1;
>> +     int idx, r = -1;
>>       struct kvm_vcpu *vcpu, *lowest = NULL;
>>
>>       if (irq->dest_mode == 0 && irq->dest_id == 0xff &&
>>                       kvm_is_dm_lowest_prio(irq))
>>               printk(KERN_INFO "kvm: apic: phys broadcast and lowest prio\n");
>>
>> -     kvm_for_each_vcpu(i, vcpu, kvm) {
>> +     idx = srcu_read_lock(&kvm->srcu_vcpus);
>> +     kvm_for_each_vcpu(vcpu, kvm) {
>>               if (!kvm_apic_present(vcpu))
>>                       continue;
>>
>> @@ -111,6 +112,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
>>       if (lowest)
>>               r = kvm_apic_set_irq(lowest, irq);
>>
>> +     srcu_read_unlock(&kvm->srcu_vcpus, idx);
>>       return r;
>>  }
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index e289486..ec0c920 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -171,7 +171,7 @@ static void ack_flush(void *_completed)
>>
>>  static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
>>  {
>> -     int i, cpu, me;
>> +     int cpu, me, idx;
>>       cpumask_var_t cpus;
>>       bool called = true;
>>       struct kvm_vcpu *vcpu;
>> @@ -179,7 +179,8 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
>>       zalloc_cpumask_var(&cpus, GFP_ATOMIC);
>>
>>       me = get_cpu();
>> -     kvm_for_each_vcpu(i, vcpu, kvm) {
>> +     idx = srcu_read_lock(&kvm->srcu_vcpus);
>> +     kvm_for_each_vcpu(vcpu, kvm) {
>>               kvm_make_request(req, vcpu);
>>               cpu = vcpu->cpu;
>>
>> @@ -190,12 +191,15 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
>>                     kvm_vcpu_exiting_guest_mode(vcpu) != OUTSIDE_GUEST_MODE)
>>                       cpumask_set_cpu(cpu, cpus);
>>       }
>> +     srcu_read_unlock(&kvm->srcu_vcpus, idx);
>> +
>>       if (unlikely(cpus == NULL))
>>               smp_call_function_many(cpu_online_mask, ack_flush, NULL, 1);
>>       else if (!cpumask_empty(cpus))
>>               smp_call_function_many(cpus, ack_flush, NULL, 1);
>>       else
>>               called = false;
>> +
>>       put_cpu();
>>       free_cpumask_var(cpus);
>>       return called;
>> @@ -477,6 +481,8 @@ static struct kvm *kvm_create_vm(void)
>>       kvm_init_memslots_id(kvm);
>>       if (init_srcu_struct(&kvm->srcu))
>>               goto out_err_nosrcu;
>> +     if (init_srcu_struct(&kvm->srcu_vcpus))
>> +             goto out_err_nosrcu_vcpus;
>>       for (i = 0; i < KVM_NR_BUSES; i++) {
>>               kvm->buses[i] = kzalloc(sizeof(struct kvm_io_bus),
>>                                       GFP_KERNEL);
>> @@ -500,10 +506,13 @@ static struct kvm *kvm_create_vm(void)
>>       raw_spin_lock(&kvm_lock);
>>       list_add(&kvm->vm_list, &vm_list);
>>       raw_spin_unlock(&kvm_lock);
>> +     INIT_LIST_HEAD(&kvm->vcpus);
>>
>>       return kvm;
>>
>>  out_err:
>> +     cleanup_srcu_struct(&kvm->srcu_vcpus);
>> +out_err_nosrcu_vcpus:
>>       cleanup_srcu_struct(&kvm->srcu);
>>  out_err_nosrcu:
>>       hardware_disable_all();
>> @@ -587,6 +596,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
>>       kvm_arch_destroy_vm(kvm);
>>       kvm_free_physmem(kvm);
>>       cleanup_srcu_struct(&kvm->srcu);
>> +     cleanup_srcu_struct(&kvm->srcu_vcpus);
>>       kvm_arch_free_vm(kvm);
>>       hardware_disable_all();
>>       mmdrop(mm);
>> @@ -1593,11 +1603,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>  {
>>       struct kvm *kvm = me->kvm;
>>       struct kvm_vcpu *vcpu;
>> -     int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
>> -     int yielded = 0;
>> -     int pass;
>> -     int i;
>> -
>> +     struct task_struct *task = NULL;
>> +     struct pid *pid;
>> +     int pass, firststart, lastone, yielded, idx;
>>       /*
>>        * We boost the priority of a VCPU that is runnable but not
>>        * currently running, because it got preempted by something
>> @@ -1605,15 +1613,22 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>        * VCPU is holding the lock that we need and will release it.
>>        * We approximate round-robin by starting at the last boosted VCPU.
>>        */
>> -     for (pass = 0; pass < 2 && !yielded; pass++) {
>> -             kvm_for_each_vcpu(i, vcpu, kvm) {
>> -                     struct task_struct *task = NULL;
>> -                     struct pid *pid;
>> -                     if (!pass && i < last_boosted_vcpu) {
>> -                             i = last_boosted_vcpu;
>> +     for (pass = 0, firststart = 0; pass < 2 && !yielded; pass++) {
>> +
>> +             idx = srcu_read_lock(&kvm->srcu_vcpus);
>> +             kvm_for_each_vcpu(vcpu, kvm) {
>> +                     if (!pass && !firststart &&
>> +                         vcpu != kvm->last_boosted_vcpu &&
>> +                         kvm->last_boosted_vcpu != NULL) {
>> +                             vcpu = kvm->last_boosted_vcpu;
> You access last_boosted_vcpu as if it is protected by srcu, but it
> isn't. kvm_vcpu_release() changes it after synchronize_srcu_expedited()
> call.
>
Oh, get it. It opens a gap to make the access to the reclaimed vcpu possible.

> I do not like this last_boosted_vcpu pointer much. May be we can rid of
> it by remembering last apic_id and searching for it each time we enter
> the function. I do not think this function is to performance sensitive.
> We enter here when vcpu is spinning anyway.
>
Fine, I find it is very hard to protect both the rcu_list and this
pointer at the same time. And vcpu_id give me a way out.

Thanks and regards,
ping fan
>> +                             firststart = 1;
>>                               continue;
>> -                     } else if (pass && i > last_boosted_vcpu)
>> +                     } else if (pass && !lastone) {
>> +                             if (vcpu == kvm->last_boosted_vcpu)
>> +                                     lastone = 1;
>> +                     } else if (pass && lastone)
>>                               break;
>> +
>>                       if (vcpu == me)
>>                               continue;
>>                       if (waitqueue_active(&vcpu->wq))
>> @@ -1629,15 +1644,20 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>                               put_task_struct(task);
>>                               continue;
>>                       }
>> +
>>                       if (yield_to(task, 1)) {
>>                               put_task_struct(task);
>> -                             kvm->last_boosted_vcpu = i;
>> +                             mutex_lock(&kvm->lock);
>> +                             kvm->last_boosted_vcpu = vcpu;
>> +                             mutex_unlock(&kvm->lock);
>>                               yielded = 1;
>>                               break;
>>                       }
>>                       put_task_struct(task);
>>               }
>> +             srcu_read_unlock(&kvm->srcu_vcpus, idx);
>>       }
>> +
>>  }
>>  EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
>>
>> @@ -1673,11 +1693,30 @@ static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
>>       return 0;
>>  }
>>
>> +static void kvm_vcpu_zap(struct kvm_vcpu *vcpu)
>> +{
>> +     kvm_arch_vcpu_zap(vcpu);
>> +}
>> +
>>  static int kvm_vcpu_release(struct inode *inode, struct file *filp)
>>  {
>>       struct kvm_vcpu *vcpu = filp->private_data;
>> +     struct kvm *kvm = vcpu->kvm;
>> +     filp->private_data = NULL;
>> +
>> +     mutex_lock(&kvm->lock);
>> +     list_del_rcu(&vcpu->list);
>> +     atomic_dec(&kvm->online_vcpus);
>> +     mutex_unlock(&kvm->lock);
>> +     synchronize_srcu_expedited(&kvm->srcu_vcpus);
>> +
>> +     mutex_lock(&kvm->lock);
>> +     if (kvm->last_boosted_vcpu == vcpu)
>> +             kvm->last_boosted_vcpu = NULL;
>> +     mutex_unlock(&kvm->lock);
>>
>> -     kvm_put_kvm(vcpu->kvm);
>> +     /*vcpu is out of list,drop it safely*/
>> +     kvm_vcpu_zap(vcpu);
>>       return 0;
>>  }
>>
>> @@ -1699,15 +1738,25 @@ static int create_vcpu_fd(struct kvm_vcpu *vcpu)
>>       return anon_inode_getfd("kvm-vcpu", &kvm_vcpu_fops, vcpu, O_RDWR);
>>  }
>>
>> +static struct kvm_vcpu *kvm_vcpu_create(struct kvm *kvm, u32 id)
>> +{
>> +     struct kvm_vcpu *vcpu;
>> +     vcpu = kvm_arch_vcpu_create(kvm, id);
>> +     if (IS_ERR(vcpu))
>> +             return vcpu;
>> +     INIT_LIST_HEAD(&vcpu->list);
>> +     return vcpu;
>> +}
>> +
>>  /*
>>   * Creates some virtual cpus.  Good luck creating more than one.
>>   */
>>  static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
>>  {
>> -     int r;
>> +     int r, idx;
>>       struct kvm_vcpu *vcpu, *v;
>>
>> -     vcpu = kvm_arch_vcpu_create(kvm, id);
>> +     vcpu = kvm_vcpu_create(kvm, id);
>>       if (IS_ERR(vcpu))
>>               return PTR_ERR(vcpu);
>>
>> @@ -1723,13 +1772,15 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
>>               goto unlock_vcpu_destroy;
>>       }
>>
>> -     kvm_for_each_vcpu(r, v, kvm)
>> +     idx = srcu_read_lock(&kvm->srcu_vcpus);
>> +     kvm_for_each_vcpu(v, kvm) {
>>               if (v->vcpu_id == id) {
>>                       r = -EEXIST;
>> +                     srcu_read_unlock(&kvm->srcu_vcpus, idx);
>>                       goto unlock_vcpu_destroy;
>>               }
>> -
>> -     BUG_ON(kvm->vcpus[atomic_read(&kvm->online_vcpus)]);
>> +     }
>> +     srcu_read_unlock(&kvm->srcu_vcpus, idx);
>>
>>       /* Now it's all set up, let userspace reach it */
>>       kvm_get_kvm(kvm);
>> @@ -1739,8 +1790,8 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
>>               goto unlock_vcpu_destroy;
>>       }
>>
>> -     kvm->vcpus[atomic_read(&kvm->online_vcpus)] = vcpu;
>> -     smp_wmb();
>> +     /*Protected by kvm->lock*/
>> +     list_add_rcu(&vcpu->list, &kvm->vcpus);
>>       atomic_inc(&kvm->online_vcpus);
>>
>>  #ifdef CONFIG_KVM_APIC_ARCHITECTURE
>> @@ -2645,13 +2696,16 @@ static int vcpu_stat_get(void *_offset, u64 *val)
>>       unsigned offset = (long)_offset;
>>       struct kvm *kvm;
>>       struct kvm_vcpu *vcpu;
>> -     int i;
>> +     int idx;
>>
>>       *val = 0;
>>       raw_spin_lock(&kvm_lock);
>> -     list_for_each_entry(kvm, &vm_list, vm_list)
>> -             kvm_for_each_vcpu(i, vcpu, kvm)
>> +     list_for_each_entry(kvm, &vm_list, vm_list) {
>> +             idx = srcu_read_lock(&kvm->srcu_vcpus);
>> +             kvm_for_each_vcpu(vcpu, kvm)
>>                       *val += *(u32 *)((void *)vcpu + offset);
>> +             srcu_read_unlock(&kvm->srcu_vcpus, idx);
>> +     }
>>
>>       raw_spin_unlock(&kvm_lock);
>>       return 0;
>> --
>> 1.7.4.4
>
> --
>                        Gleb.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v6] kvm: make vcpu life cycle separated from kvm instance
  2011-12-17  3:19             ` [PATCH v5] " Liu Ping Fan
  2011-12-26 11:09               ` Gleb Natapov
@ 2011-12-27  8:38               ` Liu Ping Fan
  2011-12-27 11:22                 ` Takuya Yoshikawa
  2011-12-28  9:53                 ` Avi Kivity
  2012-01-07  2:55               ` [PATCH v7] " Liu Ping Fan
  2 siblings, 2 replies; 78+ messages in thread
From: Liu Ping Fan @ 2011-12-27  8:38 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, avi, aliguori, gleb, mtosatti, xiaoguangrong.eric,
	jan.kiszka

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

Currently, vcpu can be destructed only when kvm instance destroyed.
Change this to vcpu's destruction before kvm instance, so vcpu MUST
and CAN be destroyed before kvm's destroy.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
 arch/x86/kvm/i8254.c     |   10 +++--
 arch/x86/kvm/i8259.c     |   17 +++++--
 arch/x86/kvm/x86.c       |   53 +++++++++++-----------
 include/linux/kvm_host.h |   20 +++-----
 virt/kvm/irq_comm.c      |    6 ++-
 virt/kvm/kvm_main.c      |  110 +++++++++++++++++++++++++++++++++++-----------
 6 files changed, 140 insertions(+), 76 deletions(-)

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index d68f99d..c190a55 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -289,9 +289,8 @@ static void pit_do_work(struct work_struct *work)
 	struct kvm_pit *pit = container_of(work, struct kvm_pit, expired);
 	struct kvm *kvm = pit->kvm;
 	struct kvm_vcpu *vcpu;
-	int i;
 	struct kvm_kpit_state *ps = &pit->pit_state;
-	int inject = 0;
+	int idx, inject = 0;
 
 	/* Try to inject pending interrupts when
 	 * last one has been acked.
@@ -315,9 +314,12 @@ static void pit_do_work(struct work_struct *work)
 		 * LVT0 to NMI delivery. Other PIC interrupts are just sent to
 		 * VCPU0, and only if its LVT0 is in EXTINT mode.
 		 */
-		if (kvm->arch.vapics_in_nmi_mode > 0)
-			kvm_for_each_vcpu(i, vcpu, kvm)
+		if (kvm->arch.vapics_in_nmi_mode > 0) {
+			idx = srcu_read_lock(&kvm->srcu_vcpus);
+			kvm_for_each_vcpu(vcpu, kvm)
 				kvm_apic_nmi_wd_deliver(vcpu);
+			srcu_read_unlock(&kvm->srcu_vcpus, idx);
+		}
 	}
 }
 
diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index b6a7353..029c0a8 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -50,25 +50,29 @@ static void pic_unlock(struct kvm_pic *s)
 {
 	bool wakeup = s->wakeup_needed;
 	struct kvm_vcpu *vcpu, *found = NULL;
-	int i;
+	struct kvm *kvm = s->kvm;
+	int idx;
 
 	s->wakeup_needed = false;
 
 	spin_unlock(&s->lock);
 
 	if (wakeup) {
-		kvm_for_each_vcpu(i, vcpu, s->kvm) {
+		idx = srcu_read_lock(&kvm->srcu_vcpus);
+		kvm_for_each_vcpu(vcpu, kvm)
 			if (kvm_apic_accept_pic_intr(vcpu)) {
 				found = vcpu;
 				break;
 			}
-		}
 
-		if (!found)
+		if (!found) {
+			srcu_read_unlock(&kvm->srcu_vcpus, idx);
 			return;
+		}
 
 		kvm_make_request(KVM_REQ_EVENT, found);
 		kvm_vcpu_kick(found);
+		srcu_read_unlock(&kvm->srcu_vcpus, idx);
 	}
 }
 
@@ -265,6 +269,7 @@ void kvm_pic_reset(struct kvm_kpic_state *s)
 	int irq, i;
 	struct kvm_vcpu *vcpu;
 	u8 irr = s->irr, isr = s->imr;
+	struct kvm *kvm = s->pics_state->kvm;
 	bool found = false;
 
 	s->last_irr = 0;
@@ -282,11 +287,13 @@ void kvm_pic_reset(struct kvm_kpic_state *s)
 	s->special_fully_nested_mode = 0;
 	s->init4 = 0;
 
-	kvm_for_each_vcpu(i, vcpu, s->pics_state->kvm)
+	i = srcu_read_lock(&kvm->srcu_vcpus);
+	kvm_for_each_vcpu(vcpu, kvm)
 		if (kvm_apic_accept_pic_intr(vcpu)) {
 			found = true;
 			break;
 		}
+	srcu_read_unlock(&kvm->srcu_vcpus, i);
 
 
 	if (!found)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1171def..ff6adf8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1786,14 +1786,20 @@ static int get_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 static int get_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 {
 	u64 data = 0;
+	int idx;
 
 	switch (msr) {
 	case HV_X64_MSR_VP_INDEX: {
-		int r;
+		int r = 0;
 		struct kvm_vcpu *v;
-		kvm_for_each_vcpu(r, v, vcpu->kvm)
+		struct kvm *kvm = vcpu->kvm;
+		idx = srcu_read_lock(&kvm->srcu_vcpus);
+		kvm_for_each_vcpu(v, vcpu->kvm) {
 			if (v == vcpu)
 				data = r;
+			r++;
+		}
+		srcu_read_unlock(&kvm->srcu_vcpus, idx);
 		break;
 	}
 	case HV_X64_MSR_EOI:
@@ -4538,7 +4544,7 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
 	struct cpufreq_freqs *freq = data;
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
-	int i, send_ipi = 0;
+	int idx, send_ipi = 0;
 
 	/*
 	 * We allow guests to temporarily run on slowing clocks,
@@ -4588,13 +4594,16 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
 
 	raw_spin_lock(&kvm_lock);
 	list_for_each_entry(kvm, &vm_list, vm_list) {
-		kvm_for_each_vcpu(i, vcpu, kvm) {
+		idx = srcu_read_lock(&kvm->srcu_vcpus);
+		kvm_for_each_vcpu(vcpu, kvm) {
 			if (vcpu->cpu != freq->cpu)
 				continue;
 			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 			if (vcpu->cpu != smp_processor_id())
 				send_ipi = 1;
 		}
+		srcu_read_unlock(&kvm->srcu_vcpus, idx);
+
 	}
 	raw_spin_unlock(&kvm_lock);
 
@@ -5881,13 +5890,17 @@ int kvm_arch_hardware_enable(void *garbage)
 {
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
-	int i;
+	int idx;
 
 	kvm_shared_msr_cpu_online();
-	list_for_each_entry(kvm, &vm_list, vm_list)
-		kvm_for_each_vcpu(i, vcpu, kvm)
+	list_for_each_entry(kvm, &vm_list, vm_list) {
+		idx = srcu_read_lock(&kvm->srcu_vcpus);
+		kvm_for_each_vcpu(vcpu, kvm) {
 			if (vcpu->cpu == smp_processor_id())
 				kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+		}
+		srcu_read_unlock(&kvm->srcu_vcpus, idx);
+	}
 	return kvm_x86_ops->hardware_enable(garbage);
 }
 
@@ -6006,27 +6019,14 @@ static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
 	vcpu_put(vcpu);
 }
 
-static void kvm_free_vcpus(struct kvm *kvm)
+void kvm_arch_vcpu_zap(struct kvm_vcpu *vcpu)
 {
-	unsigned int i;
-	struct kvm_vcpu *vcpu;
-
-	/*
-	 * Unpin any mmu pages first.
-	 */
-	kvm_for_each_vcpu(i, vcpu, kvm) {
-		kvm_clear_async_pf_completion_queue(vcpu);
-		kvm_unload_vcpu_mmu(vcpu);
-	}
-	kvm_for_each_vcpu(i, vcpu, kvm)
-		kvm_arch_vcpu_free(vcpu);
-
-	mutex_lock(&kvm->lock);
-	for (i = 0; i < atomic_read(&kvm->online_vcpus); i++)
-		kvm->vcpus[i] = NULL;
+	struct kvm *kvm = vcpu->kvm;
 
-	atomic_set(&kvm->online_vcpus, 0);
-	mutex_unlock(&kvm->lock);
+	kvm_clear_async_pf_completion_queue(vcpu);
+	kvm_unload_vcpu_mmu(vcpu);
+	kvm_arch_vcpu_free(vcpu);
+	kvm_put_kvm(kvm);
 }
 
 void kvm_arch_sync_events(struct kvm *kvm)
@@ -6040,7 +6040,6 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 	kvm_iommu_unmap_guest(kvm);
 	kfree(kvm->arch.vpic);
 	kfree(kvm->arch.vioapic);
-	kvm_free_vcpus(kvm);
 	if (kvm->arch.apic_access_page)
 		put_page(kvm->arch.apic_access_page);
 	if (kvm->arch.ept_identity_pagetable)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 900c763..b88d418d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -117,6 +117,7 @@ enum {
 
 struct kvm_vcpu {
 	struct kvm *kvm;
+	struct list_head list;
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	struct preempt_notifier preempt_notifier;
 #endif
@@ -251,12 +252,14 @@ struct kvm {
 	struct mm_struct *mm; /* userspace tied to this vm */
 	struct kvm_memslots *memslots;
 	struct srcu_struct srcu;
+	struct srcu_struct srcu_vcpus;
+
 #ifdef CONFIG_KVM_APIC_ARCHITECTURE
 	u32 bsp_vcpu_id;
 #endif
-	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
+	struct list_head vcpus;
 	atomic_t online_vcpus;
-	int last_boosted_vcpu;
+	int last_boosted_vcpu_id;
 	struct list_head vm_list;
 	struct mutex lock;
 	struct kvm_io_bus *buses[KVM_NR_BUSES];
@@ -303,17 +306,10 @@ struct kvm {
 #define kvm_printf(kvm, fmt ...) printk(KERN_DEBUG fmt)
 #define vcpu_printf(vcpu, fmt...) kvm_printf(vcpu->kvm, fmt)
 
-static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i)
-{
-	smp_rmb();
-	return kvm->vcpus[i];
-}
+void kvm_arch_vcpu_zap(struct kvm_vcpu *vcpu);
 
-#define kvm_for_each_vcpu(idx, vcpup, kvm) \
-	for (idx = 0; \
-	     idx < atomic_read(&kvm->online_vcpus) && \
-	     (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
-	     idx++)
+#define kvm_for_each_vcpu(vcpu, kvm) \
+	list_for_each_entry_rcu(vcpu, &kvm->vcpus, list)
 
 #define kvm_for_each_memslot(memslot, slots)	\
 	for (memslot = &slots->memslots[0];	\
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 9f614b4..78dc97c 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -81,14 +81,15 @@ inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
 int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 		struct kvm_lapic_irq *irq)
 {
-	int i, r = -1;
+	int idx, r = -1;
 	struct kvm_vcpu *vcpu, *lowest = NULL;
 
 	if (irq->dest_mode == 0 && irq->dest_id == 0xff &&
 			kvm_is_dm_lowest_prio(irq))
 		printk(KERN_INFO "kvm: apic: phys broadcast and lowest prio\n");
 
-	kvm_for_each_vcpu(i, vcpu, kvm) {
+	idx = srcu_read_lock(&kvm->srcu_vcpus);
+	kvm_for_each_vcpu(vcpu, kvm) {
 		if (!kvm_apic_present(vcpu))
 			continue;
 
@@ -111,6 +112,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 	if (lowest)
 		r = kvm_apic_set_irq(lowest, irq);
 
+	srcu_read_unlock(&kvm->srcu_vcpus, idx);
 	return r;
 }
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7287bf5..84b413d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -171,7 +171,7 @@ static void ack_flush(void *_completed)
 
 static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 {
-	int i, cpu, me;
+	int cpu, me, idx;
 	cpumask_var_t cpus;
 	bool called = true;
 	struct kvm_vcpu *vcpu;
@@ -179,7 +179,8 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 	zalloc_cpumask_var(&cpus, GFP_ATOMIC);
 
 	me = get_cpu();
-	kvm_for_each_vcpu(i, vcpu, kvm) {
+	idx = srcu_read_lock(&kvm->srcu_vcpus);
+	kvm_for_each_vcpu(vcpu, kvm) {
 		kvm_make_request(req, vcpu);
 		cpu = vcpu->cpu;
 
@@ -190,12 +191,15 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 		      kvm_vcpu_exiting_guest_mode(vcpu) != OUTSIDE_GUEST_MODE)
 			cpumask_set_cpu(cpu, cpus);
 	}
+	srcu_read_unlock(&kvm->srcu_vcpus, idx);
+
 	if (unlikely(cpus == NULL))
 		smp_call_function_many(cpu_online_mask, ack_flush, NULL, 1);
 	else if (!cpumask_empty(cpus))
 		smp_call_function_many(cpus, ack_flush, NULL, 1);
 	else
 		called = false;
+
 	put_cpu();
 	free_cpumask_var(cpus);
 	return called;
@@ -477,6 +481,8 @@ static struct kvm *kvm_create_vm(void)
 	kvm_init_memslots_id(kvm);
 	if (init_srcu_struct(&kvm->srcu))
 		goto out_err_nosrcu;
+	if (init_srcu_struct(&kvm->srcu_vcpus))
+		goto out_err_nosrcu_vcpus;
 	for (i = 0; i < KVM_NR_BUSES; i++) {
 		kvm->buses[i] = kzalloc(sizeof(struct kvm_io_bus),
 					GFP_KERNEL);
@@ -500,10 +506,13 @@ static struct kvm *kvm_create_vm(void)
 	raw_spin_lock(&kvm_lock);
 	list_add(&kvm->vm_list, &vm_list);
 	raw_spin_unlock(&kvm_lock);
+	INIT_LIST_HEAD(&kvm->vcpus);
 
 	return kvm;
 
 out_err:
+	cleanup_srcu_struct(&kvm->srcu_vcpus);
+out_err_nosrcu_vcpus:
 	cleanup_srcu_struct(&kvm->srcu);
 out_err_nosrcu:
 	hardware_disable_all();
@@ -587,6 +596,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	kvm_arch_destroy_vm(kvm);
 	kvm_free_physmem(kvm);
 	cleanup_srcu_struct(&kvm->srcu);
+	cleanup_srcu_struct(&kvm->srcu_vcpus);
 	kvm_arch_free_vm(kvm);
 	hardware_disable_all();
 	mmdrop(mm);
@@ -1593,11 +1603,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 {
 	struct kvm *kvm = me->kvm;
 	struct kvm_vcpu *vcpu;
-	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
-	int yielded = 0;
-	int pass;
-	int i;
-
+	struct task_struct *task = NULL;
+	struct pid *pid;
+	int pass, firststart, lastone, yielded, idx;
 	/*
 	 * We boost the priority of a VCPU that is runnable but not
 	 * currently running, because it got preempted by something
@@ -1605,15 +1613,26 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 	 * VCPU is holding the lock that we need and will release it.
 	 * We approximate round-robin by starting at the last boosted VCPU.
 	 */
-	for (pass = 0; pass < 2 && !yielded; pass++) {
-		kvm_for_each_vcpu(i, vcpu, kvm) {
-			struct task_struct *task = NULL;
-			struct pid *pid;
-			if (!pass && i < last_boosted_vcpu) {
-				i = last_boosted_vcpu;
+	for (pass = 0, firststart = 0; pass < 2 && !yielded; pass++) {
+
+		idx = srcu_read_lock(&kvm->srcu_vcpus);
+		kvm_for_each_vcpu(vcpu, kvm) {
+			if (kvm->last_boosted_vcpu_id < 0 && !pass) {
+				pass = 1;
+				break;
+			}
+			if (!pass && !firststart &&
+			    vcpu->vcpu_id != kvm->last_boosted_vcpu_id) {
 				continue;
-			} else if (pass && i > last_boosted_vcpu)
+			} else if (!pass && !firststart) {
+				firststart = 1;
+				continue;
+			} else if (pass && !lastone) {
+				if (vcpu->vcpu_id == kvm->last_boosted_vcpu_id)
+					lastone = 1;
+			} else if (pass && lastone)
 				break;
+
 			if (vcpu == me)
 				continue;
 			if (waitqueue_active(&vcpu->wq))
@@ -1629,15 +1648,20 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 				put_task_struct(task);
 				continue;
 			}
+
 			if (yield_to(task, 1)) {
 				put_task_struct(task);
-				kvm->last_boosted_vcpu = i;
+				mutex_lock(&kvm->lock);
+				kvm->last_boosted_vcpu_id = vcpu->vcpu_id;
+				mutex_unlock(&kvm->lock);
 				yielded = 1;
 				break;
 			}
 			put_task_struct(task);
 		}
+		srcu_read_unlock(&kvm->srcu_vcpus, idx);
 	}
+
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
 
@@ -1673,11 +1697,30 @@ static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
 	return 0;
 }
 
+static void kvm_vcpu_zap(struct kvm_vcpu *vcpu)
+{
+	kvm_arch_vcpu_zap(vcpu);
+}
+
 static int kvm_vcpu_release(struct inode *inode, struct file *filp)
 {
 	struct kvm_vcpu *vcpu = filp->private_data;
+	struct kvm *kvm = vcpu->kvm;
+	filp->private_data = NULL;
+
+	mutex_lock(&kvm->lock);
+	list_del_rcu(&vcpu->list);
+	atomic_dec(&kvm->online_vcpus);
+	mutex_unlock(&kvm->lock);
+	synchronize_srcu_expedited(&kvm->srcu_vcpus);
 
-	kvm_put_kvm(vcpu->kvm);
+	mutex_lock(&kvm->lock);
+	if (kvm->last_boosted_vcpu_id == vcpu->vcpu_id)
+		kvm->last_boosted_vcpu_id = -1;
+	mutex_unlock(&kvm->lock);
+
+	/*vcpu is out of list,drop it safely*/
+	kvm_vcpu_zap(vcpu);
 	return 0;
 }
 
@@ -1699,15 +1742,25 @@ static int create_vcpu_fd(struct kvm_vcpu *vcpu)
 	return anon_inode_getfd("kvm-vcpu", &kvm_vcpu_fops, vcpu, O_RDWR);
 }
 
+static struct kvm_vcpu *kvm_vcpu_create(struct kvm *kvm, u32 id)
+{
+	struct kvm_vcpu *vcpu;
+	vcpu = kvm_arch_vcpu_create(kvm, id);
+	if (IS_ERR(vcpu))
+		return vcpu;
+	INIT_LIST_HEAD(&vcpu->list);
+	return vcpu;
+}
+
 /*
  * Creates some virtual cpus.  Good luck creating more than one.
  */
 static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 {
-	int r;
+	int r, idx;
 	struct kvm_vcpu *vcpu, *v;
 
-	vcpu = kvm_arch_vcpu_create(kvm, id);
+	vcpu = kvm_vcpu_create(kvm, id);
 	if (IS_ERR(vcpu))
 		return PTR_ERR(vcpu);
 
@@ -1723,13 +1776,15 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 		goto unlock_vcpu_destroy;
 	}
 
-	kvm_for_each_vcpu(r, v, kvm)
+	idx = srcu_read_lock(&kvm->srcu_vcpus);
+	kvm_for_each_vcpu(v, kvm) {
 		if (v->vcpu_id == id) {
 			r = -EEXIST;
+			srcu_read_unlock(&kvm->srcu_vcpus, idx);
 			goto unlock_vcpu_destroy;
 		}
-
-	BUG_ON(kvm->vcpus[atomic_read(&kvm->online_vcpus)]);
+	}
+	srcu_read_unlock(&kvm->srcu_vcpus, idx);
 
 	/* Now it's all set up, let userspace reach it */
 	kvm_get_kvm(kvm);
@@ -1739,8 +1794,8 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 		goto unlock_vcpu_destroy;
 	}
 
-	kvm->vcpus[atomic_read(&kvm->online_vcpus)] = vcpu;
-	smp_wmb();
+	/*Protected by kvm->lock*/
+	list_add_rcu(&vcpu->list, &kvm->vcpus);
 	atomic_inc(&kvm->online_vcpus);
 
 	mutex_unlock(&kvm->lock);
@@ -2635,13 +2690,16 @@ static int vcpu_stat_get(void *_offset, u64 *val)
 	unsigned offset = (long)_offset;
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
-	int i;
+	int idx;
 
 	*val = 0;
 	raw_spin_lock(&kvm_lock);
-	list_for_each_entry(kvm, &vm_list, vm_list)
-		kvm_for_each_vcpu(i, vcpu, kvm)
+	list_for_each_entry(kvm, &vm_list, vm_list) {
+		idx = srcu_read_lock(&kvm->srcu_vcpus);
+		kvm_for_each_vcpu(vcpu, kvm)
 			*val += *(u32 *)((void *)vcpu + offset);
+		srcu_read_unlock(&kvm->srcu_vcpus, idx);
+	}
 
 	raw_spin_unlock(&kvm_lock);
 	return 0;
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH v6] kvm: make vcpu life cycle separated from kvm instance
  2011-12-27  8:38               ` [PATCH v6] " Liu Ping Fan
@ 2011-12-27 11:22                 ` Takuya Yoshikawa
  2011-12-28  6:54                   ` Liu ping fan
  2011-12-28  9:53                 ` Avi Kivity
  1 sibling, 1 reply; 78+ messages in thread
From: Takuya Yoshikawa @ 2011-12-27 11:22 UTC (permalink / raw)
  To: Liu Ping Fan
  Cc: kvm, linux-kernel, avi, aliguori, gleb, mtosatti,
	xiaoguangrong.eric, jan.kiszka, Takuya Yoshikawa

(2011/12/27 17:38), Liu Ping Fan wrote:
> From: Liu Ping Fan<pingfank@linux.vnet.ibm.com>
> 
> Currently, vcpu can be destructed only when kvm instance destroyed.
> Change this to vcpu's destruction before kvm instance, so vcpu MUST
> and CAN be destroyed before kvm's destroy.

I really don't understand why this big change can be justified by only
3 lines.

> 
> Signed-off-by: Liu Ping Fan<pingfank@linux.vnet.ibm.com>
> ---
>   arch/x86/kvm/i8254.c     |   10 +++--
>   arch/x86/kvm/i8259.c     |   17 +++++--
>   arch/x86/kvm/x86.c       |   53 +++++++++++-----------
>   include/linux/kvm_host.h |   20 +++-----
>   virt/kvm/irq_comm.c      |    6 ++-
>   virt/kvm/kvm_main.c      |  110 +++++++++++++++++++++++++++++++++++-----------
>   6 files changed, 140 insertions(+), 76 deletions(-)
> 

You are introducing kvm_arch_vcpu_zap().

Then, apart from the "zap" naming issue I mentioned last time,
what about other architectures than x86?


> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 900c763..b88d418d 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -117,6 +117,7 @@ enum {
> 
>   struct kvm_vcpu {
>   	struct kvm *kvm;
> +	struct list_head list;
>   #ifdef CONFIG_PREEMPT_NOTIFIERS
>   	struct preempt_notifier preempt_notifier;
>   #endif
> @@ -251,12 +252,14 @@ struct kvm {
>   	struct mm_struct *mm; /* userspace tied to this vm */
>   	struct kvm_memslots *memslots;
>   	struct srcu_struct srcu;
> +	struct srcu_struct srcu_vcpus;
> +

Another srcu.  This alone is worth explaining in the changelog IMO.

	Takuya

>   #ifdef CONFIG_KVM_APIC_ARCHITECTURE
>   	u32 bsp_vcpu_id;
>   #endif
> -	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
> +	struct list_head vcpus;
>   	atomic_t online_vcpus;
> -	int last_boosted_vcpu;
> +	int last_boosted_vcpu_id;
>   	struct list_head vm_list;
>   	struct mutex lock;
>   	struct kvm_io_bus *buses[KVM_NR_BUSES];

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6] kvm: make vcpu life cycle separated from kvm instance
  2011-12-27 11:22                 ` Takuya Yoshikawa
@ 2011-12-28  6:54                   ` Liu ping fan
  2011-12-28  9:53                     ` Avi Kivity
  2011-12-28 10:29                     ` Takuya Yoshikawa
  0 siblings, 2 replies; 78+ messages in thread
From: Liu ping fan @ 2011-12-28  6:54 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: kvm, linux-kernel, avi, aliguori, gleb, mtosatti,
	xiaoguangrong.eric, jan.kiszka, Takuya Yoshikawa

2011/12/27 Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>:
> (2011/12/27 17:38), Liu Ping Fan wrote:
>> From: Liu Ping Fan<pingfank@linux.vnet.ibm.com>
>>
>> Currently, vcpu can be destructed only when kvm instance destroyed.
>> Change this to vcpu's destruction before kvm instance, so vcpu MUST
>> and CAN be destroyed before kvm's destroy.
>
> I really don't understand why this big change can be justified by only
> 3 lines.
>
I think just recording what this patch does, not the whole story about
it. Right?
>>
>> Signed-off-by: Liu Ping Fan<pingfank@linux.vnet.ibm.com>
>> ---
>>   arch/x86/kvm/i8254.c     |   10 +++--
>>   arch/x86/kvm/i8259.c     |   17 +++++--
>>   arch/x86/kvm/x86.c       |   53 +++++++++++-----------
>>   include/linux/kvm_host.h |   20 +++-----
>>   virt/kvm/irq_comm.c      |    6 ++-
>>   virt/kvm/kvm_main.c      |  110 +++++++++++++++++++++++++++++++++++-----------
>>   6 files changed, 140 insertions(+), 76 deletions(-)
>>
>
> You are introducing kvm_arch_vcpu_zap().
>
> Then, apart from the "zap" naming issue I mentioned last time,
Yes, I will correct "zap", as you said, its meaning is quite different
from destroy. :-)

> what about other architectures than x86?
>
Have not considered it in detail yet. At first step, I just want to
figure out the whole frame, then, I will push them on other arch.
Maybe you foresee some problem when extending this onto other arch,
please tell me, thanks :-).
>
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 900c763..b88d418d 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -117,6 +117,7 @@ enum {
>>
>>   struct kvm_vcpu {
>>       struct kvm *kvm;
>> +     struct list_head list;
>>   #ifdef CONFIG_PREEMPT_NOTIFIERS
>>       struct preempt_notifier preempt_notifier;
>>   #endif
>> @@ -251,12 +252,14 @@ struct kvm {
>>       struct mm_struct *mm; /* userspace tied to this vm */
>>       struct kvm_memslots *memslots;
>>       struct srcu_struct srcu;
>> +     struct srcu_struct srcu_vcpus;
>> +
>
> Another srcu.  This alone is worth explaining in the changelog IMO.
>
Sorry, but why? I think it is just a srcu, and because it has
different aim and want a independent grace period, so not multiplex
kvm->srcu.

thanks and regards,
ping fan

>        Takuya
>
>>   #ifdef CONFIG_KVM_APIC_ARCHITECTURE
>>       u32 bsp_vcpu_id;
>>   #endif
>> -     struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
>> +     struct list_head vcpus;
>>       atomic_t online_vcpus;
>> -     int last_boosted_vcpu;
>> +     int last_boosted_vcpu_id;
>>       struct list_head vm_list;
>>       struct mutex lock;
>>       struct kvm_io_bus *buses[KVM_NR_BUSES];

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6] kvm: make vcpu life cycle separated from kvm instance
  2011-12-28  6:54                   ` Liu ping fan
@ 2011-12-28  9:53                     ` Avi Kivity
  2011-12-29 14:03                       ` Liu ping fan
  2011-12-28 10:29                     ` Takuya Yoshikawa
  1 sibling, 1 reply; 78+ messages in thread
From: Avi Kivity @ 2011-12-28  9:53 UTC (permalink / raw)
  To: Liu ping fan
  Cc: Takuya Yoshikawa, kvm, linux-kernel, aliguori, gleb, mtosatti,
	xiaoguangrong.eric, jan.kiszka, Takuya Yoshikawa

On 12/28/2011 08:54 AM, Liu ping fan wrote:
> >>
> >>   struct kvm_vcpu {
> >>       struct kvm *kvm;
> >> +     struct list_head list;
> >>   #ifdef CONFIG_PREEMPT_NOTIFIERS
> >>       struct preempt_notifier preempt_notifier;
> >>   #endif
> >> @@ -251,12 +252,14 @@ struct kvm {
> >>       struct mm_struct *mm; /* userspace tied to this vm */
> >>       struct kvm_memslots *memslots;
> >>       struct srcu_struct srcu;
> >> +     struct srcu_struct srcu_vcpus;
> >> +
> >
> > Another srcu.  This alone is worth explaining in the changelog IMO.
> >
> Sorry, but why? I think it is just a srcu, and because it has
> different aim and want a independent grace period, so not multiplex
> kvm->srcu.

There is Documentation/virtual/kvm/locking.txt for that.

btw, why does it have to be srcu?  Is rcu insufficient?

Why do we want an independent grace period, is hotunplugging a vcpu that
much different from hotunplugging memory?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6] kvm: make vcpu life cycle separated from kvm instance
  2011-12-27  8:38               ` [PATCH v6] " Liu Ping Fan
  2011-12-27 11:22                 ` Takuya Yoshikawa
@ 2011-12-28  9:53                 ` Avi Kivity
  2011-12-28  9:54                   ` Avi Kivity
  1 sibling, 1 reply; 78+ messages in thread
From: Avi Kivity @ 2011-12-28  9:53 UTC (permalink / raw)
  To: Liu Ping Fan
  Cc: kvm, linux-kernel, aliguori, gleb, mtosatti, xiaoguangrong.eric,
	jan.kiszka

On 12/27/2011 10:38 AM, Liu Ping Fan wrote:
> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>
> Currently, vcpu can be destructed only when kvm instance destroyed.
> Change this to vcpu's destruction before kvm instance, so vcpu MUST
> and CAN be destroyed before kvm's destroy.
>
> Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> ---
>  arch/x86/kvm/i8254.c     |   10 +++--
>  arch/x86/kvm/i8259.c     |   17 +++++--
>  arch/x86/kvm/x86.c       |   53 +++++++++++-----------
>  include/linux/kvm_host.h |   20 +++-----
>  virt/kvm/irq_comm.c      |    6 ++-
>  virt/kvm/kvm_main.c      |  110 +++++++++++++++++++++++++++++++++++-----------
>

Documentation/virtual/kvm/api.txt

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6] kvm: make vcpu life cycle separated from kvm instance
  2011-12-28  9:53                 ` Avi Kivity
@ 2011-12-28  9:54                   ` Avi Kivity
  2011-12-28 10:19                     ` Takuya Yoshikawa
  0 siblings, 1 reply; 78+ messages in thread
From: Avi Kivity @ 2011-12-28  9:54 UTC (permalink / raw)
  To: Liu Ping Fan
  Cc: kvm, linux-kernel, aliguori, gleb, mtosatti, xiaoguangrong.eric,
	jan.kiszka

On 12/28/2011 11:53 AM, Avi Kivity wrote:
> On 12/27/2011 10:38 AM, Liu Ping Fan wrote:
> > From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> >
> > Currently, vcpu can be destructed only when kvm instance destroyed.
> > Change this to vcpu's destruction before kvm instance, so vcpu MUST
> > and CAN be destroyed before kvm's destroy.
> >
> > Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> > ---
> >  arch/x86/kvm/i8254.c     |   10 +++--
> >  arch/x86/kvm/i8259.c     |   17 +++++--
> >  arch/x86/kvm/x86.c       |   53 +++++++++++-----------
> >  include/linux/kvm_host.h |   20 +++-----
> >  virt/kvm/irq_comm.c      |    6 ++-
> >  virt/kvm/kvm_main.c      |  110 +++++++++++++++++++++++++++++++++++-----------
> >
>
> Documentation/virtual/kvm/api.txt
>

Oops, that's only needed when the unplug API is introduced.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6] kvm: make vcpu life cycle separated from kvm instance
  2011-12-28  9:54                   ` Avi Kivity
@ 2011-12-28 10:19                     ` Takuya Yoshikawa
  2011-12-28 10:28                       ` Avi Kivity
  0 siblings, 1 reply; 78+ messages in thread
From: Takuya Yoshikawa @ 2011-12-28 10:19 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Liu Ping Fan, kvm, linux-kernel, aliguori, gleb, mtosatti,
	xiaoguangrong.eric, jan.kiszka, Takuya Yoshikawa

(2011/12/28 18:54), Avi Kivity wrote:
> On 12/28/2011 11:53 AM, Avi Kivity wrote:
>> On 12/27/2011 10:38 AM, Liu Ping Fan wrote:
>>> From: Liu Ping Fan<pingfank@linux.vnet.ibm.com>
>>>
>>> Currently, vcpu can be destructed only when kvm instance destroyed.
>>> Change this to vcpu's destruction before kvm instance, so vcpu MUST
>>> and CAN be destroyed before kvm's destroy.
>>>
>>> Signed-off-by: Liu Ping Fan<pingfank@linux.vnet.ibm.com>
>>> ---
>>>   arch/x86/kvm/i8254.c     |   10 +++--
>>>   arch/x86/kvm/i8259.c     |   17 +++++--
>>>   arch/x86/kvm/x86.c       |   53 +++++++++++-----------
>>>   include/linux/kvm_host.h |   20 +++-----
>>>   virt/kvm/irq_comm.c      |    6 ++-
>>>   virt/kvm/kvm_main.c      |  110 +++++++++++++++++++++++++++++++++++-----------
>>>
>>
>> Documentation/virtual/kvm/api.txt
>>
>
> Oops, that's only needed when the unplug API is introduced.
>

I think it is OK to to add such an API later on, but I really want
the author to write the plan in the changelog.

Otherwise people not belonging to Red Hat or IBM cannot know what
this commit is aiming at.

I am not objecting to this patch itself, but the way this kind of change
is being introduced seems not be in a good manner.

	Takuya

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6] kvm: make vcpu life cycle separated from kvm instance
  2011-12-28 10:19                     ` Takuya Yoshikawa
@ 2011-12-28 10:28                       ` Avi Kivity
  0 siblings, 0 replies; 78+ messages in thread
From: Avi Kivity @ 2011-12-28 10:28 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: Liu Ping Fan, kvm, linux-kernel, aliguori, gleb, mtosatti,
	xiaoguangrong.eric, jan.kiszka, Takuya Yoshikawa

On 12/28/2011 12:19 PM, Takuya Yoshikawa wrote:
>> Oops, that's only needed when the unplug API is introduced.
>>
>
>
> I think it is OK to to add such an API later on, but I really want
> the author to write the plan in the changelog.

It was in fact in the beginning of the thread. 

> I am not objecting to this patch itself, but the way this kind of change
> is being introduced seems not be in a good manner.

It should be part of a patch series that adds the API, otherwise it will
never be tested.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6] kvm: make vcpu life cycle separated from kvm instance
  2011-12-28  6:54                   ` Liu ping fan
  2011-12-28  9:53                     ` Avi Kivity
@ 2011-12-28 10:29                     ` Takuya Yoshikawa
  1 sibling, 0 replies; 78+ messages in thread
From: Takuya Yoshikawa @ 2011-12-28 10:29 UTC (permalink / raw)
  To: Liu ping fan
  Cc: kvm, linux-kernel, avi, aliguori, gleb, mtosatti,
	xiaoguangrong.eric, jan.kiszka, Takuya Yoshikawa

(2011/12/28 15:54), Liu ping fan wrote:

>> You are introducing kvm_arch_vcpu_zap().
>>
>> Then, apart from the "zap" naming issue I mentioned last time,
> Yes, I will correct "zap", as you said, its meaning is quite different
> from destroy. :-)
>
>> what about other architectures than x86?
>>
> Have not considered it in detail yet. At first step, I just want to
> figure out the whole frame, then, I will push them on other arch.
> Maybe you foresee some problem when extending this onto other arch,
> please tell me, thanks :-).

I do not mind if those are supported or not now.

But if not, you should write what is currently supported and what should
be done in the future.

Of course, you should at least make sure that other architectures will
not be broken by your patch.

>>
>>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>>> index 900c763..b88d418d 100644
>>> --- a/include/linux/kvm_host.h
>>> +++ b/include/linux/kvm_host.h
>>> @@ -117,6 +117,7 @@ enum {
>>>
>>>    struct kvm_vcpu {
>>>        struct kvm *kvm;
>>> +     struct list_head list;
>>>    #ifdef CONFIG_PREEMPT_NOTIFIERS
>>>        struct preempt_notifier preempt_notifier;
>>>    #endif
>>> @@ -251,12 +252,14 @@ struct kvm {
>>>        struct mm_struct *mm; /* userspace tied to this vm */
>>>        struct kvm_memslots *memslots;
>>>        struct srcu_struct srcu;
>>> +     struct srcu_struct srcu_vcpus;
>>> +
>>
>> Another srcu.  This alone is worth explaining in the changelog IMO.
>>
> Sorry, but why? I think it is just a srcu, and because it has
> different aim and want a independent grace period, so not multiplex
> kvm->srcu.

I cannot say what MUST be explained in the changelog.
But many well known maintainers are telling the importance of changelog.

It is up to you, and KVM maintainers.

	Takuya

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6] kvm: make vcpu life cycle separated from kvm instance
  2011-12-28  9:53                     ` Avi Kivity
@ 2011-12-29 14:03                       ` Liu ping fan
  2011-12-29 14:31                         ` Avi Kivity
  0 siblings, 1 reply; 78+ messages in thread
From: Liu ping fan @ 2011-12-29 14:03 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Takuya Yoshikawa, kvm, linux-kernel, aliguori, gleb, mtosatti,
	xiaoguangrong.eric, jan.kiszka, Takuya Yoshikawa

On Wed, Dec 28, 2011 at 5:53 PM, Avi Kivity <avi@redhat.com> wrote:
> On 12/28/2011 08:54 AM, Liu ping fan wrote:
>> >>
>> >>   struct kvm_vcpu {
>> >>       struct kvm *kvm;
>> >> +     struct list_head list;
>> >>   #ifdef CONFIG_PREEMPT_NOTIFIERS
>> >>       struct preempt_notifier preempt_notifier;
>> >>   #endif
>> >> @@ -251,12 +252,14 @@ struct kvm {
>> >>       struct mm_struct *mm; /* userspace tied to this vm */
>> >>       struct kvm_memslots *memslots;
>> >>       struct srcu_struct srcu;
>> >> +     struct srcu_struct srcu_vcpus;
>> >> +
>> >
>> > Another srcu.  This alone is worth explaining in the changelog IMO.
>> >
>> Sorry, but why? I think it is just a srcu, and because it has
>> different aim and want a independent grace period, so not multiplex
>> kvm->srcu.
>
> There is Documentation/virtual/kvm/locking.txt for that.
>
> btw, why does it have to be srcu?  Is rcu insufficient?
>
Just to survive from "if (yield_to(task, 1)) in  kvm_vcpu_on_spin()",

> Why do we want an independent grace period, is hotunplugging a vcpu that
> much different from hotunplugging memory?
>
I thought that if less readers on the same srcu lock, then
synchronize_srcu_expedited() may success to return more quickly.

Thanks and regards,
ping fan
> --
> error compiling committee.c: too many arguments to function
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6] kvm: make vcpu life cycle separated from kvm instance
  2011-12-29 14:03                       ` Liu ping fan
@ 2011-12-29 14:31                         ` Avi Kivity
  2012-01-05  9:35                           ` Liu ping fan
  0 siblings, 1 reply; 78+ messages in thread
From: Avi Kivity @ 2011-12-29 14:31 UTC (permalink / raw)
  To: Liu ping fan
  Cc: Takuya Yoshikawa, kvm, linux-kernel, aliguori, gleb, mtosatti,
	xiaoguangrong.eric, jan.kiszka, Takuya Yoshikawa

On 12/29/2011 04:03 PM, Liu ping fan wrote:
> > Why do we want an independent grace period, is hotunplugging a vcpu that
> > much different from hotunplugging memory?
> >
> I thought that if less readers on the same srcu lock, then
> synchronize_srcu_expedited() may success to return more quickly.

It would be good to measure it, otherwise it's premature optimization.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v6] kvm: make vcpu life cycle separated from kvm instance
  2011-12-29 14:31                         ` Avi Kivity
@ 2012-01-05  9:35                           ` Liu ping fan
  0 siblings, 0 replies; 78+ messages in thread
From: Liu ping fan @ 2012-01-05  9:35 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-kernel, aliguori, gleb, mtosatti, xiaoguangrong.eric,
	jan.kiszka, Takuya Yoshikawa

On Thu, Dec 29, 2011 at 10:31 PM, Avi Kivity <avi@redhat.com> wrote:
> On 12/29/2011 04:03 PM, Liu ping fan wrote:
>> > Why do we want an independent grace period, is hotunplugging a vcpu that
>> > much different from hotunplugging memory?
>> >
>> I thought that if less readers on the same srcu lock, then
>> synchronize_srcu_expedited() may success to return more quickly.
>
> It would be good to measure it, otherwise it's premature optimization.
>
Yes, after using kprobetrace to measure it, I found it was  premature
optimization. So I will resort to the kvm->srcu, instead of creating a
new one in next version.

Thanks and regards
ping fan
> --
> error compiling committee.c: too many arguments to function
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v7] kvm: make vcpu life cycle separated from kvm instance
  2011-12-17  3:19             ` [PATCH v5] " Liu Ping Fan
  2011-12-26 11:09               ` Gleb Natapov
  2011-12-27  8:38               ` [PATCH v6] " Liu Ping Fan
@ 2012-01-07  2:55               ` Liu Ping Fan
  2012-01-12 12:37                 ` Avi Kivity
  2 siblings, 1 reply; 78+ messages in thread
From: Liu Ping Fan @ 2012-01-07  2:55 UTC (permalink / raw)
  To: kvm
  Cc: linux-kernel, avi, aliguori, gleb, mtosatti, xiaoguangrong.eric,
	jan.kiszka, yoshikawa.takuya

From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>

Currently, vcpu will be destructed only after kvm instance is
destroyed. This result to vcpu keep idle in kernel, but can not
be freed when it is unplugged in guest.

Change this to vcpu's destruction before kvm instance, so vcpu MUST
and CAN be destroyed before kvm instance. By this way, we can remove
vcpu when guest does not need it any longer.

TODO: push changes to other archs besides x86.

Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
---
Changelog: v6->v7
-Remove kvm->srcu_vcpus, and resort to kvm->srcu for it is helpless 
 to perfermence after measurement.
-Rename kvm_vcpu_zap to kvm_vcpu_destruct and so on.

 arch/x86/kvm/i8254.c     |   10 +++--
 arch/x86/kvm/i8259.c     |   17 +++++--
 arch/x86/kvm/x86.c       |   53 +++++++++++------------
 include/linux/kvm_host.h |   19 +++-----
 virt/kvm/irq_comm.c      |    6 ++-
 virt/kvm/kvm_main.c      |  105 ++++++++++++++++++++++++++++++++++-----------
 6 files changed, 134 insertions(+), 76 deletions(-)

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index d68f99d..a737fb6 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -289,9 +289,8 @@ static void pit_do_work(struct work_struct *work)
 	struct kvm_pit *pit = container_of(work, struct kvm_pit, expired);
 	struct kvm *kvm = pit->kvm;
 	struct kvm_vcpu *vcpu;
-	int i;
 	struct kvm_kpit_state *ps = &pit->pit_state;
-	int inject = 0;
+	int idx, inject = 0;
 
 	/* Try to inject pending interrupts when
 	 * last one has been acked.
@@ -315,9 +314,12 @@ static void pit_do_work(struct work_struct *work)
 		 * LVT0 to NMI delivery. Other PIC interrupts are just sent to
 		 * VCPU0, and only if its LVT0 is in EXTINT mode.
 		 */
-		if (kvm->arch.vapics_in_nmi_mode > 0)
-			kvm_for_each_vcpu(i, vcpu, kvm)
+		if (kvm->arch.vapics_in_nmi_mode > 0) {
+			idx = srcu_read_lock(&kvm->srcu);
+			kvm_for_each_vcpu(vcpu, kvm)
 				kvm_apic_nmi_wd_deliver(vcpu);
+			srcu_read_unlock(&kvm->srcu, idx);
+		}
 	}
 }
 
diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index b6a7353..fc0fd76 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -50,25 +50,29 @@ static void pic_unlock(struct kvm_pic *s)
 {
 	bool wakeup = s->wakeup_needed;
 	struct kvm_vcpu *vcpu, *found = NULL;
-	int i;
+	struct kvm *kvm = s->kvm;
+	int idx;
 
 	s->wakeup_needed = false;
 
 	spin_unlock(&s->lock);
 
 	if (wakeup) {
-		kvm_for_each_vcpu(i, vcpu, s->kvm) {
+		idx = srcu_read_lock(&kvm->srcu);
+		kvm_for_each_vcpu(vcpu, kvm)
 			if (kvm_apic_accept_pic_intr(vcpu)) {
 				found = vcpu;
 				break;
 			}
-		}
 
-		if (!found)
+		if (!found) {
+			srcu_read_unlock(&kvm->srcu, idx);
 			return;
+		}
 
 		kvm_make_request(KVM_REQ_EVENT, found);
 		kvm_vcpu_kick(found);
+		srcu_read_unlock(&kvm->srcu, idx);
 	}
 }
 
@@ -265,6 +269,7 @@ void kvm_pic_reset(struct kvm_kpic_state *s)
 	int irq, i;
 	struct kvm_vcpu *vcpu;
 	u8 irr = s->irr, isr = s->imr;
+	struct kvm *kvm = s->pics_state->kvm;
 	bool found = false;
 
 	s->last_irr = 0;
@@ -282,11 +287,13 @@ void kvm_pic_reset(struct kvm_kpic_state *s)
 	s->special_fully_nested_mode = 0;
 	s->init4 = 0;
 
-	kvm_for_each_vcpu(i, vcpu, s->pics_state->kvm)
+	i = srcu_read_lock(&kvm->srcu);
+	kvm_for_each_vcpu(vcpu, kvm)
 		if (kvm_apic_accept_pic_intr(vcpu)) {
 			found = true;
 			break;
 		}
+	srcu_read_unlock(&kvm->srcu, i);
 
 
 	if (!found)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1171def..c14892f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1786,14 +1786,20 @@ static int get_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 static int get_msr_hyperv(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 {
 	u64 data = 0;
+	int idx;
 
 	switch (msr) {
 	case HV_X64_MSR_VP_INDEX: {
-		int r;
+		int r = 0;
 		struct kvm_vcpu *v;
-		kvm_for_each_vcpu(r, v, vcpu->kvm)
+		struct kvm *kvm = vcpu->kvm;
+		idx = srcu_read_lock(&kvm->srcu);
+		kvm_for_each_vcpu(v, vcpu->kvm) {
 			if (v == vcpu)
 				data = r;
+			r++;
+		}
+		srcu_read_unlock(&kvm->srcu, idx);
 		break;
 	}
 	case HV_X64_MSR_EOI:
@@ -4538,7 +4544,7 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
 	struct cpufreq_freqs *freq = data;
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
-	int i, send_ipi = 0;
+	int idx, send_ipi = 0;
 
 	/*
 	 * We allow guests to temporarily run on slowing clocks,
@@ -4588,13 +4594,16 @@ static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long va
 
 	raw_spin_lock(&kvm_lock);
 	list_for_each_entry(kvm, &vm_list, vm_list) {
-		kvm_for_each_vcpu(i, vcpu, kvm) {
+		idx = srcu_read_lock(&kvm->srcu);
+		kvm_for_each_vcpu(vcpu, kvm) {
 			if (vcpu->cpu != freq->cpu)
 				continue;
 			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 			if (vcpu->cpu != smp_processor_id())
 				send_ipi = 1;
 		}
+		srcu_read_unlock(&kvm->srcu, idx);
+
 	}
 	raw_spin_unlock(&kvm_lock);
 
@@ -5881,13 +5890,17 @@ int kvm_arch_hardware_enable(void *garbage)
 {
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
-	int i;
+	int idx;
 
 	kvm_shared_msr_cpu_online();
-	list_for_each_entry(kvm, &vm_list, vm_list)
-		kvm_for_each_vcpu(i, vcpu, kvm)
+	list_for_each_entry(kvm, &vm_list, vm_list) {
+		idx = srcu_read_lock(&kvm->srcu);
+		kvm_for_each_vcpu(vcpu, kvm) {
 			if (vcpu->cpu == smp_processor_id())
 				kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+		}
+		srcu_read_unlock(&kvm->srcu, idx);
+	}
 	return kvm_x86_ops->hardware_enable(garbage);
 }
 
@@ -6006,27 +6019,14 @@ static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
 	vcpu_put(vcpu);
 }
 
-static void kvm_free_vcpus(struct kvm *kvm)
+void kvm_arch_vcpu_destruct(struct kvm_vcpu *vcpu)
 {
-	unsigned int i;
-	struct kvm_vcpu *vcpu;
-
-	/*
-	 * Unpin any mmu pages first.
-	 */
-	kvm_for_each_vcpu(i, vcpu, kvm) {
-		kvm_clear_async_pf_completion_queue(vcpu);
-		kvm_unload_vcpu_mmu(vcpu);
-	}
-	kvm_for_each_vcpu(i, vcpu, kvm)
-		kvm_arch_vcpu_free(vcpu);
-
-	mutex_lock(&kvm->lock);
-	for (i = 0; i < atomic_read(&kvm->online_vcpus); i++)
-		kvm->vcpus[i] = NULL;
+	struct kvm *kvm = vcpu->kvm;
 
-	atomic_set(&kvm->online_vcpus, 0);
-	mutex_unlock(&kvm->lock);
+	kvm_clear_async_pf_completion_queue(vcpu);
+	kvm_unload_vcpu_mmu(vcpu);
+	kvm_arch_vcpu_free(vcpu);
+	kvm_put_kvm(kvm);
 }
 
 void kvm_arch_sync_events(struct kvm *kvm)
@@ -6040,7 +6040,6 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 	kvm_iommu_unmap_guest(kvm);
 	kfree(kvm->arch.vpic);
 	kfree(kvm->arch.vioapic);
-	kvm_free_vcpus(kvm);
 	if (kvm->arch.apic_access_page)
 		put_page(kvm->arch.apic_access_page);
 	if (kvm->arch.ept_identity_pagetable)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 900c763..8c93de9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -117,6 +117,7 @@ enum {
 
 struct kvm_vcpu {
 	struct kvm *kvm;
+	struct list_head list;
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	struct preempt_notifier preempt_notifier;
 #endif
@@ -251,12 +252,13 @@ struct kvm {
 	struct mm_struct *mm; /* userspace tied to this vm */
 	struct kvm_memslots *memslots;
 	struct srcu_struct srcu;
+
 #ifdef CONFIG_KVM_APIC_ARCHITECTURE
 	u32 bsp_vcpu_id;
 #endif
-	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
+	struct list_head vcpus;
 	atomic_t online_vcpus;
-	int last_boosted_vcpu;
+	int last_boosted_vcpu_id;
 	struct list_head vm_list;
 	struct mutex lock;
 	struct kvm_io_bus *buses[KVM_NR_BUSES];
@@ -303,17 +305,10 @@ struct kvm {
 #define kvm_printf(kvm, fmt ...) printk(KERN_DEBUG fmt)
 #define vcpu_printf(vcpu, fmt...) kvm_printf(vcpu->kvm, fmt)
 
-static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i)
-{
-	smp_rmb();
-	return kvm->vcpus[i];
-}
+void kvm_arch_vcpu_destruct(struct kvm_vcpu *vcpu);
 
-#define kvm_for_each_vcpu(idx, vcpup, kvm) \
-	for (idx = 0; \
-	     idx < atomic_read(&kvm->online_vcpus) && \
-	     (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
-	     idx++)
+#define kvm_for_each_vcpu(vcpu, kvm) \
+	list_for_each_entry_rcu(vcpu, &kvm->vcpus, list)
 
 #define kvm_for_each_memslot(memslot, slots)	\
 	for (memslot = &slots->memslots[0];	\
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 9f614b4..15138a8 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -81,14 +81,15 @@ inline static bool kvm_is_dm_lowest_prio(struct kvm_lapic_irq *irq)
 int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 		struct kvm_lapic_irq *irq)
 {
-	int i, r = -1;
+	int idx, r = -1;
 	struct kvm_vcpu *vcpu, *lowest = NULL;
 
 	if (irq->dest_mode == 0 && irq->dest_id == 0xff &&
 			kvm_is_dm_lowest_prio(irq))
 		printk(KERN_INFO "kvm: apic: phys broadcast and lowest prio\n");
 
-	kvm_for_each_vcpu(i, vcpu, kvm) {
+	idx = srcu_read_lock(&kvm->srcu);
+	kvm_for_each_vcpu(vcpu, kvm) {
 		if (!kvm_apic_present(vcpu))
 			continue;
 
@@ -111,6 +112,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 	if (lowest)
 		r = kvm_apic_set_irq(lowest, irq);
 
+	srcu_read_unlock(&kvm->srcu, idx);
 	return r;
 }
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7287bf5..db3782b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -171,7 +171,7 @@ static void ack_flush(void *_completed)
 
 static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 {
-	int i, cpu, me;
+	int cpu, me, idx;
 	cpumask_var_t cpus;
 	bool called = true;
 	struct kvm_vcpu *vcpu;
@@ -179,7 +179,8 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 	zalloc_cpumask_var(&cpus, GFP_ATOMIC);
 
 	me = get_cpu();
-	kvm_for_each_vcpu(i, vcpu, kvm) {
+	idx = srcu_read_lock(&kvm->srcu);
+	kvm_for_each_vcpu(vcpu, kvm) {
 		kvm_make_request(req, vcpu);
 		cpu = vcpu->cpu;
 
@@ -190,12 +191,15 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req)
 		      kvm_vcpu_exiting_guest_mode(vcpu) != OUTSIDE_GUEST_MODE)
 			cpumask_set_cpu(cpu, cpus);
 	}
+	srcu_read_unlock(&kvm->srcu, idx);
+
 	if (unlikely(cpus == NULL))
 		smp_call_function_many(cpu_online_mask, ack_flush, NULL, 1);
 	else if (!cpumask_empty(cpus))
 		smp_call_function_many(cpus, ack_flush, NULL, 1);
 	else
 		called = false;
+
 	put_cpu();
 	free_cpumask_var(cpus);
 	return called;
@@ -500,6 +504,7 @@ static struct kvm *kvm_create_vm(void)
 	raw_spin_lock(&kvm_lock);
 	list_add(&kvm->vm_list, &vm_list);
 	raw_spin_unlock(&kvm_lock);
+	INIT_LIST_HEAD(&kvm->vcpus);
 
 	return kvm;
 
@@ -1593,11 +1598,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 {
 	struct kvm *kvm = me->kvm;
 	struct kvm_vcpu *vcpu;
-	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
-	int yielded = 0;
-	int pass;
-	int i;
-
+	struct task_struct *task = NULL;
+	struct pid *pid;
+	int pass, firststart, lastone, yielded, idx;
 	/*
 	 * We boost the priority of a VCPU that is runnable but not
 	 * currently running, because it got preempted by something
@@ -1605,15 +1608,26 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 	 * VCPU is holding the lock that we need and will release it.
 	 * We approximate round-robin by starting at the last boosted VCPU.
 	 */
-	for (pass = 0; pass < 2 && !yielded; pass++) {
-		kvm_for_each_vcpu(i, vcpu, kvm) {
-			struct task_struct *task = NULL;
-			struct pid *pid;
-			if (!pass && i < last_boosted_vcpu) {
-				i = last_boosted_vcpu;
+	for (pass = 0, firststart = 0; pass < 2 && !yielded; pass++) {
+
+		idx = srcu_read_lock(&kvm->srcu);
+		kvm_for_each_vcpu(vcpu, kvm) {
+			if (kvm->last_boosted_vcpu_id < 0 && !pass) {
+				pass = 1;
+				break;
+			}
+			if (!pass && !firststart &&
+			    vcpu->vcpu_id != kvm->last_boosted_vcpu_id) {
+				continue;
+			} else if (!pass && !firststart) {
+				firststart = 1;
 				continue;
-			} else if (pass && i > last_boosted_vcpu)
+			} else if (pass && !lastone) {
+				if (vcpu->vcpu_id == kvm->last_boosted_vcpu_id)
+					lastone = 1;
+			} else if (pass && lastone)
 				break;
+
 			if (vcpu == me)
 				continue;
 			if (waitqueue_active(&vcpu->wq))
@@ -1629,15 +1643,20 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 				put_task_struct(task);
 				continue;
 			}
+
 			if (yield_to(task, 1)) {
 				put_task_struct(task);
-				kvm->last_boosted_vcpu = i;
+				mutex_lock(&kvm->lock);
+				kvm->last_boosted_vcpu_id = vcpu->vcpu_id;
+				mutex_unlock(&kvm->lock);
 				yielded = 1;
 				break;
 			}
 			put_task_struct(task);
 		}
+		srcu_read_unlock(&kvm->srcu, idx);
 	}
+
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
 
@@ -1673,11 +1692,30 @@ static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
 	return 0;
 }
 
+static void kvm_vcpu_destruct(struct kvm_vcpu *vcpu)
+{
+	kvm_arch_vcpu_destruct(vcpu);
+}
+
 static int kvm_vcpu_release(struct inode *inode, struct file *filp)
 {
 	struct kvm_vcpu *vcpu = filp->private_data;
+	struct kvm *kvm = vcpu->kvm;
+	filp->private_data = NULL;
+
+	mutex_lock(&kvm->lock);
+	list_del_rcu(&vcpu->list);
+	atomic_dec(&kvm->online_vcpus);
+	mutex_unlock(&kvm->lock);
+	synchronize_srcu_expedited(&kvm->srcu);
 
-	kvm_put_kvm(vcpu->kvm);
+	mutex_lock(&kvm->lock);
+	if (kvm->last_boosted_vcpu_id == vcpu->vcpu_id)
+		kvm->last_boosted_vcpu_id = -1;
+	mutex_unlock(&kvm->lock);
+
+	/*vcpu is out of list,drop it safely*/
+	kvm_vcpu_destruct(vcpu);
 	return 0;
 }
 
@@ -1699,15 +1737,25 @@ static int create_vcpu_fd(struct kvm_vcpu *vcpu)
 	return anon_inode_getfd("kvm-vcpu", &kvm_vcpu_fops, vcpu, O_RDWR);
 }
 
+static struct kvm_vcpu *kvm_vcpu_create(struct kvm *kvm, u32 id)
+{
+	struct kvm_vcpu *vcpu;
+	vcpu = kvm_arch_vcpu_create(kvm, id);
+	if (IS_ERR(vcpu))
+		return vcpu;
+	INIT_LIST_HEAD(&vcpu->list);
+	return vcpu;
+}
+
 /*
  * Creates some virtual cpus.  Good luck creating more than one.
  */
 static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 {
-	int r;
+	int r, idx;
 	struct kvm_vcpu *vcpu, *v;
 
-	vcpu = kvm_arch_vcpu_create(kvm, id);
+	vcpu = kvm_vcpu_create(kvm, id);
 	if (IS_ERR(vcpu))
 		return PTR_ERR(vcpu);
 
@@ -1723,13 +1771,15 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 		goto unlock_vcpu_destroy;
 	}
 
-	kvm_for_each_vcpu(r, v, kvm)
+	idx = srcu_read_lock(&kvm->srcu);
+	kvm_for_each_vcpu(v, kvm) {
 		if (v->vcpu_id == id) {
 			r = -EEXIST;
+			srcu_read_unlock(&kvm->srcu, idx);
 			goto unlock_vcpu_destroy;
 		}
-
-	BUG_ON(kvm->vcpus[atomic_read(&kvm->online_vcpus)]);
+	}
+	srcu_read_unlock(&kvm->srcu, idx);
 
 	/* Now it's all set up, let userspace reach it */
 	kvm_get_kvm(kvm);
@@ -1739,8 +1789,8 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 		goto unlock_vcpu_destroy;
 	}
 
-	kvm->vcpus[atomic_read(&kvm->online_vcpus)] = vcpu;
-	smp_wmb();
+	/*Protected by kvm->lock*/
+	list_add_rcu(&vcpu->list, &kvm->vcpus);
 	atomic_inc(&kvm->online_vcpus);
 
 	mutex_unlock(&kvm->lock);
@@ -2635,13 +2685,16 @@ static int vcpu_stat_get(void *_offset, u64 *val)
 	unsigned offset = (long)_offset;
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
-	int i;
+	int idx;
 
 	*val = 0;
 	raw_spin_lock(&kvm_lock);
-	list_for_each_entry(kvm, &vm_list, vm_list)
-		kvm_for_each_vcpu(i, vcpu, kvm)
+	list_for_each_entry(kvm, &vm_list, vm_list) {
+		idx = srcu_read_lock(&kvm->srcu);
+		kvm_for_each_vcpu(vcpu, kvm)
 			*val += *(u32 *)((void *)vcpu + offset);
+		srcu_read_unlock(&kvm->srcu, idx);
+	}
 
 	raw_spin_unlock(&kvm_lock);
 	return 0;
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH v7] kvm: make vcpu life cycle separated from kvm instance
  2012-01-07  2:55               ` [PATCH v7] " Liu Ping Fan
@ 2012-01-12 12:37                 ` Avi Kivity
  2012-01-15 13:17                   ` Liu ping fan
  0 siblings, 1 reply; 78+ messages in thread
From: Avi Kivity @ 2012-01-12 12:37 UTC (permalink / raw)
  To: Liu Ping Fan
  Cc: kvm, linux-kernel, aliguori, gleb, mtosatti, xiaoguangrong.eric,
	jan.kiszka, yoshikawa.takuya, Rik van Riel

On 01/07/2012 04:55 AM, Liu Ping Fan wrote:
> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>
> Currently, vcpu will be destructed only after kvm instance is
> destroyed. This result to vcpu keep idle in kernel, but can not
> be freed when it is unplugged in guest.
>
> Change this to vcpu's destruction before kvm instance, so vcpu MUST

Must?

> and CAN be destroyed before kvm instance. By this way, we can remove
> vcpu when guest does not need it any longer.
>
> TODO: push changes to other archs besides x86.
>
> -Rename kvm_vcpu_zap to kvm_vcpu_destruct and so on.

kvm_vcpu_destroy.

>  
>  struct kvm_vcpu {
>  	struct kvm *kvm;
> +	struct list_head list;

vcpu_list_link, so it's clear this is not a head but a link, and so we
know which list it belongs to.

> -	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
> +	struct list_head vcpus;

This has the potential for a slight performance regression by bouncing
an extra cache line, but it's acceptable IMO.  We can always introduce
an apic ID -> vcpu hash table which improves things all around.

> |
> @@ -1593,11 +1598,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  {
>  	struct kvm *kvm = me->kvm;
>  	struct kvm_vcpu *vcpu;
> -	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
> -	int yielded = 0;
> -	int pass;
> -	int i;
> -
> +	struct task_struct *task = NULL;
> +	struct pid *pid;
> +	int pass, firststart, lastone, yielded, idx;

Avoid unrelated changes please.

> @@ -1605,15 +1608,26 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  	 * VCPU is holding the lock that we need and will release it.
>  	 * We approximate round-robin by starting at the last boosted VCPU.
>  	 */
> -	for (pass = 0; pass < 2 && !yielded; pass++) {
> -		kvm_for_each_vcpu(i, vcpu, kvm) {
> -			struct task_struct *task = NULL;
> -			struct pid *pid;
> -			if (!pass && i < last_boosted_vcpu) {
> -				i = last_boosted_vcpu;
> +	for (pass = 0, firststart = 0; pass < 2 && !yielded; pass++) {
> +
> +		idx = srcu_read_lock(&kvm->srcu);

Can move the lock to the top level.

> +		kvm_for_each_vcpu(vcpu, kvm) {
> +			if (kvm->last_boosted_vcpu_id < 0 && !pass) {
> +				pass = 1;
> +				break;
> +			}
> +			if (!pass && !firststart &&
> +			    vcpu->vcpu_id != kvm->last_boosted_vcpu_id) {
> +				continue;
> +			} else if (!pass && !firststart) {
> +				firststart = 1;
>  				continue;
> -			} else if (pass && i > last_boosted_vcpu)
> +			} else if (pass && !lastone) {
> +				if (vcpu->vcpu_id == kvm->last_boosted_vcpu_id)
> +					lastone = 1;
> +			} else if (pass && lastone)
>  				break;
> +

Seems like a large change.  Is this because the vcpu list is unordered? 
Maybe it's better to order it.

Rik?

>  			if (vcpu == me)
>  				continue;
>  			if (waitqueue_active(&vcpu->wq))
> @@ -1629,15 +1643,20 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  				put_task_struct(task);
>  				continue;
>  			}
> +
>  			if (yield_to(task, 1)) {
>  				put_task_struct(task);
> -				kvm->last_boosted_vcpu = i;
> +				mutex_lock(&kvm->lock);
> +				kvm->last_boosted_vcpu_id = vcpu->vcpu_id;
> +				mutex_unlock(&kvm->lock);

Why take the mutex?

> @@ -1673,11 +1692,30 @@ static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
>  	return 0;
>  }
>  
> +static void kvm_vcpu_destruct(struct kvm_vcpu *vcpu)
> +{
> +	kvm_arch_vcpu_destruct(vcpu);
> +}
> +
>  static int kvm_vcpu_release(struct inode *inode, struct file *filp)
>  {
>  	struct kvm_vcpu *vcpu = filp->private_data;
> +	struct kvm *kvm = vcpu->kvm;
> +	filp->private_data = NULL;
> +
> +	mutex_lock(&kvm->lock);
> +	list_del_rcu(&vcpu->list);
> +	atomic_dec(&kvm->online_vcpus);
> +	mutex_unlock(&kvm->lock);
> +	synchronize_srcu_expedited(&kvm->srcu);

Why _expedited?

Even better would be call_srcu() but it doesn't exist.

I think we can actually use regular rcu.  The only user that blocks is
kvm_vcpu_on_spin(), yes? so we can convert the vcpu to a task using
get_pid_task(), then, outside the rcu lock, call yield_to().


>  
> -	kvm_put_kvm(vcpu->kvm);
> +	mutex_lock(&kvm->lock);
> +	if (kvm->last_boosted_vcpu_id == vcpu->vcpu_id)
> +		kvm->last_boosted_vcpu_id = -1;
> +	mutex_unlock(&kvm->lock);
> +
> +	/*vcpu is out of list,drop it safely*/
> +	kvm_vcpu_destruct(vcpu);

Can all kvm_arch_vcpu_destroy() directly.

> +static struct kvm_vcpu *kvm_vcpu_create(struct kvm *kvm, u32 id)
> +{
> +	struct kvm_vcpu *vcpu;
> +	vcpu = kvm_arch_vcpu_create(kvm, id);
> +	if (IS_ERR(vcpu))
> +		return vcpu;
> +	INIT_LIST_HEAD(&vcpu->list);

Really needed?

> +	return vcpu;
> +}

Just fold this into the caller.

> +
>  /*
>   * Creates some virtual cpus.  Good luck creating more than one.
>   */
>  static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
>  {
> -	int r;
> +	int r, idx;
>  	struct kvm_vcpu *vcpu, *v;
>  
> -	vcpu = kvm_arch_vcpu_create(kvm, id);
> +	vcpu = kvm_vcpu_create(kvm, id);
>  	if (IS_ERR(vcpu))
>  		return PTR_ERR(vcpu);
>  
> @@ -1723,13 +1771,15 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
>  		goto unlock_vcpu_destroy;
>  	}
>  
> -	kvm_for_each_vcpu(r, v, kvm)
> +	idx = srcu_read_lock(&kvm->srcu);
> +	kvm_for_each_vcpu(v, kvm) {
>  		if (v->vcpu_id == id) {
>  			r = -EEXIST;
> +			srcu_read_unlock(&kvm->srcu, idx);

Put that in the error path please (add a new label if needed).

>  			goto unlock_vcpu_destroy;

>  
> -	kvm->vcpus[atomic_read(&kvm->online_vcpus)] = vcpu;
> -	smp_wmb();
> +	/*Protected by kvm->lock*/

Spaces.

> +	list_add_rcu(&vcpu->list, &kvm->vcpus);
>  	atomic_inc(&kvm->online_vcpus);
 


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v7] kvm: make vcpu life cycle separated from kvm instance
  2012-01-12 12:37                 ` Avi Kivity
@ 2012-01-15 13:17                   ` Liu ping fan
  2012-01-15 13:37                     ` Avi Kivity
  0 siblings, 1 reply; 78+ messages in thread
From: Liu ping fan @ 2012-01-15 13:17 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, linux-kernel, aliguori, gleb, mtosatti, xiaoguangrong.eric,
	jan.kiszka, yoshikawa.takuya, Rik van Riel

On Thu, Jan 12, 2012 at 8:37 PM, Avi Kivity <avi@redhat.com> wrote:
> On 01/07/2012 04:55 AM, Liu Ping Fan wrote:
>> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
>>
>> Currently, vcpu will be destructed only after kvm instance is
>> destroyed. This result to vcpu keep idle in kernel, but can not
>> be freed when it is unplugged in guest.
>>
>> Change this to vcpu's destruction before kvm instance, so vcpu MUST
>
> Must?
>
Yes, in kvm_arch_vcpu_destruct-->kvm_put_kvm(kvm); so after all vcpu
destroyed, then can kvm instance

>> and CAN be destroyed before kvm instance. By this way, we can remove
>> vcpu when guest does not need it any longer.
>>
>> TODO: push changes to other archs besides x86.
>>
>> -Rename kvm_vcpu_zap to kvm_vcpu_destruct and so on.
>
> kvm_vcpu_destroy.
>
The name "kvm_arch_vcpu_destroy" is already occupied in different arch.
So change
  kvm_vcpu_zap -> kvm_vcpu_destruct
  kvm_vcpu_arch_zap -> kvm_vcpu_arch_destruct

>>
>>  struct kvm_vcpu {
>>       struct kvm *kvm;
>> +     struct list_head list;
>
> vcpu_list_link, so it's clear this is not a head but a link, and so we
> know which list it belongs to.
>
OK
>> -     struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
>> +     struct list_head vcpus;
>
> This has the potential for a slight performance regression by bouncing
> an extra cache line, but it's acceptable IMO.  We can always introduce

Sorry, not clear about this scene, do you mean that the changing of
vcpu link list will cause the invalid of cache between SMP? But the
link list is not changed often.
> an apic ID -> vcpu hash table which improves things all around.
>
>> |
>> @@ -1593,11 +1598,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>  {
>>       struct kvm *kvm = me->kvm;
>>       struct kvm_vcpu *vcpu;
>> -     int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
>> -     int yielded = 0;
>> -     int pass;
>> -     int i;
>> -
>> +     struct task_struct *task = NULL;
>> +     struct pid *pid;
>> +     int pass, firststart, lastone, yielded, idx;
>
> Avoid unrelated changes please.
>
OK
>> @@ -1605,15 +1608,26 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>        * VCPU is holding the lock that we need and will release it.
>>        * We approximate round-robin by starting at the last boosted VCPU.
>>        */
>> -     for (pass = 0; pass < 2 && !yielded; pass++) {
>> -             kvm_for_each_vcpu(i, vcpu, kvm) {
>> -                     struct task_struct *task = NULL;
>> -                     struct pid *pid;
>> -                     if (!pass && i < last_boosted_vcpu) {
>> -                             i = last_boosted_vcpu;
>> +     for (pass = 0, firststart = 0; pass < 2 && !yielded; pass++) {
>> +
>> +             idx = srcu_read_lock(&kvm->srcu);
>
> Can move the lock to the top level.
>
OK
>> +             kvm_for_each_vcpu(vcpu, kvm) {
>> +                     if (kvm->last_boosted_vcpu_id < 0 && !pass) {
>> +                             pass = 1;
>> +                             break;
>> +                     }
>> +                     if (!pass && !firststart &&
>> +                         vcpu->vcpu_id != kvm->last_boosted_vcpu_id) {
>> +                             continue;
>> +                     } else if (!pass && !firststart) {
>> +                             firststart = 1;
>>                               continue;
>> -                     } else if (pass && i > last_boosted_vcpu)
>> +                     } else if (pass && !lastone) {
>> +                             if (vcpu->vcpu_id == kvm->last_boosted_vcpu_id)
>> +                                     lastone = 1;
>> +                     } else if (pass && lastone)
>>                               break;
>> +
>
> Seems like a large change.  Is this because the vcpu list is unordered?
> Maybe it's better to order it.
>
To find the last boosted vcpu (I guest it is more likely the lock
holder), we must enumerate the vcpu link list. While implemented by
kvm->vcpus[], it is more facile.

> Rik?
>
>>                       if (vcpu == me)
>>                               continue;
>>                       if (waitqueue_active(&vcpu->wq))
>> @@ -1629,15 +1643,20 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>                               put_task_struct(task);
>>                               continue;
>>                       }
>> +
>>                       if (yield_to(task, 1)) {
>>                               put_task_struct(task);
>> -                             kvm->last_boosted_vcpu = i;
>> +                             mutex_lock(&kvm->lock);
>> +                             kvm->last_boosted_vcpu_id = vcpu->vcpu_id;
>> +                             mutex_unlock(&kvm->lock);
>
> Why take the mutex?
>
In kvm_vcpu_release()
       mutex_lock(&kvm->lock);
       if (kvm->last_boosted_vcpu_id == vcpu->vcpu_id)

----------------------------------------->CAN NOT break
               kvm->last_boosted_vcpu_id = -1;
       mutex_unlock(&kvm->lock);

>> @@ -1673,11 +1692,30 @@ static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
>>       return 0;
>>  }
>>
>> +static void kvm_vcpu_destruct(struct kvm_vcpu *vcpu)
>> +{
>> +     kvm_arch_vcpu_destruct(vcpu);
>> +}
>> +
>>  static int kvm_vcpu_release(struct inode *inode, struct file *filp)
>>  {
>>       struct kvm_vcpu *vcpu = filp->private_data;
>> +     struct kvm *kvm = vcpu->kvm;
>> +     filp->private_data = NULL;
>> +
>> +     mutex_lock(&kvm->lock);
>> +     list_del_rcu(&vcpu->list);
>> +     atomic_dec(&kvm->online_vcpus);
>> +     mutex_unlock(&kvm->lock);
>> +     synchronize_srcu_expedited(&kvm->srcu);
>
> Why _expedited?
>
> Even better would be call_srcu() but it doesn't exist.
>
> I think we can actually use regular rcu.  The only user that blocks is
> kvm_vcpu_on_spin(), yes? so we can convert the vcpu to a task using
> get_pid_task(), then, outside the rcu lock, call yield_to().
>
Yes,  kvm_vcpu_on_spin() is the only one. But I think if outside the
rcu lock, call yield_to(), it will be like the following

again:
    rcu_lock()
    kvm_for_each_vcpu(){
    ......
    }
    rcu_unlock()
    if (yield_to(task, 1)) {
    .....
    } else
        goto again;

We must travel through the linked list again to find the next vcpu.

>
>>
>> -     kvm_put_kvm(vcpu->kvm);
>> +     mutex_lock(&kvm->lock);
>> +     if (kvm->last_boosted_vcpu_id == vcpu->vcpu_id)
>> +             kvm->last_boosted_vcpu_id = -1;
>> +     mutex_unlock(&kvm->lock);
>> +
>> +     /*vcpu is out of list,drop it safely*/
>> +     kvm_vcpu_destruct(vcpu);
>
> Can all kvm_arch_vcpu_destroy() directly.
>
>> +static struct kvm_vcpu *kvm_vcpu_create(struct kvm *kvm, u32 id)
>> +{
>> +     struct kvm_vcpu *vcpu;
>> +     vcpu = kvm_arch_vcpu_create(kvm, id);
>> +     if (IS_ERR(vcpu))
>> +             return vcpu;
>> +     INIT_LIST_HEAD(&vcpu->list);
>
> Really needed?
>
Yes, it is unnecessary
>> +     return vcpu;
>> +}
>
> Just fold this into the caller.
>
OK

Thanks and regards,
ping fan
>> +
>>  /*
>>   * Creates some virtual cpus.  Good luck creating more than one.
>>   */
>>  static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
>>  {
>> -     int r;
>> +     int r, idx;
>>       struct kvm_vcpu *vcpu, *v;
>>
>> -     vcpu = kvm_arch_vcpu_create(kvm, id);
>> +     vcpu = kvm_vcpu_create(kvm, id);
>>       if (IS_ERR(vcpu))
>>               return PTR_ERR(vcpu);
>>
>> @@ -1723,13 +1771,15 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
>>               goto unlock_vcpu_destroy;
>>       }
>>
>> -     kvm_for_each_vcpu(r, v, kvm)
>> +     idx = srcu_read_lock(&kvm->srcu);
>> +     kvm_for_each_vcpu(v, kvm) {
>>               if (v->vcpu_id == id) {
>>                       r = -EEXIST;
>> +                     srcu_read_unlock(&kvm->srcu, idx);
>
> Put that in the error path please (add a new label if needed).
>
>>                       goto unlock_vcpu_destroy;
>
>>
>> -     kvm->vcpus[atomic_read(&kvm->online_vcpus)] = vcpu;
>> -     smp_wmb();
>> +     /*Protected by kvm->lock*/
>
> Spaces.
>
>> +     list_add_rcu(&vcpu->list, &kvm->vcpus);
>>       atomic_inc(&kvm->online_vcpus);
>
>
>
> --
> error compiling committee.c: too many arguments to function
>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v7] kvm: make vcpu life cycle separated from kvm instance
  2012-01-15 13:17                   ` Liu ping fan
@ 2012-01-15 13:37                     ` Avi Kivity
  0 siblings, 0 replies; 78+ messages in thread
From: Avi Kivity @ 2012-01-15 13:37 UTC (permalink / raw)
  To: Liu ping fan
  Cc: kvm, linux-kernel, aliguori, gleb, mtosatti, xiaoguangrong.eric,
	jan.kiszka, yoshikawa.takuya, Rik van Riel

On 01/15/2012 03:17 PM, Liu ping fan wrote:
> On Thu, Jan 12, 2012 at 8:37 PM, Avi Kivity <avi@redhat.com> wrote:
> > On 01/07/2012 04:55 AM, Liu Ping Fan wrote:
> >> From: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
> >>
> >> Currently, vcpu will be destructed only after kvm instance is
> >> destroyed. This result to vcpu keep idle in kernel, but can not
> >> be freed when it is unplugged in guest.
> >>
> >> Change this to vcpu's destruction before kvm instance, so vcpu MUST
> >
> > Must?
> >
> Yes, in kvm_arch_vcpu_destruct-->kvm_put_kvm(kvm); so after all vcpu
> destroyed, then can kvm instance

Oh.  Words like MUST imply that the user has to do something different. 
It's just that the normal order of operations changes.

> >> and CAN be destroyed before kvm instance. By this way, we can remove
> >> vcpu when guest does not need it any longer.
> >>
> >> TODO: push changes to other archs besides x86.
> >>
> >> -Rename kvm_vcpu_zap to kvm_vcpu_destruct and so on.
> >
> > kvm_vcpu_destroy.
> >
> The name "kvm_arch_vcpu_destroy" is already occupied in different arch.

It's actually in all archs.  But having both kvm_arch_vcpu_destroy() and
kvm_arch_vcpu_destruct() isn't going to make the code more
understandable, need to merge the two, or find different names.

> So change
>   kvm_vcpu_zap -> kvm_vcpu_destruct
>   kvm_vcpu_arch_zap -> kvm_vcpu_arch_destruct



> >> -     struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
> >> +     struct list_head vcpus;
> >
> > This has the potential for a slight performance regression by bouncing
> > an extra cache line, but it's acceptable IMO.  We can always introduce
>
> Sorry, not clear about this scene, do you mean that the changing of
> vcpu link list will cause the invalid of cache between SMP? But the
> link list is not changed often.

No, I mean that kvm_for_each_vcpu() now has to bounce a cacheline for
each vcpu, in order to read the link.

> >> +             kvm_for_each_vcpu(vcpu, kvm) {
> >> +                     if (kvm->last_boosted_vcpu_id < 0 && !pass) {
> >> +                             pass = 1;
> >> +                             break;
> >> +                     }
> >> +                     if (!pass && !firststart &&
> >> +                         vcpu->vcpu_id != kvm->last_boosted_vcpu_id) {
> >> +                             continue;
> >> +                     } else if (!pass && !firststart) {
> >> +                             firststart = 1;
> >>                               continue;
> >> -                     } else if (pass && i > last_boosted_vcpu)
> >> +                     } else if (pass && !lastone) {
> >> +                             if (vcpu->vcpu_id == kvm->last_boosted_vcpu_id)
> >> +                                     lastone = 1;
> >> +                     } else if (pass && lastone)
> >>                               break;
> >> +
> >
> > Seems like a large change.  Is this because the vcpu list is unordered?
> > Maybe it's better to order it.
> >
> To find the last boosted vcpu (I guest it is more likely the lock
> holder), we must enumerate the vcpu link list. While implemented by
> kvm->vcpus[], it is more facile.

Please simplify this code, it's pretty complicated.

> >> +
> >>                       if (yield_to(task, 1)) {
> >>                               put_task_struct(task);
> >> -                             kvm->last_boosted_vcpu = i;
> >> +                             mutex_lock(&kvm->lock);
> >> +                             kvm->last_boosted_vcpu_id = vcpu->vcpu_id;
> >> +                             mutex_unlock(&kvm->lock);
> >
> > Why take the mutex?
> >
> In kvm_vcpu_release()
>        mutex_lock(&kvm->lock);
>        if (kvm->last_boosted_vcpu_id == vcpu->vcpu_id)
>
> ----------------------------------------->CAN NOT break
>                kvm->last_boosted_vcpu_id = -1;
>        mutex_unlock(&kvm->lock);

It's not pretty taking a vm-wide lock here.  Just make the code
resilient to incorrect vcpu_id.  If it doesn't find
last_boosted_vcpu_id, it should just pick something, like the first or
last vcpu in the list.

>
> >>  static int kvm_vcpu_release(struct inode *inode, struct file *filp)
> >>  {
> >>       struct kvm_vcpu *vcpu = filp->private_data;
> >> +     struct kvm *kvm = vcpu->kvm;
> >> +     filp->private_data = NULL;
> >> +
> >> +     mutex_lock(&kvm->lock);
> >> +     list_del_rcu(&vcpu->list);
> >> +     atomic_dec(&kvm->online_vcpus);
> >> +     mutex_unlock(&kvm->lock);
> >> +     synchronize_srcu_expedited(&kvm->srcu);
> >
> > Why _expedited?
> >
> > Even better would be call_srcu() but it doesn't exist.
> >
> > I think we can actually use regular rcu.  The only user that blocks is
> > kvm_vcpu_on_spin(), yes? so we can convert the vcpu to a task using
> > get_pid_task(), then, outside the rcu lock, call yield_to().
> >
> Yes,  kvm_vcpu_on_spin() is the only one. But I think if outside the
> rcu lock, call yield_to(), it will be like the following
>
> again:
>     rcu_lock()
>     kvm_for_each_vcpu(){
>     ......
>     }
>     rcu_unlock()
>     if (yield_to(task, 1)) {
>     .....
>     } else
>         goto again;
>
> We must travel through the linked list again to find the next vcpu.

Annoying... maybe we should use an array instead of a list after all.

>
> >
> >> +static struct kvm_vcpu *kvm_vcpu_create(struct kvm *kvm, u32 id)
> >> +{
> >> +     struct kvm_vcpu *vcpu;
> >> +     vcpu = kvm_arch_vcpu_create(kvm, id);
> >> +     if (IS_ERR(vcpu))
> >> +             return vcpu;
> >> +     INIT_LIST_HEAD(&vcpu->list);
> >
> > Really needed?
> >
> Yes, it is unnecessary

Why?  list_add_rcu() will overwrite it anyway.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2012-01-15 13:37 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-25  2:35 [PATCH 0] A series patches for kvm&qemu to enable vcpu destruction in kvm Liu Ping Fan
2011-11-25  2:35 ` [PATCH 1/2] kvm: make vcpu life cycle separated from kvm instance Liu Ping Fan
2011-11-27 10:36   ` Avi Kivity
2011-12-02  6:26     ` [PATCH] " Liu Ping Fan
2011-12-02 18:26       ` Jan Kiszka
2011-12-04 11:53         ` Liu ping fan
2011-12-04 12:10           ` Gleb Natapov
2011-12-05  5:39             ` Liu ping fan
2011-12-05  8:41               ` Gleb Natapov
2011-12-06  6:54                 ` Liu ping fan
2011-12-06  8:14                   ` Gleb Natapov
2011-12-04 10:23       ` Avi Kivity
2011-12-05  5:29         ` Liu ping fan
2011-12-05  9:30           ` Avi Kivity
2011-12-05  9:42             ` Gleb Natapov
2011-12-05  9:58               ` Avi Kivity
2011-12-05 10:18                 ` Gleb Natapov
2011-12-05 10:22                   ` Avi Kivity
2011-12-05 10:40                     ` Gleb Natapov
2011-12-09  5:23       ` [PATCH V2] " Liu Ping Fan
2011-12-09 14:23         ` Gleb Natapov
2011-12-12  2:41           ` [PATCH v3] " Liu Ping Fan
2011-12-12 12:54             ` Gleb Natapov
2011-12-13  9:29               ` Liu ping fan
2011-12-13  9:47                 ` Gleb Natapov
2011-12-13 11:36             ` Marcelo Tosatti
2011-12-13 11:54               ` Gleb Natapov
2011-12-15  3:21               ` Liu ping fan
2011-12-15  4:28                 ` [PATCH v4] " Liu Ping Fan
2011-12-15  5:33                   ` Xiao Guangrong
2011-12-15  6:53                     ` Liu ping fan
2011-12-15  8:25                       ` Xiao Guangrong
2011-12-15  8:57                         ` Xiao Guangrong
2011-12-15  6:48                   ` Takuya Yoshikawa
2011-12-16  9:38                     ` Marcelo Tosatti
2011-12-17  3:57                     ` Liu ping fan
2011-12-19  1:16                       ` Takuya Yoshikawa
2011-12-15  9:10                   ` Gleb Natapov
2011-12-16  7:50                     ` Liu ping fan
2011-12-15  8:33                 ` [PATCH v3] " Gleb Natapov
2011-12-15  9:06                   ` Liu ping fan
2011-12-15  9:08                     ` Gleb Natapov
2011-12-17  3:19             ` [PATCH v5] " Liu Ping Fan
2011-12-26 11:09               ` Gleb Natapov
2011-12-26 11:17                 ` Avi Kivity
2011-12-26 11:21                   ` Gleb Natapov
2011-12-27  7:53                 ` Liu ping fan
2011-12-27  8:38               ` [PATCH v6] " Liu Ping Fan
2011-12-27 11:22                 ` Takuya Yoshikawa
2011-12-28  6:54                   ` Liu ping fan
2011-12-28  9:53                     ` Avi Kivity
2011-12-29 14:03                       ` Liu ping fan
2011-12-29 14:31                         ` Avi Kivity
2012-01-05  9:35                           ` Liu ping fan
2011-12-28 10:29                     ` Takuya Yoshikawa
2011-12-28  9:53                 ` Avi Kivity
2011-12-28  9:54                   ` Avi Kivity
2011-12-28 10:19                     ` Takuya Yoshikawa
2011-12-28 10:28                       ` Avi Kivity
2012-01-07  2:55               ` [PATCH v7] " Liu Ping Fan
2012-01-12 12:37                 ` Avi Kivity
2012-01-15 13:17                   ` Liu ping fan
2012-01-15 13:37                     ` Avi Kivity
2011-11-25 17:54 ` [PATCH 0] A series patches for kvm&qemu to enable vcpu destruction in kvm Jan Kiszka
2011-11-27  3:07   ` Liu ping fan
2011-11-27  2:42 ` [PATCH 2/2] kvm: exit to userspace with reason KVM_EXIT_VCPU_DEAD Liu Ping Fan
2011-11-27 10:36   ` Avi Kivity
2011-11-27 10:50     ` [Qemu-devel] " Gleb Natapov
2011-11-28  7:16       ` Liu ping fan
2011-11-28  8:46         ` Gleb Natapov
2011-11-27  2:45 ` [PATCH 1/5] QEMU Add cpu_phyid_to_cpu() to map cpu phyid to CPUState Liu Ping Fan
2011-11-27  2:45 ` [PATCH 2/5] QEMU Add cpu_free() to support arch related CPUState release Liu Ping Fan
2011-11-27  2:45 ` [PATCH 3/5] QEMU Introduce a pci device "cpustate" to get CPU_DEAD event in guest Liu Ping Fan
2011-11-27 10:56   ` [Qemu-devel] " Gleb Natapov
2011-11-27  2:45 ` [PATCH 4/5] QEMU Release vcpu and finally exit vcpu thread safely Liu Ping Fan
2011-11-27  2:45 ` [PATCH 5/5] QEMU tmp patches for linux-header files Liu Ping Fan
2011-11-27  2:47 ` [PATCH] virtio: add a pci driver to notify host the CPU_DEAD event Liu Ping Fan
2011-11-27 11:10   ` [Qemu-devel] " Gleb Natapov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).