kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/10] [RFC] more fine grained locking for IRQ injection
@ 2009-07-14 14:30 Gleb Natapov
  2009-07-14 14:30 ` [PATCH 01/10] Move irq routing data structure to rcu locking Gleb Natapov
                   ` (9 more replies)
  0 siblings, 10 replies; 26+ messages in thread
From: Gleb Natapov @ 2009-07-14 14:30 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti

kvm->irq_lock protects too much stuff, but still fail to protect
everything it was design to protect (see ack notifiers call in pic). I
want to make IRQ injection logic use more fine grained locking. This patch
series split kvm->irq_lock mutex to smaller spinlocks each one protects
only one thing. Irq routing, irq notifier lists and ioapic gain their own
spinlock.  pic is already uses its own lock. This patch series also makes
interrupt injection to lapic lockless (several kvm_irq_delivery_to_apic()
may run in parallel), but access to lapic was never fully locked in the
first place. VCPU could access lapic in parallel with interrupt injection.

This patch series is exactly like previous two combined + one more patch
that changes irq_routing to use gsi indexed array to map from gsi to irq
chip. But now the patch series comes with numbers. Here is oprofile data
of one WindowsXP boot with two cpus generated by the command
 "opreport -l | grep kvm | awk '{print $1" "$5}' | grep irq"

Master:                              With patches:
996 kvm_set_irq                      412 __apic_accept_irq
502 __apic_accept_irq		     298 kvm_set_irq
250 kvm_irq_delivery_to_apic	     228 kvm_irq_delivery_to_api
162 kvm_ioapic_set_irq		     178 kvm_ioapic_set_irq
162 kvm_pic_set_irq		     137 enable_irq_window
95 kvm_apic_set_irq		     122 kvm_pic_set_irq
82 kvm_set_pic_irq		     71 kvm_apic_set_irq
77 enable_irq_window		     55 pic_get_irq
60 vmx_inject_irq		     48 vmx_inject_irq
52 pic_get_irq			     44 kvm_set_ioapic_irq
49 kvm_set_ioapic_irq		     42 kvm_set_pic_irq
48 pic_update_irq		     42 pic_update_irq
40 pic_irq_request		     17 kvm_notify_acked_irq
37 kvm_notify_acked_irq		     17 pic_irq_request
				     1 kvm_inject_pit_timer_irqs

And the same with the patchset + small userspace hack (remove pic from
irq routing table when guest stops use it):
418 __apic_accept_irq
242 kvm_irq_delivery_to_apic
212 kvm_set_irq
112 kvm_ioapic_set_irq
98 enable_irq_window
81 kvm_apic_set_irq
72 vmx_inject_irq
60 kvm_pic_set_irq
30 kvm_set_ioapic_irq
20 pic_get_irq
12 pic_update_irq
11 kvm_set_pic_irq
10 pic_irq_request
7 kvm_notify_acked_irq

Gleb Natapov (9):
  Unregister ack notifier callback on PIT freeing.
  Move irq ack notifier list to arch independent code.
  Convert irq notifiers lists to RCU locking.
  Protect irq_sources_bitmap by kvm->lock instead of kvm->irq_lock
  Move irq routing to its own locking.
  Move irq notifiers lists to its own locking.
  Move IO APIC to its own lock.
  Drop kvm->irq_lock lock.
  Change irq routing table to use gsi indexed array.

 arch/ia64/include/asm/kvm_host.h |    1 -
 arch/ia64/kvm/kvm-ia64.c         |   29 +++++++---
 arch/x86/include/asm/kvm_host.h  |    1 -
 arch/x86/kvm/i8254.c             |    6 +-
 arch/x86/kvm/lapic.c             |    7 +--
 arch/x86/kvm/x86.c               |   32 +++++++----
 include/linux/kvm_host.h         |   15 ++++-
 virt/kvm/eventfd.c               |    2 -
 virt/kvm/ioapic.c                |   50 ++++++++++------
 virt/kvm/ioapic.h                |    1 +
 virt/kvm/irq_comm.c              |  117 +++++++++++++++++++++-----------------
 virt/kvm/kvm_main.c              |    8 +-
 12 files changed, 157 insertions(+), 112 deletions(-)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 01/10] Move irq routing data structure to rcu locking
  2009-07-14 14:30 [PATCH 0/10] [RFC] more fine grained locking for IRQ injection Gleb Natapov
@ 2009-07-14 14:30 ` Gleb Natapov
  2009-07-14 14:30 ` [PATCH 02/10] Unregister ack notifier callback on PIT freeing Gleb Natapov
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Gleb Natapov @ 2009-07-14 14:30 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti


Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 include/linux/kvm_host.h |    2 +-
 virt/kvm/irq_comm.c      |   55 +++++++++++++++++++++-------------------------
 virt/kvm/kvm_main.c      |    1 -
 3 files changed, 26 insertions(+), 32 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f244f11..2490816 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -166,7 +166,7 @@ struct kvm {
 
 	struct mutex irq_lock;
 #ifdef CONFIG_HAVE_KVM_IRQCHIP
-	struct list_head irq_routing; /* of kvm_kernel_irq_routing_entry */
+	struct kvm_kernel_irq_routing_entry *irq_routing;
 	struct hlist_head mask_notifier_list;
 #endif
 
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 100c267..0283a2b 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -150,7 +150,8 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
 	 * IOAPIC.  So set the bit in both. The guest will ignore
 	 * writes to the unused one.
 	 */
-	list_for_each_entry(e, &kvm->irq_routing, link)
+	rcu_read_lock();
+	for (e = rcu_dereference(kvm->irq_routing); e && e->set; e++) {
 		if (e->gsi == irq) {
 			int r = e->set(e, kvm, sig_level);
 			if (r < 0)
@@ -158,6 +159,8 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
 
 			ret = r + ((ret < 0) ? 0 : ret);
 		}
+	}
+	rcu_read_unlock();
 	return ret;
 }
 
@@ -170,12 +173,15 @@ void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
 
 	trace_kvm_ack_irq(irqchip, pin);
 
-	list_for_each_entry(e, &kvm->irq_routing, link)
+	rcu_read_lock();
+	for (e = rcu_dereference(kvm->irq_routing); e && e->set; e++) {
 		if (e->irqchip.irqchip == irqchip &&
 		    e->irqchip.pin == pin) {
 			gsi = e->gsi;
 			break;
 		}
+	}
+	rcu_read_unlock();
 
 	hlist_for_each_entry(kian, n, &kvm->arch.irq_ack_notifier_list, link)
 		if (kian->gsi == gsi)
@@ -266,19 +272,11 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, int irq, bool mask)
 			kimn->func(kimn, mask);
 }
 
-static void __kvm_free_irq_routing(struct list_head *irq_routing)
-{
-	struct kvm_kernel_irq_routing_entry *e, *n;
-
-	list_for_each_entry_safe(e, n, irq_routing, link)
-		kfree(e);
-}
-
 void kvm_free_irq_routing(struct kvm *kvm)
 {
-	mutex_lock(&kvm->irq_lock);
-	__kvm_free_irq_routing(&kvm->irq_routing);
-	mutex_unlock(&kvm->irq_lock);
+	/* Called only during vm destruction. Nobody can use the pointer
+	   at this stage */
+	kfree(kvm->irq_routing);
 }
 
 static int setup_routing_entry(struct kvm_kernel_irq_routing_entry *e,
@@ -328,43 +326,40 @@ int kvm_set_irq_routing(struct kvm *kvm,
 			unsigned nr,
 			unsigned flags)
 {
-	struct list_head irq_list = LIST_HEAD_INIT(irq_list);
-	struct list_head tmp = LIST_HEAD_INIT(tmp);
-	struct kvm_kernel_irq_routing_entry *e = NULL;
+	struct kvm_kernel_irq_routing_entry *new, *old;
 	unsigned i;
 	int r;
 
+	/* last elemet is left zeored and indicates the end of the array */
+	new = kzalloc(sizeof(*new) * (nr + 1), GFP_KERNEL);
+
+	if (!new)
+		return -ENOMEM;
+
 	for (i = 0; i < nr; ++i) {
 		r = -EINVAL;
 		if (ue->gsi >= KVM_MAX_IRQ_ROUTES)
 			goto out;
 		if (ue->flags)
 			goto out;
-		r = -ENOMEM;
-		e = kzalloc(sizeof(*e), GFP_KERNEL);
-		if (!e)
-			goto out;
-		r = setup_routing_entry(e, ue);
+		r = setup_routing_entry(new + i, ue);
 		if (r)
 			goto out;
 		++ue;
-		list_add(&e->link, &irq_list);
-		e = NULL;
 	}
 
 	mutex_lock(&kvm->irq_lock);
-	list_splice(&kvm->irq_routing, &tmp);
-	INIT_LIST_HEAD(&kvm->irq_routing);
-	list_splice(&irq_list, &kvm->irq_routing);
-	INIT_LIST_HEAD(&irq_list);
-	list_splice(&tmp, &irq_list);
+	old = kvm->irq_routing;
+	rcu_assign_pointer(kvm->irq_routing, new);
 	mutex_unlock(&kvm->irq_lock);
 
+	synchronize_rcu();
+
 	r = 0;
+	new = old;
 
 out:
-	kfree(e);
-	__kvm_free_irq_routing(&irq_list);
+	kfree(new);
 	return r;
 }
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7cd1c10..75bf72f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -945,7 +945,6 @@ static struct kvm *kvm_create_vm(void)
 	if (IS_ERR(kvm))
 		goto out;
 #ifdef CONFIG_HAVE_KVM_IRQCHIP
-	INIT_LIST_HEAD(&kvm->irq_routing);
 	INIT_HLIST_HEAD(&kvm->mask_notifier_list);
 #endif
 
-- 
1.6.2.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 02/10] Unregister ack notifier callback on PIT freeing.
  2009-07-14 14:30 [PATCH 0/10] [RFC] more fine grained locking for IRQ injection Gleb Natapov
  2009-07-14 14:30 ` [PATCH 01/10] Move irq routing data structure to rcu locking Gleb Natapov
@ 2009-07-14 14:30 ` Gleb Natapov
  2009-07-14 14:30 ` [PATCH 03/10] Move irq ack notifier list to arch independent code Gleb Natapov
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Gleb Natapov @ 2009-07-14 14:30 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti


Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/i8254.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index 137e548..472653c 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -672,6 +672,8 @@ void kvm_free_pit(struct kvm *kvm)
 	if (kvm->arch.vpit) {
 		kvm_unregister_irq_mask_notifier(kvm, 0,
 					       &kvm->arch.vpit->mask_notifier);
+		kvm_unregister_irq_ack_notifier(kvm,
+				&kvm->arch.vpit->pit_state.irq_ack_notifier);
 		mutex_lock(&kvm->arch.vpit->pit_state.lock);
 		timer = &kvm->arch.vpit->pit_state.pit_timer.timer;
 		hrtimer_cancel(timer);
-- 
1.6.2.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 03/10] Move irq ack notifier list to arch independent code.
  2009-07-14 14:30 [PATCH 0/10] [RFC] more fine grained locking for IRQ injection Gleb Natapov
  2009-07-14 14:30 ` [PATCH 01/10] Move irq routing data structure to rcu locking Gleb Natapov
  2009-07-14 14:30 ` [PATCH 02/10] Unregister ack notifier callback on PIT freeing Gleb Natapov
@ 2009-07-14 14:30 ` Gleb Natapov
  2009-07-14 14:30 ` [PATCH 04/10] Convert irq notifiers lists to RCU locking Gleb Natapov
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Gleb Natapov @ 2009-07-14 14:30 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti

Mask irq notifier list is already there.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/ia64/include/asm/kvm_host.h |    1 -
 arch/x86/include/asm/kvm_host.h  |    1 -
 include/linux/kvm_host.h         |    1 +
 virt/kvm/irq_comm.c              |    4 ++--
 virt/kvm/kvm_main.c              |    1 +
 5 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/ia64/include/asm/kvm_host.h b/arch/ia64/include/asm/kvm_host.h
index d9b6325..a362e67 100644
--- a/arch/ia64/include/asm/kvm_host.h
+++ b/arch/ia64/include/asm/kvm_host.h
@@ -475,7 +475,6 @@ struct kvm_arch {
 	struct list_head assigned_dev_head;
 	struct iommu_domain *iommu_domain;
 	int iommu_flags;
-	struct hlist_head irq_ack_notifier_list;
 
 	unsigned long irq_sources_bitmap;
 	unsigned long irq_states[KVM_IOAPIC_NUM_PINS];
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 08732d7..eb723a7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -401,7 +401,6 @@ struct kvm_arch{
 	struct kvm_pic *vpic;
 	struct kvm_ioapic *vioapic;
 	struct kvm_pit *vpit;
-	struct hlist_head irq_ack_notifier_list;
 	int vapics_in_nmi_mode;
 
 	unsigned int tss_addr;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2490816..b53a5b8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -168,6 +168,7 @@ struct kvm {
 #ifdef CONFIG_HAVE_KVM_IRQCHIP
 	struct kvm_kernel_irq_routing_entry *irq_routing;
 	struct hlist_head mask_notifier_list;
+	struct hlist_head irq_ack_notifier_list;
 #endif
 
 #ifdef KVM_ARCH_WANT_MMU_NOTIFIER
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 0283a2b..cce32de 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -183,7 +183,7 @@ void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
 	}
 	rcu_read_unlock();
 
-	hlist_for_each_entry(kian, n, &kvm->arch.irq_ack_notifier_list, link)
+	hlist_for_each_entry(kian, n, &kvm->irq_ack_notifier_list, link)
 		if (kian->gsi == gsi)
 			kian->irq_acked(kian);
 }
@@ -192,7 +192,7 @@ void kvm_register_irq_ack_notifier(struct kvm *kvm,
 				   struct kvm_irq_ack_notifier *kian)
 {
 	mutex_lock(&kvm->irq_lock);
-	hlist_add_head(&kian->link, &kvm->arch.irq_ack_notifier_list);
+	hlist_add_head(&kian->link, &kvm->irq_ack_notifier_list);
 	mutex_unlock(&kvm->irq_lock);
 }
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 75bf72f..ceaa478 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -946,6 +946,7 @@ static struct kvm *kvm_create_vm(void)
 		goto out;
 #ifdef CONFIG_HAVE_KVM_IRQCHIP
 	INIT_HLIST_HEAD(&kvm->mask_notifier_list);
+	INIT_HLIST_HEAD(&kvm->irq_ack_notifier_list);
 #endif
 
 #ifdef KVM_COALESCED_MMIO_PAGE_OFFSET
-- 
1.6.2.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 04/10] Convert irq notifiers lists to RCU locking.
  2009-07-14 14:30 [PATCH 0/10] [RFC] more fine grained locking for IRQ injection Gleb Natapov
                   ` (2 preceding siblings ...)
  2009-07-14 14:30 ` [PATCH 03/10] Move irq ack notifier list to arch independent code Gleb Natapov
@ 2009-07-14 14:30 ` Gleb Natapov
  2009-07-14 14:30 ` [PATCH 05/10] Protect irq_sources_bitmap by kvm->lock instead of kvm->irq_lock Gleb Natapov
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Gleb Natapov @ 2009-07-14 14:30 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti

Use RCU locking for mask/ack notifiers lists.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 virt/kvm/irq_comm.c |   20 +++++++++++---------
 1 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index cce32de..6c57e46 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -181,18 +181,18 @@ void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
 			break;
 		}
 	}
-	rcu_read_unlock();
 
-	hlist_for_each_entry(kian, n, &kvm->irq_ack_notifier_list, link)
+	hlist_for_each_entry_rcu(kian, n, &kvm->irq_ack_notifier_list, link)
 		if (kian->gsi == gsi)
 			kian->irq_acked(kian);
+	rcu_read_unlock();
 }
 
 void kvm_register_irq_ack_notifier(struct kvm *kvm,
 				   struct kvm_irq_ack_notifier *kian)
 {
 	mutex_lock(&kvm->irq_lock);
-	hlist_add_head(&kian->link, &kvm->irq_ack_notifier_list);
+	hlist_add_head_rcu(&kian->link, &kvm->irq_ack_notifier_list);
 	mutex_unlock(&kvm->irq_lock);
 }
 
@@ -200,8 +200,9 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
 				    struct kvm_irq_ack_notifier *kian)
 {
 	mutex_lock(&kvm->irq_lock);
-	hlist_del_init(&kian->link);
+	hlist_del_init_rcu(&kian->link);
 	mutex_unlock(&kvm->irq_lock);
+	synchronize_rcu();
 }
 
 int kvm_request_irq_source_id(struct kvm *kvm)
@@ -248,7 +249,7 @@ void kvm_register_irq_mask_notifier(struct kvm *kvm, int irq,
 {
 	mutex_lock(&kvm->irq_lock);
 	kimn->irq = irq;
-	hlist_add_head(&kimn->link, &kvm->mask_notifier_list);
+	hlist_add_head_rcu(&kimn->link, &kvm->mask_notifier_list);
 	mutex_unlock(&kvm->irq_lock);
 }
 
@@ -256,8 +257,9 @@ void kvm_unregister_irq_mask_notifier(struct kvm *kvm, int irq,
 				      struct kvm_irq_mask_notifier *kimn)
 {
 	mutex_lock(&kvm->irq_lock);
-	hlist_del(&kimn->link);
+	hlist_del_rcu(&kimn->link);
 	mutex_unlock(&kvm->irq_lock);
+	synchronize_rcu();
 }
 
 void kvm_fire_mask_notifiers(struct kvm *kvm, int irq, bool mask)
@@ -265,11 +267,11 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, int irq, bool mask)
 	struct kvm_irq_mask_notifier *kimn;
 	struct hlist_node *n;
 
-	WARN_ON(!mutex_is_locked(&kvm->irq_lock));
-
-	hlist_for_each_entry(kimn, n, &kvm->mask_notifier_list, link)
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(kimn, n, &kvm->mask_notifier_list, link)
 		if (kimn->irq == irq)
 			kimn->func(kimn, mask);
+	rcu_read_unlock();
 }
 
 void kvm_free_irq_routing(struct kvm *kvm)
-- 
1.6.2.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 05/10] Protect irq_sources_bitmap by kvm->lock instead of kvm->irq_lock
  2009-07-14 14:30 [PATCH 0/10] [RFC] more fine grained locking for IRQ injection Gleb Natapov
                   ` (3 preceding siblings ...)
  2009-07-14 14:30 ` [PATCH 04/10] Convert irq notifiers lists to RCU locking Gleb Natapov
@ 2009-07-14 14:30 ` Gleb Natapov
  2009-07-14 14:30 ` [PATCH 06/10] Move irq routing to its own locking Gleb Natapov
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Gleb Natapov @ 2009-07-14 14:30 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti

It is already protected by kvm->lock on device assignment path. Just
take the same lock in the PIT code.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/x86/kvm/i8254.c |    2 ++
 virt/kvm/irq_comm.c  |    8 ++++----
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index 472653c..59021da 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -611,7 +611,9 @@ struct kvm_pit *kvm_create_pit(struct kvm *kvm, u32 flags)
 	if (!pit)
 		return NULL;
 
+	mutex_lock(&kvm->lock);
 	pit->irq_source_id = kvm_request_irq_source_id(kvm);
+	mutex_unlock(&kvm->lock);
 	if (pit->irq_source_id < 0) {
 		kfree(pit);
 		return NULL;
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 6c57e46..ce8fcd3 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -210,7 +210,8 @@ int kvm_request_irq_source_id(struct kvm *kvm)
 	unsigned long *bitmap = &kvm->arch.irq_sources_bitmap;
 	int irq_source_id;
 
-	mutex_lock(&kvm->irq_lock);
+	WARN_ON(!mutex_is_locked(&kvm->lock));
+
 	irq_source_id = find_first_zero_bit(bitmap,
 				sizeof(kvm->arch.irq_sources_bitmap));
 
@@ -221,7 +222,6 @@ int kvm_request_irq_source_id(struct kvm *kvm)
 
 	ASSERT(irq_source_id != KVM_USERSPACE_IRQ_SOURCE_ID);
 	set_bit(irq_source_id, bitmap);
-	mutex_unlock(&kvm->irq_lock);
 
 	return irq_source_id;
 }
@@ -230,9 +230,10 @@ void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id)
 {
 	int i;
 
+	/* during vm destruction this function is called without locking */
+	WARN_ON(!mutex_is_locked(&kvm->lock) && atomic_read(&kvm->users_count));
 	ASSERT(irq_source_id != KVM_USERSPACE_IRQ_SOURCE_ID);
 
-	mutex_lock(&kvm->irq_lock);
 	if (irq_source_id < 0 ||
 	    irq_source_id >= sizeof(kvm->arch.irq_sources_bitmap)) {
 		printk(KERN_ERR "kvm: IRQ source ID out of range!\n");
@@ -241,7 +242,6 @@ void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id)
 	for (i = 0; i < KVM_IOAPIC_NUM_PINS; i++)
 		clear_bit(irq_source_id, &kvm->arch.irq_states[i]);
 	clear_bit(irq_source_id, &kvm->arch.irq_sources_bitmap);
-	mutex_unlock(&kvm->irq_lock);
 }
 
 void kvm_register_irq_mask_notifier(struct kvm *kvm, int irq,
-- 
1.6.2.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 06/10] Move irq routing to its own locking.
  2009-07-14 14:30 [PATCH 0/10] [RFC] more fine grained locking for IRQ injection Gleb Natapov
                   ` (4 preceding siblings ...)
  2009-07-14 14:30 ` [PATCH 05/10] Protect irq_sources_bitmap by kvm->lock instead of kvm->irq_lock Gleb Natapov
@ 2009-07-14 14:30 ` Gleb Natapov
  2009-07-14 14:30 ` [PATCH 07/10] Move irq notifiers lists " Gleb Natapov
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Gleb Natapov @ 2009-07-14 14:30 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti


Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 include/linux/kvm_host.h |    1 +
 virt/kvm/irq_comm.c      |    5 ++---
 virt/kvm/kvm_main.c      |    1 +
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b53a5b8..8ca15a0 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -167,6 +167,7 @@ struct kvm {
 	struct mutex irq_lock;
 #ifdef CONFIG_HAVE_KVM_IRQCHIP
 	struct kvm_kernel_irq_routing_entry *irq_routing;
+	spinlock_t irq_routing_lock;
 	struct hlist_head mask_notifier_list;
 	struct hlist_head irq_ack_notifier_list;
 #endif
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index ce8fcd3..3dbba41 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -350,11 +350,10 @@ int kvm_set_irq_routing(struct kvm *kvm,
 		++ue;
 	}
 
-	mutex_lock(&kvm->irq_lock);
+	spin_lock(&kvm->irq_routing_lock);
 	old = kvm->irq_routing;
 	rcu_assign_pointer(kvm->irq_routing, new);
-	mutex_unlock(&kvm->irq_lock);
-
+	spin_unlock(&kvm->irq_routing_lock);
 	synchronize_rcu();
 
 	r = 0;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ceaa478..1d70da3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -945,6 +945,7 @@ static struct kvm *kvm_create_vm(void)
 	if (IS_ERR(kvm))
 		goto out;
 #ifdef CONFIG_HAVE_KVM_IRQCHIP
+	spin_lock_init(&kvm->irq_routing_lock);
 	INIT_HLIST_HEAD(&kvm->mask_notifier_list);
 	INIT_HLIST_HEAD(&kvm->irq_ack_notifier_list);
 #endif
-- 
1.6.2.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 07/10] Move irq notifiers lists to its own locking.
  2009-07-14 14:30 [PATCH 0/10] [RFC] more fine grained locking for IRQ injection Gleb Natapov
                   ` (5 preceding siblings ...)
  2009-07-14 14:30 ` [PATCH 06/10] Move irq routing to its own locking Gleb Natapov
@ 2009-07-14 14:30 ` Gleb Natapov
  2009-07-14 14:30 ` [PATCH 08/10] Move IO APIC to its own lock Gleb Natapov
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: Gleb Natapov @ 2009-07-14 14:30 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti


Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 include/linux/kvm_host.h |    1 +
 virt/kvm/irq_comm.c      |   16 ++++++++--------
 virt/kvm/kvm_main.c      |    1 +
 3 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8ca15a0..4eb5c85 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -170,6 +170,7 @@ struct kvm {
 	spinlock_t irq_routing_lock;
 	struct hlist_head mask_notifier_list;
 	struct hlist_head irq_ack_notifier_list;
+	spinlock_t irq_notifier_list_lock;
 #endif
 
 #ifdef KVM_ARCH_WANT_MMU_NOTIFIER
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 3dbba41..59c1cde 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -191,17 +191,17 @@ void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
 void kvm_register_irq_ack_notifier(struct kvm *kvm,
 				   struct kvm_irq_ack_notifier *kian)
 {
-	mutex_lock(&kvm->irq_lock);
+	spin_lock(&kvm->irq_notifier_list_lock);
 	hlist_add_head_rcu(&kian->link, &kvm->irq_ack_notifier_list);
-	mutex_unlock(&kvm->irq_lock);
+	spin_unlock(&kvm->irq_notifier_list_lock);
 }
 
 void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
 				    struct kvm_irq_ack_notifier *kian)
 {
-	mutex_lock(&kvm->irq_lock);
+	spin_lock(&kvm->irq_notifier_list_lock);
 	hlist_del_init_rcu(&kian->link);
-	mutex_unlock(&kvm->irq_lock);
+	spin_unlock(&kvm->irq_notifier_list_lock);
 	synchronize_rcu();
 }
 
@@ -247,18 +247,18 @@ void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id)
 void kvm_register_irq_mask_notifier(struct kvm *kvm, int irq,
 				    struct kvm_irq_mask_notifier *kimn)
 {
-	mutex_lock(&kvm->irq_lock);
+	spin_lock(&kvm->irq_notifier_list_lock);
 	kimn->irq = irq;
 	hlist_add_head_rcu(&kimn->link, &kvm->mask_notifier_list);
-	mutex_unlock(&kvm->irq_lock);
+	spin_unlock(&kvm->irq_notifier_list_lock);
 }
 
 void kvm_unregister_irq_mask_notifier(struct kvm *kvm, int irq,
 				      struct kvm_irq_mask_notifier *kimn)
 {
-	mutex_lock(&kvm->irq_lock);
+	spin_lock(&kvm->irq_notifier_list_lock);
 	hlist_del_rcu(&kimn->link);
-	mutex_unlock(&kvm->irq_lock);
+	spin_unlock(&kvm->irq_notifier_list_lock);
 	synchronize_rcu();
 }
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1d70da3..9553444 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -948,6 +948,7 @@ static struct kvm *kvm_create_vm(void)
 	spin_lock_init(&kvm->irq_routing_lock);
 	INIT_HLIST_HEAD(&kvm->mask_notifier_list);
 	INIT_HLIST_HEAD(&kvm->irq_ack_notifier_list);
+	spin_lock_init(&kvm->irq_notifier_list_lock);
 #endif
 
 #ifdef KVM_COALESCED_MMIO_PAGE_OFFSET
-- 
1.6.2.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 08/10] Move IO APIC to its own lock.
  2009-07-14 14:30 [PATCH 0/10] [RFC] more fine grained locking for IRQ injection Gleb Natapov
                   ` (6 preceding siblings ...)
  2009-07-14 14:30 ` [PATCH 07/10] Move irq notifiers lists " Gleb Natapov
@ 2009-07-14 14:30 ` Gleb Natapov
  2009-07-15 17:57   ` Marcelo Tosatti
  2009-07-14 14:30 ` [PATCH 09/10] Drop kvm->irq_lock lock Gleb Natapov
  2009-07-14 14:30 ` [PATCH 10/10] Change irq routing table to use gsi indexed array Gleb Natapov
  9 siblings, 1 reply; 26+ messages in thread
From: Gleb Natapov @ 2009-07-14 14:30 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti


Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/ia64/kvm/kvm-ia64.c |   27 ++++++++++++++++++------
 arch/x86/kvm/lapic.c     |    5 +---
 arch/x86/kvm/x86.c       |   30 ++++++++++++++++++---------
 virt/kvm/ioapic.c        |   50 ++++++++++++++++++++++++++++-----------------
 virt/kvm/ioapic.h        |    1 +
 5 files changed, 73 insertions(+), 40 deletions(-)

diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 0ad09f0..dd7ef2d 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -850,9 +850,16 @@ static int kvm_vm_ioctl_get_irqchip(struct kvm *kvm,
 
 	r = 0;
 	switch (chip->chip_id) {
-	case KVM_IRQCHIP_IOAPIC:
-		memcpy(&chip->chip.ioapic, ioapic_irqchip(kvm),
-				sizeof(struct kvm_ioapic_state));
+	case KVM_IRQCHIP_IOAPIC: {
+		struct kvm_ioapic *ioapic = ioapic_irqchip(kvm);
+		if (ioapic) {
+			spin_lock(&ioapic->lock);
+			memcpy(&chip->chip.ioapic, ioapic,
+			       sizeof(struct kvm_ioapic_state));
+			spin_unlock(&ioapic->lock);
+		} else
+			r = -EINVAL;
+	}
 		break;
 	default:
 		r = -EINVAL;
@@ -867,10 +874,16 @@ static int kvm_vm_ioctl_set_irqchip(struct kvm *kvm, struct kvm_irqchip *chip)
 
 	r = 0;
 	switch (chip->chip_id) {
-	case KVM_IRQCHIP_IOAPIC:
-		memcpy(ioapic_irqchip(kvm),
-				&chip->chip.ioapic,
-				sizeof(struct kvm_ioapic_state));
+	case KVM_IRQCHIP_IOAPIC: {
+		struct kvm_ioapic *ioapic = ioapic_irqchip(kvm);
+		if (ioapic) {
+			spin_lock(&ioapic->lock);
+			memcpy(ioapic, &chip->chip.ioapic,
+			       sizeof(struct kvm_ioapic_state));
+			spin_unlock(&ioapic->lock);
+		} else
+			r = -EINVAL;
+	}
 		break;
 	default:
 		r = -EINVAL;
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index e2e2849..61ffcfc 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -471,11 +471,8 @@ static void apic_set_eoi(struct kvm_lapic *apic)
 		trigger_mode = IOAPIC_LEVEL_TRIG;
 	else
 		trigger_mode = IOAPIC_EDGE_TRIG;
-	if (!(apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI)) {
-		mutex_lock(&apic->vcpu->kvm->irq_lock);
+	if (!(apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI))
 		kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
-		mutex_unlock(&apic->vcpu->kvm->irq_lock);
-	}
 }
 
 static void apic_send_ipi(struct kvm_lapic *apic)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 48567fa..de99b84 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2010,10 +2010,16 @@ static int kvm_vm_ioctl_get_irqchip(struct kvm *kvm, struct kvm_irqchip *chip)
 			&pic_irqchip(kvm)->pics[1],
 			sizeof(struct kvm_pic_state));
 		break;
-	case KVM_IRQCHIP_IOAPIC:
-		memcpy(&chip->chip.ioapic,
-			ioapic_irqchip(kvm),
-			sizeof(struct kvm_ioapic_state));
+	case KVM_IRQCHIP_IOAPIC: {
+		struct kvm_ioapic *ioapic = ioapic_irqchip(kvm);
+		if (ioapic) {
+			spin_lock(&ioapic->lock);
+			memcpy(&chip->chip.ioapic, ioapic,
+			       sizeof(struct kvm_ioapic_state));
+			spin_unlock(&ioapic->lock);
+		} else
+			r = -EINVAL;
+	}
 		break;
 	default:
 		r = -EINVAL;
@@ -2042,12 +2048,16 @@ static int kvm_vm_ioctl_set_irqchip(struct kvm *kvm, struct kvm_irqchip *chip)
 			sizeof(struct kvm_pic_state));
 		spin_unlock(&pic_irqchip(kvm)->lock);
 		break;
-	case KVM_IRQCHIP_IOAPIC:
-		mutex_lock(&kvm->irq_lock);
-		memcpy(ioapic_irqchip(kvm),
-			&chip->chip.ioapic,
-			sizeof(struct kvm_ioapic_state));
-		mutex_unlock(&kvm->irq_lock);
+	case KVM_IRQCHIP_IOAPIC: {
+		struct kvm_ioapic *ioapic = ioapic_irqchip(kvm);
+		if (ioapic) {
+			spin_lock(&ioapic->lock);
+			memcpy(ioapic, &chip->chip.ioapic,
+			       sizeof(struct kvm_ioapic_state));
+			spin_unlock(&ioapic->lock);
+		} else
+			r = -EINVAL;
+	}
 		break;
 	default:
 		r = -EINVAL;
diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index fa05f67..300ee3b 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -182,6 +182,7 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level)
 	union kvm_ioapic_redirect_entry entry;
 	int ret = 1;
 
+	spin_lock(&ioapic->lock);
 	if (irq >= 0 && irq < IOAPIC_NUM_PINS) {
 		entry = ioapic->redirtbl[irq];
 		level ^= entry.fields.polarity;
@@ -196,34 +197,44 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level)
 		}
 		trace_kvm_ioapic_set_irq(entry.bits, irq, ret == 0);
 	}
+	spin_unlock(&ioapic->lock);
+
 	return ret;
 }
 
-static void __kvm_ioapic_update_eoi(struct kvm_ioapic *ioapic, int pin,
-				    int trigger_mode)
+static void __kvm_ioapic_update_eoi(struct kvm_ioapic *ioapic, int vector,
+				     int trigger_mode)
 {
-	union kvm_ioapic_redirect_entry *ent;
+	int i;
+
+	for (i = 0; i < IOAPIC_NUM_PINS; i++) {
+		union kvm_ioapic_redirect_entry *ent = &ioapic->redirtbl[i];
+
+		if (ent->fields.vector != vector)
+			continue;
 
-	ent = &ioapic->redirtbl[pin];
+		/* ack notifier may call back into ioapic via kvm_set_irq()  */
+		spin_unlock(&ioapic->lock);
+		kvm_notify_acked_irq(ioapic->kvm, KVM_IRQCHIP_IOAPIC, i);
+		spin_lock(&ioapic->lock);
 
-	kvm_notify_acked_irq(ioapic->kvm, KVM_IRQCHIP_IOAPIC, pin);
+		if (trigger_mode != IOAPIC_LEVEL_TRIG)
+			continue;
 
-	if (trigger_mode == IOAPIC_LEVEL_TRIG) {
 		ASSERT(ent->fields.trig_mode == IOAPIC_LEVEL_TRIG);
 		ent->fields.remote_irr = 0;
-		if (!ent->fields.mask && (ioapic->irr & (1 << pin)))
-			ioapic_service(ioapic, pin);
+		if (!ent->fields.mask && (ioapic->irr & (1 << i)))
+			ioapic_service(ioapic, i);
 	}
 }
 
 void kvm_ioapic_update_eoi(struct kvm *kvm, int vector, int trigger_mode)
 {
 	struct kvm_ioapic *ioapic = kvm->arch.vioapic;
-	int i;
 
-	for (i = 0; i < IOAPIC_NUM_PINS; i++)
-		if (ioapic->redirtbl[i].fields.vector == vector)
-			__kvm_ioapic_update_eoi(ioapic, i, trigger_mode);
+	spin_lock(&ioapic->lock);
+	__kvm_ioapic_update_eoi(ioapic, vector, trigger_mode);
+	spin_unlock(&ioapic->lock);
 }
 
 static inline struct kvm_ioapic *to_ioapic(struct kvm_io_device *dev)
@@ -248,8 +259,8 @@ static int ioapic_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
 	ioapic_debug("addr %lx\n", (unsigned long)addr);
 	ASSERT(!(addr & 0xf));	/* check alignment */
 
-	mutex_lock(&ioapic->kvm->irq_lock);
 	addr &= 0xff;
+	spin_lock(&ioapic->lock);
 	switch (addr) {
 	case IOAPIC_REG_SELECT:
 		result = ioapic->ioregsel;
@@ -263,6 +274,8 @@ static int ioapic_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
 		result = 0;
 		break;
 	}
+	spin_unlock(&ioapic->lock);
+
 	switch (len) {
 	case 8:
 		*(u64 *) val = result;
@@ -275,7 +288,6 @@ static int ioapic_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
 	default:
 		printk(KERN_WARNING "ioapic: wrong length %d\n", len);
 	}
-	mutex_unlock(&ioapic->kvm->irq_lock);
 	return 0;
 }
 
@@ -291,15 +303,15 @@ static int ioapic_mmio_write(struct kvm_io_device *this, gpa_t addr, int len,
 		     (void*)addr, len, val);
 	ASSERT(!(addr & 0xf));	/* check alignment */
 
-	mutex_lock(&ioapic->kvm->irq_lock);
 	if (len == 4 || len == 8)
 		data = *(u32 *) val;
 	else {
 		printk(KERN_WARNING "ioapic: Unsupported size %d\n", len);
-		goto unlock;
+		return 0;
 	}
 
 	addr &= 0xff;
+	spin_lock(&ioapic->lock);
 	switch (addr) {
 	case IOAPIC_REG_SELECT:
 		ioapic->ioregsel = data;
@@ -310,15 +322,14 @@ static int ioapic_mmio_write(struct kvm_io_device *this, gpa_t addr, int len,
 		break;
 #ifdef	CONFIG_IA64
 	case IOAPIC_REG_EOI:
-		kvm_ioapic_update_eoi(ioapic->kvm, data, IOAPIC_LEVEL_TRIG);
+		__kvm_ioapic_update_eoi(ioapic, data, IOAPIC_LEVEL_TRIG);
 		break;
 #endif
 
 	default:
 		break;
 	}
-unlock:
-	mutex_unlock(&ioapic->kvm->irq_lock);
+	spin_unlock(&ioapic->lock);
 	return 0;
 }
 
@@ -347,6 +358,7 @@ int kvm_ioapic_init(struct kvm *kvm)
 	ioapic = kzalloc(sizeof(struct kvm_ioapic), GFP_KERNEL);
 	if (!ioapic)
 		return -ENOMEM;
+	spin_lock_init(&ioapic->lock);
 	kvm->arch.vioapic = ioapic;
 	kvm_ioapic_reset(ioapic);
 	kvm_iodevice_init(&ioapic->dev, &ioapic_mmio_ops);
diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h
index 7080b71..557107e 100644
--- a/virt/kvm/ioapic.h
+++ b/virt/kvm/ioapic.h
@@ -44,6 +44,7 @@ struct kvm_ioapic {
 	struct kvm_io_device dev;
 	struct kvm *kvm;
 	void (*ack_notifier)(void *opaque, int irq);
+	spinlock_t lock;
 };
 
 #ifdef DEBUG
-- 
1.6.2.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 09/10] Drop kvm->irq_lock lock.
  2009-07-14 14:30 [PATCH 0/10] [RFC] more fine grained locking for IRQ injection Gleb Natapov
                   ` (7 preceding siblings ...)
  2009-07-14 14:30 ` [PATCH 08/10] Move IO APIC to its own lock Gleb Natapov
@ 2009-07-14 14:30 ` Gleb Natapov
  2009-07-14 14:30 ` [PATCH 10/10] Change irq routing table to use gsi indexed array Gleb Natapov
  9 siblings, 0 replies; 26+ messages in thread
From: Gleb Natapov @ 2009-07-14 14:30 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti

The only thing it protects now is interrupt injection into lapic and
this can work lockless. Even now with kvm->irq_lock in place access
to lapic is not entirely serialized since vcpu access doesn't take
kvm->irq_lock.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 arch/ia64/kvm/kvm-ia64.c |    2 --
 arch/x86/kvm/i8254.c     |    2 --
 arch/x86/kvm/lapic.c     |    2 --
 arch/x86/kvm/x86.c       |    2 --
 include/linux/kvm_host.h |    1 -
 virt/kvm/eventfd.c       |    2 --
 virt/kvm/irq_comm.c      |    6 +-----
 virt/kvm/kvm_main.c      |    5 +----
 8 files changed, 2 insertions(+), 20 deletions(-)

diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index dd7ef2d..8f1fc3a 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -998,10 +998,8 @@ long kvm_arch_vm_ioctl(struct file *filp,
 			goto out;
 		if (irqchip_in_kernel(kvm)) {
 			__s32 status;
-			mutex_lock(&kvm->irq_lock);
 			status = kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID,
 				    irq_event.irq, irq_event.level);
-			mutex_unlock(&kvm->irq_lock);
 			if (ioctl == KVM_IRQ_LINE_STATUS) {
 				irq_event.status = status;
 				if (copy_to_user(argp, &irq_event,
diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index 59021da..211f524 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -690,10 +690,8 @@ static void __inject_pit_timer_intr(struct kvm *kvm)
 	struct kvm_vcpu *vcpu;
 	int i;
 
-	mutex_lock(&kvm->irq_lock);
 	kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 1);
 	kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 0);
-	mutex_unlock(&kvm->irq_lock);
 
 	/*
 	 * Provides NMI watchdog support via Virtual Wire mode.
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 61ffcfc..0380107 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -501,9 +501,7 @@ static void apic_send_ipi(struct kvm_lapic *apic)
 		   irq.trig_mode, irq.level, irq.dest_mode, irq.delivery_mode,
 		   irq.vector);
 
-	mutex_lock(&apic->vcpu->kvm->irq_lock);
 	kvm_irq_delivery_to_apic(apic->vcpu->kvm, apic, &irq);
-	mutex_unlock(&apic->vcpu->kvm->irq_lock);
 }
 
 static u32 apic_get_tmcct(struct kvm_lapic *apic)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index de99b84..40adac5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2261,10 +2261,8 @@ long kvm_arch_vm_ioctl(struct file *filp,
 			goto out;
 		if (irqchip_in_kernel(kvm)) {
 			__s32 status;
-			mutex_lock(&kvm->irq_lock);
 			status = kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID,
 					irq_event.irq, irq_event.level);
-			mutex_unlock(&kvm->irq_lock);
 			if (ioctl == KVM_IRQ_LINE_STATUS) {
 				irq_event.status = status;
 				if (copy_to_user(argp, &irq_event,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4eb5c85..aa64d0d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -164,7 +164,6 @@ struct kvm {
 	struct kvm_coalesced_mmio_ring *coalesced_mmio_ring;
 #endif
 
-	struct mutex irq_lock;
 #ifdef CONFIG_HAVE_KVM_IRQCHIP
 	struct kvm_kernel_irq_routing_entry *irq_routing;
 	spinlock_t irq_routing_lock;
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 99017e8..95954ad 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -61,10 +61,8 @@ irqfd_inject(struct work_struct *work)
 	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
 	struct kvm *kvm = irqfd->kvm;
 
-	mutex_lock(&kvm->irq_lock);
 	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
 	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
-	mutex_unlock(&kvm->irq_lock);
 }
 
 /*
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 59c1cde..c54a28b 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -63,8 +63,6 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
 	int i, r = -1;
 	struct kvm_vcpu *vcpu, *lowest = NULL;
 
-	WARN_ON(!mutex_is_locked(&kvm->irq_lock));
-
 	if (irq->dest_mode == 0 && irq->dest_id == 0xff &&
 			kvm_is_dm_lowest_prio(irq))
 		printk(KERN_INFO "kvm: apic: phys broadcast and lowest prio\n");
@@ -116,7 +114,7 @@ static int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
 	return kvm_irq_delivery_to_apic(kvm, NULL, &irq);
 }
 
-/* This should be called with the kvm->irq_lock mutex held
+/*
  * Return value:
  *  < 0   Interrupt was ignored (masked or not delivered for other reasons)
  *  = 0   Interrupt was coalesced (previous irq is still pending)
@@ -130,8 +128,6 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
 
 	trace_kvm_set_irq(irq, level, irq_source_id);
 
-	WARN_ON(!mutex_is_locked(&kvm->irq_lock));
-
 	if (irq < KVM_IOAPIC_NUM_PINS) {
 		irq_state = (unsigned long *)&kvm->arch.irq_states[irq];
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 9553444..7ac9b75 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -68,7 +68,7 @@ MODULE_LICENSE("GPL");
 /*
  * Ordering of locks:
  *
- * 		kvm->slots_lock --> kvm->lock --> kvm->irq_lock
+ * 		kvm->slots_lock --> kvm->lock
  */
 
 DEFINE_SPINLOCK(kvm_lock);
@@ -137,7 +137,6 @@ static void kvm_assigned_dev_interrupt_work_handler(struct work_struct *work)
 				    interrupt_work);
 	kvm = assigned_dev->kvm;
 
-	mutex_lock(&kvm->irq_lock);
 	spin_lock_irq(&assigned_dev->assigned_dev_lock);
 	if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
 		struct kvm_guest_msix_entry *guest_entries =
@@ -156,7 +155,6 @@ static void kvm_assigned_dev_interrupt_work_handler(struct work_struct *work)
 			    assigned_dev->guest_irq, 1);
 
 	spin_unlock_irq(&assigned_dev->assigned_dev_lock);
-	mutex_unlock(&assigned_dev->kvm->irq_lock);
 }
 
 static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
@@ -983,7 +981,6 @@ static struct kvm *kvm_create_vm(void)
 	kvm_io_bus_init(&kvm->pio_bus);
 	kvm_eventfd_init(kvm);
 	mutex_init(&kvm->lock);
-	mutex_init(&kvm->irq_lock);
 	kvm_io_bus_init(&kvm->mmio_bus);
 	init_rwsem(&kvm->slots_lock);
 	atomic_set(&kvm->users_count, 1);
-- 
1.6.2.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 10/10] Change irq routing table to use gsi indexed array.
  2009-07-14 14:30 [PATCH 0/10] [RFC] more fine grained locking for IRQ injection Gleb Natapov
                   ` (8 preceding siblings ...)
  2009-07-14 14:30 ` [PATCH 09/10] Drop kvm->irq_lock lock Gleb Natapov
@ 2009-07-14 14:30 ` Gleb Natapov
  2009-07-15 18:18   ` Marcelo Tosatti
  2009-07-15 19:17   ` Michael S. Tsirkin
  9 siblings, 2 replies; 26+ messages in thread
From: Gleb Natapov @ 2009-07-14 14:30 UTC (permalink / raw)
  To: kvm; +Cc: mtosatti

Use gsi indexed array instead of scanning all entries on each interrupt
injection. Also maintain back mapping from irqchip/pin to gsi to speedup
interrupt acknowledgment notifications.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---
 include/linux/kvm_host.h |   11 ++++++-
 virt/kvm/irq_comm.c      |   62 ++++++++++++++++++++++++++++-----------------
 2 files changed, 47 insertions(+), 26 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index aa64d0d..ae6cbf1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -128,7 +128,14 @@ struct kvm_kernel_irq_routing_entry {
 		} irqchip;
 		struct msi_msg msi;
 	};
-	struct list_head link;
+	struct hlist_node link;
+};
+
+struct kvm_irq_routing_table {
+	int chip[3][KVM_IOAPIC_NUM_PINS];
+	struct kvm_kernel_irq_routing_entry *rt_entries;
+	u32 max_gsi;
+	struct hlist_head map[0];
 };
 
 struct kvm {
@@ -165,7 +172,7 @@ struct kvm {
 #endif
 
 #ifdef CONFIG_HAVE_KVM_IRQCHIP
-	struct kvm_kernel_irq_routing_entry *irq_routing;
+	struct kvm_irq_routing_table *irq_routing;
 	spinlock_t irq_routing_lock;
 	struct hlist_head mask_notifier_list;
 	struct hlist_head irq_ack_notifier_list;
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index c54a28b..da643d4 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -125,6 +125,8 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
 	struct kvm_kernel_irq_routing_entry *e;
 	unsigned long *irq_state, sig_level;
 	int ret = -1;
+	struct kvm_irq_routing_table *irq_rt;
+	struct hlist_node *n;
 
 	trace_kvm_set_irq(irq, level, irq_source_id);
 
@@ -147,14 +149,13 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
 	 * writes to the unused one.
 	 */
 	rcu_read_lock();
-	for (e = rcu_dereference(kvm->irq_routing); e && e->set; e++) {
-		if (e->gsi == irq) {
-			int r = e->set(e, kvm, sig_level);
-			if (r < 0)
-				continue;
+	irq_rt = rcu_dereference(kvm->irq_routing);
+	hlist_for_each_entry(e, n, &irq_rt->map[irq], link) {
+		int r = e->set(e, kvm, sig_level);
+		if (r < 0)
+			continue;
 
-			ret = r + ((ret < 0) ? 0 : ret);
-		}
+		ret = r + ((ret < 0) ? 0 : ret);
 	}
 	rcu_read_unlock();
 	return ret;
@@ -162,21 +163,16 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
 
 void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
 {
-	struct kvm_kernel_irq_routing_entry *e;
 	struct kvm_irq_ack_notifier *kian;
 	struct hlist_node *n;
-	unsigned gsi = pin;
+	unsigned gsi;
 
 	trace_kvm_ack_irq(irqchip, pin);
 
 	rcu_read_lock();
-	for (e = rcu_dereference(kvm->irq_routing); e && e->set; e++) {
-		if (e->irqchip.irqchip == irqchip &&
-		    e->irqchip.pin == pin) {
-			gsi = e->gsi;
-			break;
-		}
-	}
+	gsi = rcu_dereference(kvm->irq_routing)->chip[irqchip][pin];
+	if (gsi == -1)
+		gsi = pin;
 
 	hlist_for_each_entry_rcu(kian, n, &kvm->irq_ack_notifier_list, link)
 		if (kian->gsi == gsi)
@@ -277,7 +273,8 @@ void kvm_free_irq_routing(struct kvm *kvm)
 	kfree(kvm->irq_routing);
 }
 
-static int setup_routing_entry(struct kvm_kernel_irq_routing_entry *e,
+static int setup_routing_entry(struct kvm_irq_routing_table *rt,
+			       struct kvm_kernel_irq_routing_entry *e,
 			       const struct kvm_irq_routing_entry *ue)
 {
 	int r = -EINVAL;
@@ -303,6 +300,7 @@ static int setup_routing_entry(struct kvm_kernel_irq_routing_entry *e,
 		}
 		e->irqchip.irqchip = ue->u.irqchip.irqchip;
 		e->irqchip.pin = ue->u.irqchip.pin + delta;
+		rt->chip[ue->u.irqchip.irqchip][e->irqchip.pin] = ue->gsi;
 		break;
 	case KVM_IRQ_ROUTING_MSI:
 		e->set = kvm_set_msi;
@@ -313,6 +311,8 @@ static int setup_routing_entry(struct kvm_kernel_irq_routing_entry *e,
 	default:
 		goto out;
 	}
+
+	hlist_add_head(&e->link, &rt->map[e->gsi]);
 	r = 0;
 out:
 	return r;
@@ -324,23 +324,37 @@ int kvm_set_irq_routing(struct kvm *kvm,
 			unsigned nr,
 			unsigned flags)
 {
-	struct kvm_kernel_irq_routing_entry *new, *old;
-	unsigned i;
+	struct kvm_irq_routing_table *new, *old;
+	u32 i, j, max_gsi = 0;
 	int r;
 
-	/* last elemet is left zeored and indicates the end of the array */
-	new = kzalloc(sizeof(*new) * (nr + 1), GFP_KERNEL);
+	for (i = 0; i < nr; ++i) {
+		if (ue[i].gsi >= KVM_MAX_IRQ_ROUTES)
+			return -EINVAL;
+		max_gsi = max(max_gsi, ue[i].gsi);
+	}
+
+	max_gsi += 1;
+
+	new = kzalloc(sizeof(*new) + (max_gsi * sizeof(struct hlist_head)) +
+		      (nr * sizeof(struct kvm_kernel_irq_routing_entry)),
+		      GFP_KERNEL);
 
 	if (!new)
 		return -ENOMEM;
 
+	new->rt_entries = (void *)&new->map[max_gsi];
+
+	new->max_gsi = max_gsi;
+	for (i = 0; i < 3; i++)
+		for (j = 0; j < KVM_IOAPIC_NUM_PINS; j++)
+			new->chip[i][j] = -1;
+
 	for (i = 0; i < nr; ++i) {
 		r = -EINVAL;
-		if (ue->gsi >= KVM_MAX_IRQ_ROUTES)
-			goto out;
 		if (ue->flags)
 			goto out;
-		r = setup_routing_entry(new + i, ue);
+		r = setup_routing_entry(new, &new->rt_entries[i], ue);
 		if (r)
 			goto out;
 		++ue;
-- 
1.6.2.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 08/10] Move IO APIC to its own lock.
  2009-07-14 14:30 ` [PATCH 08/10] Move IO APIC to its own lock Gleb Natapov
@ 2009-07-15 17:57   ` Marcelo Tosatti
  2009-07-15 20:48     ` Gleb Natapov
  0 siblings, 1 reply; 26+ messages in thread
From: Marcelo Tosatti @ 2009-07-15 17:57 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm

On Tue, Jul 14, 2009 at 05:30:43PM +0300, Gleb Natapov wrote:
> 
> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> ---
>  arch/ia64/kvm/kvm-ia64.c |   27 ++++++++++++++++++------
>  arch/x86/kvm/lapic.c     |    5 +---
>  arch/x86/kvm/x86.c       |   30 ++++++++++++++++++---------
>  virt/kvm/ioapic.c        |   50 ++++++++++++++++++++++++++++-----------------
>  virt/kvm/ioapic.h        |    1 +
>  5 files changed, 73 insertions(+), 40 deletions(-)
> 
> diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
> index 0ad09f0..dd7ef2d 100644
> --- a/arch/ia64/kvm/kvm-ia64.c
> +++ b/arch/ia64/kvm/kvm-ia64.c
> @@ -850,9 +850,16 @@ static int kvm_vm_ioctl_get_irqchip(struct kvm *kvm,
>  
>  	r = 0;
>  	switch (chip->chip_id) {
> -	case KVM_IRQCHIP_IOAPIC:
> -		memcpy(&chip->chip.ioapic, ioapic_irqchip(kvm),
> -				sizeof(struct kvm_ioapic_state));
> +	case KVM_IRQCHIP_IOAPIC: {
> +		struct kvm_ioapic *ioapic = ioapic_irqchip(kvm);
> +		if (ioapic) {
> +			spin_lock(&ioapic->lock);
> +			memcpy(&chip->chip.ioapic, ioapic,
> +			       sizeof(struct kvm_ioapic_state));
> +			spin_unlock(&ioapic->lock);
> +		} else
> +			r = -EINVAL;
> +	}
>  		break;
>  	default:
>  		r = -EINVAL;
> @@ -867,10 +874,16 @@ static int kvm_vm_ioctl_set_irqchip(struct kvm *kvm, struct kvm_irqchip *chip)
>  
>  	r = 0;
>  	switch (chip->chip_id) {
> -	case KVM_IRQCHIP_IOAPIC:
> -		memcpy(ioapic_irqchip(kvm),
> -				&chip->chip.ioapic,
> -				sizeof(struct kvm_ioapic_state));
> +	case KVM_IRQCHIP_IOAPIC: {
> +		struct kvm_ioapic *ioapic = ioapic_irqchip(kvm);
> +		if (ioapic) {
> +			spin_lock(&ioapic->lock);
> +			memcpy(ioapic, &chip->chip.ioapic,
> +			       sizeof(struct kvm_ioapic_state));
> +			spin_unlock(&ioapic->lock);
> +		} else
> +			r = -EINVAL;
> +	}
>  		break;
>  	default:
>  		r = -EINVAL;
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index e2e2849..61ffcfc 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -471,11 +471,8 @@ static void apic_set_eoi(struct kvm_lapic *apic)
>  		trigger_mode = IOAPIC_LEVEL_TRIG;
>  	else
>  		trigger_mode = IOAPIC_EDGE_TRIG;
> -	if (!(apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI)) {
> -		mutex_lock(&apic->vcpu->kvm->irq_lock);
> +	if (!(apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI))
>  		kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
> -		mutex_unlock(&apic->vcpu->kvm->irq_lock);
> -	}
>  }
>  
>  static void apic_send_ipi(struct kvm_lapic *apic)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 48567fa..de99b84 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2010,10 +2010,16 @@ static int kvm_vm_ioctl_get_irqchip(struct kvm *kvm, struct kvm_irqchip *chip)
>  			&pic_irqchip(kvm)->pics[1],
>  			sizeof(struct kvm_pic_state));
>  		break;
> -	case KVM_IRQCHIP_IOAPIC:
> -		memcpy(&chip->chip.ioapic,
> -			ioapic_irqchip(kvm),
> -			sizeof(struct kvm_ioapic_state));
> +	case KVM_IRQCHIP_IOAPIC: {
> +		struct kvm_ioapic *ioapic = ioapic_irqchip(kvm);
> +		if (ioapic) {
> +			spin_lock(&ioapic->lock);
> +			memcpy(&chip->chip.ioapic, ioapic,
> +			       sizeof(struct kvm_ioapic_state));
> +			spin_unlock(&ioapic->lock);
> +		} else
> +			r = -EINVAL;
> +	}
>  		break;
>  	default:
>  		r = -EINVAL;
> @@ -2042,12 +2048,16 @@ static int kvm_vm_ioctl_set_irqchip(struct kvm *kvm, struct kvm_irqchip *chip)
>  			sizeof(struct kvm_pic_state));
>  		spin_unlock(&pic_irqchip(kvm)->lock);
>  		break;
> -	case KVM_IRQCHIP_IOAPIC:
> -		mutex_lock(&kvm->irq_lock);
> -		memcpy(ioapic_irqchip(kvm),
> -			&chip->chip.ioapic,
> -			sizeof(struct kvm_ioapic_state));
> -		mutex_unlock(&kvm->irq_lock);
> +	case KVM_IRQCHIP_IOAPIC: {
> +		struct kvm_ioapic *ioapic = ioapic_irqchip(kvm);
> +		if (ioapic) {
> +			spin_lock(&ioapic->lock);
> +			memcpy(ioapic, &chip->chip.ioapic,
> +			       sizeof(struct kvm_ioapic_state));
> +			spin_unlock(&ioapic->lock);
> +		} else
> +			r = -EINVAL;
> +	}
>  		break;
>  	default:
>  		r = -EINVAL;
> diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
> index fa05f67..300ee3b 100644
> --- a/virt/kvm/ioapic.c
> +++ b/virt/kvm/ioapic.c
> @@ -182,6 +182,7 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level)
>  	union kvm_ioapic_redirect_entry entry;
>  	int ret = 1;
>  
> +	spin_lock(&ioapic->lock);
>  	if (irq >= 0 && irq < IOAPIC_NUM_PINS) {
>  		entry = ioapic->redirtbl[irq];
>  		level ^= entry.fields.polarity;
> @@ -196,34 +197,44 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level)
>  		}
>  		trace_kvm_ioapic_set_irq(entry.bits, irq, ret == 0);
>  	}
> +	spin_unlock(&ioapic->lock);
> +
>  	return ret;
>  }
>  
> -static void __kvm_ioapic_update_eoi(struct kvm_ioapic *ioapic, int pin,
> -				    int trigger_mode)
> +static void __kvm_ioapic_update_eoi(struct kvm_ioapic *ioapic, int vector,
> +				     int trigger_mode)
>  {
> -	union kvm_ioapic_redirect_entry *ent;
> +	int i;
> +
> +	for (i = 0; i < IOAPIC_NUM_PINS; i++) {
> +		union kvm_ioapic_redirect_entry *ent = &ioapic->redirtbl[i];
> +
> +		if (ent->fields.vector != vector)
> +			continue;
>  
> -	ent = &ioapic->redirtbl[pin];
> +		/* ack notifier may call back into ioapic via kvm_set_irq()  */
> +		spin_unlock(&ioapic->lock);
> +		kvm_notify_acked_irq(ioapic->kvm, KVM_IRQCHIP_IOAPIC, i);
> +		spin_lock(&ioapic->lock);

While traversing the IOAPIC pins you drop the lock and acquire it again,
which is error prone: what if the redirection table is changed in
between, for example?

Also, irq_lock was held during ack and mask notifier callbacks, so they
(the callbacks) did not have to worry about concurrency.

You questioned the purpose of irq_lock earlier, and thats what it is:
synchronization between pic and ioapic blur at the irq notifiers.

Using RCU to synchronize ack/mask notifier list walk versus list
addition is good, but i'm not entirely sure that smaller locks like you
are proposing is a benefit long term (things become much more tricky).

Maybe it is the only way forward (and so you push this complexity to the
ack/mask notifiers), but i think it can be avoided until its proven that
there is contention on irq_lock (which is reduced by faster search).


> -	kvm_notify_acked_irq(ioapic->kvm, KVM_IRQCHIP_IOAPIC, pin);
> +		if (trigger_mode != IOAPIC_LEVEL_TRIG)
> +			continue;
>  
> -	if (trigger_mode == IOAPIC_LEVEL_TRIG) {
>  		ASSERT(ent->fields.trig_mode == IOAPIC_LEVEL_TRIG);
>  		ent->fields.remote_irr = 0;
> -		if (!ent->fields.mask && (ioapic->irr & (1 << pin)))
> -			ioapic_service(ioapic, pin);
> +		if (!ent->fields.mask && (ioapic->irr & (1 << i)))
> +			ioapic_service(ioapic, i);
>  	}
>  }
>  
>  void kvm_ioapic_update_eoi(struct kvm *kvm, int vector, int trigger_mode)
>  {
>  	struct kvm_ioapic *ioapic = kvm->arch.vioapic;
> -	int i;
>  
> -	for (i = 0; i < IOAPIC_NUM_PINS; i++)
> -		if (ioapic->redirtbl[i].fields.vector == vector)
> -			__kvm_ioapic_update_eoi(ioapic, i, trigger_mode);
> +	spin_lock(&ioapic->lock);
> +	__kvm_ioapic_update_eoi(ioapic, vector, trigger_mode);
> +	spin_unlock(&ioapic->lock);
>  }
>  
>  static inline struct kvm_ioapic *to_ioapic(struct kvm_io_device *dev)
> @@ -248,8 +259,8 @@ static int ioapic_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
>  	ioapic_debug("addr %lx\n", (unsigned long)addr);
>  	ASSERT(!(addr & 0xf));	/* check alignment */
>  
> -	mutex_lock(&ioapic->kvm->irq_lock);
>  	addr &= 0xff;
> +	spin_lock(&ioapic->lock);
>  	switch (addr) {
>  	case IOAPIC_REG_SELECT:
>  		result = ioapic->ioregsel;
> @@ -263,6 +274,8 @@ static int ioapic_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
>  		result = 0;
>  		break;
>  	}
> +	spin_unlock(&ioapic->lock);
> +
>  	switch (len) {
>  	case 8:
>  		*(u64 *) val = result;
> @@ -275,7 +288,6 @@ static int ioapic_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
>  	default:
>  		printk(KERN_WARNING "ioapic: wrong length %d\n", len);
>  	}
> -	mutex_unlock(&ioapic->kvm->irq_lock);
>  	return 0;
>  }
>  
> @@ -291,15 +303,15 @@ static int ioapic_mmio_write(struct kvm_io_device *this, gpa_t addr, int len,
>  		     (void*)addr, len, val);
>  	ASSERT(!(addr & 0xf));	/* check alignment */
>  
> -	mutex_lock(&ioapic->kvm->irq_lock);
>  	if (len == 4 || len == 8)
>  		data = *(u32 *) val;
>  	else {
>  		printk(KERN_WARNING "ioapic: Unsupported size %d\n", len);
> -		goto unlock;
> +		return 0;
>  	}
>  
>  	addr &= 0xff;
> +	spin_lock(&ioapic->lock);
>  	switch (addr) {
>  	case IOAPIC_REG_SELECT:
>  		ioapic->ioregsel = data;
> @@ -310,15 +322,14 @@ static int ioapic_mmio_write(struct kvm_io_device *this, gpa_t addr, int len,
>  		break;
>  #ifdef	CONFIG_IA64
>  	case IOAPIC_REG_EOI:
> -		kvm_ioapic_update_eoi(ioapic->kvm, data, IOAPIC_LEVEL_TRIG);
> +		__kvm_ioapic_update_eoi(ioapic, data, IOAPIC_LEVEL_TRIG);
>  		break;
>  #endif
>  
>  	default:
>  		break;
>  	}
> -unlock:
> -	mutex_unlock(&ioapic->kvm->irq_lock);
> +	spin_unlock(&ioapic->lock);
>  	return 0;
>  }
>  
> @@ -347,6 +358,7 @@ int kvm_ioapic_init(struct kvm *kvm)
>  	ioapic = kzalloc(sizeof(struct kvm_ioapic), GFP_KERNEL);
>  	if (!ioapic)
>  		return -ENOMEM;
> +	spin_lock_init(&ioapic->lock);
>  	kvm->arch.vioapic = ioapic;
>  	kvm_ioapic_reset(ioapic);
>  	kvm_iodevice_init(&ioapic->dev, &ioapic_mmio_ops);
> diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h
> index 7080b71..557107e 100644
> --- a/virt/kvm/ioapic.h
> +++ b/virt/kvm/ioapic.h
> @@ -44,6 +44,7 @@ struct kvm_ioapic {
>  	struct kvm_io_device dev;
>  	struct kvm *kvm;
>  	void (*ack_notifier)(void *opaque, int irq);
> +	spinlock_t lock;
>  };
>  
>  #ifdef DEBUG
> -- 
> 1.6.2.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 10/10] Change irq routing table to use gsi indexed array.
  2009-07-14 14:30 ` [PATCH 10/10] Change irq routing table to use gsi indexed array Gleb Natapov
@ 2009-07-15 18:18   ` Marcelo Tosatti
  2009-07-15 20:52     ` Gleb Natapov
  2009-07-15 19:17   ` Michael S. Tsirkin
  1 sibling, 1 reply; 26+ messages in thread
From: Marcelo Tosatti @ 2009-07-15 18:18 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm

On Tue, Jul 14, 2009 at 05:30:45PM +0300, Gleb Natapov wrote:
> Use gsi indexed array instead of scanning all entries on each interrupt
> injection. Also maintain back mapping from irqchip/pin to gsi to speedup
> interrupt acknowledgment notifications.
> 
> Signed-off-by: Gleb Natapov <gleb@redhat.com>
> ---
>  include/linux/kvm_host.h |   11 ++++++-
>  virt/kvm/irq_comm.c      |   62 ++++++++++++++++++++++++++++-----------------
>  2 files changed, 47 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index aa64d0d..ae6cbf1 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -128,7 +128,14 @@ struct kvm_kernel_irq_routing_entry {
>  		} irqchip;
>  		struct msi_msg msi;
>  	};
> -	struct list_head link;
> +	struct hlist_node link;
> +};
> +
> +struct kvm_irq_routing_table {
> +	int chip[3][KVM_IOAPIC_NUM_PINS];
> +	struct kvm_kernel_irq_routing_entry *rt_entries;
> +	u32 max_gsi;
> +	struct hlist_head map[0];
>  };
>  
>  struct kvm {
> @@ -165,7 +172,7 @@ struct kvm {
>  #endif
>  
>  #ifdef CONFIG_HAVE_KVM_IRQCHIP
> -	struct kvm_kernel_irq_routing_entry *irq_routing;
> +	struct kvm_irq_routing_table *irq_routing;
>  	spinlock_t irq_routing_lock;
>  	struct hlist_head mask_notifier_list;
>  	struct hlist_head irq_ack_notifier_list;
> diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> index c54a28b..da643d4 100644
> --- a/virt/kvm/irq_comm.c
> +++ b/virt/kvm/irq_comm.c
> @@ -125,6 +125,8 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
>  	struct kvm_kernel_irq_routing_entry *e;
>  	unsigned long *irq_state, sig_level;
>  	int ret = -1;
> +	struct kvm_irq_routing_table *irq_rt;
> +	struct hlist_node *n;
>  
>  	trace_kvm_set_irq(irq, level, irq_source_id);
>  
> @@ -147,14 +149,13 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
>  	 * writes to the unused one.
>  	 */
>  	rcu_read_lock();
> -	for (e = rcu_dereference(kvm->irq_routing); e && e->set; e++) {
> -		if (e->gsi == irq) {
> -			int r = e->set(e, kvm, sig_level);
> -			if (r < 0)
> -				continue;
> +	irq_rt = rcu_dereference(kvm->irq_routing);
> +	hlist_for_each_entry(e, n, &irq_rt->map[irq], link) {
> +		int r = e->set(e, kvm, sig_level);
> +		if (r < 0)
> +			continue;
>  
> -			ret = r + ((ret < 0) ? 0 : ret);
> -		}
> +		ret = r + ((ret < 0) ? 0 : ret);
>  	}
>  	rcu_read_unlock();
>  	return ret;
> @@ -162,21 +163,16 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
>  
>  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
>  {
> -	struct kvm_kernel_irq_routing_entry *e;
>  	struct kvm_irq_ack_notifier *kian;
>  	struct hlist_node *n;
> -	unsigned gsi = pin;
> +	unsigned gsi;
>  
>  	trace_kvm_ack_irq(irqchip, pin);
>  
>  	rcu_read_lock();
> -	for (e = rcu_dereference(kvm->irq_routing); e && e->set; e++) {
> -		if (e->irqchip.irqchip == irqchip &&
> -		    e->irqchip.pin == pin) {
> -			gsi = e->gsi;
> -			break;
> -		}
> -	}
> +	gsi = rcu_dereference(kvm->irq_routing)->chip[irqchip][pin];
> +	if (gsi == -1)
> +		gsi = pin;
>  
>  	hlist_for_each_entry_rcu(kian, n, &kvm->irq_ack_notifier_list, link)
>  		if (kian->gsi == gsi)
> @@ -277,7 +273,8 @@ void kvm_free_irq_routing(struct kvm *kvm)
>  	kfree(kvm->irq_routing);
>  }
>  
> -static int setup_routing_entry(struct kvm_kernel_irq_routing_entry *e,
> +static int setup_routing_entry(struct kvm_irq_routing_table *rt,
> +			       struct kvm_kernel_irq_routing_entry *e,
>  			       const struct kvm_irq_routing_entry *ue)
>  {
>  	int r = -EINVAL;
> @@ -303,6 +300,7 @@ static int setup_routing_entry(struct kvm_kernel_irq_routing_entry *e,
>  		}
>  		e->irqchip.irqchip = ue->u.irqchip.irqchip;
>  		e->irqchip.pin = ue->u.irqchip.pin + delta;
> +		rt->chip[ue->u.irqchip.irqchip][e->irqchip.pin] = ue->gsi;
>  		break;
>  	case KVM_IRQ_ROUTING_MSI:
>  		e->set = kvm_set_msi;
> @@ -313,6 +311,8 @@ static int setup_routing_entry(struct kvm_kernel_irq_routing_entry *e,
>  	default:
>  		goto out;
>  	}
> +
> +	hlist_add_head(&e->link, &rt->map[e->gsi]);
>  	r = 0;
>  out:
>  	return r;
> @@ -324,23 +324,37 @@ int kvm_set_irq_routing(struct kvm *kvm,
>  			unsigned nr,
>  			unsigned flags)
>  {
> -	struct kvm_kernel_irq_routing_entry *new, *old;
> -	unsigned i;
> +	struct kvm_irq_routing_table *new, *old;
> +	u32 i, j, max_gsi = 0;
>  	int r;
>  
> -	/* last elemet is left zeored and indicates the end of the array */
> -	new = kzalloc(sizeof(*new) * (nr + 1), GFP_KERNEL);
> +	for (i = 0; i < nr; ++i) {
> +		if (ue[i].gsi >= KVM_MAX_IRQ_ROUTES)
> +			return -EINVAL;
> +		max_gsi = max(max_gsi, ue[i].gsi);
> +	}
> +
> +	max_gsi += 1;
> +
> +	new = kzalloc(sizeof(*new) + (max_gsi * sizeof(struct hlist_head)) +
> +		      (nr * sizeof(struct kvm_kernel_irq_routing_entry)),
> +		      GFP_KERNEL);

Why don't you allocate the hlist_head's and the routing entries
separately?

>  
>  	if (!new)
>  		return -ENOMEM;
>  
> +	new->rt_entries = (void *)&new->map[max_gsi];
> +
> +	new->max_gsi = max_gsi;
> +	for (i = 0; i < 3; i++)
> +		for (j = 0; j < KVM_IOAPIC_NUM_PINS; j++)
> +			new->chip[i][j] = -1;
> +

Should use something else instead of 3. Maybe dynamic for multiple
IOAPIC's support (but you can argue thats another problem).

>  	for (i = 0; i < nr; ++i) {
>  		r = -EINVAL;
> -		if (ue->gsi >= KVM_MAX_IRQ_ROUTES)
> -			goto out;
>  		if (ue->flags)
>  			goto out;
> -		r = setup_routing_entry(new + i, ue);
> +		r = setup_routing_entry(new, &new->rt_entries[i], ue);
>  		if (r)
>  			goto out;
>  		++ue;
> -- 
> 1.6.2.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 10/10] Change irq routing table to use gsi indexed array.
  2009-07-14 14:30 ` [PATCH 10/10] Change irq routing table to use gsi indexed array Gleb Natapov
  2009-07-15 18:18   ` Marcelo Tosatti
@ 2009-07-15 19:17   ` Michael S. Tsirkin
  2009-07-15 20:48     ` Gleb Natapov
  1 sibling, 1 reply; 26+ messages in thread
From: Michael S. Tsirkin @ 2009-07-15 19:17 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm, mtosatti

On Tue, Jul 14, 2009 at 05:30:45PM +0300, Gleb Natapov wrote:
> @@ -147,14 +149,13 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
>  	 * writes to the unused one.
>  	 */
>  	rcu_read_lock();
> -	for (e = rcu_dereference(kvm->irq_routing); e && e->set; e++) {
> -		if (e->gsi == irq) {
> -			int r = e->set(e, kvm, sig_level);
> -			if (r < 0)
> -				continue;
> +	irq_rt = rcu_dereference(kvm->irq_routing);
> +	hlist_for_each_entry(e, n, &irq_rt->map[irq], link) {

Don't you need to range-check irq? E.g. with irqfd, gsi is
controlled by guest.

> +		int r = e->set(e, kvm, sig_level);
> +		if (r < 0)
> +			continue;
>  
> -			ret = r + ((ret < 0) ? 0 : ret);
> -		}
> +		ret = r + ((ret < 0) ? 0 : ret);
>  	}
>  	rcu_read_unlock();
>  	return ret;
> @@ -162,21 +163,16 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
>  
>  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
>  {
> -	struct kvm_kernel_irq_routing_entry *e;
>  	struct kvm_irq_ack_notifier *kian;
>  	struct hlist_node *n;
> -	unsigned gsi = pin;
> +	unsigned gsi;
>  
>  	trace_kvm_ack_irq(irqchip, pin);
>  
>  	rcu_read_lock();
> -	for (e = rcu_dereference(kvm->irq_routing); e && e->set; e++) {
> -		if (e->irqchip.irqchip == irqchip &&
> -		    e->irqchip.pin == pin) {
> -			gsi = e->gsi;
> -			break;
> -		}
> -	}
> +	gsi = rcu_dereference(kvm->irq_routing)->chip[irqchip][pin];

And possibly here as well. Can guest control pin?

> +	if (gsi == -1)
> +		gsi = pin;
>  
>  	hlist_for_each_entry_rcu(kian, n, &kvm->irq_ack_notifier_list, link)
>  		if (kian->gsi == gsi)
> @@ -277,7 +273,8 @@ void kvm_free_irq_routing(struct kvm *kvm)
>  	kfree(kvm->irq_routing);
>  }
>  
> -static int setup_routing_entry(struct kvm_kernel_irq_routing_entry *e,
> +static int setup_routing_entry(struct kvm_irq_routing_table *rt,
> +			       struct kvm_kernel_irq_routing_entry *e,
>  			       const struct kvm_irq_routing_entry *ue)
>  {
>  	int r = -EINVAL;
> @@ -303,6 +300,7 @@ static int setup_routing_entry(struct kvm_kernel_irq_routing_entry *e,
>  		}
>  		e->irqchip.irqchip = ue->u.irqchip.irqchip;
>  		e->irqchip.pin = ue->u.irqchip.pin + delta;
> +		rt->chip[ue->u.irqchip.irqchip][e->irqchip.pin] = ue->gsi;
>  		break;
>  	case KVM_IRQ_ROUTING_MSI:
>  		e->set = kvm_set_msi;
> @@ -313,6 +311,8 @@ static int setup_routing_entry(struct kvm_kernel_irq_routing_entry *e,
>  	default:
>  		goto out;
>  	}
> +
> +	hlist_add_head(&e->link, &rt->map[e->gsi]);
>  	r = 0;
>  out:
>  	return r;
> @@ -324,23 +324,37 @@ int kvm_set_irq_routing(struct kvm *kvm,
>  			unsigned nr,
>  			unsigned flags)
>  {
> -	struct kvm_kernel_irq_routing_entry *new, *old;
> -	unsigned i;
> +	struct kvm_irq_routing_table *new, *old;
> +	u32 i, j, max_gsi = 0;
>  	int r;
>  
> -	/* last elemet is left zeored and indicates the end of the array */
> -	new = kzalloc(sizeof(*new) * (nr + 1), GFP_KERNEL);
> +	for (i = 0; i < nr; ++i) {
> +		if (ue[i].gsi >= KVM_MAX_IRQ_ROUTES)
> +			return -EINVAL;
> +		max_gsi = max(max_gsi, ue[i].gsi);
> +	}
> +
> +	max_gsi += 1;
> +
> +	new = kzalloc(sizeof(*new) + (max_gsi * sizeof(struct hlist_head)) +
> +		      (nr * sizeof(struct kvm_kernel_irq_routing_entry)),
> +		      GFP_KERNEL);
>  
>  	if (!new)
>  		return -ENOMEM;
>  
> +	new->rt_entries = (void *)&new->map[max_gsi];
> +
> +	new->max_gsi = max_gsi;
> +	for (i = 0; i < 3; i++)
> +		for (j = 0; j < KVM_IOAPIC_NUM_PINS; j++)
> +			new->chip[i][j] = -1;
> +
>  	for (i = 0; i < nr; ++i) {
>  		r = -EINVAL;
> -		if (ue->gsi >= KVM_MAX_IRQ_ROUTES)
> -			goto out;
>  		if (ue->flags)
>  			goto out;
> -		r = setup_routing_entry(new + i, ue);
> +		r = setup_routing_entry(new, &new->rt_entries[i], ue);
>  		if (r)
>  			goto out;
>  		++ue;
> -- 
> 1.6.2.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 08/10] Move IO APIC to its own lock.
  2009-07-15 17:57   ` Marcelo Tosatti
@ 2009-07-15 20:48     ` Gleb Natapov
  2009-07-15 21:38       ` Marcelo Tosatti
  0 siblings, 1 reply; 26+ messages in thread
From: Gleb Natapov @ 2009-07-15 20:48 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm

On Wed, Jul 15, 2009 at 02:57:59PM -0300, Marcelo Tosatti wrote:
> On Tue, Jul 14, 2009 at 05:30:43PM +0300, Gleb Natapov wrote:
> > 
> > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > ---
> >  arch/ia64/kvm/kvm-ia64.c |   27 ++++++++++++++++++------
> >  arch/x86/kvm/lapic.c     |    5 +---
> >  arch/x86/kvm/x86.c       |   30 ++++++++++++++++++---------
> >  virt/kvm/ioapic.c        |   50 ++++++++++++++++++++++++++++-----------------
> >  virt/kvm/ioapic.h        |    1 +
> >  5 files changed, 73 insertions(+), 40 deletions(-)
> > 
> > diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
> > index 0ad09f0..dd7ef2d 100644
> > --- a/arch/ia64/kvm/kvm-ia64.c
> > +++ b/arch/ia64/kvm/kvm-ia64.c
> > @@ -850,9 +850,16 @@ static int kvm_vm_ioctl_get_irqchip(struct kvm *kvm,
> >  
> >  	r = 0;
> >  	switch (chip->chip_id) {
> > -	case KVM_IRQCHIP_IOAPIC:
> > -		memcpy(&chip->chip.ioapic, ioapic_irqchip(kvm),
> > -				sizeof(struct kvm_ioapic_state));
> > +	case KVM_IRQCHIP_IOAPIC: {
> > +		struct kvm_ioapic *ioapic = ioapic_irqchip(kvm);
> > +		if (ioapic) {
> > +			spin_lock(&ioapic->lock);
> > +			memcpy(&chip->chip.ioapic, ioapic,
> > +			       sizeof(struct kvm_ioapic_state));
> > +			spin_unlock(&ioapic->lock);
> > +		} else
> > +			r = -EINVAL;
> > +	}
> >  		break;
> >  	default:
> >  		r = -EINVAL;
> > @@ -867,10 +874,16 @@ static int kvm_vm_ioctl_set_irqchip(struct kvm *kvm, struct kvm_irqchip *chip)
> >  
> >  	r = 0;
> >  	switch (chip->chip_id) {
> > -	case KVM_IRQCHIP_IOAPIC:
> > -		memcpy(ioapic_irqchip(kvm),
> > -				&chip->chip.ioapic,
> > -				sizeof(struct kvm_ioapic_state));
> > +	case KVM_IRQCHIP_IOAPIC: {
> > +		struct kvm_ioapic *ioapic = ioapic_irqchip(kvm);
> > +		if (ioapic) {
> > +			spin_lock(&ioapic->lock);
> > +			memcpy(ioapic, &chip->chip.ioapic,
> > +			       sizeof(struct kvm_ioapic_state));
> > +			spin_unlock(&ioapic->lock);
> > +		} else
> > +			r = -EINVAL;
> > +	}
> >  		break;
> >  	default:
> >  		r = -EINVAL;
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index e2e2849..61ffcfc 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -471,11 +471,8 @@ static void apic_set_eoi(struct kvm_lapic *apic)
> >  		trigger_mode = IOAPIC_LEVEL_TRIG;
> >  	else
> >  		trigger_mode = IOAPIC_EDGE_TRIG;
> > -	if (!(apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI)) {
> > -		mutex_lock(&apic->vcpu->kvm->irq_lock);
> > +	if (!(apic_get_reg(apic, APIC_SPIV) & APIC_SPIV_DIRECTED_EOI))
> >  		kvm_ioapic_update_eoi(apic->vcpu->kvm, vector, trigger_mode);
> > -		mutex_unlock(&apic->vcpu->kvm->irq_lock);
> > -	}
> >  }
> >  
> >  static void apic_send_ipi(struct kvm_lapic *apic)
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 48567fa..de99b84 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2010,10 +2010,16 @@ static int kvm_vm_ioctl_get_irqchip(struct kvm *kvm, struct kvm_irqchip *chip)
> >  			&pic_irqchip(kvm)->pics[1],
> >  			sizeof(struct kvm_pic_state));
> >  		break;
> > -	case KVM_IRQCHIP_IOAPIC:
> > -		memcpy(&chip->chip.ioapic,
> > -			ioapic_irqchip(kvm),
> > -			sizeof(struct kvm_ioapic_state));
> > +	case KVM_IRQCHIP_IOAPIC: {
> > +		struct kvm_ioapic *ioapic = ioapic_irqchip(kvm);
> > +		if (ioapic) {
> > +			spin_lock(&ioapic->lock);
> > +			memcpy(&chip->chip.ioapic, ioapic,
> > +			       sizeof(struct kvm_ioapic_state));
> > +			spin_unlock(&ioapic->lock);
> > +		} else
> > +			r = -EINVAL;
> > +	}
> >  		break;
> >  	default:
> >  		r = -EINVAL;
> > @@ -2042,12 +2048,16 @@ static int kvm_vm_ioctl_set_irqchip(struct kvm *kvm, struct kvm_irqchip *chip)
> >  			sizeof(struct kvm_pic_state));
> >  		spin_unlock(&pic_irqchip(kvm)->lock);
> >  		break;
> > -	case KVM_IRQCHIP_IOAPIC:
> > -		mutex_lock(&kvm->irq_lock);
> > -		memcpy(ioapic_irqchip(kvm),
> > -			&chip->chip.ioapic,
> > -			sizeof(struct kvm_ioapic_state));
> > -		mutex_unlock(&kvm->irq_lock);
> > +	case KVM_IRQCHIP_IOAPIC: {
> > +		struct kvm_ioapic *ioapic = ioapic_irqchip(kvm);
> > +		if (ioapic) {
> > +			spin_lock(&ioapic->lock);
> > +			memcpy(ioapic, &chip->chip.ioapic,
> > +			       sizeof(struct kvm_ioapic_state));
> > +			spin_unlock(&ioapic->lock);
> > +		} else
> > +			r = -EINVAL;
> > +	}
> >  		break;
> >  	default:
> >  		r = -EINVAL;
> > diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
> > index fa05f67..300ee3b 100644
> > --- a/virt/kvm/ioapic.c
> > +++ b/virt/kvm/ioapic.c
> > @@ -182,6 +182,7 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level)
> >  	union kvm_ioapic_redirect_entry entry;
> >  	int ret = 1;
> >  
> > +	spin_lock(&ioapic->lock);
> >  	if (irq >= 0 && irq < IOAPIC_NUM_PINS) {
> >  		entry = ioapic->redirtbl[irq];
> >  		level ^= entry.fields.polarity;
> > @@ -196,34 +197,44 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level)
> >  		}
> >  		trace_kvm_ioapic_set_irq(entry.bits, irq, ret == 0);
> >  	}
> > +	spin_unlock(&ioapic->lock);
> > +
> >  	return ret;
> >  }
> >  
> > -static void __kvm_ioapic_update_eoi(struct kvm_ioapic *ioapic, int pin,
> > -				    int trigger_mode)
> > +static void __kvm_ioapic_update_eoi(struct kvm_ioapic *ioapic, int vector,
> > +				     int trigger_mode)
> >  {
> > -	union kvm_ioapic_redirect_entry *ent;
> > +	int i;
> > +
> > +	for (i = 0; i < IOAPIC_NUM_PINS; i++) {
> > +		union kvm_ioapic_redirect_entry *ent = &ioapic->redirtbl[i];
> > +
> > +		if (ent->fields.vector != vector)
> > +			continue;
> >  
> > -	ent = &ioapic->redirtbl[pin];
> > +		/* ack notifier may call back into ioapic via kvm_set_irq()  */
> > +		spin_unlock(&ioapic->lock);
> > +		kvm_notify_acked_irq(ioapic->kvm, KVM_IRQCHIP_IOAPIC, i);
> > +		spin_lock(&ioapic->lock);
> 
> While traversing the IOAPIC pins you drop the lock and acquire it again,
> which is error prone: what if the redirection table is changed in
> between, for example?
> 
Yes, the ack notifiers is a problematic part. The only thing that
current ack notifiers do is set_irq_level(0) so this is not the problem
in practice. I'll see if I can eliminate this dropping of the lock here.

> Also, irq_lock was held during ack and mask notifier callbacks, so they
> (the callbacks) did not have to worry about concurrency.
> 
Is it possible to have more then one ack for the same interrupt line at
the same time?

> You questioned the purpose of irq_lock earlier, and thats what it is:
> synchronization between pic and ioapic blur at the irq notifiers.
> 
Pic has its own locking and it fails miserably in regards ack notifiers.
I bet nobody tried device assignment with pic. It will not work.

irq_lock is indeed used to synchronization ioapic access, but the way
it is done requires calling kvm_set_irq() with lock held and this adds
unnecessary lock on a msi irq injection path.

> Using RCU to synchronize ack/mask notifier list walk versus list
> addition is good, but i'm not entirely sure that smaller locks like you
> are proposing is a benefit long term (things become much more tricky).
Without removing irq_lock RCU is useless since the list walking is always
done with irq_lock held. I see smaller locks as simplification BTW. The
thing is tricky now, or so I felt while trying to understand what
irq_lock actually protects. With smaller locks with names that clearly
indicates which data structure it protects I feel the code is much
easy to understand.

> 
> Maybe it is the only way forward (and so you push this complexity to the
> ack/mask notifiers), but i think it can be avoided until its proven that
> there is contention on irq_lock (which is reduced by faster search).
This is not about contention. This is about existence of the lock in the
first place. With the patch set there is no lock at msi irq injection
path at all and this is better than having lock no matter if the lock is
congested or not. There is still a lock on ioapic path (which MSI does
not use) and removing of this lock should be done only after we see
that it is congested, because removing it involves adding a number of
atomic accesses, so it is not clear win without lock contention.
(removing it will also solve ack notification problem hmmm)

> 
> 
> > -	kvm_notify_acked_irq(ioapic->kvm, KVM_IRQCHIP_IOAPIC, pin);
> > +		if (trigger_mode != IOAPIC_LEVEL_TRIG)
> > +			continue;
> >  
> > -	if (trigger_mode == IOAPIC_LEVEL_TRIG) {
> >  		ASSERT(ent->fields.trig_mode == IOAPIC_LEVEL_TRIG);
> >  		ent->fields.remote_irr = 0;
> > -		if (!ent->fields.mask && (ioapic->irr & (1 << pin)))
> > -			ioapic_service(ioapic, pin);
> > +		if (!ent->fields.mask && (ioapic->irr & (1 << i)))
> > +			ioapic_service(ioapic, i);
> >  	}
> >  }
> >  
> >  void kvm_ioapic_update_eoi(struct kvm *kvm, int vector, int trigger_mode)
> >  {
> >  	struct kvm_ioapic *ioapic = kvm->arch.vioapic;
> > -	int i;
> >  
> > -	for (i = 0; i < IOAPIC_NUM_PINS; i++)
> > -		if (ioapic->redirtbl[i].fields.vector == vector)
> > -			__kvm_ioapic_update_eoi(ioapic, i, trigger_mode);
> > +	spin_lock(&ioapic->lock);
> > +	__kvm_ioapic_update_eoi(ioapic, vector, trigger_mode);
> > +	spin_unlock(&ioapic->lock);
> >  }
> >  
> >  static inline struct kvm_ioapic *to_ioapic(struct kvm_io_device *dev)
> > @@ -248,8 +259,8 @@ static int ioapic_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
> >  	ioapic_debug("addr %lx\n", (unsigned long)addr);
> >  	ASSERT(!(addr & 0xf));	/* check alignment */
> >  
> > -	mutex_lock(&ioapic->kvm->irq_lock);
> >  	addr &= 0xff;
> > +	spin_lock(&ioapic->lock);
> >  	switch (addr) {
> >  	case IOAPIC_REG_SELECT:
> >  		result = ioapic->ioregsel;
> > @@ -263,6 +274,8 @@ static int ioapic_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
> >  		result = 0;
> >  		break;
> >  	}
> > +	spin_unlock(&ioapic->lock);
> > +
> >  	switch (len) {
> >  	case 8:
> >  		*(u64 *) val = result;
> > @@ -275,7 +288,6 @@ static int ioapic_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
> >  	default:
> >  		printk(KERN_WARNING "ioapic: wrong length %d\n", len);
> >  	}
> > -	mutex_unlock(&ioapic->kvm->irq_lock);
> >  	return 0;
> >  }
> >  
> > @@ -291,15 +303,15 @@ static int ioapic_mmio_write(struct kvm_io_device *this, gpa_t addr, int len,
> >  		     (void*)addr, len, val);
> >  	ASSERT(!(addr & 0xf));	/* check alignment */
> >  
> > -	mutex_lock(&ioapic->kvm->irq_lock);
> >  	if (len == 4 || len == 8)
> >  		data = *(u32 *) val;
> >  	else {
> >  		printk(KERN_WARNING "ioapic: Unsupported size %d\n", len);
> > -		goto unlock;
> > +		return 0;
> >  	}
> >  
> >  	addr &= 0xff;
> > +	spin_lock(&ioapic->lock);
> >  	switch (addr) {
> >  	case IOAPIC_REG_SELECT:
> >  		ioapic->ioregsel = data;
> > @@ -310,15 +322,14 @@ static int ioapic_mmio_write(struct kvm_io_device *this, gpa_t addr, int len,
> >  		break;
> >  #ifdef	CONFIG_IA64
> >  	case IOAPIC_REG_EOI:
> > -		kvm_ioapic_update_eoi(ioapic->kvm, data, IOAPIC_LEVEL_TRIG);
> > +		__kvm_ioapic_update_eoi(ioapic, data, IOAPIC_LEVEL_TRIG);
> >  		break;
> >  #endif
> >  
> >  	default:
> >  		break;
> >  	}
> > -unlock:
> > -	mutex_unlock(&ioapic->kvm->irq_lock);
> > +	spin_unlock(&ioapic->lock);
> >  	return 0;
> >  }
> >  
> > @@ -347,6 +358,7 @@ int kvm_ioapic_init(struct kvm *kvm)
> >  	ioapic = kzalloc(sizeof(struct kvm_ioapic), GFP_KERNEL);
> >  	if (!ioapic)
> >  		return -ENOMEM;
> > +	spin_lock_init(&ioapic->lock);
> >  	kvm->arch.vioapic = ioapic;
> >  	kvm_ioapic_reset(ioapic);
> >  	kvm_iodevice_init(&ioapic->dev, &ioapic_mmio_ops);
> > diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h
> > index 7080b71..557107e 100644
> > --- a/virt/kvm/ioapic.h
> > +++ b/virt/kvm/ioapic.h
> > @@ -44,6 +44,7 @@ struct kvm_ioapic {
> >  	struct kvm_io_device dev;
> >  	struct kvm *kvm;
> >  	void (*ack_notifier)(void *opaque, int irq);
> > +	spinlock_t lock;
> >  };
> >  
> >  #ifdef DEBUG
> > -- 
> > 1.6.2.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
			Gleb.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 10/10] Change irq routing table to use gsi indexed array.
  2009-07-15 19:17   ` Michael S. Tsirkin
@ 2009-07-15 20:48     ` Gleb Natapov
  0 siblings, 0 replies; 26+ messages in thread
From: Gleb Natapov @ 2009-07-15 20:48 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, mtosatti

On Wed, Jul 15, 2009 at 10:17:22PM +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 14, 2009 at 05:30:45PM +0300, Gleb Natapov wrote:
> > @@ -147,14 +149,13 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
> >  	 * writes to the unused one.
> >  	 */
> >  	rcu_read_lock();
> > -	for (e = rcu_dereference(kvm->irq_routing); e && e->set; e++) {
> > -		if (e->gsi == irq) {
> > -			int r = e->set(e, kvm, sig_level);
> > -			if (r < 0)
> > -				continue;
> > +	irq_rt = rcu_dereference(kvm->irq_routing);
> > +	hlist_for_each_entry(e, n, &irq_rt->map[irq], link) {
> 
> Don't you need to range-check irq? E.g. with irqfd, gsi is
> controlled by guest.
> 
Yes, I need to add range checking. Good point.

> > +		int r = e->set(e, kvm, sig_level);
> > +		if (r < 0)
> > +			continue;
> >  
> > -			ret = r + ((ret < 0) ? 0 : ret);
> > -		}
> > +		ret = r + ((ret < 0) ? 0 : ret);
> >  	}
> >  	rcu_read_unlock();
> >  	return ret;
> > @@ -162,21 +163,16 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
> >  
> >  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
> >  {
> > -	struct kvm_kernel_irq_routing_entry *e;
> >  	struct kvm_irq_ack_notifier *kian;
> >  	struct hlist_node *n;
> > -	unsigned gsi = pin;
> > +	unsigned gsi;
> >  
> >  	trace_kvm_ack_irq(irqchip, pin);
> >  
> >  	rcu_read_lock();
> > -	for (e = rcu_dereference(kvm->irq_routing); e && e->set; e++) {
> > -		if (e->irqchip.irqchip == irqchip &&
> > -		    e->irqchip.pin == pin) {
> > -			gsi = e->gsi;
> > -			break;
> > -		}
> > -	}
> > +	gsi = rcu_dereference(kvm->irq_routing)->chip[irqchip][pin];
> 
> And possibly here as well. Can guest control pin?
> 
> > +	if (gsi == -1)
> > +		gsi = pin;
> >  
> >  	hlist_for_each_entry_rcu(kian, n, &kvm->irq_ack_notifier_list, link)
> >  		if (kian->gsi == gsi)
> > @@ -277,7 +273,8 @@ void kvm_free_irq_routing(struct kvm *kvm)
> >  	kfree(kvm->irq_routing);
> >  }
> >  
> > -static int setup_routing_entry(struct kvm_kernel_irq_routing_entry *e,
> > +static int setup_routing_entry(struct kvm_irq_routing_table *rt,
> > +			       struct kvm_kernel_irq_routing_entry *e,
> >  			       const struct kvm_irq_routing_entry *ue)
> >  {
> >  	int r = -EINVAL;
> > @@ -303,6 +300,7 @@ static int setup_routing_entry(struct kvm_kernel_irq_routing_entry *e,
> >  		}
> >  		e->irqchip.irqchip = ue->u.irqchip.irqchip;
> >  		e->irqchip.pin = ue->u.irqchip.pin + delta;
> > +		rt->chip[ue->u.irqchip.irqchip][e->irqchip.pin] = ue->gsi;
> >  		break;
> >  	case KVM_IRQ_ROUTING_MSI:
> >  		e->set = kvm_set_msi;
> > @@ -313,6 +311,8 @@ static int setup_routing_entry(struct kvm_kernel_irq_routing_entry *e,
> >  	default:
> >  		goto out;
> >  	}
> > +
> > +	hlist_add_head(&e->link, &rt->map[e->gsi]);
> >  	r = 0;
> >  out:
> >  	return r;
> > @@ -324,23 +324,37 @@ int kvm_set_irq_routing(struct kvm *kvm,
> >  			unsigned nr,
> >  			unsigned flags)
> >  {
> > -	struct kvm_kernel_irq_routing_entry *new, *old;
> > -	unsigned i;
> > +	struct kvm_irq_routing_table *new, *old;
> > +	u32 i, j, max_gsi = 0;
> >  	int r;
> >  
> > -	/* last elemet is left zeored and indicates the end of the array */
> > -	new = kzalloc(sizeof(*new) * (nr + 1), GFP_KERNEL);
> > +	for (i = 0; i < nr; ++i) {
> > +		if (ue[i].gsi >= KVM_MAX_IRQ_ROUTES)
> > +			return -EINVAL;
> > +		max_gsi = max(max_gsi, ue[i].gsi);
> > +	}
> > +
> > +	max_gsi += 1;
> > +
> > +	new = kzalloc(sizeof(*new) + (max_gsi * sizeof(struct hlist_head)) +
> > +		      (nr * sizeof(struct kvm_kernel_irq_routing_entry)),
> > +		      GFP_KERNEL);
> >  
> >  	if (!new)
> >  		return -ENOMEM;
> >  
> > +	new->rt_entries = (void *)&new->map[max_gsi];
> > +
> > +	new->max_gsi = max_gsi;
> > +	for (i = 0; i < 3; i++)
> > +		for (j = 0; j < KVM_IOAPIC_NUM_PINS; j++)
> > +			new->chip[i][j] = -1;
> > +
> >  	for (i = 0; i < nr; ++i) {
> >  		r = -EINVAL;
> > -		if (ue->gsi >= KVM_MAX_IRQ_ROUTES)
> > -			goto out;
> >  		if (ue->flags)
> >  			goto out;
> > -		r = setup_routing_entry(new + i, ue);
> > +		r = setup_routing_entry(new, &new->rt_entries[i], ue);
> >  		if (r)
> >  			goto out;
> >  		++ue;
> > -- 
> > 1.6.2.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
			Gleb.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 10/10] Change irq routing table to use gsi indexed array.
  2009-07-15 18:18   ` Marcelo Tosatti
@ 2009-07-15 20:52     ` Gleb Natapov
  2009-07-15 21:42       ` Marcelo Tosatti
  0 siblings, 1 reply; 26+ messages in thread
From: Gleb Natapov @ 2009-07-15 20:52 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm

On Wed, Jul 15, 2009 at 03:18:00PM -0300, Marcelo Tosatti wrote:
> On Tue, Jul 14, 2009 at 05:30:45PM +0300, Gleb Natapov wrote:
> > Use gsi indexed array instead of scanning all entries on each interrupt
> > injection. Also maintain back mapping from irqchip/pin to gsi to speedup
> > interrupt acknowledgment notifications.
> > 
> > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > ---
> >  include/linux/kvm_host.h |   11 ++++++-
> >  virt/kvm/irq_comm.c      |   62 ++++++++++++++++++++++++++++-----------------
> >  2 files changed, 47 insertions(+), 26 deletions(-)
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index aa64d0d..ae6cbf1 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -128,7 +128,14 @@ struct kvm_kernel_irq_routing_entry {
> >  		} irqchip;
> >  		struct msi_msg msi;
> >  	};
> > -	struct list_head link;
> > +	struct hlist_node link;
> > +};
> > +
> > +struct kvm_irq_routing_table {
> > +	int chip[3][KVM_IOAPIC_NUM_PINS];
> > +	struct kvm_kernel_irq_routing_entry *rt_entries;
> > +	u32 max_gsi;
> > +	struct hlist_head map[0];
> >  };
> >  
> >  struct kvm {
> > @@ -165,7 +172,7 @@ struct kvm {
> >  #endif
> >  
> >  #ifdef CONFIG_HAVE_KVM_IRQCHIP
> > -	struct kvm_kernel_irq_routing_entry *irq_routing;
> > +	struct kvm_irq_routing_table *irq_routing;
> >  	spinlock_t irq_routing_lock;
> >  	struct hlist_head mask_notifier_list;
> >  	struct hlist_head irq_ack_notifier_list;
> > diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> > index c54a28b..da643d4 100644
> > --- a/virt/kvm/irq_comm.c
> > +++ b/virt/kvm/irq_comm.c
> > @@ -125,6 +125,8 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
> >  	struct kvm_kernel_irq_routing_entry *e;
> >  	unsigned long *irq_state, sig_level;
> >  	int ret = -1;
> > +	struct kvm_irq_routing_table *irq_rt;
> > +	struct hlist_node *n;
> >  
> >  	trace_kvm_set_irq(irq, level, irq_source_id);
> >  
> > @@ -147,14 +149,13 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
> >  	 * writes to the unused one.
> >  	 */
> >  	rcu_read_lock();
> > -	for (e = rcu_dereference(kvm->irq_routing); e && e->set; e++) {
> > -		if (e->gsi == irq) {
> > -			int r = e->set(e, kvm, sig_level);
> > -			if (r < 0)
> > -				continue;
> > +	irq_rt = rcu_dereference(kvm->irq_routing);
> > +	hlist_for_each_entry(e, n, &irq_rt->map[irq], link) {
> > +		int r = e->set(e, kvm, sig_level);
> > +		if (r < 0)
> > +			continue;
> >  
> > -			ret = r + ((ret < 0) ? 0 : ret);
> > -		}
> > +		ret = r + ((ret < 0) ? 0 : ret);
> >  	}
> >  	rcu_read_unlock();
> >  	return ret;
> > @@ -162,21 +163,16 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
> >  
> >  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
> >  {
> > -	struct kvm_kernel_irq_routing_entry *e;
> >  	struct kvm_irq_ack_notifier *kian;
> >  	struct hlist_node *n;
> > -	unsigned gsi = pin;
> > +	unsigned gsi;
> >  
> >  	trace_kvm_ack_irq(irqchip, pin);
> >  
> >  	rcu_read_lock();
> > -	for (e = rcu_dereference(kvm->irq_routing); e && e->set; e++) {
> > -		if (e->irqchip.irqchip == irqchip &&
> > -		    e->irqchip.pin == pin) {
> > -			gsi = e->gsi;
> > -			break;
> > -		}
> > -	}
> > +	gsi = rcu_dereference(kvm->irq_routing)->chip[irqchip][pin];
> > +	if (gsi == -1)
> > +		gsi = pin;
> >  
> >  	hlist_for_each_entry_rcu(kian, n, &kvm->irq_ack_notifier_list, link)
> >  		if (kian->gsi == gsi)
> > @@ -277,7 +273,8 @@ void kvm_free_irq_routing(struct kvm *kvm)
> >  	kfree(kvm->irq_routing);
> >  }
> >  
> > -static int setup_routing_entry(struct kvm_kernel_irq_routing_entry *e,
> > +static int setup_routing_entry(struct kvm_irq_routing_table *rt,
> > +			       struct kvm_kernel_irq_routing_entry *e,
> >  			       const struct kvm_irq_routing_entry *ue)
> >  {
> >  	int r = -EINVAL;
> > @@ -303,6 +300,7 @@ static int setup_routing_entry(struct kvm_kernel_irq_routing_entry *e,
> >  		}
> >  		e->irqchip.irqchip = ue->u.irqchip.irqchip;
> >  		e->irqchip.pin = ue->u.irqchip.pin + delta;
> > +		rt->chip[ue->u.irqchip.irqchip][e->irqchip.pin] = ue->gsi;
> >  		break;
> >  	case KVM_IRQ_ROUTING_MSI:
> >  		e->set = kvm_set_msi;
> > @@ -313,6 +311,8 @@ static int setup_routing_entry(struct kvm_kernel_irq_routing_entry *e,
> >  	default:
> >  		goto out;
> >  	}
> > +
> > +	hlist_add_head(&e->link, &rt->map[e->gsi]);
> >  	r = 0;
> >  out:
> >  	return r;
> > @@ -324,23 +324,37 @@ int kvm_set_irq_routing(struct kvm *kvm,
> >  			unsigned nr,
> >  			unsigned flags)
> >  {
> > -	struct kvm_kernel_irq_routing_entry *new, *old;
> > -	unsigned i;
> > +	struct kvm_irq_routing_table *new, *old;
> > +	u32 i, j, max_gsi = 0;
> >  	int r;
> >  
> > -	/* last elemet is left zeored and indicates the end of the array */
> > -	new = kzalloc(sizeof(*new) * (nr + 1), GFP_KERNEL);
> > +	for (i = 0; i < nr; ++i) {
> > +		if (ue[i].gsi >= KVM_MAX_IRQ_ROUTES)
> > +			return -EINVAL;
> > +		max_gsi = max(max_gsi, ue[i].gsi);
> > +	}
> > +
> > +	max_gsi += 1;
> > +
> > +	new = kzalloc(sizeof(*new) + (max_gsi * sizeof(struct hlist_head)) +
> > +		      (nr * sizeof(struct kvm_kernel_irq_routing_entry)),
> > +		      GFP_KERNEL);
> 
> Why don't you allocate the hlist_head's and the routing entries
> separately?
> 
I prefer it that way because cleanup after error is much easier. What
are the disadvantages?

> >  
> >  	if (!new)
> >  		return -ENOMEM;
> >  
> > +	new->rt_entries = (void *)&new->map[max_gsi];
> > +
> > +	new->max_gsi = max_gsi;
> > +	for (i = 0; i < 3; i++)
> > +		for (j = 0; j < KVM_IOAPIC_NUM_PINS; j++)
> > +			new->chip[i][j] = -1;
> > +
> 
> Should use something else instead of 3. Maybe dynamic for multiple
> IOAPIC's support (but you can argue thats another problem).
> 
This is (another problem). The code has 1 IOAPIC hardcoded pretty deeply
even at user/kernel API level. We will solve is some day.
 
> >  	for (i = 0; i < nr; ++i) {
> >  		r = -EINVAL;
> > -		if (ue->gsi >= KVM_MAX_IRQ_ROUTES)
> > -			goto out;
> >  		if (ue->flags)
> >  			goto out;
> > -		r = setup_routing_entry(new + i, ue);
> > +		r = setup_routing_entry(new, &new->rt_entries[i], ue);
> >  		if (r)
> >  			goto out;
> >  		++ue;
> > -- 
> > 1.6.2.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
			Gleb.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 08/10] Move IO APIC to its own lock.
  2009-07-15 20:48     ` Gleb Natapov
@ 2009-07-15 21:38       ` Marcelo Tosatti
  2009-07-16  7:51         ` Gleb Natapov
  0 siblings, 1 reply; 26+ messages in thread
From: Marcelo Tosatti @ 2009-07-15 21:38 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm

On Wed, Jul 15, 2009 at 11:48:17PM +0300, Gleb Natapov wrote:
> > > +		spin_unlock(&ioapic->lock);
> > > +		kvm_notify_acked_irq(ioapic->kvm, KVM_IRQCHIP_IOAPIC, i);
> > > +		spin_lock(&ioapic->lock);
> > 
> > While traversing the IOAPIC pins you drop the lock and acquire it again,
> > which is error prone: what if the redirection table is changed in
> > between, for example?
> > 
> Yes, the ack notifiers is a problematic part. The only thing that
> current ack notifiers do is set_irq_level(0) so this is not the problem
> in practice. I'll see if I can eliminate this dropping of the lock here.

Currently, yes. But take into account that an ack notifier might do 
other things than set_irq_level(0).

Say for example an ack notifier that takes the PIC or IOAPIC lock 
(for whatever reason).

> > Also, irq_lock was held during ack and mask notifier callbacks, so they
> > (the callbacks) did not have to worry about concurrency.
> > 
> Is it possible to have more then one ack for the same interrupt line at
> the same time?

Why not? But maybe this is a somewhat stupid point, because you can
require the notifiers to handle that.

> > You questioned the purpose of irq_lock earlier, and thats what it is:
> > synchronization between pic and ioapic blur at the irq notifiers.
> > 
> Pic has its own locking and it fails miserably in regards ack notifiers.
> I bet nobody tried device assignment with pic. It will not work.

It fails to take irq_lock on automatic EOI because on guest entry 
interrupts are disabled (see d5ecfdd25c412df9c0ecf3ab8e066461fd4f69df).

> irq_lock is indeed used to synchronization ioapic access, but the way
> it is done requires calling kvm_set_irq() with lock held and this adds
> unnecessary lock on a msi irq injection path.
> 
> > Using RCU to synchronize ack/mask notifier list walk versus list
> > addition is good, but i'm not entirely sure that smaller locks like you
> > are proposing is a benefit long term (things become much more tricky).
> Without removing irq_lock RCU is useless since the list walking is always
> done with irq_lock held. I see smaller locks as simplification BTW. The
> thing is tricky now, or so I felt while trying to understand what
> irq_lock actually protects. With smaller locks with names that clearly
> indicates which data structure it protects I feel the code is much
> easy to understand.

What is worrying is if you need to take both PIC and IOAPIC locks for 
some operation (then have to worry about lock ordering etc).

> > Maybe it is the only way forward (and so you push this complexity to the
> > ack/mask notifiers), but i think it can be avoided until its proven that
> > there is contention on irq_lock (which is reduced by faster search).
> This is not about contention. This is about existence of the lock in the
> first place. With the patch set there is no lock at msi irq injection
> path at all and this is better than having lock no matter if the lock is
> congested or not. There is still a lock on ioapic path (which MSI does
> not use) and removing of this lock should be done only after we see
> that it is congested, because removing it involves adding a number of
> atomic accesses, so it is not clear win without lock contention.
> (removing it will also solve ack notification problem hmmm)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 10/10] Change irq routing table to use gsi indexed array.
  2009-07-15 20:52     ` Gleb Natapov
@ 2009-07-15 21:42       ` Marcelo Tosatti
  2009-07-16  6:05         ` Gleb Natapov
  0 siblings, 1 reply; 26+ messages in thread
From: Marcelo Tosatti @ 2009-07-15 21:42 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm

On Wed, Jul 15, 2009 at 11:52:24PM +0300, Gleb Natapov wrote:
> On Wed, Jul 15, 2009 at 03:18:00PM -0300, Marcelo Tosatti wrote:
> > On Tue, Jul 14, 2009 at 05:30:45PM +0300, Gleb Natapov wrote:
> > > Use gsi indexed array instead of scanning all entries on each interrupt
> > > injection. Also maintain back mapping from irqchip/pin to gsi to speedup
> > > interrupt acknowledgment notifications.
> > > 
> > > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > > ---
> > >  include/linux/kvm_host.h |   11 ++++++-
> > >  virt/kvm/irq_comm.c      |   62 ++++++++++++++++++++++++++++-----------------
> > >  2 files changed, 47 insertions(+), 26 deletions(-)
> > > 
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index aa64d0d..ae6cbf1 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -128,7 +128,14 @@ struct kvm_kernel_irq_routing_entry {
> > >  		} irqchip;
> > >  		struct msi_msg msi;
> > >  	};
> > > -	struct list_head link;
> > > +	struct hlist_node link;
> > > +};
> > > +
> > > +struct kvm_irq_routing_table {
> > > +	int chip[3][KVM_IOAPIC_NUM_PINS];
> > > +	struct kvm_kernel_irq_routing_entry *rt_entries;
> > > +	u32 max_gsi;
> > > +	struct hlist_head map[0];
> > >  };
> > >  
> > >  struct kvm {
> > > @@ -165,7 +172,7 @@ struct kvm {
> > >  #endif
> > >  
> > >  #ifdef CONFIG_HAVE_KVM_IRQCHIP
> > > -	struct kvm_kernel_irq_routing_entry *irq_routing;
> > > +	struct kvm_irq_routing_table *irq_routing;
> > >  	spinlock_t irq_routing_lock;
> > >  	struct hlist_head mask_notifier_list;
> > >  	struct hlist_head irq_ack_notifier_list;
> > > diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> > > index c54a28b..da643d4 100644
> > > --- a/virt/kvm/irq_comm.c
> > > +++ b/virt/kvm/irq_comm.c
> > > @@ -125,6 +125,8 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
> > >  	struct kvm_kernel_irq_routing_entry *e;
> > >  	unsigned long *irq_state, sig_level;
> > >  	int ret = -1;
> > > +	struct kvm_irq_routing_table *irq_rt;
> > > +	struct hlist_node *n;
> > >  
> > >  	trace_kvm_set_irq(irq, level, irq_source_id);
> > >  
> > > @@ -147,14 +149,13 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
> > >  	 * writes to the unused one.
> > >  	 */
> > >  	rcu_read_lock();
> > > -	for (e = rcu_dereference(kvm->irq_routing); e && e->set; e++) {
> > > -		if (e->gsi == irq) {
> > > -			int r = e->set(e, kvm, sig_level);
> > > -			if (r < 0)
> > > -				continue;
> > > +	irq_rt = rcu_dereference(kvm->irq_routing);
> > > +	hlist_for_each_entry(e, n, &irq_rt->map[irq], link) {
> > > +		int r = e->set(e, kvm, sig_level);
> > > +		if (r < 0)
> > > +			continue;
> > >  
> > > -			ret = r + ((ret < 0) ? 0 : ret);
> > > -		}
> > > +		ret = r + ((ret < 0) ? 0 : ret);
> > >  	}
> > >  	rcu_read_unlock();
> > >  	return ret;
> > > @@ -162,21 +163,16 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
> > >  
> > >  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
> > >  {
> > > -	struct kvm_kernel_irq_routing_entry *e;
> > >  	struct kvm_irq_ack_notifier *kian;
> > >  	struct hlist_node *n;
> > > -	unsigned gsi = pin;
> > > +	unsigned gsi;
> > >  
> > >  	trace_kvm_ack_irq(irqchip, pin);
> > >  
> > >  	rcu_read_lock();
> > > -	for (e = rcu_dereference(kvm->irq_routing); e && e->set; e++) {
> > > -		if (e->irqchip.irqchip == irqchip &&
> > > -		    e->irqchip.pin == pin) {
> > > -			gsi = e->gsi;
> > > -			break;
> > > -		}
> > > -	}
> > > +	gsi = rcu_dereference(kvm->irq_routing)->chip[irqchip][pin];
> > > +	if (gsi == -1)
> > > +		gsi = pin;
> > >  
> > >  	hlist_for_each_entry_rcu(kian, n, &kvm->irq_ack_notifier_list, link)
> > >  		if (kian->gsi == gsi)
> > > @@ -277,7 +273,8 @@ void kvm_free_irq_routing(struct kvm *kvm)
> > >  	kfree(kvm->irq_routing);
> > >  }
> > >  
> > > -static int setup_routing_entry(struct kvm_kernel_irq_routing_entry *e,
> > > +static int setup_routing_entry(struct kvm_irq_routing_table *rt,
> > > +			       struct kvm_kernel_irq_routing_entry *e,
> > >  			       const struct kvm_irq_routing_entry *ue)
> > >  {
> > >  	int r = -EINVAL;
> > > @@ -303,6 +300,7 @@ static int setup_routing_entry(struct kvm_kernel_irq_routing_entry *e,
> > >  		}
> > >  		e->irqchip.irqchip = ue->u.irqchip.irqchip;
> > >  		e->irqchip.pin = ue->u.irqchip.pin + delta;
> > > +		rt->chip[ue->u.irqchip.irqchip][e->irqchip.pin] = ue->gsi;
> > >  		break;
> > >  	case KVM_IRQ_ROUTING_MSI:
> > >  		e->set = kvm_set_msi;
> > > @@ -313,6 +311,8 @@ static int setup_routing_entry(struct kvm_kernel_irq_routing_entry *e,
> > >  	default:
> > >  		goto out;
> > >  	}
> > > +
> > > +	hlist_add_head(&e->link, &rt->map[e->gsi]);
> > >  	r = 0;
> > >  out:
> > >  	return r;
> > > @@ -324,23 +324,37 @@ int kvm_set_irq_routing(struct kvm *kvm,
> > >  			unsigned nr,
> > >  			unsigned flags)
> > >  {
> > > -	struct kvm_kernel_irq_routing_entry *new, *old;
> > > -	unsigned i;
> > > +	struct kvm_irq_routing_table *new, *old;
> > > +	u32 i, j, max_gsi = 0;
> > >  	int r;
> > >  
> > > -	/* last elemet is left zeored and indicates the end of the array */
> > > -	new = kzalloc(sizeof(*new) * (nr + 1), GFP_KERNEL);
> > > +	for (i = 0; i < nr; ++i) {
> > > +		if (ue[i].gsi >= KVM_MAX_IRQ_ROUTES)
> > > +			return -EINVAL;
> > > +		max_gsi = max(max_gsi, ue[i].gsi);
> > > +	}
> > > +
> > > +	max_gsi += 1;
> > > +
> > > +	new = kzalloc(sizeof(*new) + (max_gsi * sizeof(struct hlist_head)) +
> > > +		      (nr * sizeof(struct kvm_kernel_irq_routing_entry)),
> > > +		      GFP_KERNEL);
> > 
> > Why don't you allocate the hlist_head's and the routing entries
> > separately?
> > 
> I prefer it that way because cleanup after error is much easier. What
> are the disadvantages?

They are two data structures (two different arrays). Also as mentioned
before by others the allocation size of irq_routing array might become
an issue.

> > >  
> > >  	if (!new)
> > >  		return -ENOMEM;
> > >  
> > > +	new->rt_entries = (void *)&new->map[max_gsi];
> > > +
> > > +	new->max_gsi = max_gsi;
> > > +	for (i = 0; i < 3; i++)
> > > +		for (j = 0; j < KVM_IOAPIC_NUM_PINS; j++)
> > > +			new->chip[i][j] = -1;
> > > +
> > 
> > Should use something else instead of 3. Maybe dynamic for multiple
> > IOAPIC's support (but you can argue thats another problem).
> > 
> This is (another problem). The code has 1 IOAPIC hardcoded pretty deeply
> even at user/kernel API level. We will solve is some day.

OK

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 10/10] Change irq routing table to use gsi indexed array.
  2009-07-15 21:42       ` Marcelo Tosatti
@ 2009-07-16  6:05         ` Gleb Natapov
  0 siblings, 0 replies; 26+ messages in thread
From: Gleb Natapov @ 2009-07-16  6:05 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm

On Wed, Jul 15, 2009 at 06:42:05PM -0300, Marcelo Tosatti wrote:
> On Wed, Jul 15, 2009 at 11:52:24PM +0300, Gleb Natapov wrote:
> > On Wed, Jul 15, 2009 at 03:18:00PM -0300, Marcelo Tosatti wrote:
> > > On Tue, Jul 14, 2009 at 05:30:45PM +0300, Gleb Natapov wrote:
> > > > Use gsi indexed array instead of scanning all entries on each interrupt
> > > > injection. Also maintain back mapping from irqchip/pin to gsi to speedup
> > > > interrupt acknowledgment notifications.
> > > > 
> > > > Signed-off-by: Gleb Natapov <gleb@redhat.com>
> > > > ---
> > > >  include/linux/kvm_host.h |   11 ++++++-
> > > >  virt/kvm/irq_comm.c      |   62 ++++++++++++++++++++++++++++-----------------
> > > >  2 files changed, 47 insertions(+), 26 deletions(-)
> > > > 
> > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > index aa64d0d..ae6cbf1 100644
> > > > --- a/include/linux/kvm_host.h
> > > > +++ b/include/linux/kvm_host.h
> > > > @@ -128,7 +128,14 @@ struct kvm_kernel_irq_routing_entry {
> > > >  		} irqchip;
> > > >  		struct msi_msg msi;
> > > >  	};
> > > > -	struct list_head link;
> > > > +	struct hlist_node link;
> > > > +};
> > > > +
> > > > +struct kvm_irq_routing_table {
> > > > +	int chip[3][KVM_IOAPIC_NUM_PINS];
> > > > +	struct kvm_kernel_irq_routing_entry *rt_entries;
> > > > +	u32 max_gsi;
> > > > +	struct hlist_head map[0];
> > > >  };
> > > >  
> > > >  struct kvm {
> > > > @@ -165,7 +172,7 @@ struct kvm {
> > > >  #endif
> > > >  
> > > >  #ifdef CONFIG_HAVE_KVM_IRQCHIP
> > > > -	struct kvm_kernel_irq_routing_entry *irq_routing;
> > > > +	struct kvm_irq_routing_table *irq_routing;
> > > >  	spinlock_t irq_routing_lock;
> > > >  	struct hlist_head mask_notifier_list;
> > > >  	struct hlist_head irq_ack_notifier_list;
> > > > diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
> > > > index c54a28b..da643d4 100644
> > > > --- a/virt/kvm/irq_comm.c
> > > > +++ b/virt/kvm/irq_comm.c
> > > > @@ -125,6 +125,8 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
> > > >  	struct kvm_kernel_irq_routing_entry *e;
> > > >  	unsigned long *irq_state, sig_level;
> > > >  	int ret = -1;
> > > > +	struct kvm_irq_routing_table *irq_rt;
> > > > +	struct hlist_node *n;
> > > >  
> > > >  	trace_kvm_set_irq(irq, level, irq_source_id);
> > > >  
> > > > @@ -147,14 +149,13 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
> > > >  	 * writes to the unused one.
> > > >  	 */
> > > >  	rcu_read_lock();
> > > > -	for (e = rcu_dereference(kvm->irq_routing); e && e->set; e++) {
> > > > -		if (e->gsi == irq) {
> > > > -			int r = e->set(e, kvm, sig_level);
> > > > -			if (r < 0)
> > > > -				continue;
> > > > +	irq_rt = rcu_dereference(kvm->irq_routing);
> > > > +	hlist_for_each_entry(e, n, &irq_rt->map[irq], link) {
> > > > +		int r = e->set(e, kvm, sig_level);
> > > > +		if (r < 0)
> > > > +			continue;
> > > >  
> > > > -			ret = r + ((ret < 0) ? 0 : ret);
> > > > -		}
> > > > +		ret = r + ((ret < 0) ? 0 : ret);
> > > >  	}
> > > >  	rcu_read_unlock();
> > > >  	return ret;
> > > > @@ -162,21 +163,16 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, int irq, int level)
> > > >  
> > > >  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
> > > >  {
> > > > -	struct kvm_kernel_irq_routing_entry *e;
> > > >  	struct kvm_irq_ack_notifier *kian;
> > > >  	struct hlist_node *n;
> > > > -	unsigned gsi = pin;
> > > > +	unsigned gsi;
> > > >  
> > > >  	trace_kvm_ack_irq(irqchip, pin);
> > > >  
> > > >  	rcu_read_lock();
> > > > -	for (e = rcu_dereference(kvm->irq_routing); e && e->set; e++) {
> > > > -		if (e->irqchip.irqchip == irqchip &&
> > > > -		    e->irqchip.pin == pin) {
> > > > -			gsi = e->gsi;
> > > > -			break;
> > > > -		}
> > > > -	}
> > > > +	gsi = rcu_dereference(kvm->irq_routing)->chip[irqchip][pin];
> > > > +	if (gsi == -1)
> > > > +		gsi = pin;
> > > >  
> > > >  	hlist_for_each_entry_rcu(kian, n, &kvm->irq_ack_notifier_list, link)
> > > >  		if (kian->gsi == gsi)
> > > > @@ -277,7 +273,8 @@ void kvm_free_irq_routing(struct kvm *kvm)
> > > >  	kfree(kvm->irq_routing);
> > > >  }
> > > >  
> > > > -static int setup_routing_entry(struct kvm_kernel_irq_routing_entry *e,
> > > > +static int setup_routing_entry(struct kvm_irq_routing_table *rt,
> > > > +			       struct kvm_kernel_irq_routing_entry *e,
> > > >  			       const struct kvm_irq_routing_entry *ue)
> > > >  {
> > > >  	int r = -EINVAL;
> > > > @@ -303,6 +300,7 @@ static int setup_routing_entry(struct kvm_kernel_irq_routing_entry *e,
> > > >  		}
> > > >  		e->irqchip.irqchip = ue->u.irqchip.irqchip;
> > > >  		e->irqchip.pin = ue->u.irqchip.pin + delta;
> > > > +		rt->chip[ue->u.irqchip.irqchip][e->irqchip.pin] = ue->gsi;
> > > >  		break;
> > > >  	case KVM_IRQ_ROUTING_MSI:
> > > >  		e->set = kvm_set_msi;
> > > > @@ -313,6 +311,8 @@ static int setup_routing_entry(struct kvm_kernel_irq_routing_entry *e,
> > > >  	default:
> > > >  		goto out;
> > > >  	}
> > > > +
> > > > +	hlist_add_head(&e->link, &rt->map[e->gsi]);
> > > >  	r = 0;
> > > >  out:
> > > >  	return r;
> > > > @@ -324,23 +324,37 @@ int kvm_set_irq_routing(struct kvm *kvm,
> > > >  			unsigned nr,
> > > >  			unsigned flags)
> > > >  {
> > > > -	struct kvm_kernel_irq_routing_entry *new, *old;
> > > > -	unsigned i;
> > > > +	struct kvm_irq_routing_table *new, *old;
> > > > +	u32 i, j, max_gsi = 0;
> > > >  	int r;
> > > >  
> > > > -	/* last elemet is left zeored and indicates the end of the array */
> > > > -	new = kzalloc(sizeof(*new) * (nr + 1), GFP_KERNEL);
> > > > +	for (i = 0; i < nr; ++i) {
> > > > +		if (ue[i].gsi >= KVM_MAX_IRQ_ROUTES)
> > > > +			return -EINVAL;
> > > > +		max_gsi = max(max_gsi, ue[i].gsi);
> > > > +	}
> > > > +
> > > > +	max_gsi += 1;
> > > > +
> > > > +	new = kzalloc(sizeof(*new) + (max_gsi * sizeof(struct hlist_head)) +
> > > > +		      (nr * sizeof(struct kvm_kernel_irq_routing_entry)),
> > > > +		      GFP_KERNEL);
> > > 
> > > Why don't you allocate the hlist_head's and the routing entries
> > > separately?
> > > 
> > I prefer it that way because cleanup after error is much easier. What
> > are the disadvantages?
> 
> They are two data structures (two different arrays). Also as mentioned
Logically it is one data structure that includes two arrays and some
other fields.

> before by others the allocation size of irq_routing array might become
> an issue.
> 
KVM_MAX_IRQ_ROUTES will be significantly reduced so this will not be the
problem. I plan to reduce it to 128. Will it be OK to use vmalloc() if
the size is greater than one page?

> > > >  
> > > >  	if (!new)
> > > >  		return -ENOMEM;
> > > >  
> > > > +	new->rt_entries = (void *)&new->map[max_gsi];
> > > > +
> > > > +	new->max_gsi = max_gsi;
> > > > +	for (i = 0; i < 3; i++)
> > > > +		for (j = 0; j < KVM_IOAPIC_NUM_PINS; j++)
> > > > +			new->chip[i][j] = -1;
> > > > +
> > > 
> > > Should use something else instead of 3. Maybe dynamic for multiple
> > > IOAPIC's support (but you can argue thats another problem).
> > > 
> > This is (another problem). The code has 1 IOAPIC hardcoded pretty deeply
> > even at user/kernel API level. We will solve is some day.
> 
> OK

--
			Gleb.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 08/10] Move IO APIC to its own lock.
  2009-07-15 21:38       ` Marcelo Tosatti
@ 2009-07-16  7:51         ` Gleb Natapov
  2009-07-16 18:09           ` Marcelo Tosatti
  0 siblings, 1 reply; 26+ messages in thread
From: Gleb Natapov @ 2009-07-16  7:51 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm

On Wed, Jul 15, 2009 at 06:38:48PM -0300, Marcelo Tosatti wrote:
> On Wed, Jul 15, 2009 at 11:48:17PM +0300, Gleb Natapov wrote:
> > > > +		spin_unlock(&ioapic->lock);
> > > > +		kvm_notify_acked_irq(ioapic->kvm, KVM_IRQCHIP_IOAPIC, i);
> > > > +		spin_lock(&ioapic->lock);
> > > 
> > > While traversing the IOAPIC pins you drop the lock and acquire it again,
> > > which is error prone: what if the redirection table is changed in
> > > between, for example?
> > > 
> > Yes, the ack notifiers is a problematic part. The only thing that
> > current ack notifiers do is set_irq_level(0) so this is not the problem
> > in practice. I'll see if I can eliminate this dropping of the lock here.
> 
> Currently, yes. But take into account that an ack notifier might do 
> other things than set_irq_level(0).
> 
For instance? Ack notifier can do whatever it wants if it does not call
back into ioapic. It may call to ioapic only by set_irq_level(0) (which
it does now from device assignment code) or by set_irq_level(1) and this
does not make sense for level triggered interrupts because not calling
set_irq_level(1) will have the same effect.

But I think I found the way to not drop the lock here.

> Say for example an ack notifier that takes the PIC or IOAPIC lock 
> (for whatever reason).
What reason? No one should take PIC or IOAPIC lock outside of PIC or
IOAPIC code. This is layering violation. IOAPIC should be accessed only
through its public interface (set_irq/mmio_read|write/update_eoi)

> 
> > > Also, irq_lock was held during ack and mask notifier callbacks, so they
> > > (the callbacks) did not have to worry about concurrency.
> > > 
> > Is it possible to have more then one ack for the same interrupt line at
> > the same time?
> 
> Why not? But maybe this is a somewhat stupid point, because you can
> require the notifiers to handle that.
If this is possible (and it looks like it may happen if IOAPIC broadcasts
an interrupt vector to more then one vcpu) then it may be better to push
complexity into an ack notifier to not penalize a common scenario.
But again, if we will not drop the lock around ack notifier call they
will be serialized.

> 
> > > You questioned the purpose of irq_lock earlier, and thats what it is:
> > > synchronization between pic and ioapic blur at the irq notifiers.
> > > 
> > Pic has its own locking and it fails miserably in regards ack notifiers.
> > I bet nobody tried device assignment with pic. It will not work.
> 
> It fails to take irq_lock on automatic EOI because on guest entry 
> interrupts are disabled (see d5ecfdd25c412df9c0ecf3ab8e066461fd4f69df).
This explains why the code is buggy, but does not fix the bug. There are
two bugs at least with the pic + ack notifiers. The first one is that
irq notifiers are called without irq_lock held and that will trigger
WARN_ON(!mutex_is_locked(&kvm->irq_lock)) in kvm_set_irq() in device
assignment case (besides what is the point of a lock if it is not taken
on every code path?). Another one is that you can't postpone call to ack
notifiers in kvm_pic_read_irq() till after pic_update_irq(s). The reason
is that pic_update_irq() will trigger acked interrupt immediately since
ack notifier is the one who lowers irq line in device assignment case
(that is the reason I haven't done the same in ioapic case BTW).

> 
> > irq_lock is indeed used to synchronization ioapic access, but the way
> > it is done requires calling kvm_set_irq() with lock held and this adds
> > unnecessary lock on a msi irq injection path.
> > 
> > > Using RCU to synchronize ack/mask notifier list walk versus list
> > > addition is good, but i'm not entirely sure that smaller locks like you
> > > are proposing is a benefit long term (things become much more tricky).
> > Without removing irq_lock RCU is useless since the list walking is always
> > done with irq_lock held. I see smaller locks as simplification BTW. The
> > thing is tricky now, or so I felt while trying to understand what
> > irq_lock actually protects. With smaller locks with names that clearly
> > indicates which data structure it protects I feel the code is much
> > easy to understand.
> 
> What is worrying is if you need to take both PIC and IOAPIC locks for 
> some operation (then have to worry about lock ordering etc).
> 
Lets not worry about far fetched theoretical cases especially since you
have the same problem now. What if you'll need to take both pic lock and
irq_lock? Except that this is not so theoretical because we need to
take them both with current code we just don't do it.

> > > Maybe it is the only way forward (and so you push this complexity to the
> > > ack/mask notifiers), but i think it can be avoided until its proven that
> > > there is contention on irq_lock (which is reduced by faster search).
> > This is not about contention. This is about existence of the lock in the
> > first place. With the patch set there is no lock at msi irq injection
> > path at all and this is better than having lock no matter if the lock is
> > congested or not. There is still a lock on ioapic path (which MSI does
> > not use) and removing of this lock should be done only after we see
> > that it is congested, because removing it involves adding a number of
> > atomic accesses, so it is not clear win without lock contention.
> > (removing it will also solve ack notification problem hmmm)

--
			Gleb.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 08/10] Move IO APIC to its own lock.
  2009-07-16  7:51         ` Gleb Natapov
@ 2009-07-16 18:09           ` Marcelo Tosatti
  2009-07-16 20:49             ` Gleb Natapov
  0 siblings, 1 reply; 26+ messages in thread
From: Marcelo Tosatti @ 2009-07-16 18:09 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm

On Thu, Jul 16, 2009 at 10:51:16AM +0300, Gleb Natapov wrote:
> On Wed, Jul 15, 2009 at 06:38:48PM -0300, Marcelo Tosatti wrote:
> > On Wed, Jul 15, 2009 at 11:48:17PM +0300, Gleb Natapov wrote:
> > > > > +		spin_unlock(&ioapic->lock);
> > > > > +		kvm_notify_acked_irq(ioapic->kvm, KVM_IRQCHIP_IOAPIC, i);
> > > > > +		spin_lock(&ioapic->lock);
> > > > 
> > > > While traversing the IOAPIC pins you drop the lock and acquire it again,
> > > > which is error prone: what if the redirection table is changed in
> > > > between, for example?
> > > > 
> > > Yes, the ack notifiers is a problematic part. The only thing that
> > > current ack notifiers do is set_irq_level(0) so this is not the problem
> > > in practice. I'll see if I can eliminate this dropping of the lock here.
> > 
> > Currently, yes. But take into account that an ack notifier might do 
> > other things than set_irq_level(0).
> > 
> For instance? Ack notifier can do whatever it wants if it does not call
> back into ioapic. It may call to ioapic only by set_irq_level(0) (which
> it does now from device assignment code) or by set_irq_level(1) and this
> does not make sense for level triggered interrupts because not calling
> set_irq_level(1) will have the same effect.

Checking whether a given gsi is masked on the
PIC/IOAPIC from the PIC/IOAPIC mask notifiers
(http://www.spinics.net/lists/kvm/msg19093.html), for example. 

Do you think that particular case should be handled in a different way?
(it could check whether a GSI is masked or not on both PIC/IOAPIC/LAPIC
from a different context other than mask notifier).

> But I think I found the way to not drop the lock here.

See, this is adding more complexity than simplifying things (the
recursive check).

> > Say for example an ack notifier that takes the PIC or IOAPIC lock 
> > (for whatever reason).
> What reason? No one should take PIC or IOAPIC lock outside of PIC or
> IOAPIC code. This is layering violation. IOAPIC should be accessed only
> through its public interface (set_irq/mmio_read|write/update_eoi)

See link above.

> > 
> > > > Also, irq_lock was held during ack and mask notifier callbacks, so they
> > > > (the callbacks) did not have to worry about concurrency.
> > > > 
> > > Is it possible to have more then one ack for the same interrupt line at
> > > the same time?
> > 
> > Why not? But maybe this is a somewhat stupid point, because you can
> > require the notifiers to handle that.
> If this is possible (and it looks like it may happen if IOAPIC broadcasts
> an interrupt vector to more then one vcpu) then it may be better to push
> complexity into an ack notifier to not penalize a common scenario.
> But again, if we will not drop the lock around ack notifier call they
> will be serialized.
> 
> > 
> > > > You questioned the purpose of irq_lock earlier, and thats what it is:
> > > > synchronization between pic and ioapic blur at the irq notifiers.
> > > > 
> > > Pic has its own locking and it fails miserably in regards ack notifiers.
> > > I bet nobody tried device assignment with pic. It will not work.
> > 
> > It fails to take irq_lock on automatic EOI because on guest entry 
> > interrupts are disabled (see d5ecfdd25c412df9c0ecf3ab8e066461fd4f69df).
> This explains why the code is buggy, but does not fix the bug. There are
> two bugs at least with the pic + ack notifiers. The first one is that
> irq notifiers are called without irq_lock held and that will trigger
> WARN_ON(!mutex_is_locked(&kvm->irq_lock)) in kvm_set_irq() in device
> assignment case (besides what is the point of a lock if it is not taken
> on every code path?).

This could be fixed by switching irq_lock to a spinlock.

>  Another one is that you can't postpone call to ack notifiers in
> kvm_pic_read_irq() till after pic_update_irq(s). The reason is that
> pic_update_irq() will trigger acked interrupt immediately since ack
> notifier is the one who lowers irq line in device assignment case
> (that is the reason I haven't done the same in ioapic case BTW).

I'm not sure i understand this one, can you be more verbose please?

> > > irq_lock is indeed used to synchronization ioapic access, but the way
> > > it is done requires calling kvm_set_irq() with lock held and this adds
> > > unnecessary lock on a msi irq injection path.
> > > 
> > > > Using RCU to synchronize ack/mask notifier list walk versus list
> > > > addition is good, but i'm not entirely sure that smaller locks like you
> > > > are proposing is a benefit long term (things become much more tricky).
> > > Without removing irq_lock RCU is useless since the list walking is always
> > > done with irq_lock held. I see smaller locks as simplification BTW. The
> > > thing is tricky now, or so I felt while trying to understand what
> > > irq_lock actually protects. With smaller locks with names that clearly
> > > indicates which data structure it protects I feel the code is much
> > > easy to understand.
> > 
> > What is worrying is if you need to take both PIC and IOAPIC locks for 
> > some operation (then have to worry about lock ordering etc).
> > 
> Lets not worry about far fetched theoretical cases especially since you
> have the same problem now. 

Well the issue is you are defining what can be done in notifier context,
so it does matter.

>  What if you'll need to take both pic lock and
> irq_lock? Except that this is not so theoretical because we need to
> take them both with current code we just don't do it.

    While most accesses to the i8259 are with the kvm mutex taken, the call
    to kvm_pic_read_irq() is not.  We can't easily take the kvm mutex there
    since the function is called with interrupts disabled.

Convert irq_lock to a spinlock.

    Fix by adding a spinlock to the virtual interrupt controller. Since
    we can't send an IPI under the spinlock (we also take the same spinlock
    in an irq disabled context), we defer the IPI until the spinlock is
    released. Similarly, we defer irq ack notifications until after spinlock
    release to avoid lock recursion.

Its possible to send IPI's under spinlocks now.

Again, i'm not objecting to these changes, but playing devil's advocate.

> > > > Maybe it is the only way forward (and so you push this complexity to the
> > > > ack/mask notifiers), but i think it can be avoided until its proven that
> > > > there is contention on irq_lock (which is reduced by faster search).
> > > This is not about contention. This is about existence of the lock in the
> > > first place. With the patch set there is no lock at msi irq injection
> > > path at all and this is better than having lock no matter if the lock is
> > > congested or not. There is still a lock on ioapic path (which MSI does
> > > not use) and removing of this lock should be done only after we see
> > > that it is congested, because removing it involves adding a number of
> > > atomic accesses, so it is not clear win without lock contention.
> > > (removing it will also solve ack notification problem hmmm)
> 
> --
> 			Gleb.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 08/10] Move IO APIC to its own lock.
  2009-07-16 18:09           ` Marcelo Tosatti
@ 2009-07-16 20:49             ` Gleb Natapov
  2009-07-17 15:19               ` Marcelo Tosatti
  0 siblings, 1 reply; 26+ messages in thread
From: Gleb Natapov @ 2009-07-16 20:49 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm

On Thu, Jul 16, 2009 at 03:09:30PM -0300, Marcelo Tosatti wrote:
> On Thu, Jul 16, 2009 at 10:51:16AM +0300, Gleb Natapov wrote:
> > On Wed, Jul 15, 2009 at 06:38:48PM -0300, Marcelo Tosatti wrote:
> > > On Wed, Jul 15, 2009 at 11:48:17PM +0300, Gleb Natapov wrote:
> > > > > > +		spin_unlock(&ioapic->lock);
> > > > > > +		kvm_notify_acked_irq(ioapic->kvm, KVM_IRQCHIP_IOAPIC, i);
> > > > > > +		spin_lock(&ioapic->lock);
> > > > > 
> > > > > While traversing the IOAPIC pins you drop the lock and acquire it again,
> > > > > which is error prone: what if the redirection table is changed in
> > > > > between, for example?
> > > > > 
> > > > Yes, the ack notifiers is a problematic part. The only thing that
> > > > current ack notifiers do is set_irq_level(0) so this is not the problem
> > > > in practice. I'll see if I can eliminate this dropping of the lock here.
> > > 
> > > Currently, yes. But take into account that an ack notifier might do 
> > > other things than set_irq_level(0).
> > > 
> > For instance? Ack notifier can do whatever it wants if it does not call
> > back into ioapic. It may call to ioapic only by set_irq_level(0) (which
> > it does now from device assignment code) or by set_irq_level(1) and this
> > does not make sense for level triggered interrupts because not calling
> > set_irq_level(1) will have the same effect.
> 
> Checking whether a given gsi is masked on the
> PIC/IOAPIC from the PIC/IOAPIC mask notifiers
> (http://www.spinics.net/lists/kvm/msg19093.html), for example. 
> 
> Do you think that particular case should be handled in a different way?
I dislike ack notifiers and think that the should not be used if
possible. kvm_set_irq() return value provide you enough info to know it
interrupt was delivered or masked so the pit case can (and should) be
handled without using irq notifiers at all. See how RTC interrupt
coalescing is handled in qemu (it does not handle masking correctly, but
only because qemu community enjoys having their code broken).

> (it could check whether a GSI is masked or not on both PIC/IOAPIC/LAPIC
> from a different context other than mask notifier).
> 
> > But I think I found the way to not drop the lock here.
> 
> See, this is adding more complexity than simplifying things (the
> recursive check).
This is not more complex than current code. (IMHO)

> 
> > > Say for example an ack notifier that takes the PIC or IOAPIC lock 
> > > (for whatever reason).
> > What reason? No one should take PIC or IOAPIC lock outside of PIC or
> > IOAPIC code. This is layering violation. IOAPIC should be accessed only
> > through its public interface (set_irq/mmio_read|write/update_eoi)
> 
> See link above.
I can't see how the code from link above can work with current code
since mask notifiers are not called from PIC.

> 
> > > 
> > > > > Also, irq_lock was held during ack and mask notifier callbacks, so they
> > > > > (the callbacks) did not have to worry about concurrency.
> > > > > 
> > > > Is it possible to have more then one ack for the same interrupt line at
> > > > the same time?
> > > 
> > > Why not? But maybe this is a somewhat stupid point, because you can
> > > require the notifiers to handle that.
> > If this is possible (and it looks like it may happen if IOAPIC broadcasts
> > an interrupt vector to more then one vcpu) then it may be better to push
> > complexity into an ack notifier to not penalize a common scenario.
> > But again, if we will not drop the lock around ack notifier call they
> > will be serialized.
> > 
> > > 
> > > > > You questioned the purpose of irq_lock earlier, and thats what it is:
> > > > > synchronization between pic and ioapic blur at the irq notifiers.
> > > > > 
> > > > Pic has its own locking and it fails miserably in regards ack notifiers.
> > > > I bet nobody tried device assignment with pic. It will not work.
> > > 
> > > It fails to take irq_lock on automatic EOI because on guest entry 
> > > interrupts are disabled (see d5ecfdd25c412df9c0ecf3ab8e066461fd4f69df).
> > This explains why the code is buggy, but does not fix the bug. There are
> > two bugs at least with the pic + ack notifiers. The first one is that
> > irq notifiers are called without irq_lock held and that will trigger
> > WARN_ON(!mutex_is_locked(&kvm->irq_lock)) in kvm_set_irq() in device
> > assignment case (besides what is the point of a lock if it is not taken
> > on every code path?).
> 
> This could be fixed by switching irq_lock to a spinlock.
Yes, but then you want to minimize the amount of code that runs under a
spinlock.

> 
> >  Another one is that you can't postpone call to ack notifiers in
> > kvm_pic_read_irq() till after pic_update_irq(s). The reason is that
> > pic_update_irq() will trigger acked interrupt immediately since ack
> > notifier is the one who lowers irq line in device assignment case
> > (that is the reason I haven't done the same in ioapic case BTW).
> 
> I'm not sure i understand this one, can you be more verbose please?
> 
1. device injects level irq by calling kvm_pic_set_irq(1)
2. level is recorded in irr (pic_set_irq1())
3. pic_update_irq() called and interrupt is sent to vcpu
4. vcpu calls kvm_pic_read_irq() which calls pic_update_irq(s) before calling 
   kvm_notify_acked_irq(). Ack callback is the one who calls kvm_pic_set_irq(0)
   to lower irq line, so when pic_update_irq(s) is called irr still
   indicates that irq level is 1 and it is delivered yet again.

But there is something strange with PIC + ack notifiers. PIC call ACK
notifiers not when guest acks the interrupt, but when KVM gets the
interrupt for delivery. Calling ack notifiers at this point is wrong.
Device assignment ack notifier enables host interrupts, but guest not yet
had a chance to clear interrupt condition in a device. Or do I miss
something here?

> > > > irq_lock is indeed used to synchronization ioapic access, but the way
> > > > it is done requires calling kvm_set_irq() with lock held and this adds
> > > > unnecessary lock on a msi irq injection path.
> > > > 
> > > > > Using RCU to synchronize ack/mask notifier list walk versus list
> > > > > addition is good, but i'm not entirely sure that smaller locks like you
> > > > > are proposing is a benefit long term (things become much more tricky).
> > > > Without removing irq_lock RCU is useless since the list walking is always
> > > > done with irq_lock held. I see smaller locks as simplification BTW. The
> > > > thing is tricky now, or so I felt while trying to understand what
> > > > irq_lock actually protects. With smaller locks with names that clearly
> > > > indicates which data structure it protects I feel the code is much
> > > > easy to understand.
> > > 
> > > What is worrying is if you need to take both PIC and IOAPIC locks for 
> > > some operation (then have to worry about lock ordering etc).
> > > 
> > Lets not worry about far fetched theoretical cases especially since you
> > have the same problem now. 
> 
> Well the issue is you are defining what can be done in notifier context,
> so it does matter.
> 
Of course, just like now we have to define things that can be done in
notifier context. It can't sleep for instance since it may be called
with interrupts disabled.

> >  What if you'll need to take both pic lock and
> > irq_lock? Except that this is not so theoretical because we need to
> > take them both with current code we just don't do it.
> 
>     While most accesses to the i8259 are with the kvm mutex taken, the call
>     to kvm_pic_read_irq() is not.  We can't easily take the kvm mutex there
>     since the function is called with interrupts disabled.
> 
> Convert irq_lock to a spinlock.
Or apply my patchset :) Which basically converts irq_lock to spinlock
and pushes it into ioapic.

> 
>     Fix by adding a spinlock to the virtual interrupt controller. Since
>     we can't send an IPI under the spinlock (we also take the same spinlock
>     in an irq disabled context), we defer the IPI until the spinlock is
>     released. Similarly, we defer irq ack notifications until after spinlock
>     release to avoid lock recursion.
> 
> Its possible to send IPI's under spinlocks now.
> 
That what I was thinking too. There is not need to all this tricks in
PIC now.

> Again, i'm not objecting to these changes, but playing devil's advocate.
> 
Just make comparison with current state of affairs fair :)

--
			Gleb.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 08/10] Move IO APIC to its own lock.
  2009-07-16 20:49             ` Gleb Natapov
@ 2009-07-17 15:19               ` Marcelo Tosatti
  2009-07-17 17:32                 ` Gleb Natapov
  0 siblings, 1 reply; 26+ messages in thread
From: Marcelo Tosatti @ 2009-07-17 15:19 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm

On Thu, Jul 16, 2009 at 11:49:35PM +0300, Gleb Natapov wrote:
> On Thu, Jul 16, 2009 at 03:09:30PM -0300, Marcelo Tosatti wrote:
> > On Thu, Jul 16, 2009 at 10:51:16AM +0300, Gleb Natapov wrote:
> > > On Wed, Jul 15, 2009 at 06:38:48PM -0300, Marcelo Tosatti wrote:
> > > > On Wed, Jul 15, 2009 at 11:48:17PM +0300, Gleb Natapov wrote:
> > > > > > > +		spin_unlock(&ioapic->lock);
> > > > > > > +		kvm_notify_acked_irq(ioapic->kvm, KVM_IRQCHIP_IOAPIC, i);
> > > > > > > +		spin_lock(&ioapic->lock);
> > > > > > 
> > > > > > While traversing the IOAPIC pins you drop the lock and acquire it again,
> > > > > > which is error prone: what if the redirection table is changed in
> > > > > > between, for example?
> > > > > > 
> > > > > Yes, the ack notifiers is a problematic part. The only thing that
> > > > > current ack notifiers do is set_irq_level(0) so this is not the problem
> > > > > in practice. I'll see if I can eliminate this dropping of the lock here.
> > > > 
> > > > Currently, yes. But take into account that an ack notifier might do 
> > > > other things than set_irq_level(0).
> > > > 
> > > For instance? Ack notifier can do whatever it wants if it does not call
> > > back into ioapic. It may call to ioapic only by set_irq_level(0) (which
> > > it does now from device assignment code) or by set_irq_level(1) and this
> > > does not make sense for level triggered interrupts because not calling
> > > set_irq_level(1) will have the same effect.
> > 
> > Checking whether a given gsi is masked on the
> > PIC/IOAPIC from the PIC/IOAPIC mask notifiers
> > (http://www.spinics.net/lists/kvm/msg19093.html), for example. 
> > 
> > Do you think that particular case should be handled in a different way?
> I dislike ack notifiers and think that the should not be used if
> possible. kvm_set_irq() return value provide you enough info to know it
> interrupt was delivered or masked so the pit case can (and should) be
> handled without using irq notifiers at all. See how RTC interrupt
> coalescing is handled in qemu (it does not handle masking correctly, but
> only because qemu community enjoys having their code broken).

There are a few issues that kvm_set_irq return code cannot handle,
AFAICS (please correct me if i'm wrong, or provide useful points for
discussion).

case 1)
1. kvm_set_irq, delivered OK (IOW, assume guest has acked the
interrupt, next_timer_event += timer_period).
2. hlt -> kvm_vcpu_block thinks it can sleep, because next injectable
timer event is the future.
3. which might not be the case, since we are not certain the guest 
has acked the interrupt.

case 2)
1. kvm_set_irq, delivered OK.
2. guest masks irq.
3. here you do no want to unhalt the vcpu on the next timer event,
because irq is masked at irqchip level.

So you need both ack notification and mask notification for proper
emulation of UNHALT. Could check whether the interrupt is masked on
injection and avoid UNHALT for case 2), though.

But to check that, you need to look into both PIC and IOAPIC pins, which 
can happen like:

pic_lock
check if pin is masked
pic_unlock

ioapic_lock
check if pin is masked
ioapic_unlock

And only then, if interrupt can reach the processor, UNHALT the vcpu.
This could avoid the need to check for PIC/IOAPIC state inside the PIT /
other timers mask notifiers (regarding the current locking discussion).

Those checks can race with another processor unmasking the pin for 
the irqchips, but then its a guest problem.

You still need the mask notifier to be able to reset the reinjection
logic on unmask (see 4780c65904f0fc4e312ee2da9383eacbe04e61ea). Is that 
the case thats broken with RTC in QEMU?

Also, the fact that you receive an ack notification allows you to make 
decisions of behaviour between injection and ack. You can also measure
the delay between these two events, which is not used at the moment
but is potentially useful information.

You do not have that information on AEOI style interrupts (such as timer 
delivery via PIT->LVT0 in NMI mode or PIC AEOI), but there's nothing 
you can to help such cases.

> > (it could check whether a GSI is masked or not on both PIC/IOAPIC/LAPIC
> > from a different context other than mask notifier).
> > 
> > > But I think I found the way to not drop the lock here.
> > 
> > See, this is adding more complexity than simplifying things (the
> > recursive check).
> This is not more complex than current code. (IMHO)

Maybe there is a better way to get rid of that recursion check, maybe 
move the notifiers to be called after internal irqchip state has been
updated, outside the lock?

> > > > Say for example an ack notifier that takes the PIC or IOAPIC lock 
> > > > (for whatever reason).
> > > What reason? No one should take PIC or IOAPIC lock outside of PIC or
> > > IOAPIC code. This is layering violation. IOAPIC should be accessed only
> > > through its public interface (set_irq/mmio_read|write/update_eoi)
> > 
> > See link above.
> I can't see how the code from link above can work with current code
> since mask notifiers are not called from PIC.

Then we're lacking mask notifier callback in i8259.c.

> > > > 
> > > > > > Also, irq_lock was held during ack and mask notifier callbacks, so they
> > > > > > (the callbacks) did not have to worry about concurrency.
> > > > > > 
> > > > > Is it possible to have more then one ack for the same interrupt line at
> > > > > the same time?
> > > > 
> > > > Why not? But maybe this is a somewhat stupid point, because you can
> > > > require the notifiers to handle that.
> > > If this is possible (and it looks like it may happen if IOAPIC broadcasts
> > > an interrupt vector to more then one vcpu) then it may be better to push
> > > complexity into an ack notifier to not penalize a common scenario.
> > > But again, if we will not drop the lock around ack notifier call they
> > > will be serialized.
> > > 
> > > > 
> > > > > > You questioned the purpose of irq_lock earlier, and thats what it is:
> > > > > > synchronization between pic and ioapic blur at the irq notifiers.
> > > > > > 
> > > > > Pic has its own locking and it fails miserably in regards ack notifiers.
> > > > > I bet nobody tried device assignment with pic. It will not work.
> > > > 
> > > > It fails to take irq_lock on automatic EOI because on guest entry 
> > > > interrupts are disabled (see d5ecfdd25c412df9c0ecf3ab8e066461fd4f69df).
> > > This explains why the code is buggy, but does not fix the bug. There are
> > > two bugs at least with the pic + ack notifiers. The first one is that
> > > irq notifiers are called without irq_lock held and that will trigger
> > > WARN_ON(!mutex_is_locked(&kvm->irq_lock)) in kvm_set_irq() in device
> > > assignment case (besides what is the point of a lock if it is not taken
> > > on every code path?).
> > 
> > This could be fixed by switching irq_lock to a spinlock.
> Yes, but then you want to minimize the amount of code that runs under a
> spinlock.
> 
> > 
> > >  Another one is that you can't postpone call to ack notifiers in
> > > kvm_pic_read_irq() till after pic_update_irq(s). The reason is that
> > > pic_update_irq() will trigger acked interrupt immediately since ack
> > > notifier is the one who lowers irq line in device assignment case
> > > (that is the reason I haven't done the same in ioapic case BTW).
> > 
> > I'm not sure i understand this one, can you be more verbose please?
> > 
> 1. device injects level irq by calling kvm_pic_set_irq(1)
> 2. level is recorded in irr (pic_set_irq1())
> 3. pic_update_irq() called and interrupt is sent to vcpu
> 4. vcpu calls kvm_pic_read_irq() which calls pic_update_irq(s) before calling 
>    kvm_notify_acked_irq(). Ack callback is the one who calls kvm_pic_set_irq(0)
>    to lower irq line, so when pic_update_irq(s) is called irr still
>    indicates that irq level is 1 and it is delivered yet again.
> 
> But there is something strange with PIC + ack notifiers. PIC call ACK
> notifiers not when guest acks the interrupt, but when KVM gets the
> interrupt for delivery. Calling ack notifiers at this point is wrong.
> Device assignment ack notifier enables host interrupts, but guest not yet
> had a chance to clear interrupt condition in a device. Or do I miss
> something here?

No, it looks wrong. The ack notification should be inside pic_clear_isr.

And it is not there already because kvm_notify_acked_irq/kvm_vcpu_kick
could not be called under spinlocks.

You still want the behaviour you described above for AEOI mode, but
you should get that for free by moving kvm_notify_acked_irq/vcpu_kick
inside pic_clear_isr.

> > > > > irq_lock is indeed used to synchronization ioapic access, but the way
> > > > > it is done requires calling kvm_set_irq() with lock held and this adds
> > > > > unnecessary lock on a msi irq injection path.
> > > > > 
> > > > > > Using RCU to synchronize ack/mask notifier list walk versus list
> > > > > > addition is good, but i'm not entirely sure that smaller locks like you
> > > > > > are proposing is a benefit long term (things become much more tricky).
> > > > > Without removing irq_lock RCU is useless since the list walking is always
> > > > > done with irq_lock held. I see smaller locks as simplification BTW. The
> > > > > thing is tricky now, or so I felt while trying to understand what
> > > > > irq_lock actually protects. With smaller locks with names that clearly
> > > > > indicates which data structure it protects I feel the code is much
> > > > > easy to understand.
> > > > 
> > > > What is worrying is if you need to take both PIC and IOAPIC locks for 
> > > > some operation (then have to worry about lock ordering etc).
> > > > 
> > > Lets not worry about far fetched theoretical cases especially since you
> > > have the same problem now. 
> > 
> > Well the issue is you are defining what can be done in notifier context,
> > so it does matter.
> > 
> Of course, just like now we have to define things that can be done in
> notifier context. It can't sleep for instance since it may be called
> with interrupts disabled.
> 
> > >  What if you'll need to take both pic lock and
> > > irq_lock? Except that this is not so theoretical because we need to
> > > take them both with current code we just don't do it.
> > 
> >     While most accesses to the i8259 are with the kvm mutex taken, the call
> >     to kvm_pic_read_irq() is not.  We can't easily take the kvm mutex there
> >     since the function is called with interrupts disabled.
> > 
> > Convert irq_lock to a spinlock.
> Or apply my patchset :) Which basically converts irq_lock to spinlock
> and pushes it into ioapic.

There's lots of issues here, better do one thing at a time. Please leave
the RCU convertion to the end after getting rid of irq_lock mutex.

> >     Fix by adding a spinlock to the virtual interrupt controller. Since
> >     we can't send an IPI under the spinlock (we also take the same spinlock
> >     in an irq disabled context), we defer the IPI until the spinlock is
> >     released. Similarly, we defer irq ack notifications until after spinlock
> >     release to avoid lock recursion.
> > 
> > Its possible to send IPI's under spinlocks now.
> > 
> That what I was thinking too. There is not need to all this tricks in
> PIC now.
> 
> > Again, i'm not objecting to these changes, but playing devil's advocate.
> > 
> Just make comparison with current state of affairs fair :)

I think i'm convinced (just give some thought on the UNHALT discussion
for timers).

> 
> --
> 			Gleb.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 08/10] Move IO APIC to its own lock.
  2009-07-17 15:19               ` Marcelo Tosatti
@ 2009-07-17 17:32                 ` Gleb Natapov
  2009-07-17 20:09                   ` Marcelo Tosatti
  0 siblings, 1 reply; 26+ messages in thread
From: Gleb Natapov @ 2009-07-17 17:32 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: kvm

On Fri, Jul 17, 2009 at 12:19:58PM -0300, Marcelo Tosatti wrote:
> On Thu, Jul 16, 2009 at 11:49:35PM +0300, Gleb Natapov wrote:
> > On Thu, Jul 16, 2009 at 03:09:30PM -0300, Marcelo Tosatti wrote:
> > > On Thu, Jul 16, 2009 at 10:51:16AM +0300, Gleb Natapov wrote:
> > > > On Wed, Jul 15, 2009 at 06:38:48PM -0300, Marcelo Tosatti wrote:
> > > > > On Wed, Jul 15, 2009 at 11:48:17PM +0300, Gleb Natapov wrote:
> > > > > > > > +		spin_unlock(&ioapic->lock);
> > > > > > > > +		kvm_notify_acked_irq(ioapic->kvm, KVM_IRQCHIP_IOAPIC, i);
> > > > > > > > +		spin_lock(&ioapic->lock);
> > > > > > > 
> > > > > > > While traversing the IOAPIC pins you drop the lock and acquire it again,
> > > > > > > which is error prone: what if the redirection table is changed in
> > > > > > > between, for example?
> > > > > > > 
> > > > > > Yes, the ack notifiers is a problematic part. The only thing that
> > > > > > current ack notifiers do is set_irq_level(0) so this is not the problem
> > > > > > in practice. I'll see if I can eliminate this dropping of the lock here.
> > > > > 
> > > > > Currently, yes. But take into account that an ack notifier might do 
> > > > > other things than set_irq_level(0).
> > > > > 
> > > > For instance? Ack notifier can do whatever it wants if it does not call
> > > > back into ioapic. It may call to ioapic only by set_irq_level(0) (which
> > > > it does now from device assignment code) or by set_irq_level(1) and this
> > > > does not make sense for level triggered interrupts because not calling
> > > > set_irq_level(1) will have the same effect.
> > > 
> > > Checking whether a given gsi is masked on the
> > > PIC/IOAPIC from the PIC/IOAPIC mask notifiers
> > > (http://www.spinics.net/lists/kvm/msg19093.html), for example. 
> > > 
> > > Do you think that particular case should be handled in a different way?
> > I dislike ack notifiers and think that the should not be used if
> > possible. kvm_set_irq() return value provide you enough info to know it
> > interrupt was delivered or masked so the pit case can (and should) be
> > handled without using irq notifiers at all. See how RTC interrupt
> > coalescing is handled in qemu (it does not handle masking correctly, but
> > only because qemu community enjoys having their code broken).
> 
> There are a few issues that kvm_set_irq return code cannot handle,
> AFAICS (please correct me if i'm wrong, or provide useful points for
> discussion).
> 
> case 1)
> 1. kvm_set_irq, delivered OK (IOW, assume guest has acked the
> interrupt, next_timer_event += timer_period).
> 2. hlt -> kvm_vcpu_block thinks it can sleep, because next injectable
> timer event is the future.
> 3. which might not be the case, since we are not certain the guest 
> has acked the interrupt.
> 
> case 2)
> 1. kvm_set_irq, delivered OK.
> 2. guest masks irq.
> 3. here you do no want to unhalt the vcpu on the next timer event,
> because irq is masked at irqchip level.
> 
> So you need both ack notification and mask notification for proper
> emulation of UNHALT. Could check whether the interrupt is masked on
> injection and avoid UNHALT for case 2), though.
> 
> But to check that, you need to look into both PIC and IOAPIC pins, which 
> can happen like:
> 
> pic_lock
> check if pin is masked
> pic_unlock
> 
> ioapic_lock
> check if pin is masked
> ioapic_unlock
And what if interrupts are blocked by vcpu? You don't event know what
cpu will receive the interrupt (may be several?).

> 
> And only then, if interrupt can reach the processor, UNHALT the vcpu.
> This could avoid the need to check for PIC/IOAPIC state inside the PIT /
> other timers mask notifiers (regarding the current locking discussion).
> 
Timer is just a device that generates interrupt. You don't check in
a device emulation code if CPU is halted and should be unhalted and
timer code shouldn't do that either. The only check that decide if vcpu
should be unhalted is in kvm_arch_vcpu_runnable() and it goes like this:
  if there is interrupt to inject and interrupt are not masked by vcpu or
  there is nmi to inject then unhalt vcpu.
So what the timer code should do is to call kvm_set_irq(1) and if
interrupt (or nmi) will be delivered to vcpu as a result of this the
above check will unhalt vcpu.

If you want to decide in timer code what should be done to vcpu as a
result of a timer event you'll have to reimplement entire pic/ioapic/lapic
logic inside it. What if timer interrupt configured to send INIT to vcpu?

Both cases you described are problematic only because you are trying to
make interrupt code too smart.

> Those checks can race with another processor unmasking the pin for 
> the irqchips, but then its a guest problem.
Sane OSse serialize access to pic/ioapic. I think we can assume this in
our code.

> 
> You still need the mask notifier to be able to reset the reinjection
> logic on unmask (see 4780c65904f0fc4e312ee2da9383eacbe04e61ea). Is that 
> the case thats broken with RTC in QEMU?
> 
Yes, interrupt masking is broken with RTC in QEMU, but only because QEMU
wanted it like this. In KVM if kvm_set_irq() returns zero it means that
interrupt was masked so reinjection logic can be reset.

> Also, the fact that you receive an ack notification allows you to make 
> decisions of behaviour between injection and ack. You can also measure
> the delay between these two events, which is not used at the moment
> but is potentially useful information.
You can use ack notifier for this, but it doesn't require to call back
into pic/ioapic code.

> 
> You do not have that information on AEOI style interrupts (such as timer 
> delivery via PIT->LVT0 in NMI mode or PIC AEOI), but there's nothing 
> you can to help such cases.
> 
> > > (it could check whether a GSI is masked or not on both PIC/IOAPIC/LAPIC
> > > from a different context other than mask notifier).
> > > 
> > > > But I think I found the way to not drop the lock here.
> > > 
> > > See, this is adding more complexity than simplifying things (the
> > > recursive check).
> > This is not more complex than current code. (IMHO)
> 
> Maybe there is a better way to get rid of that recursion check, maybe 
> move the notifiers to be called after internal irqchip state has been
> updated, outside the lock?
> 
Impressible (IOW I don't see how it could be done). The ack notifier is
the one who drives line low, so it should be called before irr is
checked for pending interrupts. Basically the same scenario I described
for pic in my previous mail.

> > > > > Say for example an ack notifier that takes the PIC or IOAPIC lock 
> > > > > (for whatever reason).
> > > > What reason? No one should take PIC or IOAPIC lock outside of PIC or
> > > > IOAPIC code. This is layering violation. IOAPIC should be accessed only
> > > > through its public interface (set_irq/mmio_read|write/update_eoi)
> > > 
> > > See link above.
> > I can't see how the code from link above can work with current code
> > since mask notifiers are not called from PIC.
> 
> Then we're lacking mask notifier callback in i8259.c.
> > > > > 
> > > > > > > Also, irq_lock was held during ack and mask notifier callbacks, so they
> > > > > > > (the callbacks) did not have to worry about concurrency.
> > > > > > > 
> > > > > > Is it possible to have more then one ack for the same interrupt line at
> > > > > > the same time?
> > > > > 
> > > > > Why not? But maybe this is a somewhat stupid point, because you can
> > > > > require the notifiers to handle that.
> > > > If this is possible (and it looks like it may happen if IOAPIC broadcasts
> > > > an interrupt vector to more then one vcpu) then it may be better to push
> > > > complexity into an ack notifier to not penalize a common scenario.
> > > > But again, if we will not drop the lock around ack notifier call they
> > > > will be serialized.
> > > > 
> > > > > 
> > > > > > > You questioned the purpose of irq_lock earlier, and thats what it is:
> > > > > > > synchronization between pic and ioapic blur at the irq notifiers.
> > > > > > > 
> > > > > > Pic has its own locking and it fails miserably in regards ack notifiers.
> > > > > > I bet nobody tried device assignment with pic. It will not work.
> > > > > 
> > > > > It fails to take irq_lock on automatic EOI because on guest entry 
> > > > > interrupts are disabled (see d5ecfdd25c412df9c0ecf3ab8e066461fd4f69df).
> > > > This explains why the code is buggy, but does not fix the bug. There are
> > > > two bugs at least with the pic + ack notifiers. The first one is that
> > > > irq notifiers are called without irq_lock held and that will trigger
> > > > WARN_ON(!mutex_is_locked(&kvm->irq_lock)) in kvm_set_irq() in device
> > > > assignment case (besides what is the point of a lock if it is not taken
> > > > on every code path?).
> > > 
> > > This could be fixed by switching irq_lock to a spinlock.
> > Yes, but then you want to minimize the amount of code that runs under a
> > spinlock.
> > 
> > > 
> > > >  Another one is that you can't postpone call to ack notifiers in
> > > > kvm_pic_read_irq() till after pic_update_irq(s). The reason is that
> > > > pic_update_irq() will trigger acked interrupt immediately since ack
> > > > notifier is the one who lowers irq line in device assignment case
> > > > (that is the reason I haven't done the same in ioapic case BTW).
> > > 
> > > I'm not sure i understand this one, can you be more verbose please?
> > > 
> > 1. device injects level irq by calling kvm_pic_set_irq(1)
> > 2. level is recorded in irr (pic_set_irq1())
> > 3. pic_update_irq() called and interrupt is sent to vcpu
> > 4. vcpu calls kvm_pic_read_irq() which calls pic_update_irq(s) before calling 
> >    kvm_notify_acked_irq(). Ack callback is the one who calls kvm_pic_set_irq(0)
> >    to lower irq line, so when pic_update_irq(s) is called irr still
> >    indicates that irq level is 1 and it is delivered yet again.
> > 
> > But there is something strange with PIC + ack notifiers. PIC call ACK
> > notifiers not when guest acks the interrupt, but when KVM gets the
> > interrupt for delivery. Calling ack notifiers at this point is wrong.
> > Device assignment ack notifier enables host interrupts, but guest not yet
> > had a chance to clear interrupt condition in a device. Or do I miss
> > something here?
> 
> No, it looks wrong. The ack notification should be inside pic_clear_isr.
> 
So you agree with me that the current code is completely broken? The
"no" at the beginning confuses me :) Otherwise I agree with you the ack
notifiers should be inside pic_clear_isr().

> And it is not there already because kvm_notify_acked_irq/kvm_vcpu_kick
> could not be called under spinlocks.
Why kvm_notify_acked_irq() cannot be called under spinlocks?
kvm_vcpu_kick can be called under spinlocks now.

> 
> You still want the behaviour you described above for AEOI mode, but
> you should get that for free by moving kvm_notify_acked_irq/vcpu_kick
> inside pic_clear_isr.
> 
Agree.

> > > > > > irq_lock is indeed used to synchronization ioapic access, but the way
> > > > > > it is done requires calling kvm_set_irq() with lock held and this adds
> > > > > > unnecessary lock on a msi irq injection path.
> > > > > > 
> > > > > > > Using RCU to synchronize ack/mask notifier list walk versus list
> > > > > > > addition is good, but i'm not entirely sure that smaller locks like you
> > > > > > > are proposing is a benefit long term (things become much more tricky).
> > > > > > Without removing irq_lock RCU is useless since the list walking is always
> > > > > > done with irq_lock held. I see smaller locks as simplification BTW. The
> > > > > > thing is tricky now, or so I felt while trying to understand what
> > > > > > irq_lock actually protects. With smaller locks with names that clearly
> > > > > > indicates which data structure it protects I feel the code is much
> > > > > > easy to understand.
> > > > > 
> > > > > What is worrying is if you need to take both PIC and IOAPIC locks for 
> > > > > some operation (then have to worry about lock ordering etc).
> > > > > 
> > > > Lets not worry about far fetched theoretical cases especially since you
> > > > have the same problem now. 
> > > 
> > > Well the issue is you are defining what can be done in notifier context,
> > > so it does matter.
> > > 
> > Of course, just like now we have to define things that can be done in
> > notifier context. It can't sleep for instance since it may be called
> > with interrupts disabled.
> > 
> > > >  What if you'll need to take both pic lock and
> > > > irq_lock? Except that this is not so theoretical because we need to
> > > > take them both with current code we just don't do it.
> > > 
> > >     While most accesses to the i8259 are with the kvm mutex taken, the call
> > >     to kvm_pic_read_irq() is not.  We can't easily take the kvm mutex there
> > >     since the function is called with interrupts disabled.
> > > 
> > > Convert irq_lock to a spinlock.
> > Or apply my patchset :) Which basically converts irq_lock to spinlock
> > and pushes it into ioapic.
> 
> There's lots of issues here, better do one thing at a time. Please leave
> the RCU convertion to the end after getting rid of irq_lock mutex.
> 
This will cause intermediate commits that will have more locks in the
interrupt injection patch than we have now (one for irq routing, one for
notifiers and one for ioapic). May be not a big deal.

> > >     Fix by adding a spinlock to the virtual interrupt controller. Since
> > >     we can't send an IPI under the spinlock (we also take the same spinlock
> > >     in an irq disabled context), we defer the IPI until the spinlock is
> > >     released. Similarly, we defer irq ack notifications until after spinlock
> > >     release to avoid lock recursion.
> > > 
> > > Its possible to send IPI's under spinlocks now.
> > > 
> > That what I was thinking too. There is not need to all this tricks in
> > PIC now.
> > 
> > > Again, i'm not objecting to these changes, but playing devil's advocate.
> > > 
> > Just make comparison with current state of affairs fair :)
> 
> I think i'm convinced (just give some thought on the UNHALT discussion
> for timers).
> 
> > 
> > --
> > 			Gleb.

--
			Gleb.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 08/10] Move IO APIC to its own lock.
  2009-07-17 17:32                 ` Gleb Natapov
@ 2009-07-17 20:09                   ` Marcelo Tosatti
  0 siblings, 0 replies; 26+ messages in thread
From: Marcelo Tosatti @ 2009-07-17 20:09 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: kvm

On Fri, Jul 17, 2009 at 08:32:00PM +0300, Gleb Natapov wrote:
> > There are a few issues that kvm_set_irq return code cannot handle,
> > AFAICS (please correct me if i'm wrong, or provide useful points for
> > discussion).
> > 
> > case 1)
> > 1. kvm_set_irq, delivered OK (IOW, assume guest has acked the
> > interrupt, next_timer_event += timer_period).
> > 2. hlt -> kvm_vcpu_block thinks it can sleep, because next injectable
> > timer event is the future.
> > 3. which might not be the case, since we are not certain the guest 
> > has acked the interrupt.
> > 
> > case 2)
> > 1. kvm_set_irq, delivered OK.
> > 2. guest masks irq.
> > 3. here you do no want to unhalt the vcpu on the next timer event,
> > because irq is masked at irqchip level.
> > 
> > So you need both ack notification and mask notification for proper
> > emulation of UNHALT. Could check whether the interrupt is masked on
> > injection and avoid UNHALT for case 2), though.
> > 
> > But to check that, you need to look into both PIC and IOAPIC pins, which 
> > can happen like:
> > 
> > pic_lock
> > check if pin is masked
> > pic_unlock
> > 
> > ioapic_lock
> > check if pin is masked
> > ioapic_unlock
> And what if interrupts are blocked by vcpu? You don't event know what
> cpu will receive the interrupt (may be several?).

Timers get bound to vcpus. If the IOAPIC is programmed to multiple
vcpus, you bind one timer instance to each target vcpu. So it will need
"target vcpu notifiers" (good point).

> > And only then, if interrupt can reach the processor, UNHALT the vcpu.
> > This could avoid the need to check for PIC/IOAPIC state inside the PIT /
> > other timers mask notifiers (regarding the current locking discussion).
> > 
> Timer is just a device that generates interrupt. You don't check in
> a device emulation code if CPU is halted and should be unhalted and
> timer code shouldn't do that either. The only check that decide if vcpu
> should be unhalted is in kvm_arch_vcpu_runnable() and it goes like this:
>   if there is interrupt to inject and interrupt are not masked by vcpu or
>   there is nmi to inject then unhalt vcpu.
> So what the timer code should do is to call kvm_set_irq(1) and if
> interrupt (or nmi) will be delivered to vcpu as a result of this the
> above check will unhalt vcpu.

OK, can use kvm_arch_vcpu_runnable in the comparison.

> If you want to decide in timer code what should be done to vcpu as a
> result of a timer event you'll have to reimplement entire pic/ioapic/lapic
> logic inside it. What if timer interrupt configured to send INIT to vcpu?

It'll go through kvm_set_irq (expect for LAPIC timer).

> Both cases you described are problematic only because you are trying to
> make interrupt code too smart.
> 
> > Those checks can race with another processor unmasking the pin for 
> > the irqchips, but then its a guest problem.
> Sane OSse serialize access to pic/ioapic. I think we can assume this in
> our code.
> 
> > 
> > You still need the mask notifier to be able to reset the reinjection
> > logic on unmask (see 4780c65904f0fc4e312ee2da9383eacbe04e61ea). Is that 
> > the case thats broken with RTC in QEMU?
> > 
> Yes, interrupt masking is broken with RTC in QEMU, but only because QEMU
> wanted it like this. In KVM if kvm_set_irq() returns zero it means that
> interrupt was masked so reinjection logic can be reset.
> 
> > Also, the fact that you receive an ack notification allows you to make 
> > decisions of behaviour between injection and ack. You can also measure
> > the delay between these two events, which is not used at the moment
> > but is potentially useful information.
> You can use ack notifier for this, but it doesn't require to call back
> into pic/ioapic code.

Agreed.

> > 
> > You do not have that information on AEOI style interrupts (such as timer 
> > delivery via PIT->LVT0 in NMI mode or PIC AEOI), but there's nothing 
> > you can to help such cases.
> > 
> > > > (it could check whether a GSI is masked or not on both PIC/IOAPIC/LAPIC
> > > > from a different context other than mask notifier).
> > > > 
> > > > > But I think I found the way to not drop the lock here.
> > > > 
> > > > See, this is adding more complexity than simplifying things (the
> > > > recursive check).
> > > This is not more complex than current code. (IMHO)
> > 
> > Maybe there is a better way to get rid of that recursion check, maybe 
> > move the notifiers to be called after internal irqchip state has been
> > updated, outside the lock?
> > 
> Impressible (IOW I don't see how it could be done). The ack notifier is
> the one who drives line low, so it should be called before irr is
> checked for pending interrupts. Basically the same scenario I described
> for pic in my previous mail.

Flargunbastic. You're right.

> > > had a chance to clear interrupt condition in a device. Or do I miss
> > > something here?
> > 
> > No, it looks wrong. The ack notification should be inside pic_clear_isr.
> > 
> So you agree with me that the current code is completely broken? The
> "no" at the beginning confuses me :) 

Yes its completly broken.

> Otherwise I agree with you the ack notifiers should be inside
> pic_clear_isr().
> 
> > And it is not there already because kvm_notify_acked_irq/kvm_vcpu_kick
> > could not be called under spinlocks.
> Why kvm_notify_acked_irq() cannot be called under spinlocks?
> kvm_vcpu_kick can be called under spinlocks now.

Agreed.


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2009-07-17 20:09 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-07-14 14:30 [PATCH 0/10] [RFC] more fine grained locking for IRQ injection Gleb Natapov
2009-07-14 14:30 ` [PATCH 01/10] Move irq routing data structure to rcu locking Gleb Natapov
2009-07-14 14:30 ` [PATCH 02/10] Unregister ack notifier callback on PIT freeing Gleb Natapov
2009-07-14 14:30 ` [PATCH 03/10] Move irq ack notifier list to arch independent code Gleb Natapov
2009-07-14 14:30 ` [PATCH 04/10] Convert irq notifiers lists to RCU locking Gleb Natapov
2009-07-14 14:30 ` [PATCH 05/10] Protect irq_sources_bitmap by kvm->lock instead of kvm->irq_lock Gleb Natapov
2009-07-14 14:30 ` [PATCH 06/10] Move irq routing to its own locking Gleb Natapov
2009-07-14 14:30 ` [PATCH 07/10] Move irq notifiers lists " Gleb Natapov
2009-07-14 14:30 ` [PATCH 08/10] Move IO APIC to its own lock Gleb Natapov
2009-07-15 17:57   ` Marcelo Tosatti
2009-07-15 20:48     ` Gleb Natapov
2009-07-15 21:38       ` Marcelo Tosatti
2009-07-16  7:51         ` Gleb Natapov
2009-07-16 18:09           ` Marcelo Tosatti
2009-07-16 20:49             ` Gleb Natapov
2009-07-17 15:19               ` Marcelo Tosatti
2009-07-17 17:32                 ` Gleb Natapov
2009-07-17 20:09                   ` Marcelo Tosatti
2009-07-14 14:30 ` [PATCH 09/10] Drop kvm->irq_lock lock Gleb Natapov
2009-07-14 14:30 ` [PATCH 10/10] Change irq routing table to use gsi indexed array Gleb Natapov
2009-07-15 18:18   ` Marcelo Tosatti
2009-07-15 20:52     ` Gleb Natapov
2009-07-15 21:42       ` Marcelo Tosatti
2009-07-16  6:05         ` Gleb Natapov
2009-07-15 19:17   ` Michael S. Tsirkin
2009-07-15 20:48     ` Gleb Natapov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).