[PATCH v4 0/9] KVM: Improve IRQ assignment for device passthrough

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/9] KVM: Improve IRQ assignment for device passthrough
@ 2010-11-08 11:21 Jan Kiszka
  2010-11-08 11:21 ` [PATCH v4 1/9] KVM: Fix srcu struct leakage Jan Kiszka
                   ` (9 more replies)
  0 siblings, 10 replies; 35+ messages in thread
From: Jan Kiszka @ 2010-11-08 11:21 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti; +Cc: kvm, Alex Williamson, Michael S. Tsirkin

Nine patches (yeah, it's getting more and more) to improve "classic"
device assigment /wrt IRQs. Highlight is the last one that resolves the
host IRQ sharing issue for all PCI 2.3 devices. Quite essential when
passing non-MSI-ready devices like many USB host controllers.

As there were concerns regarding the overhead of IRQ masking via the PCI
config space, I did some micro-benchmarks. Well, the concerns are valid:

disable_irq_nosync:           ~600 cycles
pci_2_3_irq_check_and_mask:  ~6000 cycles (EHCI)
                            ~22000 cycles (AR9287, with peaks >100000)

Specifically the varying impact of the device like in the Atheros case
is worrying (this device is actually known to cause horrible latencies
to the host, but who knows what other devices do). So I decided to go
with PCI-2.3 masking as default off in the to-be-sent qemu-kvm patch.
Maybe something to consider vor VFIO as well.

Changes in v4:
 - Allow user space to push its INTx mask state to the kernel and
   respect this on IRQ masking and delivery
 - Specify IRQF_ONESHOT for threaded IRQ of assigned device
 - Switch IRQ subsystem locking from RCU to SRCU
 - Fix for SRCU struct leakage
 - Small cleanup for kvm_vm_ioctl_assigned_device
 - Documentation for device assignment API (please check carefully if I
   got it right)

Changes in v3:
 - Save/restore PCI device state across guest usage
 - Make PCI-2.3-based IRQ sharing configurable by user space
 - Fix potential nesting of pci_block_user_cfg_access
 - Do not track PCI-level IRQ masking in software, user space may change
   it concurrently
 - Optimize kvm_assigned_dev_ack_irq for the case another IRQ is already
   pending
 - Cleanups according to review comments

Changes in v2:
 - Reworked IRQ forwarding path to use threaded IRQs (direct signalling
   from IRQ context does not work out of the box and may be too lengthy)
 - Refactored host IRQ naming of assigned devices (cosmetic change)
 - Avoid unmask on ack when the next IRQ is pending, rather reassert the
   guest line (PCI-2.3 patch)
 - Refactored PCI-2.3 patch (but still no control knob for shared mode -
   is that a must?)

Jan Kiszka (9):
  KVM: Fix srcu struct leakage
  KVM: Switch IRQ subsystem to SRCU
  KVM: Clear assigned guest IRQ on release
  KVM: Switch assigned device IRQ forwarding to threaded handler
  KVM: Refactor IRQ names of assigned devices
  KVM: Save/restore state of assigned PCI device
  KVM: Clean up kvm_vm_ioctl_assigned_device
  KVM: Document device assigment API
  KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices

 Documentation/kvm/api.txt |  203 ++++++++++++++++++++++++
 arch/x86/kvm/x86.c        |    1 +
 include/linux/kvm.h       |    6 +
 include/linux/kvm_host.h  |   16 +--
 virt/kvm/assigned-dev.c   |  383 ++++++++++++++++++++++++++++++++++----------
 virt/kvm/irq_comm.c       |   31 ++--
 virt/kvm/kvm_main.c       |   20 ++-
 7 files changed, 541 insertions(+), 119 deletions(-)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v4 1/9] KVM: Fix srcu struct leakage
  2010-11-08 11:21 [PATCH v4 0/9] KVM: Improve IRQ assignment for device passthrough Jan Kiszka
@ 2010-11-08 11:21 ` Jan Kiszka
  2010-11-08 17:00   ` Michael S. Tsirkin
  2010-11-08 11:21 ` [PATCH v4 2/9] KVM: Switch IRQ subsystem to SRCU Jan Kiszka
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: Jan Kiszka @ 2010-11-08 11:21 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti; +Cc: kvm, Alex Williamson, Michael S. Tsirkin

Clean up the srcu struct on vm destruction and refactor its release on
early errors.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 virt/kvm/kvm_main.c |   15 +++++++--------
 1 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4111a4b..6cfcde7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -401,23 +401,19 @@ static struct kvm *kvm_create_vm(void)
 	r = -ENOMEM;
 	kvm->memslots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
 	if (!kvm->memslots)
-		goto out_err;
+		goto out_err_nosrcu;
 	if (init_srcu_struct(&kvm->srcu))
-		goto out_err;
+		goto out_err_nosrcu;
 	for (i = 0; i < KVM_NR_BUSES; i++) {
 		kvm->buses[i] = kzalloc(sizeof(struct kvm_io_bus),
 					GFP_KERNEL);
-		if (!kvm->buses[i]) {
-			cleanup_srcu_struct(&kvm->srcu);
+		if (!kvm->buses[i])
 			goto out_err;
-		}
 	}
 
 	r = kvm_init_mmu_notifier(kvm);
-	if (r) {
-		cleanup_srcu_struct(&kvm->srcu);
+	if (r)
 		goto out_err;
-	}
 
 	kvm->mm = current->mm;
 	atomic_inc(&kvm->mm->mm_count);
@@ -435,6 +431,8 @@ out:
 	return kvm;
 
 out_err:
+	cleanup_srcu_struct(&kvm->srcu);
+out_err_nosrcu:
 	hardware_disable_all();
 out_err_nodisable:
 	for (i = 0; i < KVM_NR_BUSES; i++)
@@ -513,6 +511,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 #else
 	kvm_arch_flush_shadow(kvm);
 #endif
+	cleanup_srcu_struct(&kvm->srcu);
 	kvm_arch_destroy_vm(kvm);
 	hardware_disable_all();
 	mmdrop(mm);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v4 2/9] KVM: Switch IRQ subsystem to SRCU
  2010-11-08 11:21 [PATCH v4 0/9] KVM: Improve IRQ assignment for device passthrough Jan Kiszka
  2010-11-08 11:21 ` [PATCH v4 1/9] KVM: Fix srcu struct leakage Jan Kiszka
@ 2010-11-08 11:21 ` Jan Kiszka
  2010-11-09 10:49   ` Avi Kivity
  2010-11-08 11:21 ` [PATCH v4 3/9] KVM: Clear assigned guest IRQ on release Jan Kiszka
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: Jan Kiszka @ 2010-11-08 11:21 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti; +Cc: kvm, Alex Williamson, Michael S. Tsirkin

Ack and mask notifiers typically call back into kvm_set_irq, thus may
iterate over all VCPUs of a VM. Better keep this path preemptible to
prevent that user-space can massivle influence scheduling latencies. Use
sleepable RCU for the protection of irq_routing and the notfier lists.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 include/linux/kvm_host.h |    1 +
 virt/kvm/irq_comm.c      |   31 +++++++++++++++++--------------
 virt/kvm/kvm_main.c      |    5 +++++
 3 files changed, 23 insertions(+), 14 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index bcf71c7..83bf8e1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -238,6 +238,7 @@ struct kvm {
 #endif
 
 	struct mutex irq_lock;
+	struct srcu_struct irq_srcu;
 #ifdef CONFIG_HAVE_KVM_IRQCHIP
 	struct kvm_irq_routing_table __rcu *irq_routing;
 	struct hlist_head mask_notifier_list;
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 8edca91..2331587 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -150,6 +150,7 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
 	int ret = -1, i = 0;
 	struct kvm_irq_routing_table *irq_rt;
 	struct hlist_node *n;
+	int idx;
 
 	trace_kvm_set_irq(irq, level, irq_source_id);
 
@@ -157,12 +158,12 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level)
 	 * IOAPIC.  So set the bit in both. The guest will ignore
 	 * writes to the unused one.
 	 */
-	rcu_read_lock();
-	irq_rt = rcu_dereference(kvm->irq_routing);
+	idx = srcu_read_lock(&kvm->irq_srcu);
+	irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu);
 	if (irq < irq_rt->nr_rt_entries)
 		hlist_for_each_entry(e, n, &irq_rt->map[irq], link)
 			irq_set[i++] = *e;
-	rcu_read_unlock();
+	srcu_read_unlock(&kvm->irq_srcu, idx);
 
 	while(i--) {
 		int r;
@@ -180,18 +181,19 @@ void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
 {
 	struct kvm_irq_ack_notifier *kian;
 	struct hlist_node *n;
-	int gsi;
+	int gsi, idx;
 
 	trace_kvm_ack_irq(irqchip, pin);
 
-	rcu_read_lock();
-	gsi = rcu_dereference(kvm->irq_routing)->chip[irqchip][pin];
+	idx = srcu_read_lock(&kvm->irq_srcu);
+	gsi = srcu_dereference(kvm->irq_routing,
+			       &kvm->irq_srcu)->chip[irqchip][pin];
 	if (gsi != -1)
 		hlist_for_each_entry_rcu(kian, n, &kvm->irq_ack_notifier_list,
 					 link)
 			if (kian->gsi == gsi)
 				kian->irq_acked(kian);
-	rcu_read_unlock();
+	srcu_read_unlock(&kvm->irq_srcu, idx);
 }
 
 void kvm_register_irq_ack_notifier(struct kvm *kvm,
@@ -208,7 +210,7 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
 	mutex_lock(&kvm->irq_lock);
 	hlist_del_init_rcu(&kian->link);
 	mutex_unlock(&kvm->irq_lock);
-	synchronize_rcu();
+	synchronize_srcu(&kvm->irq_srcu);
 }
 
 int kvm_request_irq_source_id(struct kvm *kvm)
@@ -276,7 +278,7 @@ void kvm_unregister_irq_mask_notifier(struct kvm *kvm, int irq,
 	mutex_lock(&kvm->irq_lock);
 	hlist_del_rcu(&kimn->link);
 	mutex_unlock(&kvm->irq_lock);
-	synchronize_rcu();
+	synchronize_srcu(&kvm->irq_srcu);
 }
 
 void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
@@ -284,15 +286,16 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
 {
 	struct kvm_irq_mask_notifier *kimn;
 	struct hlist_node *n;
-	int gsi;
+	int gsi, idx;
 
-	rcu_read_lock();
-	gsi = rcu_dereference(kvm->irq_routing)->chip[irqchip][pin];
+	idx = srcu_read_lock(&kvm->irq_srcu);
+	gsi = srcu_dereference(kvm->irq_routing,
+			       &kvm->irq_srcu)->chip[irqchip][pin];
 	if (gsi != -1)
 		hlist_for_each_entry_rcu(kimn, n, &kvm->mask_notifier_list, link)
 			if (kimn->irq == gsi)
 				kimn->func(kimn, mask);
-	rcu_read_unlock();
+	srcu_read_unlock(&kvm->irq_srcu, idx);
 }
 
 void kvm_free_irq_routing(struct kvm *kvm)
@@ -411,7 +414,7 @@ int kvm_set_irq_routing(struct kvm *kvm,
 	old = kvm->irq_routing;
 	rcu_assign_pointer(kvm->irq_routing, new);
 	mutex_unlock(&kvm->irq_lock);
-	synchronize_rcu();
+	synchronize_srcu(&kvm->irq_srcu);
 
 	new = old;
 	r = 0;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6cfcde7..b0ac2c0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -404,6 +404,8 @@ static struct kvm *kvm_create_vm(void)
 		goto out_err_nosrcu;
 	if (init_srcu_struct(&kvm->srcu))
 		goto out_err_nosrcu;
+	if (init_srcu_struct(&kvm->irq_srcu))
+		goto out_err_noirqsrcu;
 	for (i = 0; i < KVM_NR_BUSES; i++) {
 		kvm->buses[i] = kzalloc(sizeof(struct kvm_io_bus),
 					GFP_KERNEL);
@@ -431,6 +433,8 @@ out:
 	return kvm;
 
 out_err:
+	cleanup_srcu_struct(&kvm->irq_srcu);
+out_err_noirqsrcu:
 	cleanup_srcu_struct(&kvm->srcu);
 out_err_nosrcu:
 	hardware_disable_all();
@@ -511,6 +515,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 #else
 	kvm_arch_flush_shadow(kvm);
 #endif
+	cleanup_srcu_struct(&kvm->irq_srcu);
 	cleanup_srcu_struct(&kvm->srcu);
 	kvm_arch_destroy_vm(kvm);
 	hardware_disable_all();
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v4 3/9] KVM: Clear assigned guest IRQ on release
  2010-11-08 11:21 [PATCH v4 0/9] KVM: Improve IRQ assignment for device passthrough Jan Kiszka
  2010-11-08 11:21 ` [PATCH v4 1/9] KVM: Fix srcu struct leakage Jan Kiszka
  2010-11-08 11:21 ` [PATCH v4 2/9] KVM: Switch IRQ subsystem to SRCU Jan Kiszka
@ 2010-11-08 11:21 ` Jan Kiszka
  2010-11-09 10:58   ` Avi Kivity
  2010-11-08 11:21 ` [PATCH v4 4/9] KVM: Switch assigned device IRQ forwarding to threaded handler Jan Kiszka
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: Jan Kiszka @ 2010-11-08 11:21 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti; +Cc: kvm, Alex Williamson, Michael S. Tsirkin

When we deassign a guest IRQ, clear the potentially asserted guest line.
There might be no chance for the guest to do this, specifically if we
switch from INTx to MSI mode.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 virt/kvm/assigned-dev.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 7c98928..ecc4419 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -141,6 +141,9 @@ static void deassign_guest_irq(struct kvm *kvm,
 	kvm_unregister_irq_ack_notifier(kvm, &assigned_dev->ack_notifier);
 	assigned_dev->ack_notifier.gsi = -1;
 
+	kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
+		    assigned_dev->guest_irq, 0);
+
 	if (assigned_dev->irq_source_id != -1)
 		kvm_free_irq_source_id(kvm, assigned_dev->irq_source_id);
 	assigned_dev->irq_source_id = -1;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v4 4/9] KVM: Switch assigned device IRQ forwarding to threaded handler
  2010-11-08 11:21 [PATCH v4 0/9] KVM: Improve IRQ assignment for device passthrough Jan Kiszka
                   ` (2 preceding siblings ...)
  2010-11-08 11:21 ` [PATCH v4 3/9] KVM: Clear assigned guest IRQ on release Jan Kiszka
@ 2010-11-08 11:21 ` Jan Kiszka
  2010-11-09 12:26   ` Avi Kivity
  2010-11-08 11:21 ` [PATCH v4 5/9] KVM: Refactor IRQ names of assigned devices Jan Kiszka
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: Jan Kiszka @ 2010-11-08 11:21 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti; +Cc: kvm, Alex Williamson, Michael S. Tsirkin

This improves the IRQ forwarding for assigned devices: By using the
kernel's threaded IRQ scheme, we can get rid of the latency-prone work
queue and simplify the code in the same run.

Moreover, we no longer have to hold assigned_dev_lock while raising the
guest IRQ, which can be a lenghty operation as we may have to iterate
over all VCPUs. The lock is now only used for synchronizing masking vs.
unmasking of INTx-type IRQs, thus is renames to intx_lock.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 include/linux/kvm_host.h |   12 +----
 virt/kvm/assigned-dev.c  |  107 +++++++++++++++-------------------------------
 2 files changed, 36 insertions(+), 83 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 83bf8e1..050674f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -458,16 +458,8 @@ struct kvm_irq_ack_notifier {
 	void (*irq_acked)(struct kvm_irq_ack_notifier *kian);
 };
 
-#define KVM_ASSIGNED_MSIX_PENDING		0x1
-struct kvm_guest_msix_entry {
-	u32 vector;
-	u16 entry;
-	u16 flags;
-};
-
 struct kvm_assigned_dev_kernel {
 	struct kvm_irq_ack_notifier ack_notifier;
-	struct work_struct interrupt_work;
 	struct list_head list;
 	int assigned_dev_id;
 	int host_segnr;
@@ -478,13 +470,13 @@ struct kvm_assigned_dev_kernel {
 	bool host_irq_disabled;
 	struct msix_entry *host_msix_entries;
 	int guest_irq;
-	struct kvm_guest_msix_entry *guest_msix_entries;
+	struct msix_entry *guest_msix_entries;
 	unsigned long irq_requested_type;
 	int irq_source_id;
 	int flags;
 	struct pci_dev *dev;
 	struct kvm *kvm;
-	spinlock_t assigned_dev_lock;
+	spinlock_t intx_lock;
 };
 
 struct kvm_irq_mask_notifier {
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index ecc4419..1d77ce1 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -55,58 +55,31 @@ static int find_index_from_host_irq(struct kvm_assigned_dev_kernel
 	return index;
 }
 
-static void kvm_assigned_dev_interrupt_work_handler(struct work_struct *work)
+static irqreturn_t kvm_assigned_dev_thread(int irq, void *dev_id)
 {
-	struct kvm_assigned_dev_kernel *assigned_dev;
-	int i;
+	struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
+	u32 vector;
+	int index;
 
-	assigned_dev = container_of(work, struct kvm_assigned_dev_kernel,
-				    interrupt_work);
+	if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_INTX) {
+		spin_lock(&assigned_dev->intx_lock);
+		disable_irq_nosync(irq);
+		assigned_dev->host_irq_disabled = true;
+		spin_unlock(&assigned_dev->intx_lock);
+	}
 
-	spin_lock_irq(&assigned_dev->assigned_dev_lock);
 	if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
-		struct kvm_guest_msix_entry *guest_entries =
-			assigned_dev->guest_msix_entries;
-		for (i = 0; i < assigned_dev->entries_nr; i++) {
-			if (!(guest_entries[i].flags &
-					KVM_ASSIGNED_MSIX_PENDING))
-				continue;
-			guest_entries[i].flags &= ~KVM_ASSIGNED_MSIX_PENDING;
+		index = find_index_from_host_irq(assigned_dev, irq);
+		if (index >= 0) {
+			vector = assigned_dev->
+					guest_msix_entries[index].vector;
 			kvm_set_irq(assigned_dev->kvm,
-				    assigned_dev->irq_source_id,
-				    guest_entries[i].vector, 1);
+				    assigned_dev->irq_source_id, vector, 1);
 		}
 	} else
 		kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
 			    assigned_dev->guest_irq, 1);
 
-	spin_unlock_irq(&assigned_dev->assigned_dev_lock);
-}
-
-static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
-{
-	unsigned long flags;
-	struct kvm_assigned_dev_kernel *assigned_dev =
-		(struct kvm_assigned_dev_kernel *) dev_id;
-
-	spin_lock_irqsave(&assigned_dev->assigned_dev_lock, flags);
-	if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
-		int index = find_index_from_host_irq(assigned_dev, irq);
-		if (index < 0)
-			goto out;
-		assigned_dev->guest_msix_entries[index].flags |=
-			KVM_ASSIGNED_MSIX_PENDING;
-	}
-
-	schedule_work(&assigned_dev->interrupt_work);
-
-	if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_GUEST_INTX) {
-		disable_irq_nosync(irq);
-		assigned_dev->host_irq_disabled = true;
-	}
-
-out:
-	spin_unlock_irqrestore(&assigned_dev->assigned_dev_lock, flags);
 	return IRQ_HANDLED;
 }
 
@@ -114,7 +87,6 @@ out:
 static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian)
 {
 	struct kvm_assigned_dev_kernel *dev;
-	unsigned long flags;
 
 	if (kian->gsi == -1)
 		return;
@@ -127,12 +99,12 @@ static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian)
 	/* The guest irq may be shared so this ack may be
 	 * from another device.
 	 */
-	spin_lock_irqsave(&dev->assigned_dev_lock, flags);
+	spin_lock(&dev->intx_lock);
 	if (dev->host_irq_disabled) {
 		enable_irq(dev->host_irq);
 		dev->host_irq_disabled = false;
 	}
-	spin_unlock_irqrestore(&dev->assigned_dev_lock, flags);
+	spin_unlock(&dev->intx_lock);
 }
 
 static void deassign_guest_irq(struct kvm *kvm,
@@ -155,28 +127,19 @@ static void deassign_host_irq(struct kvm *kvm,
 			      struct kvm_assigned_dev_kernel *assigned_dev)
 {
 	/*
-	 * In kvm_free_device_irq, cancel_work_sync return true if:
-	 * 1. work is scheduled, and then cancelled.
-	 * 2. work callback is executed.
-	 *
-	 * The first one ensured that the irq is disabled and no more events
-	 * would happen. But for the second one, the irq may be enabled (e.g.
-	 * for MSI). So we disable irq here to prevent further events.
+	 * We disable irq here to prevent further events.
 	 *
 	 * Notice this maybe result in nested disable if the interrupt type is
 	 * INTx, but it's OK for we are going to free it.
 	 *
 	 * If this function is a part of VM destroy, please ensure that till
 	 * now, the kvm state is still legal for probably we also have to wait
-	 * interrupt_work done.
+	 * on a currently running IRQ handler.
 	 */
 	if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
 		int i;
 		for (i = 0; i < assigned_dev->entries_nr; i++)
-			disable_irq_nosync(assigned_dev->
-					   host_msix_entries[i].vector);
-
-		cancel_work_sync(&assigned_dev->interrupt_work);
+			disable_irq(assigned_dev->host_msix_entries[i].vector);
 
 		for (i = 0; i < assigned_dev->entries_nr; i++)
 			free_irq(assigned_dev->host_msix_entries[i].vector,
@@ -188,8 +151,7 @@ static void deassign_host_irq(struct kvm *kvm,
 		pci_disable_msix(assigned_dev->dev);
 	} else {
 		/* Deal with MSI and INTx */
-		disable_irq_nosync(assigned_dev->host_irq);
-		cancel_work_sync(&assigned_dev->interrupt_work);
+		disable_irq(assigned_dev->host_irq);
 
 		free_irq(assigned_dev->host_irq, (void *)assigned_dev);
 
@@ -268,8 +230,9 @@ static int assigned_device_enable_host_intx(struct kvm *kvm,
 	 * on the same interrupt line is not a happy situation: there
 	 * are going to be long delays in accepting, acking, etc.
 	 */
-	if (request_irq(dev->host_irq, kvm_assigned_dev_intr,
-			0, "kvm_assigned_intx_device", (void *)dev))
+	if (request_threaded_irq(dev->host_irq, NULL, kvm_assigned_dev_thread,
+				 IRQF_ONESHOT, "kvm_assigned_intx_device",
+				 (void *)dev))
 		return -EIO;
 	return 0;
 }
@@ -287,8 +250,8 @@ static int assigned_device_enable_host_msi(struct kvm *kvm,
 	}
 
 	dev->host_irq = dev->dev->irq;
-	if (request_irq(dev->host_irq, kvm_assigned_dev_intr, 0,
-			"kvm_assigned_msi_device", (void *)dev)) {
+	if (request_threaded_irq(dev->host_irq, NULL, kvm_assigned_dev_thread,
+				 0, "kvm_assigned_msi_device", (void *)dev)) {
 		pci_disable_msi(dev->dev);
 		return -EIO;
 	}
@@ -313,10 +276,10 @@ static int assigned_device_enable_host_msix(struct kvm *kvm,
 		return r;
 
 	for (i = 0; i < dev->entries_nr; i++) {
-		r = request_irq(dev->host_msix_entries[i].vector,
-				kvm_assigned_dev_intr, 0,
-				"kvm_assigned_msix_device",
-				(void *)dev);
+		r = request_threaded_irq(dev->host_msix_entries[i].vector,
+					 NULL, kvm_assigned_dev_thread,
+					 0, "kvm_assigned_msix_device",
+					 (void *)dev);
 		if (r)
 			goto err;
 	}
@@ -557,12 +520,10 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
 	match->host_devfn = assigned_dev->devfn;
 	match->flags = assigned_dev->flags;
 	match->dev = dev;
-	spin_lock_init(&match->assigned_dev_lock);
+	spin_lock_init(&match->intx_lock);
 	match->irq_source_id = -1;
 	match->kvm = kvm;
 	match->ack_notifier.irq_acked = kvm_assigned_dev_ack_irq;
-	INIT_WORK(&match->interrupt_work,
-		  kvm_assigned_dev_interrupt_work_handler);
 
 	list_add(&match->list, &kvm->arch.assigned_dev_head);
 
@@ -654,9 +615,9 @@ static int kvm_vm_ioctl_set_msix_nr(struct kvm *kvm,
 			r = -ENOMEM;
 			goto msix_nr_out;
 		}
-		adev->guest_msix_entries = kzalloc(
-				sizeof(struct kvm_guest_msix_entry) *
-				entry_nr->entry_nr, GFP_KERNEL);
+		adev->guest_msix_entries =
+			kzalloc(sizeof(struct msix_entry) * entry_nr->entry_nr,
+				GFP_KERNEL);
 		if (!adev->guest_msix_entries) {
 			kfree(adev->host_msix_entries);
 			r = -ENOMEM;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v4 5/9] KVM: Refactor IRQ names of assigned devices
  2010-11-08 11:21 [PATCH v4 0/9] KVM: Improve IRQ assignment for device passthrough Jan Kiszka
                   ` (3 preceding siblings ...)
  2010-11-08 11:21 ` [PATCH v4 4/9] KVM: Switch assigned device IRQ forwarding to threaded handler Jan Kiszka
@ 2010-11-08 11:21 ` Jan Kiszka
  2010-11-08 11:21 ` [PATCH v4 6/9] KVM: Save/restore state of assigned PCI device Jan Kiszka
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Jan Kiszka @ 2010-11-08 11:21 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti; +Cc: kvm, Alex Williamson, Michael S. Tsirkin

Cosmetic change, but it helps to correlate IRQs with PCI devices.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 include/linux/kvm_host.h |    1 +
 virt/kvm/assigned-dev.c  |   11 ++++++-----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 050674f..fe83eb0 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -477,6 +477,7 @@ struct kvm_assigned_dev_kernel {
 	struct pci_dev *dev;
 	struct kvm *kvm;
 	spinlock_t intx_lock;
+	char irq_name[32];
 };
 
 struct kvm_irq_mask_notifier {
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 1d77ce1..7623408 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -231,8 +231,7 @@ static int assigned_device_enable_host_intx(struct kvm *kvm,
 	 * are going to be long delays in accepting, acking, etc.
 	 */
 	if (request_threaded_irq(dev->host_irq, NULL, kvm_assigned_dev_thread,
-				 IRQF_ONESHOT, "kvm_assigned_intx_device",
-				 (void *)dev))
+				 IRQF_ONESHOT, dev->irq_name, (void *)dev))
 		return -EIO;
 	return 0;
 }
@@ -251,7 +250,7 @@ static int assigned_device_enable_host_msi(struct kvm *kvm,
 
 	dev->host_irq = dev->dev->irq;
 	if (request_threaded_irq(dev->host_irq, NULL, kvm_assigned_dev_thread,
-				 0, "kvm_assigned_msi_device", (void *)dev)) {
+				 0, dev->irq_name, (void *)dev)) {
 		pci_disable_msi(dev->dev);
 		return -EIO;
 	}
@@ -278,8 +277,7 @@ static int assigned_device_enable_host_msix(struct kvm *kvm,
 	for (i = 0; i < dev->entries_nr; i++) {
 		r = request_threaded_irq(dev->host_msix_entries[i].vector,
 					 NULL, kvm_assigned_dev_thread,
-					 0, "kvm_assigned_msix_device",
-					 (void *)dev);
+					 0, dev->irq_name, (void *)dev);
 		if (r)
 			goto err;
 	}
@@ -336,6 +334,9 @@ static int assign_host_irq(struct kvm *kvm,
 	if (dev->irq_requested_type & KVM_DEV_IRQ_HOST_MASK)
 		return r;
 
+	snprintf(dev->irq_name, sizeof(dev->irq_name), "kvm:%s",
+		 pci_name(dev->dev));
+
 	switch (host_irq_type) {
 	case KVM_DEV_IRQ_HOST_INTX:
 		r = assigned_device_enable_host_intx(kvm, dev);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v4 6/9] KVM: Save/restore state of assigned PCI device
  2010-11-08 11:21 [PATCH v4 0/9] KVM: Improve IRQ assignment for device passthrough Jan Kiszka
                   ` (4 preceding siblings ...)
  2010-11-08 11:21 ` [PATCH v4 5/9] KVM: Refactor IRQ names of assigned devices Jan Kiszka
@ 2010-11-08 11:21 ` Jan Kiszka
  2010-11-09 12:35   ` Avi Kivity
  2010-11-08 11:21 ` [PATCH v4 7/9] KVM: Clean up kvm_vm_ioctl_assigned_device Jan Kiszka
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: Jan Kiszka @ 2010-11-08 11:21 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti; +Cc: kvm, Alex Williamson, Michael S. Tsirkin

The guest may change states that pci_reset_function does not touch. So
we better save/restore the assigned device across guest usage.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 virt/kvm/assigned-dev.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 7623408..d389207 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -197,7 +197,8 @@ static void kvm_free_assigned_device(struct kvm *kvm,
 {
 	kvm_free_assigned_irq(kvm, assigned_dev);
 
-	pci_reset_function(assigned_dev->dev);
+	__pci_reset_function(assigned_dev->dev);
+	pci_restore_state(assigned_dev->dev);
 
 	pci_release_regions(assigned_dev->dev);
 	pci_disable_device(assigned_dev->dev);
@@ -514,6 +515,7 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
 	}
 
 	pci_reset_function(dev);
+	pci_save_state(dev);
 
 	match->assigned_dev_id = assigned_dev->assigned_dev_id;
 	match->host_segnr = assigned_dev->segnr;
@@ -544,6 +546,7 @@ out:
 	mutex_unlock(&kvm->lock);
 	return r;
 out_list_del:
+	pci_restore_state(dev);
 	list_del(&match->list);
 	pci_release_regions(dev);
 out_disable:
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v4 7/9] KVM: Clean up kvm_vm_ioctl_assigned_device
  2010-11-08 11:21 [PATCH v4 0/9] KVM: Improve IRQ assignment for device passthrough Jan Kiszka
                   ` (5 preceding siblings ...)
  2010-11-08 11:21 ` [PATCH v4 6/9] KVM: Save/restore state of assigned PCI device Jan Kiszka
@ 2010-11-08 11:21 ` Jan Kiszka
  2010-11-08 11:21 ` [PATCH v4 8/9] KVM: Document device assigment API Jan Kiszka
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Jan Kiszka @ 2010-11-08 11:21 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti; +Cc: kvm, Alex Williamson, Michael S. Tsirkin

Any arch not supporting device assigment will also not build
assigned-dev.c. So testing for KVM_CAP_DEVICE_DEASSIGNMENT is pointless.
KVM_CAP_ASSIGN_DEV_IRQ is unconditinally set. Moreover, add a default
case for dispatching the ioctl.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 virt/kvm/assigned-dev.c |    9 ++++-----
 1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index d389207..ae72ae6 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -674,7 +674,7 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
 				  unsigned long arg)
 {
 	void __user *argp = (void __user *)arg;
-	int r = -ENOTTY;
+	int r;
 
 	switch (ioctl) {
 	case KVM_ASSIGN_PCI_DEVICE: {
@@ -692,7 +692,6 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
 		r = -EOPNOTSUPP;
 		break;
 	}
-#ifdef KVM_CAP_ASSIGN_DEV_IRQ
 	case KVM_ASSIGN_DEV_IRQ: {
 		struct kvm_assigned_irq assigned_irq;
 
@@ -715,8 +714,6 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
 			goto out;
 		break;
 	}
-#endif
-#ifdef KVM_CAP_DEVICE_DEASSIGNMENT
 	case KVM_DEASSIGN_PCI_DEVICE: {
 		struct kvm_assigned_pci_dev assigned_dev;
 
@@ -728,7 +725,6 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
 			goto out;
 		break;
 	}
-#endif
 #ifdef KVM_CAP_IRQ_ROUTING
 	case KVM_SET_GSI_ROUTING: {
 		struct kvm_irq_routing routing;
@@ -781,6 +777,9 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
 		break;
 	}
 #endif
+	default:
+		r = -ENOTTY;
+		break;
 	}
 out:
 	return r;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v4 8/9] KVM: Document device assigment API
  2010-11-08 11:21 [PATCH v4 0/9] KVM: Improve IRQ assignment for device passthrough Jan Kiszka
                   ` (6 preceding siblings ...)
  2010-11-08 11:21 ` [PATCH v4 7/9] KVM: Clean up kvm_vm_ioctl_assigned_device Jan Kiszka
@ 2010-11-08 11:21 ` Jan Kiszka
  2010-11-08 11:21 ` [PATCH v4 9/9] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices Jan Kiszka
  2010-11-16 16:55 ` [PATCH v4 0/9] KVM: Improve IRQ assignment for device passthrough Marcelo Tosatti
  9 siblings, 0 replies; 35+ messages in thread
From: Jan Kiszka @ 2010-11-08 11:21 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti; +Cc: kvm, Alex Williamson, Michael S. Tsirkin

Adds API documentation for KVM_[DE]ASSIGN_PCI_DEVICE,
KVM_[DE]ASSIGN_DEV_IRQ, KVM_SET_GSI_ROUTING, KVM_ASSIGN_SET_MSIX_NR, and
KVM_ASSIGN_SET_MSIX_ENTRY.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 Documentation/kvm/api.txt |  178 +++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 178 insertions(+), 0 deletions(-)

diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
index b336266..e1a9297 100644
--- a/Documentation/kvm/api.txt
+++ b/Documentation/kvm/api.txt
@@ -1085,6 +1085,184 @@ of 4 instructions that make up a hypercall.
 If any additional field gets added to this structure later on, a bit for that
 additional piece of information will be set in the flags bitmap.
 
+4.47 KVM_ASSIGN_PCI_DEVICE
+
+Capability: KVM_CAP_DEVICE_ASSIGNMENT
+Architectures: x86 ia64
+Type: vm ioctl
+Parameters: struct kvm_assigned_pci_dev (in)
+Returns: 0 on success, -1 on error
+
+Assigns a host PCI device to the VM.
+
+struct kvm_assigned_pci_dev {
+	__u32 assigned_dev_id;
+	__u32 busnr;
+	__u32 devfn;
+	__u32 flags;
+	__u32 segnr;
+	union {
+		__u32 reserved[11];
+	};
+};
+
+The PCI device is specified by the triple segnr, busnr, and devfn.
+Identification in succeeding service requests is done via assigned_dev_id. The
+following flags are specified:
+
+/* Depends on KVM_CAP_IOMMU */
+#define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
+
+4.48 KVM_DEASSIGN_PCI_DEVICE
+
+Capability: KVM_CAP_DEVICE_DEASSIGNMENT
+Architectures: x86 ia64
+Type: vm ioctl
+Parameters: struct kvm_assigned_pci_dev (in)
+Returns: 0 on success, -1 on error
+
+Ends PCI device assignment, releasing all associated resources.
+
+See KVM_CAP_DEVICE_ASSIGNMENT for the data structure. Only assigned_dev_id is
+used in kvm_assigned_pci_dev to identify the device.
+
+4.49 KVM_ASSIGN_DEV_IRQ
+
+Capability: KVM_CAP_ASSIGN_DEV_IRQ
+Architectures: x86 ia64
+Type: vm ioctl
+Parameters: struct kvm_assigned_irq (in)
+Returns: 0 on success, -1 on error
+
+Assigns an IRQ to a passed-through device.
+
+struct kvm_assigned_irq {
+	__u32 assigned_dev_id;
+	__u32 host_irq;
+	__u32 guest_irq;
+	__u32 flags;
+	union {
+		struct {
+			__u32 addr_lo;
+			__u32 addr_hi;
+			__u32 data;
+		} guest_msi;
+		__u32 reserved[12];
+	};
+};
+
+The following flags are defined:
+
+#define KVM_DEV_IRQ_HOST_INTX    (1 << 0)
+#define KVM_DEV_IRQ_HOST_MSI     (1 << 1)
+#define KVM_DEV_IRQ_HOST_MSIX    (1 << 2)
+
+#define KVM_DEV_IRQ_GUEST_INTX   (1 << 8)
+#define KVM_DEV_IRQ_GUEST_MSI    (1 << 9)
+#define KVM_DEV_IRQ_GUEST_MSIX   (1 << 10)
+
+It is not valid to specify multiple types per host or guest IRQ. However, the
+IRQ type of host and guest can differ or can even be null.
+
+4.50 KVM_DEASSIGN_DEV_IRQ
+
+Capability: KVM_CAP_ASSIGN_DEV_IRQ
+Architectures: x86 ia64
+Type: vm ioctl
+Parameters: struct kvm_assigned_irq (in)
+Returns: 0 on success, -1 on error
+
+Ends an IRQ assignment to a passed-through device.
+
+See KVM_ASSIGN_DEV_IRQ for the data structure. The target device is specified
+by assigned_dev_id, flags must correspond to the IRQ type specified on
+KVM_ASSIGN_DEV_IRQ. Partial deassignment of host or guest IRQ is allowed.
+
+4.51 KVM_SET_GSI_ROUTING
+
+Capability: KVM_CAP_IRQ_ROUTING
+Architectures: x86 ia64
+Type: vm ioctl
+Parameters: struct kvm_irq_routing (in)
+Returns: 0 on success, -1 on error
+
+Sets the GSI routing table entries, overwriting any previously set entries.
+
+struct kvm_irq_routing {
+	__u32 nr;
+	__u32 flags;
+	struct kvm_irq_routing_entry entries[0];
+};
+
+No flags are specified so far, the corresponding field must be set to zero.
+
+struct kvm_irq_routing_entry {
+	__u32 gsi;
+	__u32 type;
+	__u32 flags;
+	__u32 pad;
+	union {
+		struct kvm_irq_routing_irqchip irqchip;
+		struct kvm_irq_routing_msi msi;
+		__u32 pad[8];
+	} u;
+};
+
+/* gsi routing entry types */
+#define KVM_IRQ_ROUTING_IRQCHIP 1
+#define KVM_IRQ_ROUTING_MSI 2
+
+No flags are specified so far, the corresponding field must be set to zero.
+
+struct kvm_irq_routing_irqchip {
+	__u32 irqchip;
+	__u32 pin;
+};
+
+struct kvm_irq_routing_msi {
+	__u32 address_lo;
+	__u32 address_hi;
+	__u32 data;
+	__u32 pad;
+};
+
+4.52 KVM_ASSIGN_SET_MSIX_NR
+
+Capability: KVM_CAP_DEVICE_MSIX
+Architectures: x86 ia64
+Type: vm ioctl
+Parameters: struct kvm_assigned_msix_nr (in)
+Returns: 0 on success, -1 on error
+
+Set the number of MSI-X interrupts for an assigned device. This service can
+only be called once in the lifetime of an assigned device.
+
+struct kvm_assigned_msix_nr {
+	__u32 assigned_dev_id;
+	__u16 entry_nr;
+	__u16 padding;
+};
+
+#define KVM_MAX_MSIX_PER_DEV		256
+
+4.53 KVM_ASSIGN_SET_MSIX_ENTRY
+
+Capability: KVM_CAP_DEVICE_MSIX
+Architectures: x86 ia64
+Type: vm ioctl
+Parameters: struct kvm_assigned_msix_entry (in)
+Returns: 0 on success, -1 on error
+
+Specifies the routing of an MSI-X assigned device interrupt to a GSI. Setting
+the GSI vector to zero means disabling the interrupt.
+
+struct kvm_assigned_msix_entry {
+	__u32 assigned_dev_id;
+	__u32 gsi;
+	__u16 entry; /* The index of entry in the MSI-X table */
+	__u16 padding[3];
+};
+
 5. The kvm_run structure
 
 Application code obtains a pointer to the kvm_run structure by
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v4 9/9] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices
  2010-11-08 11:21 [PATCH v4 0/9] KVM: Improve IRQ assignment for device passthrough Jan Kiszka
                   ` (7 preceding siblings ...)
  2010-11-08 11:21 ` [PATCH v4 8/9] KVM: Document device assigment API Jan Kiszka
@ 2010-11-08 11:21 ` Jan Kiszka
  2010-11-09 13:27   ` Avi Kivity
  2010-11-16 16:55 ` [PATCH v4 0/9] KVM: Improve IRQ assignment for device passthrough Marcelo Tosatti
  9 siblings, 1 reply; 35+ messages in thread
From: Jan Kiszka @ 2010-11-08 11:21 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti; +Cc: kvm, Alex Williamson, Michael S. Tsirkin

PCI 2.3 allows to generically disable IRQ sources at device level. This
enables us to share IRQs of such devices between on the host side when
passing them to a guest. This feature is optional, user space has to
request it explicitly. Moreover, user space can inform us about its view
of PCI_COMMAND_INTX_DISABLE so that we can avoid unmasking the interrupt
and signaling it if the guest masked it via the PCI config space.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 Documentation/kvm/api.txt |   25 ++++
 arch/x86/kvm/x86.c        |    1 +
 include/linux/kvm.h       |    6 +
 include/linux/kvm_host.h  |    2 +
 virt/kvm/assigned-dev.c   |  288 +++++++++++++++++++++++++++++++++++++++++----
 5 files changed, 299 insertions(+), 23 deletions(-)

diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
index e1a9297..dbb126c 100644
--- a/Documentation/kvm/api.txt
+++ b/Documentation/kvm/api.txt
@@ -1112,6 +1112,14 @@ following flags are specified:
 
 /* Depends on KVM_CAP_IOMMU */
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
+/* The following two depend on KVM_CAP_PCI_2_3 */
+#define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
+#define KVM_DEV_ASSIGN_MASK_INTX	(1 << 2)
+
+If KVM_DEV_ASSIGN_PCI_2_3 is set, the kernel will manage legacy INTx interrupts
+via the PCI-2.3-compliant device-level mask, thus enable IRQ sharing with other
+assigned devices or host devices. KVM_DEV_ASSIGN_MASK_INTX specifies the
+guest's view on the INTx mask, see KVM_ASSIGN_SET_INTX_MASK for details.
 
 4.48 KVM_DEASSIGN_PCI_DEVICE
 
@@ -1263,6 +1271,23 @@ struct kvm_assigned_msix_entry {
 	__u16 padding[3];
 };
 
+5.54 KVM_ASSIGN_SET_INTX_MASK
+
+Capability: KVM_CAP_PCI_2_3
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_assigned_pci_dev (in)
+Returns: 0 on success, -1 on error
+
+Informs the kernel about the guest's view on the INTx mask. As long as the
+guest masks the legacy INTx, the kernel will refrain from unmasking it at
+hardware level and will not assert the guest's IRQ line. User space is still
+responsible for applying this state to the assigned device's real config space.
+
+See KVM_ASSIGN_DEV_IRQ for the data structure. The target device is specified
+by assigned_dev_id. In the flags field, only KVM_DEV_ASSIGN_MASK_INTX is
+evaluated.
+
 5. The kvm_run structure
 
 Application code obtains a pointer to the kvm_run structure by
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2044302..ed1b417 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1966,6 +1966,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_X86_ROBUST_SINGLESTEP:
 	case KVM_CAP_XSAVE:
 	case KVM_CAP_ASYNC_PF:
+	case KVM_CAP_PCI_2_3:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index ea2dc1a..3cadb42 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -541,6 +541,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_PPC_GET_PVINFO 57
 #define KVM_CAP_PPC_IRQ_LEVEL 58
 #define KVM_CAP_ASYNC_PF 59
+#define KVM_CAP_PCI_2_3 60
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -677,6 +678,9 @@ struct kvm_clock_data {
 #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0, struct kvm_pit_state2)
 /* Available with KVM_CAP_PPC_GET_PVINFO */
 #define KVM_PPC_GET_PVINFO	  _IOW(KVMIO,  0xa1, struct kvm_ppc_pvinfo)
+/* Available with KVM_CAP_PCI_2_3 */
+#define KVM_ASSIGN_SET_INTX_MASK  _IOW(KVMIO,  0xa2, \
+				       struct kvm_assigned_pci_dev)
 
 /*
  * ioctls for vcpu fds
@@ -742,6 +746,8 @@ struct kvm_clock_data {
 #define KVM_SET_XCRS		  _IOW(KVMIO,  0xa7, struct kvm_xcrs)
 
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
+#define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
+#define KVM_DEV_ASSIGN_MASK_INTX	(1 << 2)
 
 struct kvm_assigned_pci_dev {
 	__u32 assigned_dev_id;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index fe83eb0..7f1627c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -468,6 +468,7 @@ struct kvm_assigned_dev_kernel {
 	unsigned int entries_nr;
 	int host_irq;
 	bool host_irq_disabled;
+	bool pci_2_3;
 	struct msix_entry *host_msix_entries;
 	int guest_irq;
 	struct msix_entry *guest_msix_entries;
@@ -477,6 +478,7 @@ struct kvm_assigned_dev_kernel {
 	struct pci_dev *dev;
 	struct kvm *kvm;
 	spinlock_t intx_lock;
+	struct mutex intx_mask_lock;
 	char irq_name[32];
 };
 
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index ae72ae6..a9aab1b 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -55,17 +55,105 @@ static int find_index_from_host_irq(struct kvm_assigned_dev_kernel
 	return index;
 }
 
+static bool
+pci_2_3_set_irq_mask(struct pci_dev *dev, bool mask, bool check_status)
+{
+	u32 cmd_status_dword;
+	u16 origcmd, newcmd;
+	bool mask_updated = true;
+
+	/*
+	 * We do a single dword read to retrieve both command and status.
+	 * Document assumptions that make this possible.
+	 */
+	BUILD_BUG_ON(PCI_COMMAND % 4);
+	BUILD_BUG_ON(PCI_COMMAND + 2 != PCI_STATUS);
+
+	pci_block_user_cfg_access(dev);
+
+	/*
+	 * Read both command and status registers in a single 32-bit operation.
+	 * Note: we could cache the value for command and move the status read
+	 * out of the lock if there was a way to get notified of user changes
+	 * to command register through sysfs. Should be good for shared irqs.
+	 */
+	pci_read_config_dword(dev, PCI_COMMAND, &cmd_status_dword);
+
+	if (check_status) {
+		bool irq_pending =
+			(cmd_status_dword >> 16) & PCI_STATUS_INTERRUPT;
+
+		/*
+		 * Check interrupt status register to see whether our device
+		 * triggered the interrupt (when masking) or the next IRQ is
+		 * already pending (when unmasking).
+		 */
+		if (mask != irq_pending) {
+			mask_updated = false;
+			goto done;
+		}
+	}
+
+	origcmd = cmd_status_dword;
+	newcmd = origcmd & ~PCI_COMMAND_INTX_DISABLE;
+	if (mask)
+		newcmd |= PCI_COMMAND_INTX_DISABLE;
+	if (newcmd != origcmd)
+		pci_write_config_word(dev, PCI_COMMAND, newcmd);
+
+done:
+	pci_unblock_user_cfg_access(dev);
+	return mask_updated;
+}
+
+static void pci_2_3_irq_mask(struct pci_dev *dev)
+{
+	pci_2_3_set_irq_mask(dev, true, false);
+}
+
+static bool pci_2_3_irq_check_and_mask(struct pci_dev *dev)
+{
+	return pci_2_3_set_irq_mask(dev, true, true);
+}
+
+static void pci_2_3_irq_unmask(struct pci_dev *dev)
+{
+	pci_2_3_set_irq_mask(dev, false, false);
+}
+
+static bool pci_2_3_irq_check_and_unmask(struct pci_dev *dev)
+{
+	return pci_2_3_set_irq_mask(dev, false, true);
+}
+
+static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
+{
+	struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
+	int ret;
+
+	spin_lock(&assigned_dev->intx_lock);
+	if (pci_2_3_irq_check_and_mask(assigned_dev->dev)) {
+		assigned_dev->host_irq_disabled = true;
+		ret = IRQ_WAKE_THREAD;
+	} else
+		ret =IRQ_NONE;
+	spin_unlock(&assigned_dev->intx_lock);
+
+	return ret;
+}
+
 static irqreturn_t kvm_assigned_dev_thread(int irq, void *dev_id)
 {
 	struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
 	u32 vector;
 	int index;
 
-	if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_INTX) {
-		spin_lock(&assigned_dev->intx_lock);
+	if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_INTX &&
+	    !(assigned_dev->flags & KVM_DEV_ASSIGN_PCI_2_3)) {
+		spin_lock_irq(&assigned_dev->intx_lock);
 		disable_irq_nosync(irq);
 		assigned_dev->host_irq_disabled = true;
-		spin_unlock(&assigned_dev->intx_lock);
+		spin_unlock_irq(&assigned_dev->intx_lock);
 	}
 
 	if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
@@ -76,9 +164,17 @@ static irqreturn_t kvm_assigned_dev_thread(int irq, void *dev_id)
 			kvm_set_irq(assigned_dev->kvm,
 				    assigned_dev->irq_source_id, vector, 1);
 		}
-	} else
+	} else if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSI) {
 		kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
 			    assigned_dev->guest_irq, 1);
+	} else {
+		mutex_lock(&assigned_dev->intx_mask_lock);
+		if (!(assigned_dev->flags & KVM_DEV_ASSIGN_MASK_INTX))
+			kvm_set_irq(assigned_dev->kvm,
+				    assigned_dev->irq_source_id,
+				    assigned_dev->guest_irq, 1);
+		mutex_unlock(&assigned_dev->intx_mask_lock);
+	}
 
 	return IRQ_HANDLED;
 }
@@ -96,15 +192,34 @@ static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian)
 
 	kvm_set_irq(dev->kvm, dev->irq_source_id, dev->guest_irq, 0);
 
-	/* The guest irq may be shared so this ack may be
-	 * from another device.
-	 */
-	spin_lock(&dev->intx_lock);
-	if (dev->host_irq_disabled) {
-		enable_irq(dev->host_irq);
-		dev->host_irq_disabled = false;
+	if (likely(!(dev->irq_requested_type & KVM_DEV_IRQ_HOST_INTX)))
+		return;
+
+	mutex_lock(&dev->intx_mask_lock);
+
+	if (!(dev->flags & KVM_DEV_ASSIGN_MASK_INTX)) {
+		bool reassert = false;
+
+		spin_lock_irq(&dev->intx_lock);
+		/*
+		 * The guest IRQ may be shared so this ack can come from an
+		 * IRQ for another guest device.
+		 */
+		if (dev->host_irq_disabled) {
+			if (!(dev->flags & KVM_DEV_ASSIGN_PCI_2_3))
+				enable_irq(dev->host_irq);
+			else if (!pci_2_3_irq_check_and_unmask(dev->dev))
+				reassert = true;
+			dev->host_irq_disabled = reassert;
+		}
+		spin_unlock_irq(&dev->intx_lock);
+
+		if (reassert)
+			kvm_set_irq(dev->kvm, dev->irq_source_id,
+				    dev->guest_irq, 1);
 	}
-	spin_unlock(&dev->intx_lock);
+
+	mutex_unlock(&dev->intx_mask_lock);
 }
 
 static void deassign_guest_irq(struct kvm *kvm,
@@ -151,7 +266,13 @@ static void deassign_host_irq(struct kvm *kvm,
 		pci_disable_msix(assigned_dev->dev);
 	} else {
 		/* Deal with MSI and INTx */
-		disable_irq(assigned_dev->host_irq);
+		if (assigned_dev->flags & KVM_DEV_ASSIGN_PCI_2_3) {
+			spin_lock_irq(&assigned_dev->intx_lock);
+			pci_2_3_irq_mask(assigned_dev->dev);
+			spin_unlock_irq(&assigned_dev->intx_lock);
+			synchronize_irq(assigned_dev->host_irq);
+		} else
+			disable_irq(assigned_dev->host_irq);
 
 		free_irq(assigned_dev->host_irq, (void *)assigned_dev);
 
@@ -225,15 +346,34 @@ void kvm_free_all_assigned_devices(struct kvm *kvm)
 static int assigned_device_enable_host_intx(struct kvm *kvm,
 					    struct kvm_assigned_dev_kernel *dev)
 {
+	irq_handler_t irq_handler;
+	unsigned long flags;
+
 	dev->host_irq = dev->dev->irq;
-	/* Even though this is PCI, we don't want to use shared
-	 * interrupts. Sharing host devices with guest-assigned devices
-	 * on the same interrupt line is not a happy situation: there
-	 * are going to be long delays in accepting, acking, etc.
+
+	/*
+	 * We can only share the IRQ line with other host devices if we are
+	 * able to disable the IRQ source at device-level - independently of
+	 * the guest driver. Otherwise host devices may suffer from unbounded
+	 * IRQ latencies when the guest keeps the line asserted.
 	 */
-	if (request_threaded_irq(dev->host_irq, NULL, kvm_assigned_dev_thread,
-				 IRQF_ONESHOT, dev->irq_name, (void *)dev))
+	if (dev->flags & KVM_DEV_ASSIGN_PCI_2_3) {
+		irq_handler = kvm_assigned_dev_intr;
+		flags = IRQF_SHARED;
+	} else {
+		irq_handler = NULL;
+		flags = IRQF_ONESHOT;
+	}
+	if (request_threaded_irq(dev->host_irq, irq_handler,
+				 kvm_assigned_dev_thread, flags,
+				 dev->irq_name, (void *)dev))
 		return -EIO;
+
+	if (dev->flags & KVM_DEV_ASSIGN_PCI_2_3) {
+		spin_lock_irq(&dev->intx_lock);
+		pci_2_3_irq_unmask(dev->dev);
+		spin_unlock_irq(&dev->intx_lock);
+	}
 	return 0;
 }
 
@@ -309,7 +449,6 @@ static int assigned_device_enable_guest_msi(struct kvm *kvm,
 {
 	dev->guest_irq = irq->guest_irq;
 	dev->ack_notifier.gsi = -1;
-	dev->host_irq_disabled = false;
 	return 0;
 }
 #endif
@@ -321,7 +460,6 @@ static int assigned_device_enable_guest_msix(struct kvm *kvm,
 {
 	dev->guest_irq = irq->guest_irq;
 	dev->ack_notifier.gsi = -1;
-	dev->host_irq_disabled = false;
 	return 0;
 }
 #endif
@@ -355,6 +493,7 @@ static int assign_host_irq(struct kvm *kvm,
 	default:
 		r = -EINVAL;
 	}
+	dev->host_irq_disabled = false;
 
 	if (!r)
 		dev->irq_requested_type |= host_irq_type;
@@ -455,6 +594,7 @@ static int kvm_vm_ioctl_deassign_dev_irq(struct kvm *kvm,
 {
 	int r = -ENODEV;
 	struct kvm_assigned_dev_kernel *match;
+	unsigned long irq_type;
 
 	mutex_lock(&kvm->lock);
 
@@ -463,12 +603,55 @@ static int kvm_vm_ioctl_deassign_dev_irq(struct kvm *kvm,
 	if (!match)
 		goto out;
 
-	r = kvm_deassign_irq(kvm, match, assigned_irq->flags);
+	irq_type = assigned_irq->flags & (KVM_DEV_IRQ_HOST_MASK |
+					  KVM_DEV_IRQ_GUEST_MASK);
+	r = kvm_deassign_irq(kvm, match, irq_type);
 out:
 	mutex_unlock(&kvm->lock);
 	return r;
 }
 
+/*
+ * Verify that the device supports Interrupt Disable bit in command register,
+ * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly
+ * in PCI 2.2.
+ */
+static bool pci_2_3_supported(struct pci_dev *pdev)
+{
+	bool supported = false;
+	u16 orig, new;
+
+	pci_block_user_cfg_access(pdev);
+	pci_read_config_word(pdev, PCI_COMMAND, &orig);
+	pci_write_config_word(pdev, PCI_COMMAND,
+			      orig ^ PCI_COMMAND_INTX_DISABLE);
+	pci_read_config_word(pdev, PCI_COMMAND, &new);
+
+	/*
+	 * There's no way to protect against
+	 * hardware bugs or detect them reliably, but as long as we know
+	 * what the value should be, let's go ahead and check it.
+	 */
+	if ((new ^ orig) & ~PCI_COMMAND_INTX_DISABLE) {
+		dev_err(&pdev->dev, "Command changed from 0x%x to 0x%x: "
+			"driver or HW bug?\n", orig, new);
+		goto out;
+	}
+	if (!((new ^ orig) & PCI_COMMAND_INTX_DISABLE)) {
+		dev_warn(&pdev->dev, "Device does not support "
+			 "disabling interrupts: unable to bind.\n");
+		goto out;
+	}
+	supported = true;
+
+	/* Now restore the original value. */
+	pci_write_config_word(pdev, PCI_COMMAND, orig);
+
+out:
+	pci_unblock_user_cfg_access(pdev);
+	return supported;
+}
+
 static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
 				      struct kvm_assigned_pci_dev *assigned_dev)
 {
@@ -517,6 +700,9 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
 	pci_reset_function(dev);
 	pci_save_state(dev);
 
+	if (!pci_2_3_supported(dev))
+		assigned_dev->flags &= ~KVM_DEV_ASSIGN_PCI_2_3;
+
 	match->assigned_dev_id = assigned_dev->assigned_dev_id;
 	match->host_segnr = assigned_dev->segnr;
 	match->host_busnr = assigned_dev->busnr;
@@ -524,6 +710,7 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
 	match->flags = assigned_dev->flags;
 	match->dev = dev;
 	spin_lock_init(&match->intx_lock);
+	mutex_init(&match->intx_mask_lock);
 	match->irq_source_id = -1;
 	match->kvm = kvm;
 	match->ack_notifier.irq_acked = kvm_assigned_dev_ack_irq;
@@ -670,6 +857,53 @@ msix_entry_out:
 }
 #endif
 
+static int kvm_vm_ioctl_set_pci_irq_mask(struct kvm *kvm,
+		struct kvm_assigned_pci_dev *assigned_dev)
+{
+	int r = 0;
+	struct kvm_assigned_dev_kernel *match;
+
+	mutex_lock(&kvm->lock);
+
+	match = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
+				      assigned_dev->assigned_dev_id);
+	if (!match) {
+		r = -ENODEV;
+		goto out;
+	}
+
+	mutex_lock(&match->intx_mask_lock);
+
+	match->flags &= ~KVM_DEV_ASSIGN_MASK_INTX;
+	match->flags |= assigned_dev->flags & KVM_DEV_ASSIGN_MASK_INTX;
+
+	if (assigned_dev->flags & KVM_DEV_ASSIGN_MASK_INTX) {
+		kvm_set_irq(match->kvm, match->irq_source_id,
+			    match->guest_irq, 0);
+		/*
+		 * Masking at hardware-level is performed on demand, i.e. when
+		 * an IRQ actually arrives at the host.
+		 */
+	} else {
+		/*
+		 * Unmask the IRQ line. It may have been masked meanwhile if
+		 * we aren't using PCI 2.3 INTx masking on the host side.
+		 */
+		spin_lock_irq(&match->intx_lock);
+		if (match->host_irq_disabled) {
+			enable_irq(match->host_irq);
+			match->host_irq_disabled = false;
+		}
+		spin_unlock_irq(&match->intx_lock);
+	}
+
+	mutex_unlock(&match->intx_mask_lock);
+
+out:
+	mutex_unlock(&kvm->lock);
+	return r;
+}
+
 long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
 				  unsigned long arg)
 {
@@ -777,6 +1011,15 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
 		break;
 	}
 #endif
+	case KVM_ASSIGN_SET_INTX_MASK: {
+		struct kvm_assigned_pci_dev assigned_dev;
+
+		r = -EFAULT;
+		if (copy_from_user(&assigned_dev, argp, sizeof assigned_dev))
+			goto out;
+		r = kvm_vm_ioctl_set_pci_irq_mask(kvm, &assigned_dev);
+		break;
+	}
 	default:
 		r = -ENOTTY;
 		break;
@@ -784,4 +1027,3 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
 out:
 	return r;
 }
-
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 1/9] KVM: Fix srcu struct leakage
  2010-11-08 11:21 ` [PATCH v4 1/9] KVM: Fix srcu struct leakage Jan Kiszka
@ 2010-11-08 17:00   ` Michael S. Tsirkin
  2010-11-08 17:32     ` Jan Kiszka
  0 siblings, 1 reply; 35+ messages in thread
From: Michael S. Tsirkin @ 2010-11-08 17:00 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Avi Kivity, Marcelo Tosatti, kvm, Alex Williamson

On Mon, Nov 08, 2010 at 12:21:45PM +0100, Jan Kiszka wrote:
> Clean up the srcu struct on vm destruction and refactor its release on
> early errors.
> 
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>

Yay, I suspected something's wrong with error handling in srcu.

Acked-by: Michael S. Tsirkin <mst@redhat.com>

This one actually has nothing to do with assignment,
and it looks like it's needed for -stable and 2.6.37.
Avi/Marcelo?

> ---
>  virt/kvm/kvm_main.c |   15 +++++++--------
>  1 files changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4111a4b..6cfcde7 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -401,23 +401,19 @@ static struct kvm *kvm_create_vm(void)
>  	r = -ENOMEM;
>  	kvm->memslots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
>  	if (!kvm->memslots)
> -		goto out_err;
> +		goto out_err_nosrcu;
>  	if (init_srcu_struct(&kvm->srcu))
> -		goto out_err;
> +		goto out_err_nosrcu;
>  	for (i = 0; i < KVM_NR_BUSES; i++) {
>  		kvm->buses[i] = kzalloc(sizeof(struct kvm_io_bus),
>  					GFP_KERNEL);
> -		if (!kvm->buses[i]) {
> -			cleanup_srcu_struct(&kvm->srcu);
> +		if (!kvm->buses[i])
>  			goto out_err;
> -		}
>  	}
>  
>  	r = kvm_init_mmu_notifier(kvm);
> -	if (r) {
> -		cleanup_srcu_struct(&kvm->srcu);
> +	if (r)
>  		goto out_err;
> -	}
>  
>  	kvm->mm = current->mm;
>  	atomic_inc(&kvm->mm->mm_count);
> @@ -435,6 +431,8 @@ out:
>  	return kvm;
>  
>  out_err:
> +	cleanup_srcu_struct(&kvm->srcu);
> +out_err_nosrcu:
>  	hardware_disable_all();
>  out_err_nodisable:
>  	for (i = 0; i < KVM_NR_BUSES; i++)
> @@ -513,6 +511,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
>  #else
>  	kvm_arch_flush_shadow(kvm);
>  #endif
> +	cleanup_srcu_struct(&kvm->srcu);
>  	kvm_arch_destroy_vm(kvm);
>  	hardware_disable_all();
>  	mmdrop(mm);
> -- 
> 1.7.1

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 1/9] KVM: Fix srcu struct leakage
  2010-11-08 17:00   ` Michael S. Tsirkin
@ 2010-11-08 17:32     ` Jan Kiszka
  0 siblings, 0 replies; 35+ messages in thread
From: Jan Kiszka @ 2010-11-08 17:32 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, Marcelo Tosatti, kvm, Alex Williamson

Am 08.11.2010 18:00, Michael S. Tsirkin wrote:
> On Mon, Nov 08, 2010 at 12:21:45PM +0100, Jan Kiszka wrote:
>> Clean up the srcu struct on vm destruction and refactor its release on
>> early errors.
>>
>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> 
> Yay, I suspected something's wrong with error handling in srcu.
> 
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> 
> This one actually has nothing to do with assignment,

Yes, it is just mechanically needed (and I found it while hacking in
more SRCU).

Jan

> and it looks like it's needed for -stable and 2.6.37.
> Avi/Marcelo?
> 
>> ---
>>  virt/kvm/kvm_main.c |   15 +++++++--------
>>  1 files changed, 7 insertions(+), 8 deletions(-)
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 4111a4b..6cfcde7 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -401,23 +401,19 @@ static struct kvm *kvm_create_vm(void)
>>  	r = -ENOMEM;
>>  	kvm->memslots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
>>  	if (!kvm->memslots)
>> -		goto out_err;
>> +		goto out_err_nosrcu;
>>  	if (init_srcu_struct(&kvm->srcu))
>> -		goto out_err;
>> +		goto out_err_nosrcu;
>>  	for (i = 0; i < KVM_NR_BUSES; i++) {
>>  		kvm->buses[i] = kzalloc(sizeof(struct kvm_io_bus),
>>  					GFP_KERNEL);
>> -		if (!kvm->buses[i]) {
>> -			cleanup_srcu_struct(&kvm->srcu);
>> +		if (!kvm->buses[i])
>>  			goto out_err;
>> -		}
>>  	}
>>  
>>  	r = kvm_init_mmu_notifier(kvm);
>> -	if (r) {
>> -		cleanup_srcu_struct(&kvm->srcu);
>> +	if (r)
>>  		goto out_err;
>> -	}
>>  
>>  	kvm->mm = current->mm;
>>  	atomic_inc(&kvm->mm->mm_count);
>> @@ -435,6 +431,8 @@ out:
>>  	return kvm;
>>  
>>  out_err:
>> +	cleanup_srcu_struct(&kvm->srcu);
>> +out_err_nosrcu:
>>  	hardware_disable_all();
>>  out_err_nodisable:
>>  	for (i = 0; i < KVM_NR_BUSES; i++)
>> @@ -513,6 +511,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
>>  #else
>>  	kvm_arch_flush_shadow(kvm);
>>  #endif
>> +	cleanup_srcu_struct(&kvm->srcu);
>>  	kvm_arch_destroy_vm(kvm);
>>  	hardware_disable_all();
>>  	mmdrop(mm);
>> -- 
>> 1.7.1

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 2/9] KVM: Switch IRQ subsystem to SRCU
  2010-11-08 11:21 ` [PATCH v4 2/9] KVM: Switch IRQ subsystem to SRCU Jan Kiszka
@ 2010-11-09 10:49   ` Avi Kivity
  2010-11-09 11:21     ` Jan Kiszka
  0 siblings, 1 reply; 35+ messages in thread
From: Avi Kivity @ 2010-11-09 10:49 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Marcelo Tosatti, kvm, Alex Williamson, Michael S. Tsirkin

On 11/08/2010 01:21 PM, Jan Kiszka wrote:
> Ack and mask notifiers typically call back into kvm_set_irq, thus may
> iterate over all VCPUs of a VM. Better keep this path preemptible to
> prevent that user-space can massivle influence scheduling latencies. Use
> sleepable RCU for the protection of irq_routing and the notfier lists.
>

What about preemptible RCU, now in mainline?

/*
  * Tree-preemptable RCU implementation for rcu_read_lock().
  * Just increment ->rcu_read_lock_nesting, shared state will be updated
  * if we block.
  */
void __rcu_read_lock(void)
{
     current->rcu_read_lock_nesting++;
     barrier();  /* needed if we ever invoke rcu_read_lock in rcutree.c */
}
EXPORT_SYMBOL_GPL(__rcu_read_lock);

(wow, quite clever in its simplicity)

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/9] KVM: Clear assigned guest IRQ on release
  2010-11-08 11:21 ` [PATCH v4 3/9] KVM: Clear assigned guest IRQ on release Jan Kiszka
@ 2010-11-09 10:58   ` Avi Kivity
  2010-11-09 11:20     ` Jan Kiszka
  2010-11-09 18:36     ` Alex Williamson
  0 siblings, 2 replies; 35+ messages in thread
From: Avi Kivity @ 2010-11-09 10:58 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Marcelo Tosatti, kvm, Alex Williamson, Michael S. Tsirkin

On 11/08/2010 01:21 PM, Jan Kiszka wrote:
> When we deassign a guest IRQ, clear the potentially asserted guest line.
> There might be no chance for the guest to do this, specifically if we
> switch from INTx to MSI mode.
>
> Signed-off-by: Jan Kiszka<jan.kiszka@siemens.com>
> ---
>   virt/kvm/assigned-dev.c |    3 +++
>   1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> index 7c98928..ecc4419 100644
> --- a/virt/kvm/assigned-dev.c
> +++ b/virt/kvm/assigned-dev.c
> @@ -141,6 +141,9 @@ static void deassign_guest_irq(struct kvm *kvm,
>   	kvm_unregister_irq_ack_notifier(kvm,&assigned_dev->ack_notifier);
>   	assigned_dev->ack_notifier.gsi = -1;
>
> +	kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
> +		    assigned_dev->guest_irq, 0);
> +
>   	if (assigned_dev->irq_source_id != -1)
>   		kvm_free_irq_source_id(kvm, assigned_dev->irq_source_id);
>   	assigned_dev->irq_source_id = -1;

I guess this can't hurt.  Did you see it happen in practice?

Note: all this kvm_set_irq(..., [01]) is incorrect as it doesn't account 
for polarity.  Currently the qemu-emulated chipset uses level high pci 
interrupts, but that's not a given from kvm's point of view.

I think vfio fixes this by only routing msi interrupts via the kernel, 
and routing level interrupts through userspace, which can adjust 
polarity.  Alex/Michael?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/9] KVM: Clear assigned guest IRQ on release
  2010-11-09 10:58   ` Avi Kivity
@ 2010-11-09 11:20     ` Jan Kiszka
  2010-11-09 18:36     ` Alex Williamson
  1 sibling, 0 replies; 35+ messages in thread
From: Jan Kiszka @ 2010-11-09 11:20 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, kvm, Alex Williamson, Michael S. Tsirkin

Am 09.11.2010 11:58, Avi Kivity wrote:
> On 11/08/2010 01:21 PM, Jan Kiszka wrote:
>> When we deassign a guest IRQ, clear the potentially asserted guest line.
>> There might be no chance for the guest to do this, specifically if we
>> switch from INTx to MSI mode.
>>
>> Signed-off-by: Jan Kiszka<jan.kiszka@siemens.com>
>> ---
>>   virt/kvm/assigned-dev.c |    3 +++
>>   1 files changed, 3 insertions(+), 0 deletions(-)
>>
>> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
>> index 7c98928..ecc4419 100644
>> --- a/virt/kvm/assigned-dev.c
>> +++ b/virt/kvm/assigned-dev.c
>> @@ -141,6 +141,9 @@ static void deassign_guest_irq(struct kvm *kvm,
>>   	kvm_unregister_irq_ack_notifier(kvm,&assigned_dev->ack_notifier);
>>   	assigned_dev->ack_notifier.gsi = -1;
>>
>> +	kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
>> +		    assigned_dev->guest_irq, 0);
>> +
>>   	if (assigned_dev->irq_source_id != -1)
>>   		kvm_free_irq_source_id(kvm, assigned_dev->irq_source_id);
>>   	assigned_dev->irq_source_id = -1;
> 
> I guess this can't hurt.  Did you see it happen in practice?

Yes, with prefer_msi=off and an e1000e-driven NIC: The legacy IRQ stayed
asserted when the guest actually enabled MSI.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 2/9] KVM: Switch IRQ subsystem to SRCU
  2010-11-09 10:49   ` Avi Kivity
@ 2010-11-09 11:21     ` Jan Kiszka
  0 siblings, 0 replies; 35+ messages in thread
From: Jan Kiszka @ 2010-11-09 11:21 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, kvm, Alex Williamson, Michael S. Tsirkin

Am 09.11.2010 11:49, Avi Kivity wrote:
> On 11/08/2010 01:21 PM, Jan Kiszka wrote:
>> Ack and mask notifiers typically call back into kvm_set_irq, thus may
>> iterate over all VCPUs of a VM. Better keep this path preemptible to
>> prevent that user-space can massivle influence scheduling latencies. Use
>> sleepable RCU for the protection of irq_routing and the notfier lists.
>>
> 
> What about preemptible RCU, now in mainline?

Granted, that would resolve the preemption issue but not allow to use
sleepy locks inside this section (as done by the last patch in this series).

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 4/9] KVM: Switch assigned device IRQ forwarding to threaded handler
  2010-11-08 11:21 ` [PATCH v4 4/9] KVM: Switch assigned device IRQ forwarding to threaded handler Jan Kiszka
@ 2010-11-09 12:26   ` Avi Kivity
  2010-11-09 12:36     ` Jan Kiszka
  0 siblings, 1 reply; 35+ messages in thread
From: Avi Kivity @ 2010-11-09 12:26 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Marcelo Tosatti, kvm, Alex Williamson, Michael S. Tsirkin

On 11/08/2010 01:21 PM, Jan Kiszka wrote:
> This improves the IRQ forwarding for assigned devices: By using the
> kernel's threaded IRQ scheme, we can get rid of the latency-prone work
> queue and simplify the code in the same run.
>
> Moreover, we no longer have to hold assigned_dev_lock while raising the
> guest IRQ, which can be a lenghty operation as we may have to iterate
> over all VCPUs. The lock is now only used for synchronizing masking vs.
> unmasking of INTx-type IRQs, thus is renames to intx_lock.

Nice stuff.

>
> -#define KVM_ASSIGNED_MSIX_PENDING		0x1
> -struct kvm_guest_msix_entry {
> -	u32 vector;
> -	u16 entry;
> -	u16 flags;
> -};
> -

Ok, so .flags was used to handle the delay between the interrupt handler 
and the work queue, which is now done automatically by threaded 
interrupts.  Good.

> @@ -313,10 +276,10 @@ static int assigned_device_enable_host_msix(struct kvm *kvm,
>   		return r;
>
>   	for (i = 0; i<  dev->entries_nr; i++) {
> -		r = request_irq(dev->host_msix_entries[i].vector,
> -				kvm_assigned_dev_intr, 0,
> -				"kvm_assigned_msix_device",
> -				(void *)dev);
> +		r = request_threaded_irq(dev->host_msix_entries[i].vector,
> +					 NULL, kvm_assigned_dev_thread,
> +					 0, "kvm_assigned_msix_device",
> +					 (void *)dev);
>   		if (r)
>   			goto err;
>   	}


Should eventually be done from interrupt context.  msix delivery only 
needs a routing table lookup and smp reschedule IPI, which can be done 
from interrupt context.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 6/9] KVM: Save/restore state of assigned PCI device
  2010-11-08 11:21 ` [PATCH v4 6/9] KVM: Save/restore state of assigned PCI device Jan Kiszka
@ 2010-11-09 12:35   ` Avi Kivity
  2010-11-09 13:29     ` Jan Kiszka
  0 siblings, 1 reply; 35+ messages in thread
From: Avi Kivity @ 2010-11-09 12:35 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Marcelo Tosatti, kvm, Alex Williamson, Michael S. Tsirkin

On 11/08/2010 01:21 PM, Jan Kiszka wrote:
> The guest may change states that pci_reset_function does not touch. So
> we better save/restore the assigned device across guest usage.

Why do we care?  Shouldn't the next user reset the state to its taste?  
Or are you worried about information leakage?
> @@ -514,6 +515,7 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
>   	}
>
>   	pci_reset_function(dev);
> +	pci_save_state(dev);
>
>   	match->assigned_dev_id = assigned_dev->assigned_dev_id;
>   	match->host_segnr = assigned_dev->segnr;
> @@ -544,6 +546,7 @@ out:
>   	mutex_unlock(&kvm->lock);
>   	return r;
>   out_list_del:
> +	pci_restore_state(dev);
>   	list_del(&match->list);
>   	pci_release_regions(dev);
>   out_disable:

This assumes no one else calls pci_save_state()/pci_restore_state() 
while the guest is running.  Is this true?  What about suspend/resume? 
(can this even work without guest assistance?)

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 4/9] KVM: Switch assigned device IRQ forwarding to threaded handler
  2010-11-09 12:26   ` Avi Kivity
@ 2010-11-09 12:36     ` Jan Kiszka
  0 siblings, 0 replies; 35+ messages in thread
From: Jan Kiszka @ 2010-11-09 12:36 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, kvm, Alex Williamson, Michael S. Tsirkin

Am 09.11.2010 13:26, Avi Kivity wrote:
> On 11/08/2010 01:21 PM, Jan Kiszka wrote:
>> This improves the IRQ forwarding for assigned devices: By using the
>> kernel's threaded IRQ scheme, we can get rid of the latency-prone work
>> queue and simplify the code in the same run.
>>
>> Moreover, we no longer have to hold assigned_dev_lock while raising the
>> guest IRQ, which can be a lenghty operation as we may have to iterate
>> over all VCPUs. The lock is now only used for synchronizing masking vs.
>> unmasking of INTx-type IRQs, thus is renames to intx_lock.
> 
> Nice stuff.
> 
>>
>> -#define KVM_ASSIGNED_MSIX_PENDING		0x1
>> -struct kvm_guest_msix_entry {
>> -	u32 vector;
>> -	u16 entry;
>> -	u16 flags;
>> -};
>> -
> 
> Ok, so .flags was used to handle the delay between the interrupt handler 
> and the work queue, which is now done automatically by threaded 
> interrupts.  Good.

Yes, a nice consolidating side effect.

> 
>> @@ -313,10 +276,10 @@ static int assigned_device_enable_host_msix(struct kvm *kvm,
>>   		return r;
>>
>>   	for (i = 0; i<  dev->entries_nr; i++) {
>> -		r = request_irq(dev->host_msix_entries[i].vector,
>> -				kvm_assigned_dev_intr, 0,
>> -				"kvm_assigned_msix_device",
>> -				(void *)dev);
>> +		r = request_threaded_irq(dev->host_msix_entries[i].vector,
>> +					 NULL, kvm_assigned_dev_thread,
>> +					 0, "kvm_assigned_msix_device",
>> +					 (void *)dev);
>>   		if (r)
>>   			goto err;
>>   	}
> 
> 
> Should eventually be done from interrupt context.  msix delivery only 
> needs a routing table lookup and smp reschedule IPI, which can be done 
> from interrupt context.
> 

Yes, definitely. But we would have to catch the case when the guest
configures the MSI-X message to be a broadcast to all its trillion VCPUs
(once we support trillions of them).

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 9/9] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices
  2010-11-08 11:21 ` [PATCH v4 9/9] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices Jan Kiszka
@ 2010-11-09 13:27   ` Avi Kivity
  2010-11-09 13:35     ` Jan Kiszka
  0 siblings, 1 reply; 35+ messages in thread
From: Avi Kivity @ 2010-11-09 13:27 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Marcelo Tosatti, kvm, Alex Williamson, Michael S. Tsirkin

On 11/08/2010 01:21 PM, Jan Kiszka wrote:
> PCI 2.3 allows to generically disable IRQ sources at device level. This
> enables us to share IRQs of such devices between on the host side when
> passing them to a guest. This feature is optional, user space has to
> request it explicitly. Moreover, user space can inform us about its view
> of PCI_COMMAND_INTX_DISABLE so that we can avoid unmasking the interrupt
> and signaling it if the guest masked it via the PCI config space.
>

It's a pity this cannot be done transparently.  We could detect multiple 
devices sharing the line, but what about PCI_COMMAND_INTX_DISABLE?

Perhaps we can hook the kernel's handler for this bit?

>
>   /* Depends on KVM_CAP_IOMMU */
>   #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1<<  0)
> +/* The following two depend on KVM_CAP_PCI_2_3 */
> +#define KVM_DEV_ASSIGN_PCI_2_3		(1<<  1)
> +#define KVM_DEV_ASSIGN_MASK_INTX	(1<<  2)
> +
> +If KVM_DEV_ASSIGN_PCI_2_3 is set, the kernel will manage legacy INTx interrupts
> +via the PCI-2.3-compliant device-level mask, thus enable IRQ sharing with other
> +assigned devices or host devices. KVM_DEV_ASSIGN_MASK_INTX specifies the
> +guest's view on the INTx mask, see KVM_ASSIGN_SET_INTX_MASK for details.
>
>   4.48 KVM_DEASSIGN_PCI_DEVICE
>
> @@ -1263,6 +1271,23 @@ struct kvm_assigned_msix_entry {
>   	__u16 padding[3];
>   };
>
> +5.54 KVM_ASSIGN_SET_INTX_MASK

4.54?

(54? wow.)

> +
> +Capability: KVM_CAP_PCI_2_3
> +Architectures: x86
> +Type: vm ioctl
> +Parameters: struct kvm_assigned_pci_dev (in)
> +Returns: 0 on success, -1 on error
> +
> +Informs the kernel about the guest's view on the INTx mask. As long as the
> +guest masks the legacy INTx, the kernel will refrain from unmasking it at
> +hardware level and will not assert the guest's IRQ line. User space is still
> +responsible for applying this state to the assigned device's real config space.

What if userspace lies?

> +
> +See KVM_ASSIGN_DEV_IRQ for the data structure. The target device is specified
> +by assigned_dev_id. In the flags field, only KVM_DEV_ASSIGN_MASK_INTX is
> +evaluated.
> +
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index fe83eb0..7f1627c 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -468,6 +468,7 @@ struct kvm_assigned_dev_kernel {
>   	unsigned int entries_nr;
>   	int host_irq;
>   	bool host_irq_disabled;
> +	bool pci_2_3;
>   	struct msix_entry *host_msix_entries;
>   	int guest_irq;
>   	struct msix_entry *guest_msix_entries;
> @@ -477,6 +478,7 @@ struct kvm_assigned_dev_kernel {
>   	struct pci_dev *dev;
>   	struct kvm *kvm;
>   	spinlock_t intx_lock;
> +	struct mutex intx_mask_lock;
>   	char irq_name[32];
>   };

I saw no reason this can't be a spinlock, but perhaps I missed 
something.  This would allow us to avoid srcu, which is slightly more 
expensive than rcu.  Since pci 2.3 assigned devices are not a major use 
case, I'd like not to penalize the mainstream users for this.

This patch undoes some of the niceness of the previous patches, but I 
have no alternative to suggest.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 6/9] KVM: Save/restore state of assigned PCI device
  2010-11-09 12:35   ` Avi Kivity
@ 2010-11-09 13:29     ` Jan Kiszka
  2010-11-09 13:36       ` Avi Kivity
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Kiszka @ 2010-11-09 13:29 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, kvm, Alex Williamson, Michael S. Tsirkin

Am 09.11.2010 13:35, Avi Kivity wrote:
> On 11/08/2010 01:21 PM, Jan Kiszka wrote:
>> The guest may change states that pci_reset_function does not touch. So
>> we better save/restore the assigned device across guest usage.
> 
> Why do we care?  Shouldn't the next user reset the state to its taste?  

Maybe he should, but are we sure this actually happens? E.g.
pci_reset_function preserves the config state, thus does not remove the
traces of guest.

> Or are you worried about information leakage?

Anyone who's worried about security should not assign the same device to
different users (host or guest) that should be isolated from each other.
You never know in what state the firmware and/or internal memory of the
device are.

Probably, anyone concerned about security should not give a device in
untrusted hands at all...

>> @@ -514,6 +515,7 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
>>   	}
>>
>>   	pci_reset_function(dev);
>> +	pci_save_state(dev);
>>
>>   	match->assigned_dev_id = assigned_dev->assigned_dev_id;
>>   	match->host_segnr = assigned_dev->segnr;
>> @@ -544,6 +546,7 @@ out:
>>   	mutex_unlock(&kvm->lock);
>>   	return r;
>>   out_list_del:
>> +	pci_restore_state(dev);
>>   	list_del(&match->list);
>>   	pci_release_regions(dev);
>>   out_disable:
> 
> This assumes no one else calls pci_save_state()/pci_restore_state() 

Correct.

> while the guest is running.  Is this true?  What about suspend/resume? 
> (can this even work without guest assistance?)

Well, so far suspend/resume does not work at all once assigned devices
are in use. IIUC, the guest is not informed about the fact that power
will be cut off, trashing the states of its assigned devices. We would
have to signal this event to the guest and give it a chance to save the
states.

Maybe we could generate an ACPI event that corresponds to pressing a
standby button. Or we would have to signal PCI unplug events for all
assigned devices. In any case, suspend/resume of the host then becomes
dependent on guest proper reaction.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 9/9] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices
  2010-11-09 13:27   ` Avi Kivity
@ 2010-11-09 13:35     ` Jan Kiszka
  2010-11-09 13:41       ` Avi Kivity
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Kiszka @ 2010-11-09 13:35 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, kvm, Alex Williamson, Michael S. Tsirkin

Am 09.11.2010 14:27, Avi Kivity wrote:
> On 11/08/2010 01:21 PM, Jan Kiszka wrote:
>> PCI 2.3 allows to generically disable IRQ sources at device level. This
>> enables us to share IRQs of such devices between on the host side when
>> passing them to a guest. This feature is optional, user space has to
>> request it explicitly. Moreover, user space can inform us about its view
>> of PCI_COMMAND_INTX_DISABLE so that we can avoid unmasking the interrupt
>> and signaling it if the guest masked it via the PCI config space.
>>
> 
> It's a pity this cannot be done transparently.  We could detect multiple 
> devices sharing the line,

Even that is not possible. Assigned or host devices may be activated
after we registered exclusively, pushing the breakage from VM start-up
to a different operation.

> but what about PCI_COMMAND_INTX_DISABLE?
> 
> Perhaps we can hook the kernel's handler for this bit?

Some IRQ registration notifier that would allow us to reregister our
handler with IRQ sharing support? Maybe.

> 
>>
>>   /* Depends on KVM_CAP_IOMMU */
>>   #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1<<  0)
>> +/* The following two depend on KVM_CAP_PCI_2_3 */
>> +#define KVM_DEV_ASSIGN_PCI_2_3		(1<<  1)
>> +#define KVM_DEV_ASSIGN_MASK_INTX	(1<<  2)
>> +
>> +If KVM_DEV_ASSIGN_PCI_2_3 is set, the kernel will manage legacy INTx interrupts
>> +via the PCI-2.3-compliant device-level mask, thus enable IRQ sharing with other
>> +assigned devices or host devices. KVM_DEV_ASSIGN_MASK_INTX specifies the
>> +guest's view on the INTx mask, see KVM_ASSIGN_SET_INTX_MASK for details.
>>
>>   4.48 KVM_DEASSIGN_PCI_DEVICE
>>
>> @@ -1263,6 +1271,23 @@ struct kvm_assigned_msix_entry {
>>   	__u16 padding[3];
>>   };
>>
>> +5.54 KVM_ASSIGN_SET_INTX_MASK
> 
> 4.54?

Of course.

> 
> (54? wow.)

And I don't think all IOCTLs are already documented (though the majority
now).

> 
>> +
>> +Capability: KVM_CAP_PCI_2_3
>> +Architectures: x86
>> +Type: vm ioctl
>> +Parameters: struct kvm_assigned_pci_dev (in)
>> +Returns: 0 on success, -1 on error
>> +
>> +Informs the kernel about the guest's view on the INTx mask. As long as the
>> +guest masks the legacy INTx, the kernel will refrain from unmasking it at
>> +hardware level and will not assert the guest's IRQ line. User space is still
>> +responsible for applying this state to the assigned device's real config space.
> 
> What if userspace lies?

User space problem. We will at worst receive one IRQ, mask it, and then
user space need to react again.

> 
>> +
>> +See KVM_ASSIGN_DEV_IRQ for the data structure. The target device is specified
>> +by assigned_dev_id. In the flags field, only KVM_DEV_ASSIGN_MASK_INTX is
>> +evaluated.
>> +
>>
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index fe83eb0..7f1627c 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -468,6 +468,7 @@ struct kvm_assigned_dev_kernel {
>>   	unsigned int entries_nr;
>>   	int host_irq;
>>   	bool host_irq_disabled;
>> +	bool pci_2_3;
>>   	struct msix_entry *host_msix_entries;
>>   	int guest_irq;
>>   	struct msix_entry *guest_msix_entries;
>> @@ -477,6 +478,7 @@ struct kvm_assigned_dev_kernel {
>>   	struct pci_dev *dev;
>>   	struct kvm *kvm;
>>   	spinlock_t intx_lock;
>> +	struct mutex intx_mask_lock;
>>   	char irq_name[32];
>>   };
> 
> I saw no reason this can't be a spinlock, but perhaps I missed 
> something.  This would allow us to avoid srcu, which is slightly more 
> expensive than rcu.  Since pci 2.3 assigned devices are not a major use 
> case, I'd like not to penalize the mainstream users for this.

The lock has to be held across kvm_set_irq, which is the potentially
expensive (O(n), n == number of VCPUs) operation.

> 
> This patch undoes some of the niceness of the previous patches, but I 
> have no alternative to suggest.
> 

Yes, it surely does not make things simpler. But much of the complexity
is avoided during runtime when MSIs are used.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 6/9] KVM: Save/restore state of assigned PCI device
  2010-11-09 13:29     ` Jan Kiszka
@ 2010-11-09 13:36       ` Avi Kivity
  2010-11-09 13:44         ` Jan Kiszka
  0 siblings, 1 reply; 35+ messages in thread
From: Avi Kivity @ 2010-11-09 13:36 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Marcelo Tosatti, kvm, Alex Williamson, Michael S. Tsirkin

On 11/09/2010 03:29 PM, Jan Kiszka wrote:
> Am 09.11.2010 13:35, Avi Kivity wrote:
> >  On 11/08/2010 01:21 PM, Jan Kiszka wrote:
> >>  The guest may change states that pci_reset_function does not touch. So
> >>  we better save/restore the assigned device across guest usage.
> >
> >  Why do we care?  Shouldn't the next user reset the state to its taste?
>
> Maybe he should, but are we sure this actually happens? E.g.
> pci_reset_function preserves the config state, thus does not remove the
> traces of guest.

Oh yes, I read the code but it didn't register.  Of course this change 
is quite necessary.

(I understood you to mean that the PCI 2.3 reset doesn't reset 
everything, but that isn't what you said).

> >  Or are you worried about information leakage?
>
> Anyone who's worried about security should not assign the same device to
> different users (host or guest) that should be isolated from each other.
> You never know in what state the firmware and/or internal memory of the
> device are.
>
> Probably, anyone concerned about security should not give a device in
> untrusted hands at all...

Agreed, it's much more suitable to the embedded/fixed function case.

> >
> >  This assumes no one else calls pci_save_state()/pci_restore_state()
>
> Correct.

Given the above, this is of secondary concern.

> >  while the guest is running.  Is this true?  What about suspend/resume?
> >  (can this even work without guest assistance?)
>
> Well, so far suspend/resume does not work at all once assigned devices
> are in use. IIUC, the guest is not informed about the fact that power
> will be cut off, trashing the states of its assigned devices. We would
> have to signal this event to the guest and give it a chance to save the
> states.
>
> Maybe we could generate an ACPI event that corresponds to pressing a
> standby button. Or we would have to signal PCI unplug events for all
> assigned devices. In any case, suspend/resume of the host then becomes
> dependent on guest proper reaction.

Yeah.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 9/9] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices
  2010-11-09 13:35     ` Jan Kiszka
@ 2010-11-09 13:41       ` Avi Kivity
  2010-11-09 14:11         ` Jan Kiszka
  0 siblings, 1 reply; 35+ messages in thread
From: Avi Kivity @ 2010-11-09 13:41 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Marcelo Tosatti, kvm, Alex Williamson, Michael S. Tsirkin

On 11/09/2010 03:35 PM, Jan Kiszka wrote:
> Am 09.11.2010 14:27, Avi Kivity wrote:
> >  On 11/08/2010 01:21 PM, Jan Kiszka wrote:
> >>  PCI 2.3 allows to generically disable IRQ sources at device level. This
> >>  enables us to share IRQs of such devices between on the host side when
> >>  passing them to a guest. This feature is optional, user space has to
> >>  request it explicitly. Moreover, user space can inform us about its view
> >>  of PCI_COMMAND_INTX_DISABLE so that we can avoid unmasking the interrupt
> >>  and signaling it if the guest masked it via the PCI config space.
> >>
> >
> >  It's a pity this cannot be done transparently.  We could detect multiple
> >  devices sharing the line,
>
> Even that is not possible. Assigned or host devices may be activated
> after we registered exclusively, pushing the breakage from VM start-up
> to a different operation.

We could detect that and switch the interrupt mode.  Or we could always 
to IRQF_SHARED and fake something in the immediate callback.

> >  but what about PCI_COMMAND_INTX_DISABLE?
> >
> >  Perhaps we can hook the kernel's handler for this bit?
>
> Some IRQ registration notifier that would allow us to reregister our
> handler with IRQ sharing support? Maybe.

Adding an internal API if preferable to an external one (it may be a 
pain to kvm-kmod users though).

> >
> >>  +
> >>  +Capability: KVM_CAP_PCI_2_3
> >>  +Architectures: x86
> >>  +Type: vm ioctl
> >>  +Parameters: struct kvm_assigned_pci_dev (in)
> >>  +Returns: 0 on success, -1 on error
> >>  +
> >>  +Informs the kernel about the guest's view on the INTx mask. As long as the
> >>  +guest masks the legacy INTx, the kernel will refrain from unmasking it at
> >>  +hardware level and will not assert the guest's IRQ line. User space is still
> >>  +responsible for applying this state to the assigned device's real config space.
> >
> >  What if userspace lies?
>
> User space problem. We will at worst receive one IRQ, mask it, and then
> user space need to react again.

Ok.

> >
> >  I saw no reason this can't be a spinlock, but perhaps I missed
> >  something.  This would allow us to avoid srcu, which is slightly more
> >  expensive than rcu.  Since pci 2.3 assigned devices are not a major use
> >  case, I'd like not to penalize the mainstream users for this.
>
> The lock has to be held across kvm_set_irq, which is the potentially
> expensive (O(n), n == number of VCPUs) operation.

What we should probably do is have broadcast interrupts deferred to a 
thread.  I agree it isn't pretty.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 6/9] KVM: Save/restore state of assigned PCI device
  2010-11-09 13:36       ` Avi Kivity
@ 2010-11-09 13:44         ` Jan Kiszka
  2010-11-09 13:46           ` Avi Kivity
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Kiszka @ 2010-11-09 13:44 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, kvm, Alex Williamson, Michael S. Tsirkin

Am 09.11.2010 14:36, Avi Kivity wrote:
> On 11/09/2010 03:29 PM, Jan Kiszka wrote:
>> Am 09.11.2010 13:35, Avi Kivity wrote:
>>>  On 11/08/2010 01:21 PM, Jan Kiszka wrote:
>>>>  The guest may change states that pci_reset_function does not touch. So
>>>>  we better save/restore the assigned device across guest usage.
>>>
>>>  Why do we care?  Shouldn't the next user reset the state to its taste?
>>
>> Maybe he should, but are we sure this actually happens? E.g.
>> pci_reset_function preserves the config state, thus does not remove the
>> traces of guest.
> 
> Oh yes, I read the code but it didn't register.  Of course this change 
> is quite necessary.
> 
> (I understood you to mean that the PCI 2.3 reset doesn't reset 
> everything, but that isn't what you said).

What the hardware makes out of the reset is even another story. No
guarantees I bet (isn't function-level reset an optional thing anyway?).

At least I can report that I managed to kick my Intel 82577LM into limbo
land by trying to load the wrong driver in a guest - host reset was
required afterward to reclaim its functionality.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 6/9] KVM: Save/restore state of assigned PCI device
  2010-11-09 13:44         ` Jan Kiszka
@ 2010-11-09 13:46           ` Avi Kivity
  2010-11-09 16:41             ` Don Dutile
  0 siblings, 1 reply; 35+ messages in thread
From: Avi Kivity @ 2010-11-09 13:46 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Marcelo Tosatti, kvm, Alex Williamson, Michael S. Tsirkin

On 11/09/2010 03:44 PM, Jan Kiszka wrote:
> >
> >  Oh yes, I read the code but it didn't register.  Of course this change
> >  is quite necessary.
> >
> >  (I understood you to mean that the PCI 2.3 reset doesn't reset
> >  everything, but that isn't what you said).
>
> What the hardware makes out of the reset is even another story. No
> guarantees I bet (isn't function-level reset an optional thing anyway?).

It is.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 9/9] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices
  2010-11-09 13:41       ` Avi Kivity
@ 2010-11-09 14:11         ` Jan Kiszka
  2010-11-09 14:20           ` Avi Kivity
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Kiszka @ 2010-11-09 14:11 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, kvm, Alex Williamson, Michael S. Tsirkin

Am 09.11.2010 14:41, Avi Kivity wrote:
> On 11/09/2010 03:35 PM, Jan Kiszka wrote:
>> Am 09.11.2010 14:27, Avi Kivity wrote:
>>>  On 11/08/2010 01:21 PM, Jan Kiszka wrote:
>>>>  PCI 2.3 allows to generically disable IRQ sources at device level. This
>>>>  enables us to share IRQs of such devices between on the host side when
>>>>  passing them to a guest. This feature is optional, user space has to
>>>>  request it explicitly. Moreover, user space can inform us about its view
>>>>  of PCI_COMMAND_INTX_DISABLE so that we can avoid unmasking the interrupt
>>>>  and signaling it if the guest masked it via the PCI config space.
>>>>
>>>
>>>  It's a pity this cannot be done transparently.  We could detect multiple
>>>  devices sharing the line,
>>
>> Even that is not possible. Assigned or host devices may be activated
>> after we registered exclusively, pushing the breakage from VM start-up
>> to a different operation.
> 
> We could detect that and switch the interrupt mode.  Or we could always 
> to IRQF_SHARED and fake something in the immediate callback.
> 
>>>  but what about PCI_COMMAND_INTX_DISABLE?
>>>
>>>  Perhaps we can hook the kernel's handler for this bit?
>>
>> Some IRQ registration notifier that would allow us to reregister our
>> handler with IRQ sharing support? Maybe.
> 
> Adding an internal API if preferable to an external one (it may be a 
> pain to kvm-kmod users though).

Primary concern should be a clean and robust API. From that POV, I would
prefer an official hook with genirq maintainer blessing over fragile
detection heuristics in kvm. Given that VSIO should benefit from any
transparent pattern we develop here as well, it's probably worth to go
that path - if it is really preferred over manual control like this
patch proposes.

For kvm-kmod, we could simply enforce IRQ sharing measures
unconditionally. Not optimal from the performance POV, but people
concerned that much about performance should better use KVM over the
corresponding kernel anyway.

> 
>>>
>>>>  +
>>>>  +Capability: KVM_CAP_PCI_2_3
>>>>  +Architectures: x86
>>>>  +Type: vm ioctl
>>>>  +Parameters: struct kvm_assigned_pci_dev (in)
>>>>  +Returns: 0 on success, -1 on error
>>>>  +
>>>>  +Informs the kernel about the guest's view on the INTx mask. As long as the
>>>>  +guest masks the legacy INTx, the kernel will refrain from unmasking it at
>>>>  +hardware level and will not assert the guest's IRQ line. User space is still
>>>>  +responsible for applying this state to the assigned device's real config space.
>>>
>>>  What if userspace lies?
>>
>> User space problem. We will at worst receive one IRQ, mask it, and then
>> user space need to react again.
> 
> Ok.
> 
>>>
>>>  I saw no reason this can't be a spinlock, but perhaps I missed
>>>  something.  This would allow us to avoid srcu, which is slightly more
>>>  expensive than rcu.  Since pci 2.3 assigned devices are not a major use
>>>  case, I'd like not to penalize the mainstream users for this.
>>
>> The lock has to be held across kvm_set_irq, which is the potentially
>> expensive (O(n), n == number of VCPUs) operation.
> 
> What we should probably do is have broadcast interrupts deferred to a 
> thread.  I agree it isn't pretty.

Not sure we could defer it that easily (if at all). However, I think
such improvements should be done on top if this already complex change.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 9/9] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices
  2010-11-09 14:11         ` Jan Kiszka
@ 2010-11-09 14:20           ` Avi Kivity
  0 siblings, 0 replies; 35+ messages in thread
From: Avi Kivity @ 2010-11-09 14:20 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Marcelo Tosatti, kvm, Alex Williamson, Michael S. Tsirkin

On 11/09/2010 04:11 PM, Jan Kiszka wrote:
> >
> >>>   but what about PCI_COMMAND_INTX_DISABLE?
> >>>
> >>>   Perhaps we can hook the kernel's handler for this bit?
> >>
> >>  Some IRQ registration notifier that would allow us to reregister our
> >>  handler with IRQ sharing support? Maybe.
> >
> >  Adding an internal API if preferable to an external one (it may be a
> >  pain to kvm-kmod users though).
>
> Primary concern should be a clean and robust API. From that POV, I would
> prefer an official hook with genirq maintainer blessing over fragile
> detection heuristics in kvm.

That is what I meant (internal to the kernel, not to kvm, vs the 
external kvm/user interface).

> Given that VSIO should benefit from any
> transparent pattern we develop here as well, it's probably worth to go
> that path - if it is really preferred over manual control like this
> patch proposes.

Yes.  The API is already sufficiently complicated that it is very hard 
for userspace to get right.

> For kvm-kmod, we could simply enforce IRQ sharing measures
> unconditionally. Not optimal from the performance POV, but people
> concerned that much about performance should better use KVM over the
> corresponding kernel anyway.

Or backport the needed patches.  kvm-kmod is now no longer needed for 
enterprise users with fixed distros, it's mostly useful for the embedded 
guys which have more flexibility.

> >>>
> >>>   I saw no reason this can't be a spinlock, but perhaps I missed
> >>>   something.  This would allow us to avoid srcu, which is slightly more
> >>>   expensive than rcu.  Since pci 2.3 assigned devices are not a major use
> >>>   case, I'd like not to penalize the mainstream users for this.
> >>
> >>  The lock has to be held across kvm_set_irq, which is the potentially
> >>  expensive (O(n), n == number of VCPUs) operation.
> >
> >  What we should probably do is have broadcast interrupts deferred to a
> >  thread.  I agree it isn't pretty.
>
> Not sure we could defer it that easily (if at all). However, I think
> such improvements should be done on top if this already complex change.

That means flipping locking to srcu and back to rcu.  What I'm proposing 
is to keep broadcast within the spinlock for now, and move it later, 
provided we agree that it's doable.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 6/9] KVM: Save/restore state of assigned PCI device
  2010-11-09 13:46           ` Avi Kivity
@ 2010-11-09 16:41             ` Don Dutile
  0 siblings, 0 replies; 35+ messages in thread
From: Don Dutile @ 2010-11-09 16:41 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jan Kiszka, Marcelo Tosatti, kvm, Alex Williamson, Michael S. Tsirkin

Avi Kivity wrote:
> On 11/09/2010 03:44 PM, Jan Kiszka wrote:
>> >
>> >  Oh yes, I read the code but it didn't register.  Of course this change
>> >  is quite necessary.
>> >
>> >  (I understood you to mean that the PCI 2.3 reset doesn't reset
>> >  everything, but that isn't what you said).
>>
>> What the hardware makes out of the reset is even another story. No
>> guarantees I bet (isn't function-level reset an optional thing anyway?).
> 
> It is.
> 
optional for PF's; mandatory for VFs.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/9] KVM: Clear assigned guest IRQ on release
  2010-11-09 10:58   ` Avi Kivity
  2010-11-09 11:20     ` Jan Kiszka
@ 2010-11-09 18:36     ` Alex Williamson
  2010-11-10  6:53       ` Avi Kivity
  1 sibling, 1 reply; 35+ messages in thread
From: Alex Williamson @ 2010-11-09 18:36 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, Marcelo Tosatti, kvm, Michael S. Tsirkin

On Tue, 2010-11-09 at 12:58 +0200, Avi Kivity wrote:
> On 11/08/2010 01:21 PM, Jan Kiszka wrote:
> > When we deassign a guest IRQ, clear the potentially asserted guest line.
> > There might be no chance for the guest to do this, specifically if we
> > switch from INTx to MSI mode.
> >
> > Signed-off-by: Jan Kiszka<jan.kiszka@siemens.com>
> > ---
> >   virt/kvm/assigned-dev.c |    3 +++
> >   1 files changed, 3 insertions(+), 0 deletions(-)
> >
> > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > index 7c98928..ecc4419 100644
> > --- a/virt/kvm/assigned-dev.c
> > +++ b/virt/kvm/assigned-dev.c
> > @@ -141,6 +141,9 @@ static void deassign_guest_irq(struct kvm *kvm,
> >   	kvm_unregister_irq_ack_notifier(kvm,&assigned_dev->ack_notifier);
> >   	assigned_dev->ack_notifier.gsi = -1;
> >
> > +	kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
> > +		    assigned_dev->guest_irq, 0);
> > +
> >   	if (assigned_dev->irq_source_id != -1)
> >   		kvm_free_irq_source_id(kvm, assigned_dev->irq_source_id);
> >   	assigned_dev->irq_source_id = -1;
> 
> I guess this can't hurt.  Did you see it happen in practice?
> 
> Note: all this kvm_set_irq(..., [01]) is incorrect as it doesn't account 
> for polarity.  Currently the qemu-emulated chipset uses level high pci 
> interrupts, but that's not a given from kvm's point of view.
> 
> I think vfio fixes this by only routing msi interrupts via the kernel, 
> and routing level interrupts through userspace, which can adjust 
> polarity.  Alex/Michael?

The latest patches will route legacy interrupts via irqfd if available
too.  We do have the issue that KVM pulses interrupts injected this way,
but it seems to work nonetheless.  I was thinking about proposing a
level flag for set_irqfd to only set it, allowing the ack notifier to
deassert it.  Perhaps we also need a flag to toggle the polarity.

Alex


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/9] KVM: Clear assigned guest IRQ on release
  2010-11-09 18:36     ` Alex Williamson
@ 2010-11-10  6:53       ` Avi Kivity
  2010-11-26 10:36         ` Michael S. Tsirkin
  0 siblings, 1 reply; 35+ messages in thread
From: Avi Kivity @ 2010-11-10  6:53 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Jan Kiszka, Marcelo Tosatti, kvm, Michael S. Tsirkin

On 11/09/2010 08:36 PM, Alex Williamson wrote:
> >
> >  Note: all this kvm_set_irq(..., [01]) is incorrect as it doesn't account
> >  for polarity.  Currently the qemu-emulated chipset uses level high pci
> >  interrupts, but that's not a given from kvm's point of view.
> >
> >  I think vfio fixes this by only routing msi interrupts via the kernel,
> >  and routing level interrupts through userspace, which can adjust
> >  polarity.  Alex/Michael?
>
> The latest patches will route legacy interrupts via irqfd if available
> too.  We do have the issue that KVM pulses interrupts injected this way,
> but it seems to work nonetheless.

That doesn't sound too good.

> I was thinking about proposing a
> level flag for set_irqfd to only set it, allowing the ack notifier to
> deassert it.  Perhaps we also need a flag to toggle the polarity.

Or maybe irqfd is not up to this task.

I'd rather stick with the ordinary ioctls for this until proven they're 
too slow.  By proven, I mean a userspace irq forwarding implementation 
that doesn't suck like we it does now (can be just a prototype).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 0/9] KVM: Improve IRQ assignment for device passthrough
  2010-11-08 11:21 [PATCH v4 0/9] KVM: Improve IRQ assignment for device passthrough Jan Kiszka
                   ` (8 preceding siblings ...)
  2010-11-08 11:21 ` [PATCH v4 9/9] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices Jan Kiszka
@ 2010-11-16 16:55 ` Marcelo Tosatti
  2010-11-16 18:26   ` Jan Kiszka
  9 siblings, 1 reply; 35+ messages in thread
From: Marcelo Tosatti @ 2010-11-16 16:55 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Avi Kivity, kvm, Alex Williamson, Michael S. Tsirkin

On Mon, Nov 08, 2010 at 12:21:44PM +0100, Jan Kiszka wrote:
> Nine patches (yeah, it's getting more and more) to improve "classic"
> device assigment /wrt IRQs. Highlight is the last one that resolves the
> host IRQ sharing issue for all PCI 2.3 devices. Quite essential when
> passing non-MSI-ready devices like many USB host controllers.
> 
> As there were concerns regarding the overhead of IRQ masking via the PCI
> config space, I did some micro-benchmarks. Well, the concerns are valid:
> 
> disable_irq_nosync:           ~600 cycles
> pci_2_3_irq_check_and_mask:  ~6000 cycles (EHCI)
>                             ~22000 cycles (AR9287, with peaks >100000)
> 
> Specifically the varying impact of the device like in the Atheros case
> is worrying (this device is actually known to cause horrible latencies
> to the host, but who knows what other devices do). So I decided to go
> with PCI-2.3 masking as default off in the to-be-sent qemu-kvm patch.
> Maybe something to consider vor VFIO as well.

Looks fine to me. Michael, Alex? 

Also needs a rebase.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 0/9] KVM: Improve IRQ assignment for device passthrough
  2010-11-16 16:55 ` [PATCH v4 0/9] KVM: Improve IRQ assignment for device passthrough Marcelo Tosatti
@ 2010-11-16 18:26   ` Jan Kiszka
  2010-11-17  8:25     ` Avi Kivity
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Kiszka @ 2010-11-16 18:26 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, kvm, Alex Williamson, Michael S. Tsirkin

Am 16.11.2010 17:55, Marcelo Tosatti wrote:
> On Mon, Nov 08, 2010 at 12:21:44PM +0100, Jan Kiszka wrote:
>> Nine patches (yeah, it's getting more and more) to improve "classic"
>> device assigment /wrt IRQs. Highlight is the last one that resolves the
>> host IRQ sharing issue for all PCI 2.3 devices. Quite essential when
>> passing non-MSI-ready devices like many USB host controllers.
>>
>> As there were concerns regarding the overhead of IRQ masking via the PCI
>> config space, I did some micro-benchmarks. Well, the concerns are valid:
>>
>> disable_irq_nosync:           ~600 cycles
>> pci_2_3_irq_check_and_mask:  ~6000 cycles (EHCI)
>>                             ~22000 cycles (AR9287, with peaks >100000)
>>
>> Specifically the varying impact of the device like in the Atheros case
>> is worrying (this device is actually known to cause horrible latencies
>> to the host, but who knows what other devices do). So I decided to go
>> with PCI-2.3 masking as default off in the to-be-sent qemu-kvm patch.
>> Maybe something to consider vor VFIO as well.
> 
> Looks fine to me. Michael, Alex? 
> 
> Also needs a rebase.

I think Avi wanted to explore possibilities to avoid host_pci_2_3=on|off
by switching between both modes on demand. I've some ideas, but still
need to look into design details. Given that this would have impact on
the ABI, I guess we better wait with merging 9/9.

Moreover, Avi preferred to skip the srcu conversion and run with a
simply lock across kvm_set_irq (just like we do now).

But I could offer to rebase and repost the independent rest of the
series tonight.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 0/9] KVM: Improve IRQ assignment for device passthrough
  2010-11-16 18:26   ` Jan Kiszka
@ 2010-11-17  8:25     ` Avi Kivity
  0 siblings, 0 replies; 35+ messages in thread
From: Avi Kivity @ 2010-11-17  8:25 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Marcelo Tosatti, kvm, Alex Williamson, Michael S. Tsirkin

On 11/16/2010 08:26 PM, Jan Kiszka wrote:
> Am 16.11.2010 17:55, Marcelo Tosatti wrote:
> >  On Mon, Nov 08, 2010 at 12:21:44PM +0100, Jan Kiszka wrote:
> >>  Nine patches (yeah, it's getting more and more) to improve "classic"
> >>  device assigment /wrt IRQs. Highlight is the last one that resolves the
> >>  host IRQ sharing issue for all PCI 2.3 devices. Quite essential when
> >>  passing non-MSI-ready devices like many USB host controllers.
> >>
> >>  As there were concerns regarding the overhead of IRQ masking via the PCI
> >>  config space, I did some micro-benchmarks. Well, the concerns are valid:
> >>
> >>  disable_irq_nosync:           ~600 cycles
> >>  pci_2_3_irq_check_and_mask:  ~6000 cycles (EHCI)
> >>                              ~22000 cycles (AR9287, with peaks>100000)
> >>
> >>  Specifically the varying impact of the device like in the Atheros case
> >>  is worrying (this device is actually known to cause horrible latencies
> >>  to the host, but who knows what other devices do). So I decided to go
> >>  with PCI-2.3 masking as default off in the to-be-sent qemu-kvm patch.
> >>  Maybe something to consider vor VFIO as well.
> >
> >  Looks fine to me. Michael, Alex?
> >
> >  Also needs a rebase.
>
> I think Avi wanted to explore possibilities to avoid host_pci_2_3=on|off
> by switching between both modes on demand. I've some ideas, but still
> need to look into design details. Given that this would have impact on
> the ABI, I guess we better wait with merging 9/9.

Indeed.

> Moreover, Avi preferred to skip the srcu conversion and run with a
> simply lock across kvm_set_irq (just like we do now).

Yes.  It was predicated on agreeing that deferring broadcast/multicast 
IPIs to a workqueue is possible.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4 3/9] KVM: Clear assigned guest IRQ on release
  2010-11-10  6:53       ` Avi Kivity
@ 2010-11-26 10:36         ` Michael S. Tsirkin
  0 siblings, 0 replies; 35+ messages in thread
From: Michael S. Tsirkin @ 2010-11-26 10:36 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alex Williamson, Jan Kiszka, Marcelo Tosatti, kvm

On Wed, Nov 10, 2010 at 08:53:35AM +0200, Avi Kivity wrote:
> On 11/09/2010 08:36 PM, Alex Williamson wrote:
> >>
> >>  Note: all this kvm_set_irq(..., [01]) is incorrect as it doesn't account
> >>  for polarity.  Currently the qemu-emulated chipset uses level high pci
> >>  interrupts, but that's not a given from kvm's point of view.
> >>
> >>  I think vfio fixes this by only routing msi interrupts via the kernel,
> >>  and routing level interrupts through userspace, which can adjust
> >>  polarity.  Alex/Michael?
> >
> >The latest patches will route legacy interrupts via irqfd if available
> >too.  We do have the issue that KVM pulses interrupts injected this way,
> >but it seems to work nonetheless.
> 
> That doesn't sound too good.

It's more than luck that it works: there is a backend
device that keeps the interrupt asserted, if driver did not
handle the interrupt but the OS did ack it, we will unmask
the interrupt, get another one from the device
(or just notice status bit is set) and send another pulse.

Whether all this improves or hurts performance, or whether
some guest might notice the difference, I don't know.
So maybe pulsing the interrupt is actually a good thing?
Anyone has any theories?

What I think is broken, is
1. interrupt sharing, as irqfds use the userspace id instead of
   a per-device id.
2. emulated devices as there's no one to reassert the interrupt.


> >I was thinking about proposing a
> >level flag for set_irqfd to only set it, allowing the ack notifier to
> >deassert it.  Perhaps we also need a flag to toggle the polarity.
> 
> Or maybe irqfd is not up to this task.
> 
> I'd rather stick with the ordinary ioctls for this until proven
> they're too slow.  By proven, I mean a userspace irq forwarding
> implementation that doesn't suck like we it does now (can be just a
> prototype).
> 
> -- 
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2010-11-26 10:36 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-08 11:21 [PATCH v4 0/9] KVM: Improve IRQ assignment for device passthrough Jan Kiszka
2010-11-08 11:21 ` [PATCH v4 1/9] KVM: Fix srcu struct leakage Jan Kiszka
2010-11-08 17:00   ` Michael S. Tsirkin
2010-11-08 17:32     ` Jan Kiszka
2010-11-08 11:21 ` [PATCH v4 2/9] KVM: Switch IRQ subsystem to SRCU Jan Kiszka
2010-11-09 10:49   ` Avi Kivity
2010-11-09 11:21     ` Jan Kiszka
2010-11-08 11:21 ` [PATCH v4 3/9] KVM: Clear assigned guest IRQ on release Jan Kiszka
2010-11-09 10:58   ` Avi Kivity
2010-11-09 11:20     ` Jan Kiszka
2010-11-09 18:36     ` Alex Williamson
2010-11-10  6:53       ` Avi Kivity
2010-11-26 10:36         ` Michael S. Tsirkin
2010-11-08 11:21 ` [PATCH v4 4/9] KVM: Switch assigned device IRQ forwarding to threaded handler Jan Kiszka
2010-11-09 12:26   ` Avi Kivity
2010-11-09 12:36     ` Jan Kiszka
2010-11-08 11:21 ` [PATCH v4 5/9] KVM: Refactor IRQ names of assigned devices Jan Kiszka
2010-11-08 11:21 ` [PATCH v4 6/9] KVM: Save/restore state of assigned PCI device Jan Kiszka
2010-11-09 12:35   ` Avi Kivity
2010-11-09 13:29     ` Jan Kiszka
2010-11-09 13:36       ` Avi Kivity
2010-11-09 13:44         ` Jan Kiszka
2010-11-09 13:46           ` Avi Kivity
2010-11-09 16:41             ` Don Dutile
2010-11-08 11:21 ` [PATCH v4 7/9] KVM: Clean up kvm_vm_ioctl_assigned_device Jan Kiszka
2010-11-08 11:21 ` [PATCH v4 8/9] KVM: Document device assigment API Jan Kiszka
2010-11-08 11:21 ` [PATCH v4 9/9] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices Jan Kiszka
2010-11-09 13:27   ` Avi Kivity
2010-11-09 13:35     ` Jan Kiszka
2010-11-09 13:41       ` Avi Kivity
2010-11-09 14:11         ` Jan Kiszka
2010-11-09 14:20           ` Avi Kivity
2010-11-16 16:55 ` [PATCH v4 0/9] KVM: Improve IRQ assignment for device passthrough Marcelo Tosatti
2010-11-16 18:26   ` Jan Kiszka
2010-11-17  8:25     ` Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.