All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
@ 2010-10-20  8:26 Sheng Yang
  2010-10-20  8:26 ` [PATCH 1/8] PCI: MSI: Move MSI-X entry definition to pci_regs.h Sheng Yang
                   ` (10 more replies)
  0 siblings, 11 replies; 66+ messages in thread
From: Sheng Yang @ 2010-10-20  8:26 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti; +Cc: kvm, Michael S. Tsirkin, Sheng Yang

Here is v2.

Changelog:

v1->v2

The major change from v1 is I've added the in-kernel MSI-X mask emulation
support, as well as adding shortcuts for reading MSI-X table.

I've taken Michael's advice to use mask/unmask directly, but unsure about
exporting irq_to_desc() for module...

Also add flush_work() according to Marcelo's comments.

Sheng Yang (8):
  PCI: MSI: Move MSI-X entry definition to pci_regs.h
  irq: Export irq_to_desc() to modules
  KVM: x86: Enable ENABLE_CAP capability for x86
  KVM: Move struct kvm_io_device to kvm_host.h
  KVM: Add kvm_get_irq_routing_entry() func
  KVM: assigned dev: Preparation for mask support in userspace
  KVM: assigned dev: Introduce io_device for MSI-X MMIO accessing
  KVM: Emulation MSI-X mask bits for assigned devices

 Documentation/kvm/api.txt       |   30 +++++-
 arch/x86/include/asm/kvm_host.h |    2 +
 arch/x86/kvm/x86.c              |   32 ++++++
 drivers/pci/msi.h               |    6 -
 include/linux/kvm.h             |   19 +++-
 include/linux/kvm_host.h        |   28 +++++
 include/linux/pci_regs.h        |    7 +
 kernel/irq/handle.c             |    2 +
 virt/kvm/assigned-dev.c         |  230 +++++++++++++++++++++++++++++++++++++++
 virt/kvm/iodev.h                |   25 +----
 virt/kvm/irq_comm.c             |   20 ++++
 11 files changed, 367 insertions(+), 34 deletions(-)


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 1/8] PCI: MSI: Move MSI-X entry definition to pci_regs.h
  2010-10-20  8:26 [PATCH 0/8][v2] MSI-X mask emulation support for assigned device Sheng Yang
@ 2010-10-20  8:26 ` Sheng Yang
  2010-10-20 11:07   ` Matthew Wilcox
  2010-10-20  8:26 ` [PATCH 2/8] irq: Export irq_to_desc() to modules Sheng Yang
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 66+ messages in thread
From: Sheng Yang @ 2010-10-20  8:26 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti
  Cc: kvm, Michael S. Tsirkin, Sheng Yang, Jesse Barnes, linux-pci

It would be used by KVM later.

Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: linux-pci@vger.kernel.org
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
---
 drivers/pci/msi.h        |    6 ------
 include/linux/pci_regs.h |    7 +++++++
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/msi.h b/drivers/pci/msi.h
index de27c1c..28a3c52 100644
--- a/drivers/pci/msi.h
+++ b/drivers/pci/msi.h
@@ -6,12 +6,6 @@
 #ifndef MSI_H
 #define MSI_H
 
-#define PCI_MSIX_ENTRY_SIZE		16
-#define  PCI_MSIX_ENTRY_LOWER_ADDR	0
-#define  PCI_MSIX_ENTRY_UPPER_ADDR	4
-#define  PCI_MSIX_ENTRY_DATA		8
-#define  PCI_MSIX_ENTRY_VECTOR_CTRL	12
-
 #define msi_control_reg(base)		(base + PCI_MSI_FLAGS)
 #define msi_lower_address_reg(base)	(base + PCI_MSI_ADDRESS_LO)
 #define msi_upper_address_reg(base)	(base + PCI_MSI_ADDRESS_HI)
diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
index 455b9cc..acfc224 100644
--- a/include/linux/pci_regs.h
+++ b/include/linux/pci_regs.h
@@ -307,6 +307,13 @@
 #define  PCI_MSIX_FLAGS_MASKALL	(1 << 14)
 #define PCI_MSIX_FLAGS_BIRMASK	(7 << 0)
 
+/* MSI-X entry's format */
+#define PCI_MSIX_ENTRY_SIZE		16
+#define  PCI_MSIX_ENTRY_LOWER_ADDR	0
+#define  PCI_MSIX_ENTRY_UPPER_ADDR	4
+#define  PCI_MSIX_ENTRY_DATA		8
+#define  PCI_MSIX_ENTRY_VECTOR_CTRL	12
+
 /* CompactPCI Hotswap Register */
 
 #define PCI_CHSWP_CSR		2	/* Control and Status Register */
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 2/8] irq: Export irq_to_desc() to modules
  2010-10-20  8:26 [PATCH 0/8][v2] MSI-X mask emulation support for assigned device Sheng Yang
  2010-10-20  8:26 ` [PATCH 1/8] PCI: MSI: Move MSI-X entry definition to pci_regs.h Sheng Yang
@ 2010-10-20  8:26 ` Sheng Yang
  2010-10-20  8:26 ` [PATCH 3/8] KVM: x86: Enable ENABLE_CAP capability for x86 Sheng Yang
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 66+ messages in thread
From: Sheng Yang @ 2010-10-20  8:26 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti
  Cc: kvm, Michael S. Tsirkin, Sheng Yang, linux-kernel

KVM need to execute mask/unmask directly on MSI/MSI-X devices. The alternative
way of doing this is export mask_msi_irq(), but it lacks of checking if IRQ
type was really MSI.

Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
---
 kernel/irq/handle.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index 27e5c69..1ea2a24 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -139,6 +139,7 @@ struct irq_desc *irq_to_desc(unsigned int irq)
 {
 	return radix_tree_lookup(&irq_desc_tree, irq);
 }
+EXPORT_SYMBOL_GPL(irq_to_desc);
 
 void replace_irq_desc(unsigned int irq, struct irq_desc *desc)
 {
@@ -276,6 +277,7 @@ struct irq_desc *irq_to_desc(unsigned int irq)
 {
 	return (irq < NR_IRQS) ? irq_desc + irq : NULL;
 }
+EXPORT_SYMBOL_GPL(irq_to_desc);
 
 struct irq_desc *irq_to_desc_alloc_node(unsigned int irq, int node)
 {
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 3/8] KVM: x86: Enable ENABLE_CAP capability for x86
  2010-10-20  8:26 [PATCH 0/8][v2] MSI-X mask emulation support for assigned device Sheng Yang
  2010-10-20  8:26 ` [PATCH 1/8] PCI: MSI: Move MSI-X entry definition to pci_regs.h Sheng Yang
  2010-10-20  8:26 ` [PATCH 2/8] irq: Export irq_to_desc() to modules Sheng Yang
@ 2010-10-20  8:26 ` Sheng Yang
  2010-10-20  8:26 ` [PATCH 4/8] KVM: Move struct kvm_io_device to kvm_host.h Sheng Yang
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 66+ messages in thread
From: Sheng Yang @ 2010-10-20  8:26 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti; +Cc: kvm, Michael S. Tsirkin, Sheng Yang

It would be used later by KVM_CAP_MSIX_MASK related interfaces.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
---
 Documentation/kvm/api.txt |    8 +++++---
 arch/x86/kvm/x86.c        |   26 ++++++++++++++++++++++++++
 2 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
index b336266..d82d637 100644
--- a/Documentation/kvm/api.txt
+++ b/Documentation/kvm/api.txt
@@ -817,7 +817,7 @@ documentation when it pops into existence).
 4.36 KVM_ENABLE_CAP
 
 Capability: KVM_CAP_ENABLE_CAP
-Architectures: ppc
+Architectures: ppc, x86
 Type: vcpu ioctl
 Parameters: struct kvm_enable_cap (in)
 Returns: 0 on success; -1 on error
@@ -828,8 +828,10 @@ can enable an extension, making it available to the guest.
 On systems that do not support this ioctl, it always fails. On systems that
 do support it, it only works for extensions that are supported for enablement.
 
-To check if a capability can be enabled, the KVM_CHECK_EXTENSION ioctl should
-be used.
+For PPC, to check if a capability can be enabled, the KVM_CHECK_EXTENSION
+ioctl should be used.
+
+For x86, only some specific capabilities need to be enabled before use.
 
 struct kvm_enable_cap {
        /* in */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f3f86b2..fc62546 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1926,6 +1926,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_DEBUGREGS:
 	case KVM_CAP_X86_ROBUST_SINGLESTEP:
 	case KVM_CAP_XSAVE:
+	case KVM_CAP_ENABLE_CAP:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -2707,6 +2708,23 @@ static int kvm_vcpu_ioctl_x86_set_xcrs(struct kvm_vcpu *vcpu,
 	return r;
 }
 
+static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
+				     struct kvm_enable_cap *cap)
+{
+	int r;
+
+	if (cap->flags)
+		return -EINVAL;
+
+	switch (cap->cap) {
+	default:
+		r = -EINVAL;
+		break;
+	}
+
+	return r;
+}
+
 long kvm_arch_vcpu_ioctl(struct file *filp,
 			 unsigned int ioctl, unsigned long arg)
 {
@@ -2970,6 +2988,14 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 		r = kvm_vcpu_ioctl_x86_set_xcrs(vcpu, u.xcrs);
 		break;
 	}
+	case KVM_ENABLE_CAP: {
+		struct kvm_enable_cap cap;
+		r = -EFAULT;
+		if (copy_from_user(&cap, argp, sizeof(cap)))
+			goto out;
+		r = kvm_vcpu_ioctl_enable_cap(vcpu, &cap);
+		break;
+	}
 	default:
 		r = -EINVAL;
 	}
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 4/8] KVM: Move struct kvm_io_device to kvm_host.h
  2010-10-20  8:26 [PATCH 0/8][v2] MSI-X mask emulation support for assigned device Sheng Yang
                   ` (2 preceding siblings ...)
  2010-10-20  8:26 ` [PATCH 3/8] KVM: x86: Enable ENABLE_CAP capability for x86 Sheng Yang
@ 2010-10-20  8:26 ` Sheng Yang
  2010-10-20  8:26 ` [PATCH 5/8] KVM: Add kvm_get_irq_routing_entry() func Sheng Yang
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 66+ messages in thread
From: Sheng Yang @ 2010-10-20  8:26 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti; +Cc: kvm, Michael S. Tsirkin, Sheng Yang

Then it can be used in struct kvm_assigned_dev_kernel later.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
---
 include/linux/kvm_host.h |   23 +++++++++++++++++++++++
 virt/kvm/iodev.h         |   25 +------------------------
 2 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0b89d00..7f9e4b7 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -74,6 +74,29 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx,
 int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
 			      struct kvm_io_device *dev);
 
+struct kvm_io_device;
+
+/**
+ * kvm_io_device_ops are called under kvm slots_lock.
+ * read and write handlers return 0 if the transaction has been handled,
+ * or non-zero to have it passed to the next device.
+ **/
+struct kvm_io_device_ops {
+	int (*read)(struct kvm_io_device *this,
+		    gpa_t addr,
+		    int len,
+		    void *val);
+	int (*write)(struct kvm_io_device *this,
+		     gpa_t addr,
+		     int len,
+		     const void *val);
+	void (*destructor)(struct kvm_io_device *this);
+};
+
+struct kvm_io_device {
+	const struct kvm_io_device_ops *ops;
+};
+
 struct kvm_vcpu {
 	struct kvm *kvm;
 #ifdef CONFIG_PREEMPT_NOTIFIERS
diff --git a/virt/kvm/iodev.h b/virt/kvm/iodev.h
index 12fd3ca..d1f5651 100644
--- a/virt/kvm/iodev.h
+++ b/virt/kvm/iodev.h
@@ -17,32 +17,9 @@
 #define __KVM_IODEV_H__
 
 #include <linux/kvm_types.h>
+#include <linux/kvm_host.h>
 #include <asm/errno.h>
 
-struct kvm_io_device;
-
-/**
- * kvm_io_device_ops are called under kvm slots_lock.
- * read and write handlers return 0 if the transaction has been handled,
- * or non-zero to have it passed to the next device.
- **/
-struct kvm_io_device_ops {
-	int (*read)(struct kvm_io_device *this,
-		    gpa_t addr,
-		    int len,
-		    void *val);
-	int (*write)(struct kvm_io_device *this,
-		     gpa_t addr,
-		     int len,
-		     const void *val);
-	void (*destructor)(struct kvm_io_device *this);
-};
-
-
-struct kvm_io_device {
-	const struct kvm_io_device_ops *ops;
-};
-
 static inline void kvm_iodevice_init(struct kvm_io_device *dev,
 				     const struct kvm_io_device_ops *ops)
 {
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 5/8] KVM: Add kvm_get_irq_routing_entry() func
  2010-10-20  8:26 [PATCH 0/8][v2] MSI-X mask emulation support for assigned device Sheng Yang
                   ` (3 preceding siblings ...)
  2010-10-20  8:26 ` [PATCH 4/8] KVM: Move struct kvm_io_device to kvm_host.h Sheng Yang
@ 2010-10-20  8:26 ` Sheng Yang
  2010-10-20  8:53   ` Avi Kivity
  2010-10-20  8:26 ` [PATCH 6/8] KVM: assigned dev: Preparation for mask support in userspace Sheng Yang
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 66+ messages in thread
From: Sheng Yang @ 2010-10-20  8:26 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti; +Cc: kvm, Michael S. Tsirkin, Sheng Yang

We need to query the entry later.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
---
 include/linux/kvm_host.h |    2 ++
 virt/kvm/irq_comm.c      |   20 ++++++++++++++++++++
 2 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7f9e4b7..30f83cd 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -614,6 +614,8 @@ int kvm_set_irq_routing(struct kvm *kvm,
 			const struct kvm_irq_routing_entry *entries,
 			unsigned nr,
 			unsigned flags);
+struct kvm_kernel_irq_routing_entry *kvm_get_irq_routing_entry(struct kvm *kvm,
+							int gsi);
 void kvm_free_irq_routing(struct kvm *kvm);
 
 #else
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 8edca91..80ced8c 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -421,6 +421,26 @@ out:
 	return r;
 }
 
+struct kvm_kernel_irq_routing_entry *kvm_get_irq_routing_entry(struct kvm *kvm,
+							       int gsi)
+{
+	int count = 0;
+	struct kvm_kernel_irq_routing_entry *ei = NULL;
+	struct kvm_irq_routing_table *irq_rt;
+	struct hlist_node *n;
+
+	rcu_read_lock();
+	irq_rt = rcu_dereference(kvm->irq_routing);
+	if (gsi < irq_rt->nr_rt_entries)
+		hlist_for_each_entry(ei, n, &irq_rt->map[gsi], link)
+			count++;
+	rcu_read_unlock();
+	if (count == 1)
+		return ei;
+
+	return NULL;
+}
+
 #define IOAPIC_ROUTING_ENTRY(irq) \
 	{ .gsi = irq, .type = KVM_IRQ_ROUTING_IRQCHIP,	\
 	  .u.irqchip.irqchip = KVM_IRQCHIP_IOAPIC, .u.irqchip.pin = (irq) }
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 6/8] KVM: assigned dev: Preparation for mask support in userspace
  2010-10-20  8:26 [PATCH 0/8][v2] MSI-X mask emulation support for assigned device Sheng Yang
                   ` (4 preceding siblings ...)
  2010-10-20  8:26 ` [PATCH 5/8] KVM: Add kvm_get_irq_routing_entry() func Sheng Yang
@ 2010-10-20  8:26 ` Sheng Yang
  2010-10-20  9:30   ` Avi Kivity
  2010-10-22 14:53   ` Marcelo Tosatti
  2010-10-20  8:26 ` [PATCH 7/8] KVM: assigned dev: Introduce io_device for MSI-X MMIO accessing Sheng Yang
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 66+ messages in thread
From: Sheng Yang @ 2010-10-20  8:26 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti; +Cc: kvm, Michael S. Tsirkin, Sheng Yang

The feature wouldn't be enabled until later patch set msix_flags_enabled. It
would be enabled along with mask support in kernel.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
---
 arch/x86/include/asm/kvm_host.h |    2 ++
 include/linux/kvm.h             |    6 +++++-
 include/linux/kvm_host.h        |    1 +
 virt/kvm/assigned-dev.c         |   39 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 47 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e209078..2bb69ba 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -456,6 +456,8 @@ struct kvm_arch {
 	/* fields used by HYPER-V emulation */
 	u64 hv_guest_os_id;
 	u64 hv_hypercall;
+
+	bool msix_flags_enabled;
 };
 
 struct kvm_vm_stat {
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 919ae53..a699ec9 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -787,11 +787,15 @@ struct kvm_assigned_msix_nr {
 };
 
 #define KVM_MAX_MSIX_PER_DEV		256
+
+#define KVM_MSIX_FLAG_MASK	1
+
 struct kvm_assigned_msix_entry {
 	__u32 assigned_dev_id;
 	__u32 gsi;
 	__u16 entry; /* The index of entry in the MSI-X table */
-	__u16 padding[3];
+	__u16 flags;
+	__u16 padding[2];
 };
 
 #endif /* __LINUX_KVM_H */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 30f83cd..81a6284 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -438,6 +438,7 @@ struct kvm_irq_ack_notifier {
 };
 
 #define KVM_ASSIGNED_MSIX_PENDING		0x1
+#define KVM_ASSIGNED_MSIX_MASK			0x2
 struct kvm_guest_msix_entry {
 	u32 vector;
 	u16 entry;
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 7c98928..bf96ea7 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -666,11 +666,35 @@ msix_nr_out:
 	return r;
 }
 
+static void update_msix_mask(struct kvm_assigned_dev_kernel *assigned_dev,
+			     int index)
+{
+	int irq;
+	struct irq_desc *desc;
+
+	if (!assigned_dev->dev->msix_enabled ||
+	    !(assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX))
+		return;
+
+	irq = assigned_dev->host_msix_entries[index].vector;
+	BUG_ON(irq == 0);
+	desc = irq_to_desc(irq);
+	BUG_ON(!desc->msi_desc);
+
+	if (assigned_dev->guest_msix_entries[index].flags &
+			KVM_ASSIGNED_MSIX_MASK) {
+		desc->chip->mask(irq);
+		flush_work(&assigned_dev->interrupt_work);
+	} else
+		desc->chip->unmask(irq);
+}
+
 static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm,
 				       struct kvm_assigned_msix_entry *entry)
 {
 	int r = 0, i;
 	struct kvm_assigned_dev_kernel *adev;
+	bool entry_masked;
 
 	mutex_lock(&kvm->lock);
 
@@ -688,6 +712,21 @@ static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm,
 			adev->guest_msix_entries[i].entry = entry->entry;
 			adev->guest_msix_entries[i].vector = entry->gsi;
 			adev->host_msix_entries[i].entry = entry->entry;
+			if (!kvm->arch.msix_flags_enabled)
+				break;
+			entry_masked = adev->guest_msix_entries[i].flags &
+				KVM_ASSIGNED_MSIX_MASK;
+			if ((entry->flags & KVM_MSIX_FLAG_MASK) &&
+					!entry_masked) {
+				adev->guest_msix_entries[i].flags |=
+					KVM_ASSIGNED_MSIX_MASK;
+				update_msix_mask(adev, i);
+			} else if (!(entry->flags & KVM_MSIX_FLAG_MASK) &&
+					entry_masked) {
+				adev->guest_msix_entries[i].flags &=
+					~KVM_ASSIGNED_MSIX_MASK;
+				update_msix_mask(adev, i);
+			}
 			break;
 		}
 	if (i == adev->entries_nr) {
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 7/8] KVM: assigned dev: Introduce io_device for MSI-X MMIO accessing
  2010-10-20  8:26 [PATCH 0/8][v2] MSI-X mask emulation support for assigned device Sheng Yang
                   ` (5 preceding siblings ...)
  2010-10-20  8:26 ` [PATCH 6/8] KVM: assigned dev: Preparation for mask support in userspace Sheng Yang
@ 2010-10-20  8:26 ` Sheng Yang
  2010-10-20  9:46   ` Avi Kivity
  2010-10-20 22:35   ` Michael S. Tsirkin
  2010-10-20  8:26 ` [PATCH 8/8] KVM: Emulation MSI-X mask bits for assigned devices Sheng Yang
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 66+ messages in thread
From: Sheng Yang @ 2010-10-20  8:26 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti; +Cc: kvm, Michael S. Tsirkin, Sheng Yang

It would be work with KVM_CAP_DEVICE_MSIX_MASK, which we would enable in the
last patch.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
---
 include/linux/kvm.h      |    7 +++
 include/linux/kvm_host.h |    2 +
 virt/kvm/assigned-dev.c  |  131 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 140 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index a699ec9..0a7bd34 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -798,4 +798,11 @@ struct kvm_assigned_msix_entry {
 	__u16 padding[2];
 };
 
+struct kvm_assigned_msix_mmio {
+	__u32 assigned_dev_id;
+	__u64 base_addr;
+	__u32 flags;
+	__u32 reserved[2];
+};
+
 #endif /* __LINUX_KVM_H */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 81a6284..b67082f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -465,6 +465,8 @@ struct kvm_assigned_dev_kernel {
 	struct pci_dev *dev;
 	struct kvm *kvm;
 	spinlock_t assigned_dev_lock;
+	u64 msix_mmio_base;
+	struct kvm_io_device msix_mmio_dev;
 };
 
 struct kvm_irq_mask_notifier {
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index bf96ea7..5d2adc4 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -739,6 +739,137 @@ msix_entry_out:
 
 	return r;
 }
+
+static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev,
+			      gpa_t addr, int len, int *idx)
+{
+	int i;
+
+	if (!(adev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX))
+		return false;
+	BUG_ON(adev->msix_mmio_base == 0);
+	for (i = 0; i < adev->entries_nr; i++) {
+		u64 start, end;
+		start = adev->msix_mmio_base +
+			adev->guest_msix_entries[i].entry * PCI_MSIX_ENTRY_SIZE;
+		end = start + PCI_MSIX_ENTRY_SIZE;
+		if (addr >= start && addr + len <= end) {
+			*idx = i;
+			return true;
+		}
+	}
+	return false;
+}
+
+static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
+			  void *val)
+{
+	struct kvm_assigned_dev_kernel *adev =
+			container_of(this, struct kvm_assigned_dev_kernel,
+				     msix_mmio_dev);
+	int idx, r = 0;
+	u32 entry[4];
+	struct kvm_kernel_irq_routing_entry *e;
+
+	mutex_lock(&adev->kvm->lock);
+	if (!msix_mmio_in_range(adev, addr, len, &idx)) {
+		r = -EOPNOTSUPP;
+		goto out;
+	}
+	if ((addr & 0x3) || len != 4) {
+		printk(KERN_WARNING
+			"KVM: Unaligned reading for device MSI-X MMIO! "
+			"addr 0x%llx, len %d\n", addr, len);
+		r = -EOPNOTSUPP;
+		goto out;
+	}
+
+	e = kvm_get_irq_routing_entry(adev->kvm,
+			adev->guest_msix_entries[idx].vector);
+	if (!e || e->type != KVM_IRQ_ROUTING_MSI) {
+		printk(KERN_WARNING "KVM: Wrong MSI-X routing entry! "
+			"addr 0x%llx, len %d\n", addr, len);
+		r = -EOPNOTSUPP;
+		goto out;
+	}
+	entry[0] = e->msi.address_lo;
+	entry[1] = e->msi.address_hi;
+	entry[2] = e->msi.data;
+	entry[3] = !!(adev->guest_msix_entries[idx].flags &
+			KVM_ASSIGNED_MSIX_MASK);
+	memcpy(val, &entry[addr % PCI_MSIX_ENTRY_SIZE / 4], len);
+
+out:
+	mutex_unlock(&adev->kvm->lock);
+	return r;
+}
+
+static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int len,
+			   const void *val)
+{
+	struct kvm_assigned_dev_kernel *adev =
+			container_of(this, struct kvm_assigned_dev_kernel,
+				     msix_mmio_dev);
+	int idx, r = 0;
+	unsigned long new_val = *(unsigned long *)val;
+	bool entry_masked;
+
+	mutex_lock(&adev->kvm->lock);
+	if (!msix_mmio_in_range(adev, addr, len, &idx)) {
+		r = -EOPNOTSUPP;
+		goto out;
+	}
+	if ((addr & 0x3) || len != 4) {
+		printk(KERN_WARNING
+			"KVM: Unaligned writing for device MSI-X MMIO! "
+			"addr 0x%llx, len %d, val 0x%lx\n",
+			addr, len, new_val);
+		r = -EOPNOTSUPP;
+		goto out;
+	}
+	entry_masked = adev->guest_msix_entries[idx].flags &
+			KVM_ASSIGNED_MSIX_MASK;
+	if (addr % PCI_MSIX_ENTRY_SIZE != PCI_MSIX_ENTRY_VECTOR_CTRL) {
+		/* Only allow entry modification when entry was masked */
+		if (!entry_masked) {
+			printk(KERN_WARNING
+				"KVM: guest try to write unmasked MSI-X entry. "
+				"addr 0x%llx, len %d, val 0x%lx\n",
+				addr, len, new_val);
+			r = 0;
+		} else
+			/* Leave it to QEmu */
+			r = -EOPNOTSUPP;
+		goto out;
+	}
+	if (new_val & ~1ul) {
+		printk(KERN_WARNING
+			"KVM: Bad writing for device MSI-X MMIO! "
+			"addr 0x%llx, len %d, val 0x%lx\n",
+			addr, len, new_val);
+		r = -EOPNOTSUPP;
+		goto out;
+	}
+	if (new_val == 1 && !entry_masked) {
+		adev->guest_msix_entries[idx].flags |=
+			KVM_ASSIGNED_MSIX_MASK;
+		update_msix_mask(adev, idx);
+	} else if (new_val == 0 && entry_masked) {
+		adev->guest_msix_entries[idx].flags &=
+			~KVM_ASSIGNED_MSIX_MASK;
+		update_msix_mask(adev, idx);
+	}
+out:
+	mutex_unlock(&adev->kvm->lock);
+
+	return r;
+}
+
+static const struct kvm_io_device_ops msix_mmio_ops = {
+	.read     = msix_mmio_read,
+	.write    = msix_mmio_write,
+};
+
 #endif
 
 long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 8/8] KVM: Emulation MSI-X mask bits for assigned devices
  2010-10-20  8:26 [PATCH 0/8][v2] MSI-X mask emulation support for assigned device Sheng Yang
                   ` (6 preceding siblings ...)
  2010-10-20  8:26 ` [PATCH 7/8] KVM: assigned dev: Introduce io_device for MSI-X MMIO accessing Sheng Yang
@ 2010-10-20  8:26 ` Sheng Yang
  2010-10-20  9:49   ` Avi Kivity
                     ` (2 more replies)
  2010-10-20  9:51 ` [PATCH 0/8][v2] MSI-X mask emulation support for assigned device Avi Kivity
                   ` (2 subsequent siblings)
  10 siblings, 3 replies; 66+ messages in thread
From: Sheng Yang @ 2010-10-20  8:26 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti; +Cc: kvm, Michael S. Tsirkin, Sheng Yang

This patch enable per-vector mask for assigned devices using MSI-X.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
---
 Documentation/kvm/api.txt |   22 ++++++++++++++++
 arch/x86/kvm/x86.c        |    6 ++++
 include/linux/kvm.h       |    8 +++++-
 virt/kvm/assigned-dev.c   |   60 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 95 insertions(+), 1 deletions(-)

diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
index d82d637..f324a50 100644
--- a/Documentation/kvm/api.txt
+++ b/Documentation/kvm/api.txt
@@ -1087,6 +1087,28 @@ of 4 instructions that make up a hypercall.
 If any additional field gets added to this structure later on, a bit for that
 additional piece of information will be set in the flags bitmap.
 
+4.47 KVM_ASSIGN_REG_MSIX_MMIO
+
+Capability: KVM_CAP_DEVICE_MSIX_MASK
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_assigned_msix_mmio (in)
+Returns: 0 on success, !0 on error
+
+struct kvm_assigned_msix_mmio {
+	/* Assigned device's ID */
+	__u32 assigned_dev_id;
+	/* MSI-X table MMIO address */
+	__u64 base_addr;
+	/* Must be 0 */
+	__u32 flags;
+	/* Must be 0, reserved for future use */
+	__u64 reserved;
+};
+
+This ioctl would enable in-kernel MSI-X emulation, which would handle MSI-X
+mask bit in the kernel.
+
 5. The kvm_run structure
 
 Application code obtains a pointer to the kvm_run structure by
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fc62546..ba07a2f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1927,6 +1927,8 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_X86_ROBUST_SINGLESTEP:
 	case KVM_CAP_XSAVE:
 	case KVM_CAP_ENABLE_CAP:
+	case KVM_CAP_DEVICE_MSIX_EXT:
+	case KVM_CAP_DEVICE_MSIX_MASK:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -2717,6 +2719,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 		return -EINVAL;
 
 	switch (cap->cap) {
+	case KVM_CAP_DEVICE_MSIX_EXT:
+		vcpu->kvm->arch.msix_flags_enabled = true;
+		r = 0;
+		break;
 	default:
 		r = -EINVAL;
 		break;
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 0a7bd34..1494ed0 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -540,6 +540,10 @@ struct kvm_ppc_pvinfo {
 #endif
 #define KVM_CAP_PPC_GET_PVINFO 57
 #define KVM_CAP_PPC_IRQ_LEVEL 58
+#ifdef __KVM_HAVE_MSIX
+#define KVM_CAP_DEVICE_MSIX_EXT 59
+#define KVM_CAP_DEVICE_MSIX_MASK 60
+#endif
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -671,6 +675,8 @@ struct kvm_clock_data {
 #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct kvm_xen_hvm_config)
 #define KVM_SET_CLOCK             _IOW(KVMIO,  0x7b, struct kvm_clock_data)
 #define KVM_GET_CLOCK             _IOR(KVMIO,  0x7c, struct kvm_clock_data)
+#define KVM_ASSIGN_REG_MSIX_MMIO  _IOW(KVMIO,  0x7d, \
+					struct kvm_assigned_msix_mmio)
 /* Available with KVM_CAP_PIT_STATE2 */
 #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct kvm_pit_state2)
 #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0, struct kvm_pit_state2)
@@ -802,7 +808,7 @@ struct kvm_assigned_msix_mmio {
 	__u32 assigned_dev_id;
 	__u64 base_addr;
 	__u32 flags;
-	__u32 reserved[2];
+	__u64 reserved;
 };
 
 #endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 5d2adc4..9573194 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -17,6 +17,8 @@
 #include <linux/pci.h>
 #include <linux/interrupt.h>
 #include <linux/slab.h>
+#include <linux/irqnr.h>
+
 #include "irq.h"
 
 static struct kvm_assigned_dev_kernel *kvm_find_assigned_dev(struct list_head *head,
@@ -169,6 +171,14 @@ static void deassign_host_irq(struct kvm *kvm,
 	 */
 	if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
 		int i;
+#ifdef __KVM_HAVE_MSIX
+		if (assigned_dev->msix_mmio_base) {
+			mutex_lock(&kvm->slots_lock);
+			kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS,
+					&assigned_dev->msix_mmio_dev);
+			mutex_unlock(&kvm->slots_lock);
+		}
+#endif
 		for (i = 0; i < assigned_dev->entries_nr; i++)
 			disable_irq_nosync(assigned_dev->
 					   host_msix_entries[i].vector);
@@ -318,6 +328,15 @@ static int assigned_device_enable_host_msix(struct kvm *kvm,
 			goto err;
 	}
 
+	if (dev->msix_mmio_base) {
+		mutex_lock(&kvm->slots_lock);
+		r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
+				&dev->msix_mmio_dev);
+		mutex_unlock(&kvm->slots_lock);
+		if (r)
+			goto err;
+	}
+
 	return 0;
 err:
 	for (i -= 1; i >= 0; i--)
@@ -870,6 +889,31 @@ static const struct kvm_io_device_ops msix_mmio_ops = {
 	.write    = msix_mmio_write,
 };
 
+static int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm,
+				struct kvm_assigned_msix_mmio *msix_mmio)
+{
+	int r = 0;
+	struct kvm_assigned_dev_kernel *adev;
+
+	mutex_lock(&kvm->lock);
+	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
+				      msix_mmio->assigned_dev_id);
+	if (!adev) {
+		r = -EINVAL;
+		goto out;
+	}
+	if (msix_mmio->base_addr == 0) {
+		r = -EINVAL;
+		goto out;
+	}
+	adev->msix_mmio_base = msix_mmio->base_addr;
+
+	kvm_iodevice_init(&adev->msix_mmio_dev, &msix_mmio_ops);
+out:
+	mutex_unlock(&kvm->lock);
+
+	return r;
+}
 #endif
 
 long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
@@ -982,6 +1026,22 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
 			goto out;
 		break;
 	}
+	case KVM_ASSIGN_REG_MSIX_MMIO: {
+		struct kvm_assigned_msix_mmio msix_mmio;
+
+		r = -EFAULT;
+		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
+			goto out;
+
+		r = -EINVAL;
+		if (msix_mmio.flags != 0 || msix_mmio.reserved != 0)
+			goto out;
+
+		r = kvm_vm_ioctl_register_msix_mmio(kvm, &msix_mmio);
+		if (r)
+			goto out;
+		break;
+	}
 #endif
 	}
 out:
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH 5/8] KVM: Add kvm_get_irq_routing_entry() func
  2010-10-20  8:26 ` [PATCH 5/8] KVM: Add kvm_get_irq_routing_entry() func Sheng Yang
@ 2010-10-20  8:53   ` Avi Kivity
  2010-10-20  8:58     ` Sheng Yang
  2010-10-20  9:13     ` Sheng Yang
  0 siblings, 2 replies; 66+ messages in thread
From: Avi Kivity @ 2010-10-20  8:53 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Marcelo Tosatti, kvm, Michael S. Tsirkin

  On 10/20/2010 10:26 AM, Sheng Yang wrote:
> We need to query the entry later.
>
>
> +struct kvm_kernel_irq_routing_entry *kvm_get_irq_routing_entry(struct kvm *kvm,
> +							       int gsi)
> +{
> +	int count = 0;
> +	struct kvm_kernel_irq_routing_entry *ei = NULL;
> +	struct kvm_irq_routing_table *irq_rt;
> +	struct hlist_node *n;
> +
> +	rcu_read_lock();
> +	irq_rt = rcu_dereference(kvm->irq_routing);
> +	if (gsi<  irq_rt->nr_rt_entries)
> +		hlist_for_each_entry(ei, n,&irq_rt->map[gsi], link)
> +			count++;
> +	rcu_read_unlock();
> +	if (count == 1)
> +		return ei;
> +
> +	return NULL;
> +}
> +

I believe this is incorrect rcu usage.  rcu_read_lock() prevents ei from 
being destroyed under us, but rcu_read_unlock() removes that protection, 
and a future dereference of ei may access freed memory.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 5/8] KVM: Add kvm_get_irq_routing_entry() func
  2010-10-20  8:53   ` Avi Kivity
@ 2010-10-20  8:58     ` Sheng Yang
  2010-10-20  9:13     ` Sheng Yang
  1 sibling, 0 replies; 66+ messages in thread
From: Sheng Yang @ 2010-10-20  8:58 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, kvm, Michael S. Tsirkin

On Wednesday 20 October 2010 16:53:02 Avi Kivity wrote:
>   On 10/20/2010 10:26 AM, Sheng Yang wrote:
> > We need to query the entry later.
> > 
> > 
> > +struct kvm_kernel_irq_routing_entry *kvm_get_irq_routing_entry(struct
> > kvm *kvm, +							       int gsi)
> > +{
> > +	int count = 0;
> > +	struct kvm_kernel_irq_routing_entry *ei = NULL;
> > +	struct kvm_irq_routing_table *irq_rt;
> > +	struct hlist_node *n;
> > +
> > +	rcu_read_lock();
> > +	irq_rt = rcu_dereference(kvm->irq_routing);
> > +	if (gsi<  irq_rt->nr_rt_entries)
> > +		hlist_for_each_entry(ei, n,&irq_rt->map[gsi], link)
> > +			count++;
> > +	rcu_read_unlock();
> > +	if (count == 1)
> > +		return ei;
> > +
> > +	return NULL;
> > +}
> > +
> 
> I believe this is incorrect rcu usage.  rcu_read_lock() prevents ei from
> being destroyed under us, but rcu_read_unlock() removes that protection,
> and a future dereference of ei may access freed memory.

Yes... I would update the patch by copying it to caller's variable.

--
regards
Yang, Sheng

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 5/8] KVM: Add kvm_get_irq_routing_entry() func
  2010-10-20  8:53   ` Avi Kivity
  2010-10-20  8:58     ` Sheng Yang
@ 2010-10-20  9:13     ` Sheng Yang
  2010-10-20  9:17       ` Sheng Yang
  1 sibling, 1 reply; 66+ messages in thread
From: Sheng Yang @ 2010-10-20  9:13 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti; +Cc: Michael S. Tsirkin, kvm, Sheng Yang

We need to query the entry later.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
---
 include/linux/kvm_host.h |    2 ++
 virt/kvm/irq_comm.c      |   20 ++++++++++++++++++++
 2 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7f9e4b7..e2ecbac 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -614,6 +614,8 @@ int kvm_set_irq_routing(struct kvm *kvm,
 			const struct kvm_irq_routing_entry *entries,
 			unsigned nr,
 			unsigned flags);
+int kvm_get_irq_routing_entry(struct kvm *kvm, int gsi,
+		struct kvm_kernel_irq_routing_entry *entry);
 void kvm_free_irq_routing(struct kvm *kvm);
 
 #else
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 8edca91..870d083 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -421,6 +421,26 @@ out:
 	return r;
 }
 
+int kvm_get_irq_routing_entry(struct kvm *kvm, int gsi,
+		struct kvm_kernel_irq_routing_entry *entry)
+{
+	int count = 0;
+	struct kvm_kernel_irq_routing_entry *ei = NULL;
+	struct kvm_irq_routing_table *irq_rt;
+	struct hlist_node *n;
+
+	rcu_read_lock();
+	irq_rt = rcu_dereference(kvm->irq_routing);
+	if (gsi < irq_rt->nr_rt_entries)
+		hlist_for_each_entry(ei, n, &irq_rt->map[gsi], link)
+			count++;
+	rcu_read_unlock();
+	if (count == 1)
+		memcpy(entry, ei, sizeof(*ei));
+
+	return count == 1;
+}
+
 #define IOAPIC_ROUTING_ENTRY(irq) \
 	{ .gsi = irq, .type = KVM_IRQ_ROUTING_IRQCHIP,	\
 	  .u.irqchip.irqchip = KVM_IRQCHIP_IOAPIC, .u.irqchip.pin = (irq) }
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 5/8] KVM: Add kvm_get_irq_routing_entry() func
  2010-10-20  9:13     ` Sheng Yang
@ 2010-10-20  9:17       ` Sheng Yang
  2010-10-20  9:32         ` Avi Kivity
  0 siblings, 1 reply; 66+ messages in thread
From: Sheng Yang @ 2010-10-20  9:17 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti; +Cc: Michael S. Tsirkin, kvm, Sheng Yang

We need to query the entry later.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
---
 include/linux/kvm_host.h |    2 ++
 virt/kvm/irq_comm.c      |   20 ++++++++++++++++++++
 2 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7f9e4b7..e2ecbac 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -614,6 +614,8 @@ int kvm_set_irq_routing(struct kvm *kvm,
 			const struct kvm_irq_routing_entry *entries,
 			unsigned nr,
 			unsigned flags);
+int kvm_get_irq_routing_entry(struct kvm *kvm, int gsi,
+		struct kvm_kernel_irq_routing_entry *entry);
 void kvm_free_irq_routing(struct kvm *kvm);
 
 #else
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 8edca91..870d083 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -421,6 +421,26 @@ out:
 	return r;
 }
 
+int kvm_get_irq_routing_entry(struct kvm *kvm, int gsi,
+		struct kvm_kernel_irq_routing_entry *entry)
+{
+	int count = 0;
+	struct kvm_kernel_irq_routing_entry *ei = NULL;
+	struct kvm_irq_routing_table *irq_rt;
+	struct hlist_node *n;
+
+	rcu_read_lock();
+	irq_rt = rcu_dereference(kvm->irq_routing);
+	if (gsi < irq_rt->nr_rt_entries)
+		hlist_for_each_entry(ei, n, &irq_rt->map[gsi], link)
+			count++;
+	rcu_read_unlock();
+	if (count == 1)
+		memcpy(entry, ei, sizeof(*ei));
+
+	return (count != 1);
+}
+
 #define IOAPIC_ROUTING_ENTRY(irq) \
 	{ .gsi = irq, .type = KVM_IRQ_ROUTING_IRQCHIP,	\
 	  .u.irqchip.irqchip = KVM_IRQCHIP_IOAPIC, .u.irqchip.pin = (irq) }
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH 6/8] KVM: assigned dev: Preparation for mask support in userspace
  2010-10-20  8:26 ` [PATCH 6/8] KVM: assigned dev: Preparation for mask support in userspace Sheng Yang
@ 2010-10-20  9:30   ` Avi Kivity
  2010-10-22 14:53   ` Marcelo Tosatti
  1 sibling, 0 replies; 66+ messages in thread
From: Avi Kivity @ 2010-10-20  9:30 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Marcelo Tosatti, kvm, Michael S. Tsirkin

  On 10/20/2010 10:26 AM, Sheng Yang wrote:
> The feature wouldn't be enabled until later patch set msix_flags_enabled. It
> would be enabled along with mask support in kernel.
>
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index e209078..2bb69ba 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -456,6 +456,8 @@ struct kvm_arch {
>   	/* fields used by HYPER-V emulation */
>   	u64 hv_guest_os_id;
>   	u64 hv_hypercall;
> +
> +	bool msix_flags_enabled;
>   };

In theory this should be in generic code, under #ifdef 
KVM_APIC_ARCHITECTURE.  In practice x86 is fine since ia64 isn't 
actively developed.

> @@ -666,11 +666,35 @@ msix_nr_out:
>   	return r;
>   }
>
> +static void update_msix_mask(struct kvm_assigned_dev_kernel *assigned_dev,
> +			     int index)
> +{
> +	int irq;
> +	struct irq_desc *desc;
> +
> +	if (!assigned_dev->dev->msix_enabled ||
> +	    !(assigned_dev->irq_requested_type&  KVM_DEV_IRQ_HOST_MSIX))
> +		return;
> +
> +	irq = assigned_dev->host_msix_entries[index].vector;
> +	BUG_ON(irq == 0);
> +	desc = irq_to_desc(irq);
> +	BUG_ON(!desc->msi_desc);
> +
> +	if (assigned_dev->guest_msix_entries[index].flags&
> +			KVM_ASSIGNED_MSIX_MASK) {
> +		desc->chip->mask(irq);
> +		flush_work(&assigned_dev->interrupt_work);
> +	} else
> +		desc->chip->unmask(irq);
> +}
> +
>   static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm,
>   				       struct kvm_assigned_msix_entry *entry)
>   {
>   	int r = 0, i;
>   	struct kvm_assigned_dev_kernel *adev;
> +	bool entry_masked;
>
>   	mutex_lock(&kvm->lock);
>
> @@ -688,6 +712,21 @@ static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm,
>   			adev->guest_msix_entries[i].entry = entry->entry;
>   			adev->guest_msix_entries[i].vector = entry->gsi;
>   			adev->host_msix_entries[i].entry = entry->entry;
> +			if (!kvm->arch.msix_flags_enabled)
> +				break;
> +			entry_masked = adev->guest_msix_entries[i].flags&
> +				KVM_ASSIGNED_MSIX_MASK;
> +			if ((entry->flags&  KVM_MSIX_FLAG_MASK)&&
> +					!entry_masked) {
> +				adev->guest_msix_entries[i].flags |=
> +					KVM_ASSIGNED_MSIX_MASK;
> +				update_msix_mask(adev, i);
> +			} else if (!(entry->flags&  KVM_MSIX_FLAG_MASK)&&
> +					entry_masked) {
> +				adev->guest_msix_entries[i].flags&=
> +					~KVM_ASSIGNED_MSIX_MASK;
> +				update_msix_mask(adev, i);
> +			}
>   			break;
>   		}

I think you can fold these two functions together, will get rid of the 
extra if and be a little more readable.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 5/8] KVM: Add kvm_get_irq_routing_entry() func
  2010-10-20  9:17       ` Sheng Yang
@ 2010-10-20  9:32         ` Avi Kivity
  0 siblings, 0 replies; 66+ messages in thread
From: Avi Kivity @ 2010-10-20  9:32 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Marcelo Tosatti, Michael S. Tsirkin, kvm

  On 10/20/2010 11:17 AM, Sheng Yang wrote:
>   }
>
> +int kvm_get_irq_routing_entry(struct kvm *kvm, int gsi,
> +		struct kvm_kernel_irq_routing_entry *entry)
> +{
> +	int count = 0;
> +	struct kvm_kernel_irq_routing_entry *ei = NULL;
> +	struct kvm_irq_routing_table *irq_rt;
> +	struct hlist_node *n;
> +
> +	rcu_read_lock();
> +	irq_rt = rcu_dereference(kvm->irq_routing);
> +	if (gsi<  irq_rt->nr_rt_entries)
> +		hlist_for_each_entry(ei, n,&irq_rt->map[gsi], link)
> +			count++;
> +	rcu_read_unlock();
> +	if (count == 1)
> +		memcpy(entry, ei, sizeof(*ei));

Need to be before the unlock.

> +
> +	return (count != 1);
> +}
> +

"*entry = *ei" is clearer and safer.  But is it correct? we might be 
using outdated data.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 7/8] KVM: assigned dev: Introduce io_device for MSI-X MMIO accessing
  2010-10-20  8:26 ` [PATCH 7/8] KVM: assigned dev: Introduce io_device for MSI-X MMIO accessing Sheng Yang
@ 2010-10-20  9:46   ` Avi Kivity
  2010-10-20 10:33     ` Michael S. Tsirkin
  2010-10-21  6:46     ` Sheng Yang
  2010-10-20 22:35   ` Michael S. Tsirkin
  1 sibling, 2 replies; 66+ messages in thread
From: Avi Kivity @ 2010-10-20  9:46 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Marcelo Tosatti, kvm, Michael S. Tsirkin

  On 10/20/2010 10:26 AM, Sheng Yang wrote:
> It would be work with KVM_CAP_DEVICE_MSIX_MASK, which we would enable in the
> last patch.
>
>
> +struct kvm_assigned_msix_mmio {
> +	__u32 assigned_dev_id;
> +	__u64 base_addr;

Different alignment and size on 32 and 64 bits.

Is base_addr a guest physical address?  Do we need a size or it it fixed?

> +	__u32 flags;
> +	__u32 reserved[2];
> +};
> +
>
> @@ -465,6 +465,8 @@ struct kvm_assigned_dev_kernel {
>   	struct pci_dev *dev;
>   	struct kvm *kvm;
>   	spinlock_t assigned_dev_lock;
> +	u64 msix_mmio_base;

gpa_t.

> +	struct kvm_io_device msix_mmio_dev;
>   };
>
>   struct kvm_irq_mask_notifier {
> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> index bf96ea7..5d2adc4 100644
> --- a/virt/kvm/assigned-dev.c
> +++ b/virt/kvm/assigned-dev.c
> @@ -739,6 +739,137 @@ msix_entry_out:
>
>   	return r;
>   }
> +
> +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev,
> +			      gpa_t addr, int len, int *idx)
> +{
> +	int i;
> +
> +	if (!(adev->irq_requested_type&  KVM_DEV_IRQ_HOST_MSIX))
> +		return false;

Just don't install the io_device in that case.

> +	BUG_ON(adev->msix_mmio_base == 0);
> +	for (i = 0; i<  adev->entries_nr; i++) {
> +		u64 start, end;
> +		start = adev->msix_mmio_base +
> +			adev->guest_msix_entries[i].entry * PCI_MSIX_ENTRY_SIZE;
> +		end = start + PCI_MSIX_ENTRY_SIZE;
> +		if (addr>= start&&  addr + len<= end) {
> +			*idx = i;
> +			return true;
> +		}

What if it's a partial hit?  write part of an entry and part of another 
entry?

> +	}
> +	return false;
> +}
> +
> +static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
> +			  void *val)
> +{
> +	struct kvm_assigned_dev_kernel *adev =
> +			container_of(this, struct kvm_assigned_dev_kernel,
> +				     msix_mmio_dev);
> +	int idx, r = 0;
> +	u32 entry[4];
> +	struct kvm_kernel_irq_routing_entry *e;
> +
> +	mutex_lock(&adev->kvm->lock);
> +	if (!msix_mmio_in_range(adev, addr, len,&idx)) {
> +		r = -EOPNOTSUPP;
> +		goto out;
> +	}
> +	if ((addr&  0x3) || len != 4) {
> +		printk(KERN_WARNING
> +			"KVM: Unaligned reading for device MSI-X MMIO! "
> +			"addr 0x%llx, len %d\n", addr, len);

Guest exploitable printk()

> +		r = -EOPNOTSUPP;

If the guest assigned the device to another guest, it allows the nested 
guest to kill the non-nested guest.  Need to exit in a graceful fashion.

> +		goto out;
> +	}
> +
> +	e = kvm_get_irq_routing_entry(adev->kvm,
> +			adev->guest_msix_entries[idx].vector);
> +	if (!e || e->type != KVM_IRQ_ROUTING_MSI) {
> +		printk(KERN_WARNING "KVM: Wrong MSI-X routing entry! "
> +			"addr 0x%llx, len %d\n", addr, len);
> +		r = -EOPNOTSUPP;
> +		goto out;
> +	}
> +	entry[0] = e->msi.address_lo;
> +	entry[1] = e->msi.address_hi;
> +	entry[2] = e->msi.data;
> +	entry[3] = !!(adev->guest_msix_entries[idx].flags&
> +			KVM_ASSIGNED_MSIX_MASK);
> +	memcpy(val,&entry[addr % PCI_MSIX_ENTRY_SIZE / 4], len);
> +
> +out:
> +	mutex_unlock(&adev->kvm->lock);
> +	return r;
> +}
> +
> +static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int len,
> +			   const void *val)
> +{
> +	struct kvm_assigned_dev_kernel *adev =
> +			container_of(this, struct kvm_assigned_dev_kernel,
> +				     msix_mmio_dev);
> +	int idx, r = 0;
> +	unsigned long new_val = *(unsigned long *)val;
> +	bool entry_masked;
> +
> +	mutex_lock(&adev->kvm->lock);
> +	if (!msix_mmio_in_range(adev, addr, len,&idx)) {
> +		r = -EOPNOTSUPP;
> +		goto out;
> +	}
> +	if ((addr&  0x3) || len != 4) {
> +		printk(KERN_WARNING
> +			"KVM: Unaligned writing for device MSI-X MMIO! "
> +			"addr 0x%llx, len %d, val 0x%lx\n",
> +			addr, len, new_val);
> +		r = -EOPNOTSUPP;
> +		goto out;
> +	}
> +	entry_masked = adev->guest_msix_entries[idx].flags&
> +			KVM_ASSIGNED_MSIX_MASK;
> +	if (addr % PCI_MSIX_ENTRY_SIZE != PCI_MSIX_ENTRY_VECTOR_CTRL) {
> +		/* Only allow entry modification when entry was masked */
> +		if (!entry_masked) {
> +			printk(KERN_WARNING
> +				"KVM: guest try to write unmasked MSI-X entry. "
> +				"addr 0x%llx, len %d, val 0x%lx\n",
> +				addr, len, new_val);
> +			r = 0;

What does the spec says about this situation?

> +		} else
> +			/* Leave it to QEmu */

s/qemu/userspace/

> +			r = -EOPNOTSUPP;

What would userspace do in this situation?  I hope you documented 
precisely what the kernel handles and what it doesn't?

I prefer more kernel code in the kernel to having an interface which is 
hard to use correctly.

> +		goto out;
> +	}
> +	if (new_val&  ~1ul) {

Is there a #define for this bit?

> +		printk(KERN_WARNING
> +			"KVM: Bad writing for device MSI-X MMIO! "
> +			"addr 0x%llx, len %d, val 0x%lx\n",
> +			addr, len, new_val);
> +		r = -EOPNOTSUPP;
> +		goto out;
> +	}
> +	if (new_val == 1&&  !entry_masked) {
> +		adev->guest_msix_entries[idx].flags |=
> +			KVM_ASSIGNED_MSIX_MASK;
> +		update_msix_mask(adev, idx);
> +	} else if (new_val == 0&&  entry_masked) {
> +		adev->guest_msix_entries[idx].flags&=
> +			~KVM_ASSIGNED_MSIX_MASK;
> +		update_msix_mask(adev, idx);
> +	}

Ah, I see you do reuse update_msix_mask().

> +out:
> +	mutex_unlock(&adev->kvm->lock);
> +
> +	return r;
> +}
> +
> +static const struct kvm_io_device_ops msix_mmio_ops = {
> +	.read     = msix_mmio_read,
> +	.write    = msix_mmio_write,
> +};
> +
>   #endif
>
>   long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 8/8] KVM: Emulation MSI-X mask bits for assigned devices
  2010-10-20  8:26 ` [PATCH 8/8] KVM: Emulation MSI-X mask bits for assigned devices Sheng Yang
@ 2010-10-20  9:49   ` Avi Kivity
  2010-10-20 22:24   ` Michael S. Tsirkin
  2010-10-21  8:30   ` Sheng Yang
  2 siblings, 0 replies; 66+ messages in thread
From: Avi Kivity @ 2010-10-20  9:49 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Marcelo Tosatti, kvm, Michael S. Tsirkin

  On 10/20/2010 10:26 AM, Sheng Yang wrote:
> This patch enable per-vector mask for assigned devices using MSI-X.
>
>
> @@ -1087,6 +1087,28 @@ of 4 instructions that make up a hypercall.
>   If any additional field gets added to this structure later on, a bit for that
>   additional piece of information will be set in the flags bitmap.
>
> +4.47 KVM_ASSIGN_REG_MSIX_MMIO
> +
> +Capability: KVM_CAP_DEVICE_MSIX_MASK
> +Architectures: x86
> +Type: vm ioctl
> +Parameters: struct kvm_assigned_msix_mmio (in)
> +Returns: 0 on success, !0 on error
> +
> +struct kvm_assigned_msix_mmio {
> +	/* Assigned device's ID */
> +	__u32 assigned_dev_id;
> +	/* MSI-X table MMIO address */
> +	__u64 base_addr;
> +	/* Must be 0 */
> +	__u32 flags;
> +	/* Must be 0, reserved for future use */
> +	__u64 reserved;
> +};
> +
> +This ioctl would enable in-kernel MSI-X emulation, which would handle MSI-X
> +mask bit in the kernel.

Need to clarify what happens for non-mask bits.
> @@ -2717,6 +2719,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
>   		return -EINVAL;
>
>   	switch (cap->cap) {
> +	case KVM_CAP_DEVICE_MSIX_EXT:
> +		vcpu->kvm->arch.msix_flags_enabled = true;
> +		r = 0;
> +		break;

It's strange to enable a per-vm capability with a vcpu ioctl.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20  8:26 [PATCH 0/8][v2] MSI-X mask emulation support for assigned device Sheng Yang
                   ` (7 preceding siblings ...)
  2010-10-20  8:26 ` [PATCH 8/8] KVM: Emulation MSI-X mask bits for assigned devices Sheng Yang
@ 2010-10-20  9:51 ` Avi Kivity
  2010-10-20 10:44   ` Michael S. Tsirkin
  2010-10-21  7:41   ` Sheng Yang
  2010-10-20 19:02 ` Marcelo Tosatti
  2010-10-20 22:20 ` Michael S. Tsirkin
  10 siblings, 2 replies; 66+ messages in thread
From: Avi Kivity @ 2010-10-20  9:51 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Marcelo Tosatti, kvm, Michael S. Tsirkin, Alex Williamson

  On 10/20/2010 10:26 AM, Sheng Yang wrote:
> Here is v2.
>
> Changelog:
>
> v1->v2
>
> The major change from v1 is I've added the in-kernel MSI-X mask emulation
> support, as well as adding shortcuts for reading MSI-X table.
>
> I've taken Michael's advice to use mask/unmask directly, but unsure about
> exporting irq_to_desc() for module...
>
> Also add flush_work() according to Marcelo's comments.
>

Any performance numbers?  What are the affected guests?  just RHEL 4, or 
any others?

Alex, Michael, how would you do this with vfio?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 7/8] KVM: assigned dev: Introduce io_device for MSI-X MMIO accessing
  2010-10-20  9:46   ` Avi Kivity
@ 2010-10-20 10:33     ` Michael S. Tsirkin
  2010-10-21  6:46     ` Sheng Yang
  1 sibling, 0 replies; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-10-20 10:33 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Sheng Yang, Marcelo Tosatti, kvm

On Wed, Oct 20, 2010 at 11:46:47AM +0200, Avi Kivity wrote:
> >+		/* Only allow entry modification when entry was masked */
> >+		if (!entry_masked) {
> >+			printk(KERN_WARNING
> >+				"KVM: guest try to write unmasked MSI-X entry. "
> >+				"addr 0x%llx, len %d, val 0x%lx\n",
> >+				addr, len, new_val);
> >+			r = 0;
> 
> What does the spec says about this situation?

That it's not allowed.  However, existing userspace changes entries that
kvm thinks are unmasked, because it does not know how to tell kvm about
mask bits, so we need to keep supporting this.


-- 
MST

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20  9:51 ` [PATCH 0/8][v2] MSI-X mask emulation support for assigned device Avi Kivity
@ 2010-10-20 10:44   ` Michael S. Tsirkin
  2010-10-20 10:59     ` Avi Kivity
  2010-10-20 14:47     ` Alex Williamson
  2010-10-21  7:41   ` Sheng Yang
  1 sibling, 2 replies; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-10-20 10:44 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Sheng Yang, Marcelo Tosatti, kvm, Alex Williamson

On Wed, Oct 20, 2010 at 11:51:01AM +0200, Avi Kivity wrote:
>  On 10/20/2010 10:26 AM, Sheng Yang wrote:
> >Here is v2.
> >
> >Changelog:
> >
> >v1->v2
> >
> >The major change from v1 is I've added the in-kernel MSI-X mask emulation
> >support, as well as adding shortcuts for reading MSI-X table.
> >
> >I've taken Michael's advice to use mask/unmask directly, but unsure about
> >exporting irq_to_desc() for module...
> >
> >Also add flush_work() according to Marcelo's comments.
> >
> 
> Any performance numbers?  What are the affected guests?  just RHEL
> 4, or any others?

Likely any old linux.

> Alex, Michael, how would you do this with vfio?

With current VFIO we would catch mask writes in qemu and
call a KVM ioctl. We would also need an ioctl to retrieve
pending bits long term.

I think that it is unfortunate that we need to do this in userspace
while rest of configuration is done in kernel.
I would be much happier with userspace simply forwarding
everything to VFIO, so emulation does not have to
be split. That would be a clean interface: just mmap
MSIX BAR and forget about it.

If instead of eventfd we had a file descriptor that can pass vector
information from vfio to kvm and back, that would fix it,
as we would not need to set us GSIs at all,
and not need for userspace to handle MSIX specially.


> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20 10:44   ` Michael S. Tsirkin
@ 2010-10-20 10:59     ` Avi Kivity
  2010-10-20 13:43       ` Michael S. Tsirkin
  2010-10-20 14:47     ` Alex Williamson
  1 sibling, 1 reply; 66+ messages in thread
From: Avi Kivity @ 2010-10-20 10:59 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Sheng Yang, Marcelo Tosatti, kvm, Alex Williamson

  On 10/20/2010 12:44 PM, Michael S. Tsirkin wrote:
> >
> >  Any performance numbers?  What are the affected guests?  just RHEL
> >  4, or any others?
>
> Likely any old linux.

I meant that people are likely to virtualize and expect high performance 
from.

What about RHEL 3?  Does it support msi?  How about RHEL 5 - has it 
fixed this problem?

> >  Alex, Michael, how would you do this with vfio?
>
> With current VFIO we would catch mask writes in qemu and
> call a KVM ioctl.

Doing what?  Updating the irq routing to include/exclude the interrupt?  
Disconnecting the irqfd?

Note you could disconnect the irqfd from either vfio or kvm.

> We would also need an ioctl to retrieve
> pending bits long term.

Suppose you disconnect the irqfd.   Isn't the value of the eventfd 
equivalent to the pending bit?

> I think that it is unfortunate that we need to do this in userspace
> while rest of configuration is done in kernel.
> I would be much happier with userspace simply forwarding
> everything to VFIO, so emulation does not have to
> be split. That would be a clean interface: just mmap
> MSIX BAR and forget about it.

Agree.

> If instead of eventfd we had a file descriptor that can pass vector
> information from vfio to kvm and back, that would fix it,
> as we would not need to set us GSIs at all,
> and not need for userspace to handle MSIX specially.
>

But if we emulate the entire msix bar in vfio, that's not needed, right?

How far away is vfio?  If it's merged soon, we might avoid making 
changes to the old assigned device infrastructure and instead update 
vfio.  On the other hand, changes to the old infrastructure are much 
more amenable to backporting for long term support distro kernels, so we 
may need to actively develop both for a while.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 1/8] PCI: MSI: Move MSI-X entry definition to pci_regs.h
  2010-10-20  8:26 ` [PATCH 1/8] PCI: MSI: Move MSI-X entry definition to pci_regs.h Sheng Yang
@ 2010-10-20 11:07   ` Matthew Wilcox
  0 siblings, 0 replies; 66+ messages in thread
From: Matthew Wilcox @ 2010-10-20 11:07 UTC (permalink / raw)
  To: Sheng Yang
  Cc: Avi Kivity, Marcelo Tosatti, kvm, Michael S. Tsirkin,
	Jesse Barnes, linux-pci

On Wed, Oct 20, 2010 at 04:26:25PM +0800, Sheng Yang wrote:
> It would be used by KVM later.
> 
> Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
> Cc: linux-pci@vger.kernel.org
> Signed-off-by: Sheng Yang <sheng@linux.intel.com>

Thanks for doing this.  It should have been done long ago :-)

Reviewed-by: Matthew Wilcox <willy@linux.intel.com>

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20 10:59     ` Avi Kivity
@ 2010-10-20 13:43       ` Michael S. Tsirkin
  2010-10-20 14:58         ` Alex Williamson
  2010-10-20 15:17         ` Avi Kivity
  0 siblings, 2 replies; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-10-20 13:43 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Sheng Yang, Marcelo Tosatti, kvm, Alex Williamson

On Wed, Oct 20, 2010 at 12:59:42PM +0200, Avi Kivity wrote:
>  On 10/20/2010 12:44 PM, Michael S. Tsirkin wrote:
> >>
> >>  Any performance numbers?  What are the affected guests?  just RHEL
> >>  4, or any others?
> >
> >Likely any old linux.
> 
> I meant that people are likely to virtualize and expect high
> performance from.
> 
> What about RHEL 3?  Does it support msi?

Yes. I think it's the same.

>  How about RHEL 5 - has it
> fixed this problem?

Yes, it has reduced the problem. Instead of masking immediately it
records the masked state and masks the first time it gets an interrupt.
Most devices mask and unmask immediately so the chance of this happening
is small.

> >>  Alex, Michael, how would you do this with vfio?
> >
> >With current VFIO we would catch mask writes in qemu and
> >call a KVM ioctl.
> 
> Doing what?  Updating the irq routing to include/exclude the
> interrupt?  Disconnecting the irqfd?

No. I mean call the new mask ioctl.

> Note you could disconnect the irqfd from either vfio or kvm.

This is what current code does, implementing mask in userspace.
But it is an rcu write side so it is a slow path operation.

> >We would also need an ioctl to retrieve
> >pending bits long term.
> 
> Suppose you disconnect the irqfd.   Isn't the value of the eventfd
> equivalent to the pending bit?

Yes. This is what current code does.

> >I think that it is unfortunate that we need to do this in userspace
> >while rest of configuration is done in kernel.
> >I would be much happier with userspace simply forwarding
> >everything to VFIO, so emulation does not have to
> >be split. That would be a clean interface: just mmap
> >MSIX BAR and forget about it.
> 
> Agree.
> 
> >If instead of eventfd we had a file descriptor that can pass vector
> >information from vfio to kvm and back, that would fix it,
> >as we would not need to set us GSIs at all,
> >and not need for userspace to handle MSIX specially.
> >
> 
> But if we emulate the entire msix bar in vfio, that's not needed, right?

Yes, I think it is. How does kvm know which interrupt to inject?
Either vfio needs to pass that info to qemu and qemu would pass it
to kvm, or vfio would have some way to pass that info to kvm
directly.

> How far away is vfio?  If it's merged soon, we might avoid making
> changes to the old assigned device infrastructure and instead update
> vfio.

Hard to be sure, hopefully 2.6.38 material.

Some issues off the top of my head are
- readonly/virtualized table correctness
	hopefully will start converging now that
	we are switching to standard registers from pci_regs.h
- work out some capability negotiation mechanism so userspace/kernel
  can detect bug fixes/missing features and recover or fail gracefully
- with multiple assigned devices in a guest:
  I don't think we have figured out how do they share an iommu context
- maybe: reset handling (flr support) - need to look a it


> On the other hand, changes to the old infrastructure are much
> more amenable to backporting for long term support distro kernels,
> so we may need to actively develop both for a while.

Right.

> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20 14:47     ` Alex Williamson
@ 2010-10-20 14:46       ` Michael S. Tsirkin
  2010-10-20 15:07         ` Alex Williamson
  2010-10-20 15:23       ` Avi Kivity
  1 sibling, 1 reply; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-10-20 14:46 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Avi Kivity, Sheng Yang, Marcelo Tosatti, kvm

On Wed, Oct 20, 2010 at 08:47:14AM -0600, Alex Williamson wrote:
> On Wed, 2010-10-20 at 12:44 +0200, Michael S. Tsirkin wrote:
> > On Wed, Oct 20, 2010 at 11:51:01AM +0200, Avi Kivity wrote:
> > >  On 10/20/2010 10:26 AM, Sheng Yang wrote:
> > > >Here is v2.
> > > >
> > > >Changelog:
> > > >
> > > >v1->v2
> > > >
> > > >The major change from v1 is I've added the in-kernel MSI-X mask emulation
> > > >support, as well as adding shortcuts for reading MSI-X table.
> > > >
> > > >I've taken Michael's advice to use mask/unmask directly, but unsure about
> > > >exporting irq_to_desc() for module...
> > > >
> > > >Also add flush_work() according to Marcelo's comments.
> > > >
> > > 
> > > Any performance numbers?  What are the affected guests?  just RHEL
> > > 4, or any others?
> > 
> > Likely any old linux.
> > 
> > > Alex, Michael, how would you do this with vfio?
> > 
> > With current VFIO we would catch mask writes in qemu and
> > call a KVM ioctl. We would also need an ioctl to retrieve
> > pending bits long term.
> 
> Ugh, no.  VFIO us currently independent of KVM.  I'd like to keep it
> that way.  We'll need to optimize interrupt injection and eoi via KVM,
> but it should only be a performance optimization, not a functional
> requirement.

So ideally masking would be optimizeable too.

> It would probably make sense to request a mask/unmask ioctl in VFIO for
> MSI-X, then perhaps the pending bits would only support read/write (no
> mmap), so we could avoid an ioctl there.

Why not mask/unmask with a write?

> > I think that it is unfortunate that we need to do this in userspace
> > while rest of configuration is done in kernel.
> > I would be much happier with userspace simply forwarding
> > everything to VFIO, so emulation does not have to
> > be split. That would be a clean interface: just mmap
> > MSIX BAR and forget about it.
> > 
> > If instead of eventfd we had a file descriptor that can pass vector
> > information from vfio to kvm and back, that would fix it,
> > as we would not need to set us GSIs at all,
> > and not need for userspace to handle MSIX specially.
> 
> Sounds good except that I'd like to get away from forcing device
> assignment interfaces into KVM.
> 
> Alex

How to split up the modules is an implementation detail though.

-- 
MST

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20 10:44   ` Michael S. Tsirkin
  2010-10-20 10:59     ` Avi Kivity
@ 2010-10-20 14:47     ` Alex Williamson
  2010-10-20 14:46       ` Michael S. Tsirkin
  2010-10-20 15:23       ` Avi Kivity
  1 sibling, 2 replies; 66+ messages in thread
From: Alex Williamson @ 2010-10-20 14:47 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, Sheng Yang, Marcelo Tosatti, kvm

On Wed, 2010-10-20 at 12:44 +0200, Michael S. Tsirkin wrote:
> On Wed, Oct 20, 2010 at 11:51:01AM +0200, Avi Kivity wrote:
> >  On 10/20/2010 10:26 AM, Sheng Yang wrote:
> > >Here is v2.
> > >
> > >Changelog:
> > >
> > >v1->v2
> > >
> > >The major change from v1 is I've added the in-kernel MSI-X mask emulation
> > >support, as well as adding shortcuts for reading MSI-X table.
> > >
> > >I've taken Michael's advice to use mask/unmask directly, but unsure about
> > >exporting irq_to_desc() for module...
> > >
> > >Also add flush_work() according to Marcelo's comments.
> > >
> > 
> > Any performance numbers?  What are the affected guests?  just RHEL
> > 4, or any others?
> 
> Likely any old linux.
> 
> > Alex, Michael, how would you do this with vfio?
> 
> With current VFIO we would catch mask writes in qemu and
> call a KVM ioctl. We would also need an ioctl to retrieve
> pending bits long term.

Ugh, no.  VFIO us currently independent of KVM.  I'd like to keep it
that way.  We'll need to optimize interrupt injection and eoi via KVM,
but it should only be a performance optimization, not a functional
requirement.

It would probably make sense to request a mask/unmask ioctl in VFIO for
MSI-X, then perhaps the pending bits would only support read/write (no
mmap), so we could avoid an ioctl there.

> I think that it is unfortunate that we need to do this in userspace
> while rest of configuration is done in kernel.
> I would be much happier with userspace simply forwarding
> everything to VFIO, so emulation does not have to
> be split. That would be a clean interface: just mmap
> MSIX BAR and forget about it.
> 
> If instead of eventfd we had a file descriptor that can pass vector
> information from vfio to kvm and back, that would fix it,
> as we would not need to set us GSIs at all,
> and not need for userspace to handle MSIX specially.

Sounds good except that I'd like to get away from forcing device
assignment interfaces into KVM.

Alex


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20 14:58         ` Alex Williamson
@ 2010-10-20 14:58           ` Michael S. Tsirkin
  2010-10-20 15:12             ` Alex Williamson
  0 siblings, 1 reply; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-10-20 14:58 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Avi Kivity, Sheng Yang, Marcelo Tosatti, kvm

On Wed, Oct 20, 2010 at 08:58:38AM -0600, Alex Williamson wrote:
> On Wed, 2010-10-20 at 15:43 +0200, Michael S. Tsirkin wrote:
> > On Wed, Oct 20, 2010 at 12:59:42PM +0200, Avi Kivity wrote:
> > > How far away is vfio?  If it's merged soon, we might avoid making
> > > changes to the old assigned device infrastructure and instead update
> > > vfio.
> > 
> > Hard to be sure, hopefully 2.6.38 material.
> > 
> > Some issues off the top of my head are
> > - readonly/virtualized table correctness
> > 	hopefully will start converging now that
> > 	we are switching to standard registers from pci_regs.h
> > - work out some capability negotiation mechanism so userspace/kernel
> >   can detect bug fixes/missing features and recover or fail gracefully
> > - with multiple assigned devices in a guest:
> >   I don't think we have figured out how do they share an iommu context
> 
> A single UIOMMU fd can be used for multiple devices.  The code I wrote
> supports this, but it's really only meant to support a uiommufd passed
> via the command line for libvirt usage.  Since libvirt doesn't yet
> support vfio, it's never been tested.

I think there was some issue that programming was
done through vfio fd and got lost if you try to hot unplug
device which programmed it. Maybe I'm wrong ...

> > - maybe: reset handling (flr support) - need to look a it
> 
> I'd add that for the qemu vfio driver, we need to work out the KVM
> interrupt optimizations, otherwise we suffer extra latency vs current
> code.

You see this in some benchmark?

> > 
> > > On the other hand, changes to the old infrastructure are much
> > > more amenable to backporting for long term support distro kernels,
> > > so we may need to actively develop both for a while.
> 
> Yep, I think we'll need to continue development and probably maintain
> them both for a while.
> 
> Alex

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20 13:43       ` Michael S. Tsirkin
@ 2010-10-20 14:58         ` Alex Williamson
  2010-10-20 14:58           ` Michael S. Tsirkin
  2010-10-20 15:17         ` Avi Kivity
  1 sibling, 1 reply; 66+ messages in thread
From: Alex Williamson @ 2010-10-20 14:58 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, Sheng Yang, Marcelo Tosatti, kvm

On Wed, 2010-10-20 at 15:43 +0200, Michael S. Tsirkin wrote:
> On Wed, Oct 20, 2010 at 12:59:42PM +0200, Avi Kivity wrote:
> > How far away is vfio?  If it's merged soon, we might avoid making
> > changes to the old assigned device infrastructure and instead update
> > vfio.
> 
> Hard to be sure, hopefully 2.6.38 material.
> 
> Some issues off the top of my head are
> - readonly/virtualized table correctness
> 	hopefully will start converging now that
> 	we are switching to standard registers from pci_regs.h
> - work out some capability negotiation mechanism so userspace/kernel
>   can detect bug fixes/missing features and recover or fail gracefully
> - with multiple assigned devices in a guest:
>   I don't think we have figured out how do they share an iommu context

A single UIOMMU fd can be used for multiple devices.  The code I wrote
supports this, but it's really only meant to support a uiommufd passed
via the command line for libvirt usage.  Since libvirt doesn't yet
support vfio, it's never been tested.

> - maybe: reset handling (flr support) - need to look a it

I'd add that for the qemu vfio driver, we need to work out the KVM
interrupt optimizations, otherwise we suffer extra latency vs current
code.

> 
> > On the other hand, changes to the old infrastructure are much
> > more amenable to backporting for long term support distro kernels,
> > so we may need to actively develop both for a while.

Yep, I think we'll need to continue development and probably maintain
them both for a while.

Alex


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20 14:46       ` Michael S. Tsirkin
@ 2010-10-20 15:07         ` Alex Williamson
  2010-10-20 15:13           ` Michael S. Tsirkin
  0 siblings, 1 reply; 66+ messages in thread
From: Alex Williamson @ 2010-10-20 15:07 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, Sheng Yang, Marcelo Tosatti, kvm

On Wed, 2010-10-20 at 16:46 +0200, Michael S. Tsirkin wrote:
> On Wed, Oct 20, 2010 at 08:47:14AM -0600, Alex Williamson wrote:
> > On Wed, 2010-10-20 at 12:44 +0200, Michael S. Tsirkin wrote:
> > > On Wed, Oct 20, 2010 at 11:51:01AM +0200, Avi Kivity wrote:
> > > >  On 10/20/2010 10:26 AM, Sheng Yang wrote:
> > > > >Here is v2.
> > > > >
> > > > >Changelog:
> > > > >
> > > > >v1->v2
> > > > >
> > > > >The major change from v1 is I've added the in-kernel MSI-X mask emulation
> > > > >support, as well as adding shortcuts for reading MSI-X table.
> > > > >
> > > > >I've taken Michael's advice to use mask/unmask directly, but unsure about
> > > > >exporting irq_to_desc() for module...
> > > > >
> > > > >Also add flush_work() according to Marcelo's comments.
> > > > >
> > > > 
> > > > Any performance numbers?  What are the affected guests?  just RHEL
> > > > 4, or any others?
> > > 
> > > Likely any old linux.
> > > 
> > > > Alex, Michael, how would you do this with vfio?
> > > 
> > > With current VFIO we would catch mask writes in qemu and
> > > call a KVM ioctl. We would also need an ioctl to retrieve
> > > pending bits long term.
> > 
> > Ugh, no.  VFIO us currently independent of KVM.  I'd like to keep it
> > that way.  We'll need to optimize interrupt injection and eoi via KVM,
> > but it should only be a performance optimization, not a functional
> > requirement.
> 
> So ideally masking would be optimizeable too.

What does KVM add to the masking?  VFIO owns the interrupt handler,
which can decide if the interrupt is masked and set a pending bit in an
emulated PBA, or if unmasked it sends it to qemu via eventfd.  For KVM,
I think we'd just augment that last bit to relay the interrupt to KVM
for direct guest injection.

> > It would probably make sense to request a mask/unmask ioctl in VFIO for
> > MSI-X, then perhaps the pending bits would only support read/write (no
> > mmap), so we could avoid an ioctl there.
> 
> Why not mask/unmask with a write?

That would be possible too, only trouble is then we have QEMU
intercepting and interpreting the write as well as VFIO intercepting and
interpreting the write.  If VFIO is only masking off the mask bit,
that'd be pretty trivial though.

Alex




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20 14:58           ` Michael S. Tsirkin
@ 2010-10-20 15:12             ` Alex Williamson
  0 siblings, 0 replies; 66+ messages in thread
From: Alex Williamson @ 2010-10-20 15:12 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, Sheng Yang, Marcelo Tosatti, kvm

On Wed, 2010-10-20 at 16:58 +0200, Michael S. Tsirkin wrote:
> On Wed, Oct 20, 2010 at 08:58:38AM -0600, Alex Williamson wrote:
> > On Wed, 2010-10-20 at 15:43 +0200, Michael S. Tsirkin wrote:
> > > On Wed, Oct 20, 2010 at 12:59:42PM +0200, Avi Kivity wrote:
> > > > How far away is vfio?  If it's merged soon, we might avoid making
> > > > changes to the old assigned device infrastructure and instead update
> > > > vfio.
> > > 
> > > Hard to be sure, hopefully 2.6.38 material.
> > > 
> > > Some issues off the top of my head are
> > > - readonly/virtualized table correctness
> > > 	hopefully will start converging now that
> > > 	we are switching to standard registers from pci_regs.h
> > > - work out some capability negotiation mechanism so userspace/kernel
> > >   can detect bug fixes/missing features and recover or fail gracefully
> > > - with multiple assigned devices in a guest:
> > >   I don't think we have figured out how do they share an iommu context
> > 
> > A single UIOMMU fd can be used for multiple devices.  The code I wrote
> > supports this, but it's really only meant to support a uiommufd passed
> > via the command line for libvirt usage.  Since libvirt doesn't yet
> > support vfio, it's never been tested.
> 
> I think there was some issue that programming was
> done through vfio fd and got lost if you try to hot unplug
> device which programmed it. Maybe I'm wrong ...

Hasn't been tested, so bugs are probably.  Sounds fixable though.
Sharing an IOMMU context is at least part of the design.

> > > - maybe: reset handling (flr support) - need to look a it
> > 
> > I'd add that for the qemu vfio driver, we need to work out the KVM
> > interrupt optimizations, otherwise we suffer extra latency vs current
> > code.
> 
> You see this in some benchmark?

I haven't done any formal benchmarks on VFIO, but I recall that while
both can achieve line rate on a 1G link, VFIO currently requires more
CPU to do so.  Bouncing interrupts through QEMU seems like an obvious
candidate for that.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20 15:07         ` Alex Williamson
@ 2010-10-20 15:13           ` Michael S. Tsirkin
  2010-10-20 20:13             ` Alex Williamson
  0 siblings, 1 reply; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-10-20 15:13 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Avi Kivity, Sheng Yang, Marcelo Tosatti, kvm

On Wed, Oct 20, 2010 at 09:07:12AM -0600, Alex Williamson wrote:
> On Wed, 2010-10-20 at 16:46 +0200, Michael S. Tsirkin wrote:
> > On Wed, Oct 20, 2010 at 08:47:14AM -0600, Alex Williamson wrote:
> > > On Wed, 2010-10-20 at 12:44 +0200, Michael S. Tsirkin wrote:
> > > > On Wed, Oct 20, 2010 at 11:51:01AM +0200, Avi Kivity wrote:
> > > > >  On 10/20/2010 10:26 AM, Sheng Yang wrote:
> > > > > >Here is v2.
> > > > > >
> > > > > >Changelog:
> > > > > >
> > > > > >v1->v2
> > > > > >
> > > > > >The major change from v1 is I've added the in-kernel MSI-X mask emulation
> > > > > >support, as well as adding shortcuts for reading MSI-X table.
> > > > > >
> > > > > >I've taken Michael's advice to use mask/unmask directly, but unsure about
> > > > > >exporting irq_to_desc() for module...
> > > > > >
> > > > > >Also add flush_work() according to Marcelo's comments.
> > > > > >
> > > > > 
> > > > > Any performance numbers?  What are the affected guests?  just RHEL
> > > > > 4, or any others?
> > > > 
> > > > Likely any old linux.
> > > > 
> > > > > Alex, Michael, how would you do this with vfio?
> > > > 
> > > > With current VFIO we would catch mask writes in qemu and
> > > > call a KVM ioctl. We would also need an ioctl to retrieve
> > > > pending bits long term.
> > > 
> > > Ugh, no.  VFIO us currently independent of KVM.  I'd like to keep it
> > > that way.  We'll need to optimize interrupt injection and eoi via KVM,
> > > but it should only be a performance optimization, not a functional
> > > requirement.
> > 
> > So ideally masking would be optimizeable too.
> 
> What does KVM add to the masking?  VFIO owns the interrupt handler,
> which can decide if the interrupt is masked and set a pending bit in an
> emulated PBA, or if unmasked it sends it to qemu via eventfd.  For KVM,
> I think we'd just augment that last bit to relay the interrupt to KVM
> for direct guest injection.

Right. MAsking in KVM is good for old style assignment and for vhost-net.

> > > It would probably make sense to request a mask/unmask ioctl in VFIO for
> > > MSI-X, then perhaps the pending bits would only support read/write (no
> > > mmap), so we could avoid an ioctl there.
> > 
> > Why not mask/unmask with a write?
> 
> That would be possible too, only trouble is then we have QEMU
> intercepting and interpreting the write as well as VFIO intercepting and
> interpreting the write.
> If VFIO is only masking off the mask bit,
> that'd be pretty trivial though.
> 
> Alex

I just mean write() instead of an ioctl()

-- 
MST

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20 13:43       ` Michael S. Tsirkin
  2010-10-20 14:58         ` Alex Williamson
@ 2010-10-20 15:17         ` Avi Kivity
  2010-10-20 15:22           ` Alex Williamson
  1 sibling, 1 reply; 66+ messages in thread
From: Avi Kivity @ 2010-10-20 15:17 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Sheng Yang, Marcelo Tosatti, kvm, Alex Williamson

  On 10/20/2010 03:43 PM, Michael S. Tsirkin wrote:
> >  >If instead of eventfd we had a file descriptor that can pass vector
> >  >information from vfio to kvm and back, that would fix it,
> >  >as we would not need to set us GSIs at all,
> >  >and not need for userspace to handle MSIX specially.
> >  >
> >
> >  But if we emulate the entire msix bar in vfio, that's not needed, right?
>
> Yes, I think it is. How does kvm know which interrupt to inject?
> Either vfio needs to pass that info to qemu and qemu would pass it
> to kvm, or vfio would have some way to pass that info to kvm
> directly.

Wait.  We can't emulate the BAR in vfio, we have to emulate it in kvm 
where we emulate the write instruction.  We then need to tell vfio, 
perhaps via userspace, that masking state has changed.

Seems very intrusive.

We can perhaps use the new semaphore capability in eventfd to pass 
mask/unmask information from kvm to userspace.  We'd usually hook it to 
the corresponding vfio facility.

> >  How far away is vfio?  If it's merged soon, we might avoid making
> >  changes to the old assigned device infrastructure and instead update
> >  vfio.
>
> Hard to be sure, hopefully 2.6.38 material.

My question was how to implement this with vfio.  But your list is 
interesting as well.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20 15:17         ` Avi Kivity
@ 2010-10-20 15:22           ` Alex Williamson
  2010-10-20 15:26             ` Avi Kivity
  0 siblings, 1 reply; 66+ messages in thread
From: Alex Williamson @ 2010-10-20 15:22 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Michael S. Tsirkin, Sheng Yang, Marcelo Tosatti, kvm

On Wed, 2010-10-20 at 17:17 +0200, Avi Kivity wrote:
> On 10/20/2010 03:43 PM, Michael S. Tsirkin wrote:
> > >  >If instead of eventfd we had a file descriptor that can pass vector
> > >  >information from vfio to kvm and back, that would fix it,
> > >  >as we would not need to set us GSIs at all,
> > >  >and not need for userspace to handle MSIX specially.
> > >  >
> > >
> > >  But if we emulate the entire msix bar in vfio, that's not needed, right?
> >
> > Yes, I think it is. How does kvm know which interrupt to inject?
> > Either vfio needs to pass that info to qemu and qemu would pass it
> > to kvm, or vfio would have some way to pass that info to kvm
> > directly.
> 
> Wait.  We can't emulate the BAR in vfio, we have to emulate it in kvm 
> where we emulate the write instruction.  We then need to tell vfio, 
> perhaps via userspace, that masking state has changed.
> 
> Seems very intrusive.

We wouldn't direct map the vector table or pending bits, so we could
trap and emulate in qemu, which could then call into reads/writes in
vfio.

Alex

> We can perhaps use the new semaphore capability in eventfd to pass 
> mask/unmask information from kvm to userspace.  We'd usually hook it to 
> the corresponding vfio facility.
> 
> > >  How far away is vfio?  If it's merged soon, we might avoid making
> > >  changes to the old assigned device infrastructure and instead update
> > >  vfio.
> >
> > Hard to be sure, hopefully 2.6.38 material.
> 
> My question was how to implement this with vfio.  But your list is 
> interesting as well.
> 
> 




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20 14:47     ` Alex Williamson
  2010-10-20 14:46       ` Michael S. Tsirkin
@ 2010-10-20 15:23       ` Avi Kivity
  2010-10-20 15:38         ` Alex Williamson
  1 sibling, 1 reply; 66+ messages in thread
From: Avi Kivity @ 2010-10-20 15:23 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Michael S. Tsirkin, Sheng Yang, Marcelo Tosatti, kvm

  On 10/20/2010 04:47 PM, Alex Williamson wrote:
> >
> >  With current VFIO we would catch mask writes in qemu and
> >  call a KVM ioctl. We would also need an ioctl to retrieve
> >  pending bits long term.
>
> Ugh, no.  VFIO us currently independent of KVM.  I'd like to keep it
> that way.

Me, too.  Perhaps even more than you.

> We'll need to optimize interrupt injection and eoi via KVM,
> but it should only be a performance optimization, not a functional
> requirement.

For level-triggered interrupts only, yes?  MSI EOI does not involve any 
device or interrupt controller visible action?

> It would probably make sense to request a mask/unmask ioctl in VFIO for
> MSI-X, then perhaps the pending bits would only support read/write (no
> mmap), so we could avoid an ioctl there.

I would much like to see in-band information (which mask/unmask is for 
older Linux) done via eventfds so userspace is not involved.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20 15:22           ` Alex Williamson
@ 2010-10-20 15:26             ` Avi Kivity
  2010-10-20 15:38               ` Alex Williamson
  0 siblings, 1 reply; 66+ messages in thread
From: Avi Kivity @ 2010-10-20 15:26 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Michael S. Tsirkin, Sheng Yang, Marcelo Tosatti, kvm

  On 10/20/2010 05:22 PM, Alex Williamson wrote:
> On Wed, 2010-10-20 at 17:17 +0200, Avi Kivity wrote:
> >  On 10/20/2010 03:43 PM, Michael S. Tsirkin wrote:
> >  >  >   >If instead of eventfd we had a file descriptor that can pass vector
> >  >  >   >information from vfio to kvm and back, that would fix it,
> >  >  >   >as we would not need to set us GSIs at all,
> >  >  >   >and not need for userspace to handle MSIX specially.
> >  >  >   >
> >  >  >
> >  >  >   But if we emulate the entire msix bar in vfio, that's not needed, right?
> >  >
> >  >  Yes, I think it is. How does kvm know which interrupt to inject?
> >  >  Either vfio needs to pass that info to qemu and qemu would pass it
> >  >  to kvm, or vfio would have some way to pass that info to kvm
> >  >  directly.
> >
> >  Wait.  We can't emulate the BAR in vfio, we have to emulate it in kvm
> >  where we emulate the write instruction.  We then need to tell vfio,
> >  perhaps via userspace, that masking state has changed.
> >
> >  Seems very intrusive.
>
> We wouldn't direct map the vector table or pending bits, so we could
> trap and emulate in qemu, which could then call into reads/writes in
> vfio.
>

That's 100% unintrusive for kvm, but that's what we do today, which is 
deemed too slow.

Another option is to fake an interrupt remapping device and do a direct 
map.  Will those older guests recognize and use it?  I imagine not.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20 15:26             ` Avi Kivity
@ 2010-10-20 15:38               ` Alex Williamson
  0 siblings, 0 replies; 66+ messages in thread
From: Alex Williamson @ 2010-10-20 15:38 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Michael S. Tsirkin, Sheng Yang, Marcelo Tosatti, kvm

On Wed, 2010-10-20 at 17:26 +0200, Avi Kivity wrote:
> On 10/20/2010 05:22 PM, Alex Williamson wrote:
> > On Wed, 2010-10-20 at 17:17 +0200, Avi Kivity wrote:
> > >  On 10/20/2010 03:43 PM, Michael S. Tsirkin wrote:
> > >  >  >   >If instead of eventfd we had a file descriptor that can pass vector
> > >  >  >   >information from vfio to kvm and back, that would fix it,
> > >  >  >   >as we would not need to set us GSIs at all,
> > >  >  >   >and not need for userspace to handle MSIX specially.
> > >  >  >   >
> > >  >  >
> > >  >  >   But if we emulate the entire msix bar in vfio, that's not needed, right?
> > >  >
> > >  >  Yes, I think it is. How does kvm know which interrupt to inject?
> > >  >  Either vfio needs to pass that info to qemu and qemu would pass it
> > >  >  to kvm, or vfio would have some way to pass that info to kvm
> > >  >  directly.
> > >
> > >  Wait.  We can't emulate the BAR in vfio, we have to emulate it in kvm
> > >  where we emulate the write instruction.  We then need to tell vfio,
> > >  perhaps via userspace, that masking state has changed.
> > >
> > >  Seems very intrusive.
> >
> > We wouldn't direct map the vector table or pending bits, so we could
> > trap and emulate in qemu, which could then call into reads/writes in
> > vfio.
> >
> 
> That's 100% unintrusive for kvm, but that's what we do today, which is 
> deemed too slow.
> 
> Another option is to fake an interrupt remapping device and do a direct 
> map.  Will those older guests recognize and use it?  I imagine not.

Nope, I think it's a non-starter to think anything but the latest
current guests will have support for that.

Alex


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20 15:23       ` Avi Kivity
@ 2010-10-20 15:38         ` Alex Williamson
  2010-10-20 15:54           ` Avi Kivity
  0 siblings, 1 reply; 66+ messages in thread
From: Alex Williamson @ 2010-10-20 15:38 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Michael S. Tsirkin, Sheng Yang, Marcelo Tosatti, kvm

On Wed, 2010-10-20 at 17:23 +0200, Avi Kivity wrote:
> On 10/20/2010 04:47 PM, Alex Williamson wrote:
> > >
> > >  With current VFIO we would catch mask writes in qemu and
> > >  call a KVM ioctl. We would also need an ioctl to retrieve
> > >  pending bits long term.
> >
> > Ugh, no.  VFIO us currently independent of KVM.  I'd like to keep it
> > that way.
> 
> Me, too.  Perhaps even more than you.
> 
> > We'll need to optimize interrupt injection and eoi via KVM,
> > but it should only be a performance optimization, not a functional
> > requirement.
> 
> For level-triggered interrupts only, yes?  MSI EOI does not involve any 
> device or interrupt controller visible action?

Right, the EOI is for legacy interrupts only.  Perhaps we don't care
enough about performance of those to route it through KVM so long as we
can co-exist with the KVM APIC.  MSIs are currently still bouncing from
VFIO to QEMU to the guest, which seems inefficient.

> > It would probably make sense to request a mask/unmask ioctl in VFIO for
> > MSI-X, then perhaps the pending bits would only support read/write (no
> > mmap), so we could avoid an ioctl there.
> 
> I would much like to see in-band information (which mask/unmask is for 
> older Linux) done via eventfds so userspace is not involved.

Hmm, I'm not sure how to do that yet.

Alex




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20 15:38         ` Alex Williamson
@ 2010-10-20 15:54           ` Avi Kivity
  2010-10-20 15:59             ` Michael S. Tsirkin
  0 siblings, 1 reply; 66+ messages in thread
From: Avi Kivity @ 2010-10-20 15:54 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Michael S. Tsirkin, Sheng Yang, Marcelo Tosatti, kvm

  On 10/20/2010 05:38 PM, Alex Williamson wrote:
> >  >  We'll need to optimize interrupt injection and eoi via KVM,
> >  >  but it should only be a performance optimization, not a functional
> >  >  requirement.
> >
> >  For level-triggered interrupts only, yes?  MSI EOI does not involve any
> >  device or interrupt controller visible action?
>
> Right, the EOI is for legacy interrupts only.  Perhaps we don't care
> enough about performance of those to route it through KVM so long as we
> can co-exist with the KVM APIC.

We will need a way to get the EOI out to userspace, then.

Perhaps that will give us motivation to split the ioapic from the kernel 
(though I suppose fear of regressions will stop us).

Anyway EOI notifiers are also useful for timekeeping.

>    MSIs are currently still bouncing from
> VFIO to QEMU to the guest, which seems inefficient.

Why is that?  Can't you use irqfd like vhost-net?

> >  >  It would probably make sense to request a mask/unmask ioctl in VFIO for
> >  >  MSI-X, then perhaps the pending bits would only support read/write (no
> >  >  mmap), so we could avoid an ioctl there.
> >
> >  I would much like to see in-band information (which mask/unmask is for
> >  older Linux) done via eventfds so userspace is not involved.
>
> Hmm, I'm not sure how to do that yet.

One ugly way is to use two eventfds, one counting mask events, one 
counting unmask events.  The difference is the value if the masked bit.

Another option is to use 
http://permalink.gmane.org/gmane.linux.kernel.commits.head/188038 which 
allows kvm to maintain a counter using eventfd.  The vfio state machine 
then looks like:

interrupt:
   if counter == 0 mask interrupts, set pending bit, wait for counter to 
become 1
   if counter == 1 forward it
counter becomes 1: unmask, forward interrupt if pending

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20 15:54           ` Avi Kivity
@ 2010-10-20 15:59             ` Michael S. Tsirkin
  2010-10-20 16:13               ` Avi Kivity
  2010-10-20 18:31               ` Alex Williamson
  0 siblings, 2 replies; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-10-20 15:59 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alex Williamson, Sheng Yang, Marcelo Tosatti, kvm

On Wed, Oct 20, 2010 at 05:54:20PM +0200, Avi Kivity wrote:
>  On 10/20/2010 05:38 PM, Alex Williamson wrote:
> >>  >  We'll need to optimize interrupt injection and eoi via KVM,
> >>  >  but it should only be a performance optimization, not a functional
> >>  >  requirement.
> >>
> >>  For level-triggered interrupts only, yes?  MSI EOI does not involve any
> >>  device or interrupt controller visible action?
> >
> >Right, the EOI is for legacy interrupts only.  Perhaps we don't care
> >enough about performance of those to route it through KVM so long as we
> >can co-exist with the KVM APIC.
> 
> We will need a way to get the EOI out to userspace, then.
> 
> Perhaps that will give us motivation to split the ioapic from the
> kernel (though I suppose fear of regressions will stop us).
> 
> Anyway EOI notifiers are also useful for timekeeping.
> 
> >   MSIs are currently still bouncing from
> >VFIO to QEMU to the guest, which seems inefficient.
> 
> Why is that?  Can't you use irqfd like vhost-net?
> 
> >>  >  It would probably make sense to request a mask/unmask ioctl in VFIO for
> >>  >  MSI-X, then perhaps the pending bits would only support read/write (no
> >>  >  mmap), so we could avoid an ioctl there.
> >>
> >>  I would much like to see in-band information (which mask/unmask is for
> >>  older Linux) done via eventfds so userspace is not involved.
> >
> >Hmm, I'm not sure how to do that yet.
> 
> One ugly way is to use two eventfds, one counting mask events, one
> counting unmask events.  The difference is the value if the masked
> bit.
> 
> Another option is to use
> http://permalink.gmane.org/gmane.linux.kernel.commits.head/188038
> which allows kvm to maintain a counter using eventfd.  The vfio
> state machine then looks like:
> 
> interrupt:
>   if counter == 0 mask interrupts, set pending bit, wait for counter
> to become 1
>   if counter == 1 forward it
> counter becomes 1: unmask, forward interrupt if pending

One issue is that this page has a ton of other info.
KVM would have to keep all that in kernel...

-- 
MST

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20 15:59             ` Michael S. Tsirkin
@ 2010-10-20 16:13               ` Avi Kivity
  2010-10-20 17:11                 ` Michael S. Tsirkin
  2010-10-20 18:31               ` Alex Williamson
  1 sibling, 1 reply; 66+ messages in thread
From: Avi Kivity @ 2010-10-20 16:13 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Alex Williamson, Sheng Yang, Marcelo Tosatti, kvm

  On 10/20/2010 05:59 PM, Michael S. Tsirkin wrote:
> >
> >  One ugly way is to use two eventfds, one counting mask events, one
> >  counting unmask events.  The difference is the value if the masked
> >  bit.
> >
> >  Another option is to use
> >  http://permalink.gmane.org/gmane.linux.kernel.commits.head/188038
> >  which allows kvm to maintain a counter using eventfd.  The vfio
> >  state machine then looks like:
> >
> >  interrupt:
> >    if counter == 0 mask interrupts, set pending bit, wait for counter
> >  to become 1
> >    if counter == 1 forward it
> >  counter becomes 1: unmask, forward interrupt if pending
>
> One issue is that this page has a ton of other info.
> KVM would have to keep all that in kernel...

Or we'd make kvm an "accelerator" and pass through other accesses to 
userspace.  I don't like it because it's a blurry programming model.  At 
the very least it has to be documented very well.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20 16:13               ` Avi Kivity
@ 2010-10-20 17:11                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-10-20 17:11 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alex Williamson, Sheng Yang, Marcelo Tosatti, kvm

On Wed, Oct 20, 2010 at 06:13:53PM +0200, Avi Kivity wrote:
>  On 10/20/2010 05:59 PM, Michael S. Tsirkin wrote:
> >>
> >>  One ugly way is to use two eventfds, one counting mask events, one
> >>  counting unmask events.  The difference is the value if the masked
> >>  bit.
> >>
> >>  Another option is to use
> >>  http://permalink.gmane.org/gmane.linux.kernel.commits.head/188038
> >>  which allows kvm to maintain a counter using eventfd.  The vfio
> >>  state machine then looks like:
> >>
> >>  interrupt:
> >>    if counter == 0 mask interrupts, set pending bit, wait for counter
> >>  to become 1
> >>    if counter == 1 forward it
> >>  counter becomes 1: unmask, forward interrupt if pending
> >
> >One issue is that this page has a ton of other info.
> >KVM would have to keep all that in kernel...
> 
> Or we'd make kvm an "accelerator" and pass through other accesses to
> userspace.  I don't like it because it's a blurry programming model.

Not sure I understand.

> At the very least it has to be documented very well.
> 
> -- 
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20 15:59             ` Michael S. Tsirkin
  2010-10-20 16:13               ` Avi Kivity
@ 2010-10-20 18:31               ` Alex Williamson
  1 sibling, 0 replies; 66+ messages in thread
From: Alex Williamson @ 2010-10-20 18:31 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, Sheng Yang, Marcelo Tosatti, kvm

On Wed, 2010-10-20 at 17:59 +0200, Michael S. Tsirkin wrote:
> On Wed, Oct 20, 2010 at 05:54:20PM +0200, Avi Kivity wrote:
> >  On 10/20/2010 05:38 PM, Alex Williamson wrote:
> > >>  >  We'll need to optimize interrupt injection and eoi via KVM,
> > >>  >  but it should only be a performance optimization, not a functional
> > >>  >  requirement.
> > >>
> > >>  For level-triggered interrupts only, yes?  MSI EOI does not involve any
> > >>  device or interrupt controller visible action?
> > >
> > >Right, the EOI is for legacy interrupts only.  Perhaps we don't care
> > >enough about performance of those to route it through KVM so long as we
> > >can co-exist with the KVM APIC.
> > 
> > We will need a way to get the EOI out to userspace, then.
> > 
> > Perhaps that will give us motivation to split the ioapic from the
> > kernel (though I suppose fear of regressions will stop us).
> > 
> > Anyway EOI notifiers are also useful for timekeeping.
> > 
> > >   MSIs are currently still bouncing from
> > >VFIO to QEMU to the guest, which seems inefficient.
> > 
> > Why is that?  Can't you use irqfd like vhost-net?

Hope so, just haven't done it yet.

> > >>  >  It would probably make sense to request a mask/unmask ioctl in VFIO for
> > >>  >  MSI-X, then perhaps the pending bits would only support read/write (no
> > >>  >  mmap), so we could avoid an ioctl there.
> > >>
> > >>  I would much like to see in-band information (which mask/unmask is for
> > >>  older Linux) done via eventfds so userspace is not involved.
> > >
> > >Hmm, I'm not sure how to do that yet.
> > 
> > One ugly way is to use two eventfds, one counting mask events, one
> > counting unmask events.  The difference is the value if the masked
> > bit.
> > 
> > Another option is to use
> > http://permalink.gmane.org/gmane.linux.kernel.commits.head/188038
> > which allows kvm to maintain a counter using eventfd.  The vfio
> > state machine then looks like:
> > 
> > interrupt:
> >   if counter == 0 mask interrupts, set pending bit, wait for counter
> > to become 1
> >   if counter == 1 forward it
> > counter becomes 1: unmask, forward interrupt if pending
> 
> One issue is that this page has a ton of other info.
> KVM would have to keep all that in kernel...

Yeah, I'm a little confused about who is setting and who is receiving
the eventfd.  KVM is going to trap that page, filter out the mask bit,
do the eventfd thing for the msix vector table, and forward the rest to
qemu?  I'd like to see some benchmarks to know if that's really
worthwhile.

Alex


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20  8:26 [PATCH 0/8][v2] MSI-X mask emulation support for assigned device Sheng Yang
                   ` (8 preceding siblings ...)
  2010-10-20  9:51 ` [PATCH 0/8][v2] MSI-X mask emulation support for assigned device Avi Kivity
@ 2010-10-20 19:02 ` Marcelo Tosatti
  2010-10-21  7:10   ` Sheng Yang
  2010-10-20 22:20 ` Michael S. Tsirkin
  10 siblings, 1 reply; 66+ messages in thread
From: Marcelo Tosatti @ 2010-10-20 19:02 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Avi Kivity, kvm, Michael S. Tsirkin

On Wed, Oct 20, 2010 at 04:26:24PM +0800, Sheng Yang wrote:
> Here is v2.
> 
> Changelog:
> 
> v1->v2
> 
> The major change from v1 is I've added the in-kernel MSI-X mask emulation
> support, as well as adding shortcuts for reading MSI-X table.
> 
> I've taken Michael's advice to use mask/unmask directly, but unsure about
> exporting irq_to_desc() for module...
> 
> Also add flush_work() according to Marcelo's comments.
> 
> Sheng Yang (8):
>   PCI: MSI: Move MSI-X entry definition to pci_regs.h
>   irq: Export irq_to_desc() to modules
>   KVM: x86: Enable ENABLE_CAP capability for x86
>   KVM: Move struct kvm_io_device to kvm_host.h
>   KVM: Add kvm_get_irq_routing_entry() func
>   KVM: assigned dev: Preparation for mask support in userspace
>   KVM: assigned dev: Introduce io_device for MSI-X MMIO accessing
>   KVM: Emulation MSI-X mask bits for assigned devices

Why does the current scheme, without msix per-vector mask support, is
functional at all? Luck?



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20 15:13           ` Michael S. Tsirkin
@ 2010-10-20 20:13             ` Alex Williamson
  2010-10-20 22:06               ` Michael S. Tsirkin
  0 siblings, 1 reply; 66+ messages in thread
From: Alex Williamson @ 2010-10-20 20:13 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, Sheng Yang, Marcelo Tosatti, kvm

On Wed, 2010-10-20 at 17:13 +0200, Michael S. Tsirkin wrote:
> On Wed, Oct 20, 2010 at 09:07:12AM -0600, Alex Williamson wrote:
> > On Wed, 2010-10-20 at 16:46 +0200, Michael S. Tsirkin wrote:
> > > On Wed, Oct 20, 2010 at 08:47:14AM -0600, Alex Williamson wrote:
> > > > It would probably make sense to request a mask/unmask ioctl in VFIO for
> > > > MSI-X, then perhaps the pending bits would only support read/write (no
> > > > mmap), so we could avoid an ioctl there.
> > > 
> > > Why not mask/unmask with a write?
> > 
> > That would be possible too, only trouble is then we have QEMU
> > intercepting and interpreting the write as well as VFIO intercepting and
> > interpreting the write.
> > If VFIO is only masking off the mask bit,
> > that'd be pretty trivial though.
> 
> I just mean write() instead of an ioctl()

Hmm, looking back through my vfio driver, I actually have some code that
passes guest writes of the vector control field down to vfio.  With
interrupt remapping support, vfio should pass these to the device
masking the interrupt at the source so we don't even need pending bit
emulation.  Then we just need to make sure we only filter out the vector
table for read/write of the page it lives on so we can support the PBA
being on the same page.

Alex




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20 20:13             ` Alex Williamson
@ 2010-10-20 22:06               ` Michael S. Tsirkin
  0 siblings, 0 replies; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-10-20 22:06 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Avi Kivity, Sheng Yang, Marcelo Tosatti, kvm

On Wed, Oct 20, 2010 at 02:13:01PM -0600, Alex Williamson wrote:
> On Wed, 2010-10-20 at 17:13 +0200, Michael S. Tsirkin wrote:
> > On Wed, Oct 20, 2010 at 09:07:12AM -0600, Alex Williamson wrote:
> > > On Wed, 2010-10-20 at 16:46 +0200, Michael S. Tsirkin wrote:
> > > > On Wed, Oct 20, 2010 at 08:47:14AM -0600, Alex Williamson wrote:
> > > > > It would probably make sense to request a mask/unmask ioctl in VFIO for
> > > > > MSI-X, then perhaps the pending bits would only support read/write (no
> > > > > mmap), so we could avoid an ioctl there.
> > > > 
> > > > Why not mask/unmask with a write?
> > > 
> > > That would be possible too, only trouble is then we have QEMU
> > > intercepting and interpreting the write as well as VFIO intercepting and
> > > interpreting the write.
> > > If VFIO is only masking off the mask bit,
> > > that'd be pretty trivial though.
> > 
> > I just mean write() instead of an ioctl()
> 
> Hmm, looking back through my vfio driver, I actually have some code that
> passes guest writes of the vector control field down to vfio.  With
> interrupt remapping support, vfio should pass these to the device
> masking the interrupt at the source so we don't even need pending bit
> emulation.

No, that would conflict with kernel using mask bits.
If we do this this direct access is very wrong, we must use kernel APIs for this.

>  Then we just need to make sure we only filter out the vector
> table for read/write of the page it lives on so we can support the PBA
> being on the same page.
> 
> Alex
> 
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20  8:26 [PATCH 0/8][v2] MSI-X mask emulation support for assigned device Sheng Yang
                   ` (9 preceding siblings ...)
  2010-10-20 19:02 ` Marcelo Tosatti
@ 2010-10-20 22:20 ` Michael S. Tsirkin
  10 siblings, 0 replies; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-10-20 22:20 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Wed, Oct 20, 2010 at 04:26:24PM +0800, Sheng Yang wrote:
> Here is v2.
> 
> Changelog:
> 
> v1->v2
> 
> The major change from v1 is I've added the in-kernel MSI-X mask emulation
> support, as well as adding shortcuts for reading MSI-X table.

The major thing everyone here is asking is whether this in-kernel
emulation can be shown to improve performance as measured some
guest/benchmark.  Do you know?

-- 
MST

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 8/8] KVM: Emulation MSI-X mask bits for assigned devices
  2010-10-20  8:26 ` [PATCH 8/8] KVM: Emulation MSI-X mask bits for assigned devices Sheng Yang
  2010-10-20  9:49   ` Avi Kivity
@ 2010-10-20 22:24   ` Michael S. Tsirkin
  2010-10-21  8:30   ` Sheng Yang
  2 siblings, 0 replies; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-10-20 22:24 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Wed, Oct 20, 2010 at 04:26:32PM +0800, Sheng Yang wrote:
> This patch enable per-vector mask for assigned devices using MSI-X.
> 
> Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> ---
>  Documentation/kvm/api.txt |   22 ++++++++++++++++
>  arch/x86/kvm/x86.c        |    6 ++++
>  include/linux/kvm.h       |    8 +++++-
>  virt/kvm/assigned-dev.c   |   60 +++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 95 insertions(+), 1 deletions(-)
> 
> diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
> index d82d637..f324a50 100644
> --- a/Documentation/kvm/api.txt
> +++ b/Documentation/kvm/api.txt
> @@ -1087,6 +1087,28 @@ of 4 instructions that make up a hypercall.
>  If any additional field gets added to this structure later on, a bit for that
>  additional piece of information will be set in the flags bitmap.
>  
> +4.47 KVM_ASSIGN_REG_MSIX_MMIO
> +
> +Capability: KVM_CAP_DEVICE_MSIX_MASK
> +Architectures: x86
> +Type: vm ioctl
> +Parameters: struct kvm_assigned_msix_mmio (in)
> +Returns: 0 on success, !0 on error
> +
> +struct kvm_assigned_msix_mmio {
> +	/* Assigned device's ID */
> +	__u32 assigned_dev_id;
> +	/* MSI-X table MMIO address */
> +	__u64 base_addr;
> +	/* Must be 0 */
> +	__u32 flags;
> +	/* Must be 0, reserved for future use */
> +	__u64 reserved;
> +};
> +
> +This ioctl would enable in-kernel MSI-X emulation, which would handle MSI-X
> +mask bit in the kernel.
> +
>  5. The kvm_run structure
>  
>  Application code obtains a pointer to the kvm_run structure by
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index fc62546..ba07a2f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1927,6 +1927,8 @@ int kvm_dev_ioctl_check_extension(long ext)
>  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
>  	case KVM_CAP_XSAVE:
>  	case KVM_CAP_ENABLE_CAP:
> +	case KVM_CAP_DEVICE_MSIX_EXT:
> +	case KVM_CAP_DEVICE_MSIX_MASK:
>  		r = 1;
>  		break;
>  	case KVM_CAP_COALESCED_MMIO:
> @@ -2717,6 +2719,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
>  		return -EINVAL;
>  
>  	switch (cap->cap) {
> +	case KVM_CAP_DEVICE_MSIX_EXT:
> +		vcpu->kvm->arch.msix_flags_enabled = true;
> +		r = 0;
> +		break;
>  	default:
>  		r = -EINVAL;
>  		break;
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index 0a7bd34..1494ed0 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -540,6 +540,10 @@ struct kvm_ppc_pvinfo {
>  #endif
>  #define KVM_CAP_PPC_GET_PVINFO 57
>  #define KVM_CAP_PPC_IRQ_LEVEL 58
> +#ifdef __KVM_HAVE_MSIX
> +#define KVM_CAP_DEVICE_MSIX_EXT 59
> +#define KVM_CAP_DEVICE_MSIX_MASK 60
> +#endif
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -671,6 +675,8 @@ struct kvm_clock_data {
>  #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct kvm_xen_hvm_config)
>  #define KVM_SET_CLOCK             _IOW(KVMIO,  0x7b, struct kvm_clock_data)
>  #define KVM_GET_CLOCK             _IOR(KVMIO,  0x7c, struct kvm_clock_data)
> +#define KVM_ASSIGN_REG_MSIX_MMIO  _IOW(KVMIO,  0x7d, \
> +					struct kvm_assigned_msix_mmio)
>  /* Available with KVM_CAP_PIT_STATE2 */
>  #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct kvm_pit_state2)
>  #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0, struct kvm_pit_state2)
> @@ -802,7 +808,7 @@ struct kvm_assigned_msix_mmio {
>  	__u32 assigned_dev_id;
>  	__u64 base_addr;
>  	__u32 flags;
> -	__u32 reserved[2];
> +	__u64 reserved;
>  };
>  
>  #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> index 5d2adc4..9573194 100644
> --- a/virt/kvm/assigned-dev.c
> +++ b/virt/kvm/assigned-dev.c
> @@ -17,6 +17,8 @@
>  #include <linux/pci.h>
>  #include <linux/interrupt.h>
>  #include <linux/slab.h>
> +#include <linux/irqnr.h>
> +
>  #include "irq.h"
>  
>  static struct kvm_assigned_dev_kernel *kvm_find_assigned_dev(struct list_head *head,
> @@ -169,6 +171,14 @@ static void deassign_host_irq(struct kvm *kvm,
>  	 */
>  	if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
>  		int i;
> +#ifdef __KVM_HAVE_MSIX
> +		if (assigned_dev->msix_mmio_base) {

This special-casing of 0 address is not a great idea IMHO.
Let's just use a flag.

> +			mutex_lock(&kvm->slots_lock);
> +			kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS,
> +					&assigned_dev->msix_mmio_dev);
> +			mutex_unlock(&kvm->slots_lock);
> +		}
> +#endif
>  		for (i = 0; i < assigned_dev->entries_nr; i++)
>  			disable_irq_nosync(assigned_dev->
>  					   host_msix_entries[i].vector);
> @@ -318,6 +328,15 @@ static int assigned_device_enable_host_msix(struct kvm *kvm,
>  			goto err;
>  	}
>  
> +	if (dev->msix_mmio_base) {
> +		mutex_lock(&kvm->slots_lock);
> +		r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
> +				&dev->msix_mmio_dev);
> +		mutex_unlock(&kvm->slots_lock);
> +		if (r)
> +			goto err;
> +	}
> +
>  	return 0;
>  err:
>  	for (i -= 1; i >= 0; i--)
> @@ -870,6 +889,31 @@ static const struct kvm_io_device_ops msix_mmio_ops = {
>  	.write    = msix_mmio_write,
>  };
>  
> +static int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm,
> +				struct kvm_assigned_msix_mmio *msix_mmio)
> +{
> +	int r = 0;
> +	struct kvm_assigned_dev_kernel *adev;
> +
> +	mutex_lock(&kvm->lock);
> +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> +				      msix_mmio->assigned_dev_id);
> +	if (!adev) {
> +		r = -EINVAL;
> +		goto out;
> +	}
> +	if (msix_mmio->base_addr == 0) {
> +		r = -EINVAL;
> +		goto out;
> +	}
> +	adev->msix_mmio_base = msix_mmio->base_addr;
> +
> +	kvm_iodevice_init(&adev->msix_mmio_dev, &msix_mmio_ops);
> +out:
> +	mutex_unlock(&kvm->lock);
> +
> +	return r;
> +}
>  #endif
>  
>  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> @@ -982,6 +1026,22 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
>  			goto out;
>  		break;
>  	}
> +	case KVM_ASSIGN_REG_MSIX_MMIO: {
> +		struct kvm_assigned_msix_mmio msix_mmio;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
> +			goto out;
> +
> +		r = -EINVAL;
> +		if (msix_mmio.flags != 0 || msix_mmio.reserved != 0)
> +			goto out;
> +
> +		r = kvm_vm_ioctl_register_msix_mmio(kvm, &msix_mmio);
> +		if (r)
> +			goto out;
> +		break;
> +	}
>  #endif
>  	}
>  out:
> -- 
> 1.7.0.1

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 7/8] KVM: assigned dev: Introduce io_device for MSI-X MMIO accessing
  2010-10-20  8:26 ` [PATCH 7/8] KVM: assigned dev: Introduce io_device for MSI-X MMIO accessing Sheng Yang
  2010-10-20  9:46   ` Avi Kivity
@ 2010-10-20 22:35   ` Michael S. Tsirkin
  2010-10-21  7:44     ` Sheng Yang
  1 sibling, 1 reply; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-10-20 22:35 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Wed, Oct 20, 2010 at 04:26:31PM +0800, Sheng Yang wrote:
> It would be work with KVM_CAP_DEVICE_MSIX_MASK, which we would enable in the
> last patch.
> 
> Signed-off-by: Sheng Yang <sheng@linux.intel.com>

Merge this with patch 8 - it does not make sense to add a bunch
of users of the field msix_mmio_base but init it in the next patch.

> ---
>  include/linux/kvm.h      |    7 +++
>  include/linux/kvm_host.h |    2 +
>  virt/kvm/assigned-dev.c  |  131 ++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 140 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index a699ec9..0a7bd34 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -798,4 +798,11 @@ struct kvm_assigned_msix_entry {
>  	__u16 padding[2];
>  };
>  
> +struct kvm_assigned_msix_mmio {
> +	__u32 assigned_dev_id;

I think avi commented - there's padding here.

> +	__u64 base_addr;
> +	__u32 flags;
> +	__u32 reserved[2];
> +};
> +
>  #endif /* __LINUX_KVM_H */
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 81a6284..b67082f 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -465,6 +465,8 @@ struct kvm_assigned_dev_kernel {
>  	struct pci_dev *dev;
>  	struct kvm *kvm;
>  	spinlock_t assigned_dev_lock;
> +	u64 msix_mmio_base;
> +	struct kvm_io_device msix_mmio_dev;
>  };
>  
>  struct kvm_irq_mask_notifier {
> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> index bf96ea7..5d2adc4 100644
> --- a/virt/kvm/assigned-dev.c
> +++ b/virt/kvm/assigned-dev.c
> @@ -739,6 +739,137 @@ msix_entry_out:
>  
>  	return r;
>  }
> +
> +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev,
> +			      gpa_t addr, int len, int *idx)
> +{
> +	int i;
> +
> +	if (!(adev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX))
> +		return false;
> +	BUG_ON(adev->msix_mmio_base == 0);
> +	for (i = 0; i < adev->entries_nr; i++) {
> +		u64 start, end;
> +		start = adev->msix_mmio_base +
> +			adev->guest_msix_entries[i].entry * PCI_MSIX_ENTRY_SIZE;
> +		end = start + PCI_MSIX_ENTRY_SIZE;
> +		if (addr >= start && addr + len <= end) {
> +			*idx = i;
> +			return true;
> +		}
> +	}

We really should not need guest_msix_entries at all:
if we are emulating MSIX in kernel anyway, let us just
emulate it there. Doing half setup from qemu
and half from kvm will just create problems.

If you do it all in kernel, you will simply need a single
range check to see whether this is mask write.


> +	return false;
> +}
> +
> +static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
> +			  void *val)
> +{
> +	struct kvm_assigned_dev_kernel *adev =
> +			container_of(this, struct kvm_assigned_dev_kernel,
> +				     msix_mmio_dev);
> +	int idx, r = 0;
> +	u32 entry[4];
> +	struct kvm_kernel_irq_routing_entry *e;
> +
> +	mutex_lock(&adev->kvm->lock);
> +	if (!msix_mmio_in_range(adev, addr, len, &idx)) {
> +		r = -EOPNOTSUPP;
> +		goto out;
> +	}
> +	if ((addr & 0x3) || len != 4) {
> +		printk(KERN_WARNING
> +			"KVM: Unaligned reading for device MSI-X MMIO! "
> +			"addr 0x%llx, len %d\n", addr, len);
> +		r = -EOPNOTSUPP;
> +		goto out;
> +	}
> +
> +	e = kvm_get_irq_routing_entry(adev->kvm,
> +			adev->guest_msix_entries[idx].vector);
> +	if (!e || e->type != KVM_IRQ_ROUTING_MSI) {
> +		printk(KERN_WARNING "KVM: Wrong MSI-X routing entry! "
> +			"addr 0x%llx, len %d\n", addr, len);
> +		r = -EOPNOTSUPP;
> +		goto out;
> +	}
> +	entry[0] = e->msi.address_lo;
> +	entry[1] = e->msi.address_hi;
> +	entry[2] = e->msi.data;
> +	entry[3] = !!(adev->guest_msix_entries[idx].flags &
> +			KVM_ASSIGNED_MSIX_MASK);
> +	memcpy(val, &entry[addr % PCI_MSIX_ENTRY_SIZE / 4], len);
> +
> +out:
> +	mutex_unlock(&adev->kvm->lock);
> +	return r;
> +}
> +
> +static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int len,
> +			   const void *val)
> +{
> +	struct kvm_assigned_dev_kernel *adev =
> +			container_of(this, struct kvm_assigned_dev_kernel,
> +				     msix_mmio_dev);
> +	int idx, r = 0;
> +	unsigned long new_val = *(unsigned long *)val;
> +	bool entry_masked;
> +
> +	mutex_lock(&adev->kvm->lock);
> +	if (!msix_mmio_in_range(adev, addr, len, &idx)) {
> +		r = -EOPNOTSUPP;
> +		goto out;
> +	}
> +	if ((addr & 0x3) || len != 4) {
> +		printk(KERN_WARNING
> +			"KVM: Unaligned writing for device MSI-X MMIO! "
> +			"addr 0x%llx, len %d, val 0x%lx\n",
> +			addr, len, new_val);
> +		r = -EOPNOTSUPP;
> +		goto out;
> +	}
> +	entry_masked = adev->guest_msix_entries[idx].flags &
> +			KVM_ASSIGNED_MSIX_MASK;
> +	if (addr % PCI_MSIX_ENTRY_SIZE != PCI_MSIX_ENTRY_VECTOR_CTRL) {
> +		/* Only allow entry modification when entry was masked */
> +		if (!entry_masked) {
> +			printk(KERN_WARNING
> +				"KVM: guest try to write unmasked MSI-X entry. "
> +				"addr 0x%llx, len %d, val 0x%lx\n",
> +				addr, len, new_val);
> +			r = 0;
> +		} else
> +			/* Leave it to QEmu */
> +			r = -EOPNOTSUPP;

So half the emulation is here half is there...
Let's just put it all in kernel and be done with it?


> +		goto out;
> +	}
> +	if (new_val & ~1ul) {
> +		printk(KERN_WARNING
> +			"KVM: Bad writing for device MSI-X MMIO! "
> +			"addr 0x%llx, len %d, val 0x%lx\n",
> +			addr, len, new_val);
> +		r = -EOPNOTSUPP;
> +		goto out;
> +	}
> +	if (new_val == 1 && !entry_masked) {
> +		adev->guest_msix_entries[idx].flags |=
> +			KVM_ASSIGNED_MSIX_MASK;
> +		update_msix_mask(adev, idx);
> +	} else if (new_val == 0 && entry_masked) {
> +		adev->guest_msix_entries[idx].flags &=
> +			~KVM_ASSIGNED_MSIX_MASK;
> +		update_msix_mask(adev, idx);
> +	}
> +out:
> +	mutex_unlock(&adev->kvm->lock);
> +
> +	return r;
> +}
> +
> +static const struct kvm_io_device_ops msix_mmio_ops = {
> +	.read     = msix_mmio_read,
> +	.write    = msix_mmio_write,
> +};
> +
>  #endif
>  
>  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> -- 
> 1.7.0.1

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 7/8] KVM: assigned dev: Introduce io_device for MSI-X MMIO accessing
  2010-10-20  9:46   ` Avi Kivity
  2010-10-20 10:33     ` Michael S. Tsirkin
@ 2010-10-21  6:46     ` Sheng Yang
  2010-10-21  9:27       ` Avi Kivity
  1 sibling, 1 reply; 66+ messages in thread
From: Sheng Yang @ 2010-10-21  6:46 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, kvm, Michael S. Tsirkin

On Wednesday 20 October 2010 17:46:47 Avi Kivity wrote:
>   On 10/20/2010 10:26 AM, Sheng Yang wrote:
> > It would be work with KVM_CAP_DEVICE_MSIX_MASK, which we would enable in
> > the last patch.
> > 
> > 
> > +struct kvm_assigned_msix_mmio {
> > +	__u32 assigned_dev_id;
> > +	__u64 base_addr;
> 
> Different alignment and size on 32 and 64 bits.
> 
> Is base_addr a guest physical address?  Do we need a size or it it fixed?

Yes it is. The size would be implicit showed when guest using 
KVM_ASSIGN_SET_MSIX_ENTRY(say, entry number).

> 
> > +	__u32 flags;
> > +	__u32 reserved[2];
> > +};
> > +
> > 
> > @@ -465,6 +465,8 @@ struct kvm_assigned_dev_kernel {
> > 
> >   	struct pci_dev *dev;
> >   	struct kvm *kvm;
> >   	spinlock_t assigned_dev_lock;
> > 
> > +	u64 msix_mmio_base;
> 
> gpa_t.
> 
> > +	struct kvm_io_device msix_mmio_dev;
> > 
> >   };
> >   
> >   struct kvm_irq_mask_notifier {
> > 
> > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > index bf96ea7..5d2adc4 100644
> > --- a/virt/kvm/assigned-dev.c
> > +++ b/virt/kvm/assigned-dev.c
> > 
> > @@ -739,6 +739,137 @@ msix_entry_out:
> >   	return r;
> >   
> >   }
> > 
> > +
> > +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev,
> > +			      gpa_t addr, int len, int *idx)
> > +{
> > +	int i;
> > +
> > +	if (!(adev->irq_requested_type&  KVM_DEV_IRQ_HOST_MSIX))
> > +		return false;
> 
> Just don't install the io_device in that case.

Yeah, I meant to remove this line, because entries_nr = 0 in the case.
> 
> > +	BUG_ON(adev->msix_mmio_base == 0);
> > +	for (i = 0; i<  adev->entries_nr; i++) {
> > +		u64 start, end;
> > +		start = adev->msix_mmio_base +
> > +			adev->guest_msix_entries[i].entry * PCI_MSIX_ENTRY_SIZE;
> > +		end = start + PCI_MSIX_ENTRY_SIZE;
> > +		if (addr>= start&&  addr + len<= end) {
> > +			*idx = i;
> > +			return true;
> > +		}
> 
> What if it's a partial hit?  write part of an entry and part of another
> entry?

Can't be. The spec said the accessing must be DWORD aligned. And I forced the 
checking later.
> 
> > +	}
> > +	return false;
> > +}
> > +
> > +static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr, int
> > len, +			  void *val)
> > +{
> > +	struct kvm_assigned_dev_kernel *adev =
> > +			container_of(this, struct kvm_assigned_dev_kernel,
> > +				     msix_mmio_dev);
> > +	int idx, r = 0;
> > +	u32 entry[4];
> > +	struct kvm_kernel_irq_routing_entry *e;
> > +
> > +	mutex_lock(&adev->kvm->lock);
> > +	if (!msix_mmio_in_range(adev, addr, len,&idx)) {
> > +		r = -EOPNOTSUPP;
> > +		goto out;
> > +	}
> > +	if ((addr&  0x3) || len != 4) {
> > +		printk(KERN_WARNING
> > +			"KVM: Unaligned reading for device MSI-X MMIO! "
> > +			"addr 0x%llx, len %d\n", addr, len);
> 
> Guest exploitable printk()
> 
> > +		r = -EOPNOTSUPP;
> 
> If the guest assigned the device to another guest, it allows the nested
> guest to kill the non-nested guest.  Need to exit in a graceful fashion.

Don't understand... It wouldn't result in kill but return to QEmu/userspace.
> 
> > +		goto out;
> > +	}
> > +
> > +	e = kvm_get_irq_routing_entry(adev->kvm,
> > +			adev->guest_msix_entries[idx].vector);
> > +	if (!e || e->type != KVM_IRQ_ROUTING_MSI) {
> > +		printk(KERN_WARNING "KVM: Wrong MSI-X routing entry! "
> > +			"addr 0x%llx, len %d\n", addr, len);
> > +		r = -EOPNOTSUPP;
> > +		goto out;
> > +	}
> > +	entry[0] = e->msi.address_lo;
> > +	entry[1] = e->msi.address_hi;
> > +	entry[2] = e->msi.data;
> > +	entry[3] = !!(adev->guest_msix_entries[idx].flags&
> > +			KVM_ASSIGNED_MSIX_MASK);
> > +	memcpy(val,&entry[addr % PCI_MSIX_ENTRY_SIZE / 4], len);
> > +
> > +out:
> > +	mutex_unlock(&adev->kvm->lock);
> > +	return r;
> > +}
> > +
> > +static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int
> > len, +			   const void *val)
> > +{
> > +	struct kvm_assigned_dev_kernel *adev =
> > +			container_of(this, struct kvm_assigned_dev_kernel,
> > +				     msix_mmio_dev);
> > +	int idx, r = 0;
> > +	unsigned long new_val = *(unsigned long *)val;
> > +	bool entry_masked;
> > +
> > +	mutex_lock(&adev->kvm->lock);
> > +	if (!msix_mmio_in_range(adev, addr, len,&idx)) {
> > +		r = -EOPNOTSUPP;
> > +		goto out;
> > +	}
> > +	if ((addr&  0x3) || len != 4) {
> > +		printk(KERN_WARNING
> > +			"KVM: Unaligned writing for device MSI-X MMIO! "
> > +			"addr 0x%llx, len %d, val 0x%lx\n",
> > +			addr, len, new_val);
> > +		r = -EOPNOTSUPP;
> > +		goto out;
> > +	}
> > +	entry_masked = adev->guest_msix_entries[idx].flags&
> > +			KVM_ASSIGNED_MSIX_MASK;
> > +	if (addr % PCI_MSIX_ENTRY_SIZE != PCI_MSIX_ENTRY_VECTOR_CTRL) {
> > +		/* Only allow entry modification when entry was masked */
> > +		if (!entry_masked) {
> > +			printk(KERN_WARNING
> > +				"KVM: guest try to write unmasked MSI-X entry. "
> > +				"addr 0x%llx, len %d, val 0x%lx\n",
> > +				addr, len, new_val);
> > +			r = 0;
> 
> What does the spec says about this situation?

As Michael pointed out. The spec said the result is "undefined" indeed.

> 
> > +		} else
> > +			/* Leave it to QEmu */
> 
> s/qemu/userspace/
> 
> > +			r = -EOPNOTSUPP;
> 
> What would userspace do in this situation?  I hope you documented
> precisely what the kernel handles and what it doesn't?
> 
> I prefer more kernel code in the kernel to having an interface which is
> hard to use correctly.
> 
> > +		goto out;
> > +	}
> > +	if (new_val&  ~1ul) {
> 
> Is there a #define for this bit?

Sorry I didn't find it. mask_msi_irq() also use the number 1... Maybe we can add 
one.

--
regards
Yang, Sheng

> 
> > +		printk(KERN_WARNING
> > +			"KVM: Bad writing for device MSI-X MMIO! "
> > +			"addr 0x%llx, len %d, val 0x%lx\n",
> > +			addr, len, new_val);
> > +		r = -EOPNOTSUPP;
> > +		goto out;
> > +	}
> > +	if (new_val == 1&&  !entry_masked) {
> > +		adev->guest_msix_entries[idx].flags |=
> > +			KVM_ASSIGNED_MSIX_MASK;
> > +		update_msix_mask(adev, idx);
> > +	} else if (new_val == 0&&  entry_masked) {
> > +		adev->guest_msix_entries[idx].flags&=
> > +			~KVM_ASSIGNED_MSIX_MASK;
> > +		update_msix_mask(adev, idx);
> > +	}
> 
> Ah, I see you do reuse update_msix_mask().
> 
> > +out:
> > +	mutex_unlock(&adev->kvm->lock);
> > +
> > +	return r;
> > +}
> > +
> > +static const struct kvm_io_device_ops msix_mmio_ops = {
> > +	.read     = msix_mmio_read,
> > +	.write    = msix_mmio_write,
> > +};
> > +
> > 
> >   #endif
> >   
> >   long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20 19:02 ` Marcelo Tosatti
@ 2010-10-21  7:10   ` Sheng Yang
  2010-10-21  8:21     ` Michael S. Tsirkin
  0 siblings, 1 reply; 66+ messages in thread
From: Sheng Yang @ 2010-10-21  7:10 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, kvm, Michael S. Tsirkin

On Thursday 21 October 2010 03:02:24 Marcelo Tosatti wrote:
> On Wed, Oct 20, 2010 at 04:26:24PM +0800, Sheng Yang wrote:
> > Here is v2.
> > 
> > Changelog:
> > 
> > v1->v2
> > 
> > The major change from v1 is I've added the in-kernel MSI-X mask emulation
> > support, as well as adding shortcuts for reading MSI-X table.
> > 
> > I've taken Michael's advice to use mask/unmask directly, but unsure about
> > exporting irq_to_desc() for module...
> > 
> > Also add flush_work() according to Marcelo's comments.
> > 
> > Sheng Yang (8):
> >   PCI: MSI: Move MSI-X entry definition to pci_regs.h
> >   irq: Export irq_to_desc() to modules
> >   KVM: x86: Enable ENABLE_CAP capability for x86
> >   KVM: Move struct kvm_io_device to kvm_host.h
> >   KVM: Add kvm_get_irq_routing_entry() func
> >   KVM: assigned dev: Preparation for mask support in userspace
> >   KVM: assigned dev: Introduce io_device for MSI-X MMIO accessing
> >   KVM: Emulation MSI-X mask bits for assigned devices
> 
> Why does the current scheme, without msix per-vector mask support, is
> functional at all? Luck?

Well, I believe we are lucky... We just ignored the operation in the past.

I had raised this issue when Michael begin to work on MSI-X support in QEmu long 
ago, but then I was busy on some other things. Until now when Eddie want to add 
MSI-X in-kernel acceleration, we back to it...

And about the "flush_work()" you commented, I still think even for the native 
device, it's possible that the short time after OS write to the mask bit, the 
interrupt may be delivered, if the device already send the message out on the 
bus(I just guess, haven't observed)... The spec didn't say that the finish of 
writing mask bit behavior would also get all message on the bus delivered. So I 
think leave the work a little late would also be fine.

--
regards
Yang, Sheng

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-20  9:51 ` [PATCH 0/8][v2] MSI-X mask emulation support for assigned device Avi Kivity
  2010-10-20 10:44   ` Michael S. Tsirkin
@ 2010-10-21  7:41   ` Sheng Yang
  1 sibling, 0 replies; 66+ messages in thread
From: Sheng Yang @ 2010-10-21  7:41 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, kvm, Michael S. Tsirkin, Alex Williamson

On Wednesday 20 October 2010 17:51:01 Avi Kivity wrote:
>   On 10/20/2010 10:26 AM, Sheng Yang wrote:
> > Here is v2.
> > 
> > Changelog:
> > 
> > v1->v2
> > 
> > The major change from v1 is I've added the in-kernel MSI-X mask emulation
> > support, as well as adding shortcuts for reading MSI-X table.
> > 
> > I've taken Michael's advice to use mask/unmask directly, but unsure about
> > exporting irq_to_desc() for module...
> > 
> > Also add flush_work() according to Marcelo's comments.
> 
> Any performance numbers?  What are the affected guests?  just RHEL 4, or
> any others?

At least current RHEL5 series would be affected. I have done an simple benchmark on 
RHEL5u5 guest with 512m memory and 1 cpu. Device is one 10G NIC with SRIOV. One VF 
was assigned to guest to communicate the PF in the host. 3 threads had been used 
in iperf of guest to push the CPU utilization to 100%. In this condition, QEmu 
method's bandwidth is about 20% lower than in-kernel one(~7.5G vs ~9G). Interrupt 
rate in this condition is about 20k/sec.

The reason is 2.6.18 kernel used mask_msi in ack() for MSI chip, caused significant 
mask bit operations when interrupt rate is high.

We have also reproduced the issue on some large scale benchmark on the guest with 
newer kernel like 2.6.30 on Xen, under very high interrupt rate, due to some 
interrupt rate limitation mechanism in kernel.
> 
> Alex, Michael, how would you do this with vfio?

--
regards
Yang, Sheng

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 7/8] KVM: assigned dev: Introduce io_device for MSI-X MMIO accessing
  2010-10-20 22:35   ` Michael S. Tsirkin
@ 2010-10-21  7:44     ` Sheng Yang
  0 siblings, 0 replies; 66+ messages in thread
From: Sheng Yang @ 2010-10-21  7:44 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Thursday 21 October 2010 06:35:11 Michael S. Tsirkin wrote:
> On Wed, Oct 20, 2010 at 04:26:31PM +0800, Sheng Yang wrote:
> > It would be work with KVM_CAP_DEVICE_MSIX_MASK, which we would enable in
> > the last patch.
> > 
> > Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> 
> Merge this with patch 8 - it does not make sense to add a bunch
> of users of the field msix_mmio_base but init it in the next patch.

I just meant to make the reviewer easier, seems I am fail. :)
> 
> > ---
> > 
> >  include/linux/kvm.h      |    7 +++
> >  include/linux/kvm_host.h |    2 +
> >  virt/kvm/assigned-dev.c  |  131
> >  ++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 
140
> >  insertions(+), 0 deletions(-)
> > 
> > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > index a699ec9..0a7bd34 100644
> > --- a/include/linux/kvm.h
> > +++ b/include/linux/kvm.h
> > @@ -798,4 +798,11 @@ struct kvm_assigned_msix_entry {
> > 
> >  	__u16 padding[2];
> >  
> >  };
> > 
> > +struct kvm_assigned_msix_mmio {
> > +	__u32 assigned_dev_id;
> 
> I think avi commented - there's padding here.
> 
> > +	__u64 base_addr;
> > +	__u32 flags;
> > +	__u32 reserved[2];
> > +};
> > +
> > 
> >  #endif /* __LINUX_KVM_H */
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 81a6284..b67082f 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -465,6 +465,8 @@ struct kvm_assigned_dev_kernel {
> > 
> >  	struct pci_dev *dev;
> >  	struct kvm *kvm;
> >  	spinlock_t assigned_dev_lock;
> > 
> > +	u64 msix_mmio_base;
> > +	struct kvm_io_device msix_mmio_dev;
> > 
> >  };
> >  
> >  struct kvm_irq_mask_notifier {
> > 
> > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > index bf96ea7..5d2adc4 100644
> > --- a/virt/kvm/assigned-dev.c
> > +++ b/virt/kvm/assigned-dev.c
> > 
> > @@ -739,6 +739,137 @@ msix_entry_out:
> >  	return r;
> >  
> >  }
> > 
> > +
> > +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev,
> > +			      gpa_t addr, int len, int *idx)
> > +{
> > +	int i;
> > +
> > +	if (!(adev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX))
> > +		return false;
> > +	BUG_ON(adev->msix_mmio_base == 0);
> > +	for (i = 0; i < adev->entries_nr; i++) {
> > +		u64 start, end;
> > +		start = adev->msix_mmio_base +
> > +			adev->guest_msix_entries[i].entry * PCI_MSIX_ENTRY_SIZE;
> > +		end = start + PCI_MSIX_ENTRY_SIZE;
> > +		if (addr >= start && addr + len <= end) {
> > +			*idx = i;
> > +			return true;
> > +		}
> > +	}
> 
> We really should not need guest_msix_entries at all:
> if we are emulating MSIX in kernel anyway, let us just
> emulate it there. Doing half setup from qemu
> and half from kvm will just create problems.
> 
> If you do it all in kernel, you will simply need a single
> range check to see whether this is mask write.

Would explain it in the an separate mail. And please comments as well.

--
regards
Yang, Sheng

> 
> > +	return false;
> > +}
> > +
> > +static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr, int
> > len, +			  void *val)
> > +{
> > +	struct kvm_assigned_dev_kernel *adev =
> > +			container_of(this, struct kvm_assigned_dev_kernel,
> > +				     msix_mmio_dev);
> > +	int idx, r = 0;
> > +	u32 entry[4];
> > +	struct kvm_kernel_irq_routing_entry *e;
> > +
> > +	mutex_lock(&adev->kvm->lock);
> > +	if (!msix_mmio_in_range(adev, addr, len, &idx)) {
> > +		r = -EOPNOTSUPP;
> > +		goto out;
> > +	}
> > +	if ((addr & 0x3) || len != 4) {
> > +		printk(KERN_WARNING
> > +			"KVM: Unaligned reading for device MSI-X MMIO! "
> > +			"addr 0x%llx, len %d\n", addr, len);
> > +		r = -EOPNOTSUPP;
> > +		goto out;
> > +	}
> > +
> > +	e = kvm_get_irq_routing_entry(adev->kvm,
> > +			adev->guest_msix_entries[idx].vector);
> > +	if (!e || e->type != KVM_IRQ_ROUTING_MSI) {
> > +		printk(KERN_WARNING "KVM: Wrong MSI-X routing entry! "
> > +			"addr 0x%llx, len %d\n", addr, len);
> > +		r = -EOPNOTSUPP;
> > +		goto out;
> > +	}
> > +	entry[0] = e->msi.address_lo;
> > +	entry[1] = e->msi.address_hi;
> > +	entry[2] = e->msi.data;
> > +	entry[3] = !!(adev->guest_msix_entries[idx].flags &
> > +			KVM_ASSIGNED_MSIX_MASK);
> > +	memcpy(val, &entry[addr % PCI_MSIX_ENTRY_SIZE / 4], len);
> > +
> > +out:
> > +	mutex_unlock(&adev->kvm->lock);
> > +	return r;
> > +}
> > +
> > +static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int
> > len, +			   const void *val)
> > +{
> > +	struct kvm_assigned_dev_kernel *adev =
> > +			container_of(this, struct kvm_assigned_dev_kernel,
> > +				     msix_mmio_dev);
> > +	int idx, r = 0;
> > +	unsigned long new_val = *(unsigned long *)val;
> > +	bool entry_masked;
> > +
> > +	mutex_lock(&adev->kvm->lock);
> > +	if (!msix_mmio_in_range(adev, addr, len, &idx)) {
> > +		r = -EOPNOTSUPP;
> > +		goto out;
> > +	}
> > +	if ((addr & 0x3) || len != 4) {
> > +		printk(KERN_WARNING
> > +			"KVM: Unaligned writing for device MSI-X MMIO! "
> > +			"addr 0x%llx, len %d, val 0x%lx\n",
> > +			addr, len, new_val);
> > +		r = -EOPNOTSUPP;
> > +		goto out;
> > +	}
> > +	entry_masked = adev->guest_msix_entries[idx].flags &
> > +			KVM_ASSIGNED_MSIX_MASK;
> > +	if (addr % PCI_MSIX_ENTRY_SIZE != PCI_MSIX_ENTRY_VECTOR_CTRL) {
> > +		/* Only allow entry modification when entry was masked */
> > +		if (!entry_masked) {
> > +			printk(KERN_WARNING
> > +				"KVM: guest try to write unmasked MSI-X entry. "
> > +				"addr 0x%llx, len %d, val 0x%lx\n",
> > +				addr, len, new_val);
> > +			r = 0;
> > +		} else
> > +			/* Leave it to QEmu */
> > +			r = -EOPNOTSUPP;
> 
> So half the emulation is here half is there...
> Let's just put it all in kernel and be done with it?
> 
> > +		goto out;
> > +	}
> > +	if (new_val & ~1ul) {
> > +		printk(KERN_WARNING
> > +			"KVM: Bad writing for device MSI-X MMIO! "
> > +			"addr 0x%llx, len %d, val 0x%lx\n",
> > +			addr, len, new_val);
> > +		r = -EOPNOTSUPP;
> > +		goto out;
> > +	}
> > +	if (new_val == 1 && !entry_masked) {
> > +		adev->guest_msix_entries[idx].flags |=
> > +			KVM_ASSIGNED_MSIX_MASK;
> > +		update_msix_mask(adev, idx);
> > +	} else if (new_val == 0 && entry_masked) {
> > +		adev->guest_msix_entries[idx].flags &=
> > +			~KVM_ASSIGNED_MSIX_MASK;
> > +		update_msix_mask(adev, idx);
> > +	}
> > +out:
> > +	mutex_unlock(&adev->kvm->lock);
> > +
> > +	return r;
> > +}
> > +
> > +static const struct kvm_io_device_ops msix_mmio_ops = {
> > +	.read     = msix_mmio_read,
> > +	.write    = msix_mmio_write,
> > +};
> > +
> > 
> >  #endif
> >  
> >  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/8][v2] MSI-X mask emulation support for assigned device
  2010-10-21  7:10   ` Sheng Yang
@ 2010-10-21  8:21     ` Michael S. Tsirkin
  0 siblings, 0 replies; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-10-21  8:21 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Marcelo Tosatti, Avi Kivity, kvm

On Thu, Oct 21, 2010 at 03:10:19PM +0800, Sheng Yang wrote:
> On Thursday 21 October 2010 03:02:24 Marcelo Tosatti wrote:
> > On Wed, Oct 20, 2010 at 04:26:24PM +0800, Sheng Yang wrote:
> > > Here is v2.
> > > 
> > > Changelog:
> > > 
> > > v1->v2
> > > 
> > > The major change from v1 is I've added the in-kernel MSI-X mask emulation
> > > support, as well as adding shortcuts for reading MSI-X table.
> > > 
> > > I've taken Michael's advice to use mask/unmask directly, but unsure about
> > > exporting irq_to_desc() for module...
> > > 
> > > Also add flush_work() according to Marcelo's comments.
> > > 
> > > Sheng Yang (8):
> > >   PCI: MSI: Move MSI-X entry definition to pci_regs.h
> > >   irq: Export irq_to_desc() to modules
> > >   KVM: x86: Enable ENABLE_CAP capability for x86
> > >   KVM: Move struct kvm_io_device to kvm_host.h
> > >   KVM: Add kvm_get_irq_routing_entry() func
> > >   KVM: assigned dev: Preparation for mask support in userspace
> > >   KVM: assigned dev: Introduce io_device for MSI-X MMIO accessing
> > >   KVM: Emulation MSI-X mask bits for assigned devices
> > 
> > Why does the current scheme, without msix per-vector mask support, is
> > functional at all? Luck?
> 
> Well, I believe we are lucky... We just ignored the operation in the past.
> 
> I had raised this issue when Michael begin to work on MSI-X support in QEmu long 
> ago, but then I was busy on some other things. Until now when Eddie want to add 
> MSI-X in-kernel acceleration, we back to it...
> 
> And about the "flush_work()" you commented, I still think even for the native 
> device, it's possible that the short time after OS write to the mask bit, the 
> interrupt may be delivered, if the device already send the message out on the 
> bus(I just guess, haven't observed)... The spec didn't say that the finish of 
> writing mask bit behavior would also get all message on the bus delivered. So I 
> think leave the work a little late would also be fine.

Yes ... but e.g. bus read afterwards would have to flush the
write up the bus...

> --
> regards
> Yang, Sheng

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 8/8] KVM: Emulation MSI-X mask bits for assigned devices
  2010-10-20  8:26 ` [PATCH 8/8] KVM: Emulation MSI-X mask bits for assigned devices Sheng Yang
  2010-10-20  9:49   ` Avi Kivity
  2010-10-20 22:24   ` Michael S. Tsirkin
@ 2010-10-21  8:30   ` Sheng Yang
  2010-10-21  8:39     ` Michael S. Tsirkin
  2 siblings, 1 reply; 66+ messages in thread
From: Sheng Yang @ 2010-10-21  8:30 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Marcelo Tosatti, kvm, Michael S. Tsirkin

On Wednesday 20 October 2010 16:26:32 Sheng Yang wrote:
> This patch enable per-vector mask for assigned devices using MSI-X.

The basic idea of kernel and QEmu's responsibilities are:

1. Because QEmu owned the irq routing table, so the change of table should still 
go to the QEmu, like we did in msix_mmio_write().

2. And the others things can be done in kernel, for performance. Here we covered 
the reading(converted entry from routing table and mask bit state of enabled MSI-X 
entries), and writing the mask bit for enabled MSI-X entries. Originally we only 
has mask bit handled in kernel, but later we found that Linux kernel would read 
MSI-X mmio just after every writing to mask bit, in order to flush the writing. So 
we add reading MSI data/addr as well.

3. Disabled entries's mask bit accessing would go to QEmu, because it may result 
in disable/enable MSI-X. Explained later.

4. Only QEmu has knowledge of PCI configuration space, so it's QEmu to 
decide enable/disable MSI-X for device.
.
5. There is an distinction between enabled entry and disabled entry of MSI-X 
table. The entries we had used for pci_enable_msix()(not necessary in sequence 
number) are already enabled, the others are disabled. When device's MSI-X is 
enabled and guest want to enable an disabled entry, we would go back to QEmu  
because this vector didn't exist in the routing table. Also due to 
pci_enable_msix() in kernel didn't allow us to enable vectors one by one, but all 
at once. So we have to disable MSI-X first, then enable it with new entries, which 
contained the new vector guest want to use. This situation is only happen when 
device is being initialized. After that, kernel can know and handle the mask bit 
of the enabled entry.

I've also considered handle all MMIO operation in kernel, and changing irq routing 
in kernel directly. But as long as irq routing is owned by QEmu, I think it's 
better to leave to it...

Notice the mask/unmask bits must be handled together, either in kernel or in 
userspace. Because if kernel has handled enabled vector's mask bit directly, it 
would be unsync with QEmu's records. It doesn't matter when QEmu don't access the 
related record. And the only place QEmu want to consult it's enabled entries' mask 
bit state is writing to MSI addr/data. The writing should be discarded if the 
entry is unmasked. This checking has already been done by kernel in this patchset, 
so we are fine here.

If we want to access the enabled entries' mask bit in the future, we can directly 
access device's MMIO. That's the reason why I have followed Michael's advice to 
use mask/unmask directly.

Hope this would make the patches more clear. I meant to add comments for this 
changeset, but miss it later.

--
regards
Yang, Sheng

> 
> Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> ---
>  Documentation/kvm/api.txt |   22 ++++++++++++++++
>  arch/x86/kvm/x86.c        |    6 ++++
>  include/linux/kvm.h       |    8 +++++-
>  virt/kvm/assigned-dev.c   |   60
> +++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 95
> insertions(+), 1 deletions(-)
> 
> diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
> index d82d637..f324a50 100644
> --- a/Documentation/kvm/api.txt
> +++ b/Documentation/kvm/api.txt
> @@ -1087,6 +1087,28 @@ of 4 instructions that make up a hypercall.
>  If any additional field gets added to this structure later on, a bit for
> that additional piece of information will be set in the flags bitmap.
> 
> +4.47 KVM_ASSIGN_REG_MSIX_MMIO
> +
> +Capability: KVM_CAP_DEVICE_MSIX_MASK
> +Architectures: x86
> +Type: vm ioctl
> +Parameters: struct kvm_assigned_msix_mmio (in)
> +Returns: 0 on success, !0 on error
> +
> +struct kvm_assigned_msix_mmio {
> +	/* Assigned device's ID */
> +	__u32 assigned_dev_id;
> +	/* MSI-X table MMIO address */
> +	__u64 base_addr;
> +	/* Must be 0 */
> +	__u32 flags;
> +	/* Must be 0, reserved for future use */
> +	__u64 reserved;
> +};
> +
> +This ioctl would enable in-kernel MSI-X emulation, which would handle
> MSI-X +mask bit in the kernel.
> +
>  5. The kvm_run structure
> 
>  Application code obtains a pointer to the kvm_run structure by
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index fc62546..ba07a2f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1927,6 +1927,8 @@ int kvm_dev_ioctl_check_extension(long ext)
>  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
>  	case KVM_CAP_XSAVE:
>  	case KVM_CAP_ENABLE_CAP:
> +	case KVM_CAP_DEVICE_MSIX_EXT:
> +	case KVM_CAP_DEVICE_MSIX_MASK:
>  		r = 1;
>  		break;
>  	case KVM_CAP_COALESCED_MMIO:
> @@ -2717,6 +2719,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu
> *vcpu, return -EINVAL;
> 
>  	switch (cap->cap) {
> +	case KVM_CAP_DEVICE_MSIX_EXT:
> +		vcpu->kvm->arch.msix_flags_enabled = true;
> +		r = 0;
> +		break;
>  	default:
>  		r = -EINVAL;
>  		break;
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index 0a7bd34..1494ed0 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -540,6 +540,10 @@ struct kvm_ppc_pvinfo {
>  #endif
>  #define KVM_CAP_PPC_GET_PVINFO 57
>  #define KVM_CAP_PPC_IRQ_LEVEL 58
> +#ifdef __KVM_HAVE_MSIX
> +#define KVM_CAP_DEVICE_MSIX_EXT 59
> +#define KVM_CAP_DEVICE_MSIX_MASK 60
> +#endif
> 
>  #ifdef KVM_CAP_IRQ_ROUTING
> 
> @@ -671,6 +675,8 @@ struct kvm_clock_data {
>  #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct
> kvm_xen_hvm_config) #define KVM_SET_CLOCK             _IOW(KVMIO,  0x7b,
> struct kvm_clock_data) #define KVM_GET_CLOCK             _IOR(KVMIO, 
> 0x7c, struct kvm_clock_data) +#define KVM_ASSIGN_REG_MSIX_MMIO 
> _IOW(KVMIO,  0x7d, \
> +					struct kvm_assigned_msix_mmio)
>  /* Available with KVM_CAP_PIT_STATE2 */
>  #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct
> kvm_pit_state2) #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0,
> struct kvm_pit_state2) @@ -802,7 +808,7 @@ struct kvm_assigned_msix_mmio {
>  	__u32 assigned_dev_id;
>  	__u64 base_addr;
>  	__u32 flags;
> -	__u32 reserved[2];
> +	__u64 reserved;
>  };
> 
>  #endif /* __LINUX_KVM_H */
> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> index 5d2adc4..9573194 100644
> --- a/virt/kvm/assigned-dev.c
> +++ b/virt/kvm/assigned-dev.c
> @@ -17,6 +17,8 @@
>  #include <linux/pci.h>
>  #include <linux/interrupt.h>
>  #include <linux/slab.h>
> +#include <linux/irqnr.h>
> +
>  #include "irq.h"
> 
>  static struct kvm_assigned_dev_kernel *kvm_find_assigned_dev(struct
> list_head *head, @@ -169,6 +171,14 @@ static void deassign_host_irq(struct
> kvm *kvm, */
>  	if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
>  		int i;
> +#ifdef __KVM_HAVE_MSIX
> +		if (assigned_dev->msix_mmio_base) {
> +			mutex_lock(&kvm->slots_lock);
> +			kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS,
> +					&assigned_dev->msix_mmio_dev);
> +			mutex_unlock(&kvm->slots_lock);
> +		}
> +#endif
>  		for (i = 0; i < assigned_dev->entries_nr; i++)
>  			disable_irq_nosync(assigned_dev->
>  					   host_msix_entries[i].vector);
> @@ -318,6 +328,15 @@ static int assigned_device_enable_host_msix(struct kvm
> *kvm, goto err;
>  	}
> 
> +	if (dev->msix_mmio_base) {
> +		mutex_lock(&kvm->slots_lock);
> +		r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
> +				&dev->msix_mmio_dev);
> +		mutex_unlock(&kvm->slots_lock);
> +		if (r)
> +			goto err;
> +	}
> +
>  	return 0;
>  err:
>  	for (i -= 1; i >= 0; i--)
> @@ -870,6 +889,31 @@ static const struct kvm_io_device_ops msix_mmio_ops =
> { .write    = msix_mmio_write,
>  };
> 
> +static int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm,
> +				struct kvm_assigned_msix_mmio *msix_mmio)
> +{
> +	int r = 0;
> +	struct kvm_assigned_dev_kernel *adev;
> +
> +	mutex_lock(&kvm->lock);
> +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> +				      msix_mmio->assigned_dev_id);
> +	if (!adev) {
> +		r = -EINVAL;
> +		goto out;
> +	}
> +	if (msix_mmio->base_addr == 0) {
> +		r = -EINVAL;
> +		goto out;
> +	}
> +	adev->msix_mmio_base = msix_mmio->base_addr;
> +
> +	kvm_iodevice_init(&adev->msix_mmio_dev, &msix_mmio_ops);
> +out:
> +	mutex_unlock(&kvm->lock);
> +
> +	return r;
> +}
>  #endif
> 
>  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> @@ -982,6 +1026,22 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm,
> unsigned ioctl, goto out;
>  		break;
>  	}
> +	case KVM_ASSIGN_REG_MSIX_MMIO: {
> +		struct kvm_assigned_msix_mmio msix_mmio;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
> +			goto out;
> +
> +		r = -EINVAL;
> +		if (msix_mmio.flags != 0 || msix_mmio.reserved != 0)
> +			goto out;
> +
> +		r = kvm_vm_ioctl_register_msix_mmio(kvm, &msix_mmio);
> +		if (r)
> +			goto out;
> +		break;
> +	}
>  #endif
>  	}
>  out:

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 8/8] KVM: Emulation MSI-X mask bits for assigned devices
  2010-10-21  8:30   ` Sheng Yang
@ 2010-10-21  8:39     ` Michael S. Tsirkin
  2010-10-22  4:42       ` Sheng Yang
  0 siblings, 1 reply; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-10-21  8:39 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Thu, Oct 21, 2010 at 04:30:02PM +0800, Sheng Yang wrote:
> On Wednesday 20 October 2010 16:26:32 Sheng Yang wrote:
> > This patch enable per-vector mask for assigned devices using MSI-X.
> 
> The basic idea of kernel and QEmu's responsibilities are:
> 
> 1. Because QEmu owned the irq routing table, so the change of table should still 
> go to the QEmu, like we did in msix_mmio_write().
> 
> 2. And the others things can be done in kernel, for performance. Here we covered 
> the reading(converted entry from routing table and mask bit state of enabled MSI-X 
> entries), and writing the mask bit for enabled MSI-X entries. Originally we only 
> has mask bit handled in kernel, but later we found that Linux kernel would read 
> MSI-X mmio just after every writing to mask bit, in order to flush the writing. So 
> we add reading MSI data/addr as well.
> 
> 3. Disabled entries's mask bit accessing would go to QEmu, because it may result 
> in disable/enable MSI-X. Explained later.
> 
> 4. Only QEmu has knowledge of PCI configuration space, so it's QEmu to 
> decide enable/disable MSI-X for device.
> .

Config space yes, but it's a simple global yes/no after all.

> 5. There is an distinction between enabled entry and disabled entry of MSI-X 
> table.

That's my point. There's no such thing as 'enabled entries'
in the spec. There are only masked and unmasked entries.

Current interface deals with gsi numbers so qemu had to work around
this. The hack used there is removing gsi for masked vector which has 0
address and data.  It works because this is what linux and windows
guests happen to do, but it is out of spec: vector/data value for a
masked entry have no meaning.

Since you are building a new interface, can design it without
constraints...

> The entries we had used for pci_enable_msix()(not necessary in sequence 
> number) are already enabled, the others are disabled. When device's MSI-X is 
> enabled and guest want to enable an disabled entry, we would go back to QEmu  
> because this vector didn't exist in the routing table. Also due to 
> pci_enable_msix() in kernel didn't allow us to enable vectors one by one, but all 
> at once. So we have to disable MSI-X first, then enable it with new entries, which 
> contained the new vector guest want to use. This situation is only happen when 
> device is being initialized. After that, kernel can know and handle the mask bit 
> of the enabled entry.
> 
> I've also considered handle all MMIO operation in kernel, and changing irq routing 
> in kernel directly. But as long as irq routing is owned by QEmu, I think it's 
> better to leave to it...

Yes, this is my suggestion, except we don't need no routing :)
To inject MSI you just need address/data pairs.
Look at kvm_set_msi: given address/data you can just inject
the interrupt. No need for table lookups.

> Notice the mask/unmask bits must be handled together, either in kernel or in 
> userspace. Because if kernel has handled enabled vector's mask bit directly, it 
> would be unsync with QEmu's records. It doesn't matter when QEmu don't access the 
> related record. And the only place QEmu want to consult it's enabled entries' mask 
> bit state is writing to MSI addr/data. The writing should be discarded if the 
> entry is unmasked. This checking has already been done by kernel in this patchset, 
> so we are fine here.
> 
> If we want to access the enabled entries' mask bit in the future, we can directly 
> access device's MMIO.

We really must implement this for correctness, btw. If you do not pass
reads to the device, messages intended for the masked entry
might still be in flight.

> That's the reason why I have followed Michael's advice to use
> mask/unmask directly.
> Hope this would make the patches more clear. I meant to add comments for this 
> changeset, but miss it later.
> 
> --
> regards
> Yang, Sheng
> 
> > 
> > Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> > ---
> >  Documentation/kvm/api.txt |   22 ++++++++++++++++
> >  arch/x86/kvm/x86.c        |    6 ++++
> >  include/linux/kvm.h       |    8 +++++-
> >  virt/kvm/assigned-dev.c   |   60
> > +++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 95
> > insertions(+), 1 deletions(-)
> > 
> > diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
> > index d82d637..f324a50 100644
> > --- a/Documentation/kvm/api.txt
> > +++ b/Documentation/kvm/api.txt
> > @@ -1087,6 +1087,28 @@ of 4 instructions that make up a hypercall.
> >  If any additional field gets added to this structure later on, a bit for
> > that additional piece of information will be set in the flags bitmap.
> > 
> > +4.47 KVM_ASSIGN_REG_MSIX_MMIO
> > +
> > +Capability: KVM_CAP_DEVICE_MSIX_MASK
> > +Architectures: x86
> > +Type: vm ioctl
> > +Parameters: struct kvm_assigned_msix_mmio (in)
> > +Returns: 0 on success, !0 on error
> > +
> > +struct kvm_assigned_msix_mmio {
> > +	/* Assigned device's ID */
> > +	__u32 assigned_dev_id;
> > +	/* MSI-X table MMIO address */
> > +	__u64 base_addr;
> > +	/* Must be 0 */
> > +	__u32 flags;
> > +	/* Must be 0, reserved for future use */
> > +	__u64 reserved;
> > +};
> > +
> > +This ioctl would enable in-kernel MSI-X emulation, which would handle
> > MSI-X +mask bit in the kernel.
> > +
> >  5. The kvm_run structure
> > 
> >  Application code obtains a pointer to the kvm_run structure by
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index fc62546..ba07a2f 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -1927,6 +1927,8 @@ int kvm_dev_ioctl_check_extension(long ext)
> >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> >  	case KVM_CAP_XSAVE:
> >  	case KVM_CAP_ENABLE_CAP:
> > +	case KVM_CAP_DEVICE_MSIX_EXT:
> > +	case KVM_CAP_DEVICE_MSIX_MASK:
> >  		r = 1;
> >  		break;
> >  	case KVM_CAP_COALESCED_MMIO:
> > @@ -2717,6 +2719,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu
> > *vcpu, return -EINVAL;
> > 
> >  	switch (cap->cap) {
> > +	case KVM_CAP_DEVICE_MSIX_EXT:
> > +		vcpu->kvm->arch.msix_flags_enabled = true;
> > +		r = 0;
> > +		break;
> >  	default:
> >  		r = -EINVAL;
> >  		break;
> > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > index 0a7bd34..1494ed0 100644
> > --- a/include/linux/kvm.h
> > +++ b/include/linux/kvm.h
> > @@ -540,6 +540,10 @@ struct kvm_ppc_pvinfo {
> >  #endif
> >  #define KVM_CAP_PPC_GET_PVINFO 57
> >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > +#ifdef __KVM_HAVE_MSIX
> > +#define KVM_CAP_DEVICE_MSIX_EXT 59
> > +#define KVM_CAP_DEVICE_MSIX_MASK 60
> > +#endif
> > 
> >  #ifdef KVM_CAP_IRQ_ROUTING
> > 
> > @@ -671,6 +675,8 @@ struct kvm_clock_data {
> >  #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct
> > kvm_xen_hvm_config) #define KVM_SET_CLOCK             _IOW(KVMIO,  0x7b,
> > struct kvm_clock_data) #define KVM_GET_CLOCK             _IOR(KVMIO, 
> > 0x7c, struct kvm_clock_data) +#define KVM_ASSIGN_REG_MSIX_MMIO 
> > _IOW(KVMIO,  0x7d, \
> > +					struct kvm_assigned_msix_mmio)
> >  /* Available with KVM_CAP_PIT_STATE2 */
> >  #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct
> > kvm_pit_state2) #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0,
> > struct kvm_pit_state2) @@ -802,7 +808,7 @@ struct kvm_assigned_msix_mmio {
> >  	__u32 assigned_dev_id;
> >  	__u64 base_addr;
> >  	__u32 flags;
> > -	__u32 reserved[2];
> > +	__u64 reserved;
> >  };
> > 
> >  #endif /* __LINUX_KVM_H */
> > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > index 5d2adc4..9573194 100644
> > --- a/virt/kvm/assigned-dev.c
> > +++ b/virt/kvm/assigned-dev.c
> > @@ -17,6 +17,8 @@
> >  #include <linux/pci.h>
> >  #include <linux/interrupt.h>
> >  #include <linux/slab.h>
> > +#include <linux/irqnr.h>
> > +
> >  #include "irq.h"
> > 
> >  static struct kvm_assigned_dev_kernel *kvm_find_assigned_dev(struct
> > list_head *head, @@ -169,6 +171,14 @@ static void deassign_host_irq(struct
> > kvm *kvm, */
> >  	if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
> >  		int i;
> > +#ifdef __KVM_HAVE_MSIX
> > +		if (assigned_dev->msix_mmio_base) {
> > +			mutex_lock(&kvm->slots_lock);
> > +			kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS,
> > +					&assigned_dev->msix_mmio_dev);
> > +			mutex_unlock(&kvm->slots_lock);
> > +		}
> > +#endif
> >  		for (i = 0; i < assigned_dev->entries_nr; i++)
> >  			disable_irq_nosync(assigned_dev->
> >  					   host_msix_entries[i].vector);
> > @@ -318,6 +328,15 @@ static int assigned_device_enable_host_msix(struct kvm
> > *kvm, goto err;
> >  	}
> > 
> > +	if (dev->msix_mmio_base) {
> > +		mutex_lock(&kvm->slots_lock);
> > +		r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
> > +				&dev->msix_mmio_dev);
> > +		mutex_unlock(&kvm->slots_lock);
> > +		if (r)
> > +			goto err;
> > +	}
> > +
> >  	return 0;
> >  err:
> >  	for (i -= 1; i >= 0; i--)
> > @@ -870,6 +889,31 @@ static const struct kvm_io_device_ops msix_mmio_ops =
> > { .write    = msix_mmio_write,
> >  };
> > 
> > +static int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm,
> > +				struct kvm_assigned_msix_mmio *msix_mmio)
> > +{
> > +	int r = 0;
> > +	struct kvm_assigned_dev_kernel *adev;
> > +
> > +	mutex_lock(&kvm->lock);
> > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > +				      msix_mmio->assigned_dev_id);
> > +	if (!adev) {
> > +		r = -EINVAL;
> > +		goto out;
> > +	}
> > +	if (msix_mmio->base_addr == 0) {
> > +		r = -EINVAL;
> > +		goto out;
> > +	}
> > +	adev->msix_mmio_base = msix_mmio->base_addr;
> > +
> > +	kvm_iodevice_init(&adev->msix_mmio_dev, &msix_mmio_ops);
> > +out:
> > +	mutex_unlock(&kvm->lock);
> > +
> > +	return r;
> > +}
> >  #endif
> > 
> >  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> > @@ -982,6 +1026,22 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm,
> > unsigned ioctl, goto out;
> >  		break;
> >  	}
> > +	case KVM_ASSIGN_REG_MSIX_MMIO: {
> > +		struct kvm_assigned_msix_mmio msix_mmio;
> > +
> > +		r = -EFAULT;
> > +		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
> > +			goto out;
> > +
> > +		r = -EINVAL;
> > +		if (msix_mmio.flags != 0 || msix_mmio.reserved != 0)
> > +			goto out;
> > +
> > +		r = kvm_vm_ioctl_register_msix_mmio(kvm, &msix_mmio);
> > +		if (r)
> > +			goto out;
> > +		break;
> > +	}
> >  #endif
> >  	}
> >  out:

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 7/8] KVM: assigned dev: Introduce io_device for MSI-X MMIO accessing
  2010-10-21  9:27       ` Avi Kivity
@ 2010-10-21  9:24         ` Michael S. Tsirkin
  2010-10-21  9:47           ` Avi Kivity
  0 siblings, 1 reply; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-10-21  9:24 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Sheng Yang, Marcelo Tosatti, kvm

On Thu, Oct 21, 2010 at 11:27:30AM +0200, Avi Kivity wrote:
>  On 10/21/2010 08:46 AM, Sheng Yang wrote:
> >>  >  +		r = -EOPNOTSUPP;
> >>
> >>  If the guest assigned the device to another guest, it allows the nested
> >>  guest to kill the non-nested guest.  Need to exit in a graceful fashion.
> >
> >Don't understand... It wouldn't result in kill but return to QEmu/userspace.
> 
> What would qemu do on EOPNOTSUPP?  It has no way of knowing that
> this was triggered by an unsupported msix access.  What can it do?
> 
> Best to just ignore the write.
> 
> If you're worried about debugging, we can have a trace_kvm_discard()
> tracepoint that logs the address and a type enum field that explains
> why an access was discarded.

The issue is that the same page is used for mask and entry programming.

> >>  >  +	if (addr % PCI_MSIX_ENTRY_SIZE != PCI_MSIX_ENTRY_VECTOR_CTRL) {
> >>  >  +		/* Only allow entry modification when entry was masked */
> >>  >  +		if (!entry_masked) {
> >>  >  +			printk(KERN_WARNING
> >>  >  +				"KVM: guest try to write unmasked MSI-X entry. "
> >>  >  +				"addr 0x%llx, len %d, val 0x%lx\n",
> >>  >  +				addr, len, new_val);
> >>  >  +			r = 0;
> >>
> >>  What does the spec says about this situation?
> >
> >As Michael pointed out. The spec said the result is "undefined" indeed.
> 
> Ok.  Then we should silently discard the write instead of allowing
> the guest to flood host dmesg.
> 
> >>
> >>  >  +		goto out;
> >>  >  +	}
> >>  >  +	if (new_val&   ~1ul) {
> >>
> >>  Is there a #define for this bit?
> >
> >Sorry I didn't find it. mask_msi_irq() also use the number 1... Maybe we can add
> >one.
> 
> Yes please.
> 
> 
> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 7/8] KVM: assigned dev: Introduce io_device for MSI-X MMIO accessing
  2010-10-21  6:46     ` Sheng Yang
@ 2010-10-21  9:27       ` Avi Kivity
  2010-10-21  9:24         ` Michael S. Tsirkin
  0 siblings, 1 reply; 66+ messages in thread
From: Avi Kivity @ 2010-10-21  9:27 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Marcelo Tosatti, kvm, Michael S. Tsirkin

  On 10/21/2010 08:46 AM, Sheng Yang wrote:
> >  >  +		r = -EOPNOTSUPP;
> >
> >  If the guest assigned the device to another guest, it allows the nested
> >  guest to kill the non-nested guest.  Need to exit in a graceful fashion.
>
> Don't understand... It wouldn't result in kill but return to QEmu/userspace.

What would qemu do on EOPNOTSUPP?  It has no way of knowing that this 
was triggered by an unsupported msix access.  What can it do?

Best to just ignore the write.

If you're worried about debugging, we can have a trace_kvm_discard() 
tracepoint that logs the address and a type enum field that explains why 
an access was discarded.

> >  >  +	if (addr % PCI_MSIX_ENTRY_SIZE != PCI_MSIX_ENTRY_VECTOR_CTRL) {
> >  >  +		/* Only allow entry modification when entry was masked */
> >  >  +		if (!entry_masked) {
> >  >  +			printk(KERN_WARNING
> >  >  +				"KVM: guest try to write unmasked MSI-X entry. "
> >  >  +				"addr 0x%llx, len %d, val 0x%lx\n",
> >  >  +				addr, len, new_val);
> >  >  +			r = 0;
> >
> >  What does the spec says about this situation?
>
> As Michael pointed out. The spec said the result is "undefined" indeed.

Ok.  Then we should silently discard the write instead of allowing the 
guest to flood host dmesg.

> >
> >  >  +		goto out;
> >  >  +	}
> >  >  +	if (new_val&   ~1ul) {
> >
> >  Is there a #define for this bit?
>
> Sorry I didn't find it. mask_msi_irq() also use the number 1... Maybe we can add
> one.

Yes please.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 7/8] KVM: assigned dev: Introduce io_device for MSI-X MMIO accessing
  2010-10-21  9:24         ` Michael S. Tsirkin
@ 2010-10-21  9:47           ` Avi Kivity
  2010-10-21 10:51             ` Michael S. Tsirkin
  0 siblings, 1 reply; 66+ messages in thread
From: Avi Kivity @ 2010-10-21  9:47 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Sheng Yang, Marcelo Tosatti, kvm

  On 10/21/2010 11:24 AM, Michael S. Tsirkin wrote:
> On Thu, Oct 21, 2010 at 11:27:30AM +0200, Avi Kivity wrote:
> >   On 10/21/2010 08:46 AM, Sheng Yang wrote:
> >  >>   >   +		r = -EOPNOTSUPP;
> >  >>
> >  >>   If the guest assigned the device to another guest, it allows the nested
> >  >>   guest to kill the non-nested guest.  Need to exit in a graceful fashion.
> >  >
> >  >Don't understand... It wouldn't result in kill but return to QEmu/userspace.
> >
> >  What would qemu do on EOPNOTSUPP?  It has no way of knowing that
> >  this was triggered by an unsupported msix access.  What can it do?
> >
> >  Best to just ignore the write.
> >
> >  If you're worried about debugging, we can have a trace_kvm_discard()
> >  tracepoint that logs the address and a type enum field that explains
> >  why an access was discarded.
>
> The issue is that the same page is used for mask and entry programming.

Yeah.  For that use the normal mmio exit_reason.  I was referring to 
misaligned writes.

I'm not happy with partial emulation, but I'm not happy either with yet 
another interface to communicate the decoded MSI BAR writes to 
userspace.  Shall we just reprogram the irq routing table?


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 7/8] KVM: assigned dev: Introduce io_device for MSI-X MMIO accessing
  2010-10-21  9:47           ` Avi Kivity
@ 2010-10-21 10:51             ` Michael S. Tsirkin
  0 siblings, 0 replies; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-10-21 10:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Sheng Yang, Marcelo Tosatti, kvm

On Thu, Oct 21, 2010 at 11:47:18AM +0200, Avi Kivity wrote:
>  On 10/21/2010 11:24 AM, Michael S. Tsirkin wrote:
> >On Thu, Oct 21, 2010 at 11:27:30AM +0200, Avi Kivity wrote:
> >>   On 10/21/2010 08:46 AM, Sheng Yang wrote:
> >>  >>   >   +		r = -EOPNOTSUPP;
> >>  >>
> >>  >>   If the guest assigned the device to another guest, it allows the nested
> >>  >>   guest to kill the non-nested guest.  Need to exit in a graceful fashion.
> >>  >
> >>  >Don't understand... It wouldn't result in kill but return to QEmu/userspace.
> >>
> >>  What would qemu do on EOPNOTSUPP?  It has no way of knowing that
> >>  this was triggered by an unsupported msix access.  What can it do?
> >>
> >>  Best to just ignore the write.
> >>
> >>  If you're worried about debugging, we can have a trace_kvm_discard()
> >>  tracepoint that logs the address and a type enum field that explains
> >>  why an access was discarded.
> >
> >The issue is that the same page is used for mask and entry programming.
> 
> Yeah.  For that use the normal mmio exit_reason.  I was referring to
> misaligned writes.

Yes, I think we can just drop them if we like. Might be a good idea to
stick a trace point there for debugging.

> I'm not happy with partial emulation, but I'm not happy either with
> yet another interface to communicate the decoded MSI BAR writes to
> userspace.  Shall we just reprogram the irq routing table?

Well we don't need to touch the routing table at all.
We have the MSI mask, address and data in kernel.
That is enough to interrupt the guest.

For vhost-net, we would need an interface that maps an irqfd
to a vector # (not gsi), or alternatively map gsi to
a pair device/vector number. This is easy to implement.

For VFIO, we have a problem if we try to work especially without
interrupt remapping as we can't just program all entries for all
devices.  So solution could be some combination of requiring interrupt
remapping, looking at addresses and looking at the pending bit.

New vectors are added/removed rarely. So maybe we can get away with 
1. interface to read MSIX table from userspace (good for debugging anyway)
2. an eventfd to signal on MSIX table writes


> 
> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 8/8] KVM: Emulation MSI-X mask bits for assigned devices
  2010-10-21  8:39     ` Michael S. Tsirkin
@ 2010-10-22  4:42       ` Sheng Yang
  2010-10-22 10:17         ` Michael S. Tsirkin
  0 siblings, 1 reply; 66+ messages in thread
From: Sheng Yang @ 2010-10-22  4:42 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Thursday 21 October 2010 16:39:07 Michael S. Tsirkin wrote:
> On Thu, Oct 21, 2010 at 04:30:02PM +0800, Sheng Yang wrote:
> > On Wednesday 20 October 2010 16:26:32 Sheng Yang wrote:
> > > This patch enable per-vector mask for assigned devices using MSI-X.
> > 
> > The basic idea of kernel and QEmu's responsibilities are:
> > 
> > 1. Because QEmu owned the irq routing table, so the change of table
> > should still go to the QEmu, like we did in msix_mmio_write().
> > 
> > 2. And the others things can be done in kernel, for performance. Here we
> > covered the reading(converted entry from routing table and mask bit
> > state of enabled MSI-X entries), and writing the mask bit for enabled
> > MSI-X entries. Originally we only has mask bit handled in kernel, but
> > later we found that Linux kernel would read MSI-X mmio just after every
> > writing to mask bit, in order to flush the writing. So we add reading
> > MSI data/addr as well.
> > 
> > 3. Disabled entries's mask bit accessing would go to QEmu, because it may
> > result in disable/enable MSI-X. Explained later.
> > 
> > 4. Only QEmu has knowledge of PCI configuration space, so it's QEmu to
> > decide enable/disable MSI-X for device.
> > .
> 
> Config space yes, but it's a simple global yes/no after all.
> 
> > 5. There is an distinction between enabled entry and disabled entry of
> > MSI-X table.
> 
> That's my point. There's no such thing as 'enabled entries'
> in the spec. There are only masked and unmasked entries.
>
> Current interface deals with gsi numbers so qemu had to work around
> this. The hack used there is removing gsi for masked vector which has 0
> address and data.  It works because this is what linux and windows
> guests happen to do, but it is out of spec: vector/data value for a
> masked entry have no meaning.

Well, I just realized something unnatural about the 0 contained data/address 
entry. So I checked spec again, and found the mask bit should be set after reset. 
So after fix this, I think unmasked 0 address/data entry shouldn't be there 
anymore.
 
> Since you are building a new interface, can design it without
> constraints...

A constraint is pci_enable_msix(). We have to use it to allocate irq for each 
entry, as well as program the entry in the real hardware. pci_enable_msix() is 
only a yes/no choice. We can't add new enabled entries after pci_enable_msix(), 
and we can only enable/disable/mask/unmask one IRQ through kernel API, not the 
entry in the MSI-X table. And we still have to allocate new IRQ for new entry.  
When guest unmask "disabled entry", we have to disable and enable MSI-X again in 
order to use the new entry. That's why "enabled/disabled entry" concept existed.

So even guest only unmasked one entry, it's a totally different work for KVM 
underlaying. This logic won't change no matter where the MMIO handling is. And in 
fact I don't like this kind of tricky things in kernel...

> > The entries we had used for pci_enable_msix()(not necessary in sequence
> > number) are already enabled, the others are disabled. When device's MSI-X
> > is enabled and guest want to enable an disabled entry, we would go back
> > to QEmu because this vector didn't exist in the routing table. Also due
> > to pci_enable_msix() in kernel didn't allow us to enable vectors one by
> > one, but all at once. So we have to disable MSI-X first, then enable it
> > with new entries, which contained the new vector guest want to use. This
> > situation is only happen when device is being initialized. After that,
> > kernel can know and handle the mask bit of the enabled entry.
> > 
> > I've also considered handle all MMIO operation in kernel, and changing
> > irq routing in kernel directly. But as long as irq routing is owned by
> > QEmu, I think it's better to leave to it...
> 
> Yes, this is my suggestion, except we don't need no routing :)
> To inject MSI you just need address/data pairs.
> Look at kvm_set_msi: given address/data you can just inject
> the interrupt. No need for table lookups.

You still need to look up data/address pair in the guest MSI-X table. The routing 
table used here is just an replacement for the table, because we can construct the 
entry according to the routing table. Two choices, using the routing table, or 
creating an new MSI-X table. 

Still, the key is about who to own the routing/MSI-X table. If kernel own it, it 
would be straightforward to intercept all the MMIO in the kernel; but if it's 
QEmu, we still need go back to QEmu for it.
 
> > Notice the mask/unmask bits must be handled together, either in kernel or
> > in userspace. Because if kernel has handled enabled vector's mask bit
> > directly, it would be unsync with QEmu's records. It doesn't matter when
> > QEmu don't access the related record. And the only place QEmu want to
> > consult it's enabled entries' mask bit state is writing to MSI
> > addr/data. The writing should be discarded if the entry is unmasked.
> > This checking has already been done by kernel in this patchset, so we
> > are fine here.
> > 
> > If we want to access the enabled entries' mask bit in the future, we can
> > directly access device's MMIO.
> 
> We really must implement this for correctness, btw. If you do not pass
> reads to the device, messages intended for the masked entry
> might still be in flight.

Oh, yes, kernel would also mask the device as well. I would take this into 
consideration.

--
regards
Yang, Sheng

> 
> > That's the reason why I have followed Michael's advice to use
> > mask/unmask directly.
> > Hope this would make the patches more clear. I meant to add comments for
> > this changeset, but miss it later.
> > 
> > --
> > regards
> > Yang, Sheng
> > 
> > > Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> > > ---
> > > 
> > >  Documentation/kvm/api.txt |   22 ++++++++++++++++
> > >  arch/x86/kvm/x86.c        |    6 ++++
> > >  include/linux/kvm.h       |    8 +++++-
> > >  virt/kvm/assigned-dev.c   |   60
> > > 
> > > +++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 
95
> > > insertions(+), 1 deletions(-)
> > > 
> > > diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
> > > index d82d637..f324a50 100644
> > > --- a/Documentation/kvm/api.txt
> > > +++ b/Documentation/kvm/api.txt
> > > @@ -1087,6 +1087,28 @@ of 4 instructions that make up a hypercall.
> > > 
> > >  If any additional field gets added to this structure later on, a bit
> > >  for
> > > 
> > > that additional piece of information will be set in the flags bitmap.
> > > 
> > > +4.47 KVM_ASSIGN_REG_MSIX_MMIO
> > > +
> > > +Capability: KVM_CAP_DEVICE_MSIX_MASK
> > > +Architectures: x86
> > > +Type: vm ioctl
> > > +Parameters: struct kvm_assigned_msix_mmio (in)
> > > +Returns: 0 on success, !0 on error
> > > +
> > > +struct kvm_assigned_msix_mmio {
> > > +	/* Assigned device's ID */
> > > +	__u32 assigned_dev_id;
> > > +	/* MSI-X table MMIO address */
> > > +	__u64 base_addr;
> > > +	/* Must be 0 */
> > > +	__u32 flags;
> > > +	/* Must be 0, reserved for future use */
> > > +	__u64 reserved;
> > > +};
> > > +
> > > +This ioctl would enable in-kernel MSI-X emulation, which would handle
> > > MSI-X +mask bit in the kernel.
> > > +
> > > 
> > >  5. The kvm_run structure
> > >  
> > >  Application code obtains a pointer to the kvm_run structure by
> > > 
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index fc62546..ba07a2f 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -1927,6 +1927,8 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > 
> > >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > >  	case KVM_CAP_XSAVE:
> > > 
> > >  	case KVM_CAP_ENABLE_CAP:
> > > +	case KVM_CAP_DEVICE_MSIX_EXT:
> > > 
> > > +	case KVM_CAP_DEVICE_MSIX_MASK:
> > >  		r = 1;
> > >  		break;
> > >  	
> > >  	case KVM_CAP_COALESCED_MMIO:
> > > @@ -2717,6 +2719,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct
> > > kvm_vcpu *vcpu, return -EINVAL;
> > > 
> > >  	switch (cap->cap) {
> > > 
> > > +	case KVM_CAP_DEVICE_MSIX_EXT:
> > > +		vcpu->kvm->arch.msix_flags_enabled = true;
> > > +		r = 0;
> > > +		break;
> > > 
> > >  	default:
> > >  		r = -EINVAL;
> > >  		break;
> > > 
> > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > index 0a7bd34..1494ed0 100644
> > > --- a/include/linux/kvm.h
> > > +++ b/include/linux/kvm.h
> > > @@ -540,6 +540,10 @@ struct kvm_ppc_pvinfo {
> > > 
> > >  #endif
> > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > 
> > > +#ifdef __KVM_HAVE_MSIX
> > > +#define KVM_CAP_DEVICE_MSIX_EXT 59
> > > +#define KVM_CAP_DEVICE_MSIX_MASK 60
> > > +#endif
> > > 
> > >  #ifdef KVM_CAP_IRQ_ROUTING
> > > 
> > > @@ -671,6 +675,8 @@ struct kvm_clock_data {
> > > 
> > >  #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct
> > > 
> > > kvm_xen_hvm_config) #define KVM_SET_CLOCK             _IOW(KVMIO, 
> > > 0x7b, struct kvm_clock_data) #define KVM_GET_CLOCK            
> > > _IOR(KVMIO, 0x7c, struct kvm_clock_data) +#define
> > > KVM_ASSIGN_REG_MSIX_MMIO _IOW(KVMIO,  0x7d, \
> > > +					struct kvm_assigned_msix_mmio)
> > > 
> > >  /* Available with KVM_CAP_PIT_STATE2 */
> > >  #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct
> > > 
> > > kvm_pit_state2) #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0,
> > > struct kvm_pit_state2) @@ -802,7 +808,7 @@ struct
> > > kvm_assigned_msix_mmio {
> > > 
> > >  	__u32 assigned_dev_id;
> > >  	__u64 base_addr;
> > >  	__u32 flags;
> > > 
> > > -	__u32 reserved[2];
> > > +	__u64 reserved;
> > > 
> > >  };
> > >  
> > >  #endif /* __LINUX_KVM_H */
> > > 
> > > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > > index 5d2adc4..9573194 100644
> > > --- a/virt/kvm/assigned-dev.c
> > > +++ b/virt/kvm/assigned-dev.c
> > > @@ -17,6 +17,8 @@
> > > 
> > >  #include <linux/pci.h>
> > >  #include <linux/interrupt.h>
> > >  #include <linux/slab.h>
> > > 
> > > +#include <linux/irqnr.h>
> > > +
> > > 
> > >  #include "irq.h"
> > >  
> > >  static struct kvm_assigned_dev_kernel *kvm_find_assigned_dev(struct
> > > 
> > > list_head *head, @@ -169,6 +171,14 @@ static void
> > > deassign_host_irq(struct kvm *kvm, */
> > > 
> > >  	if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
> > >  	
> > >  		int i;
> > > 
> > > +#ifdef __KVM_HAVE_MSIX
> > > +		if (assigned_dev->msix_mmio_base) {
> > > +			mutex_lock(&kvm->slots_lock);
> > > +			kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS,
> > > +					&assigned_dev->msix_mmio_dev);
> > > +			mutex_unlock(&kvm->slots_lock);
> > > +		}
> > > +#endif
> > > 
> > >  		for (i = 0; i < assigned_dev->entries_nr; i++)
> > >  		
> > >  			disable_irq_nosync(assigned_dev->
> > >  			
> > >  					   host_msix_entries[i].vector);
> > > 
> > > @@ -318,6 +328,15 @@ static int assigned_device_enable_host_msix(struct
> > > kvm *kvm, goto err;
> > > 
> > >  	}
> > > 
> > > +	if (dev->msix_mmio_base) {
> > > +		mutex_lock(&kvm->slots_lock);
> > > +		r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
> > > +				&dev->msix_mmio_dev);
> > > +		mutex_unlock(&kvm->slots_lock);
> > > +		if (r)
> > > +			goto err;
> > > +	}
> > > +
> > > 
> > >  	return 0;
> > >  
> > >  err:
> > >  	for (i -= 1; i >= 0; i--)
> > > 
> > > @@ -870,6 +889,31 @@ static const struct kvm_io_device_ops
> > > msix_mmio_ops = { .write    = msix_mmio_write,
> > > 
> > >  };
> > > 
> > > +static int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm,
> > > +				struct kvm_assigned_msix_mmio *msix_mmio)
> > > +{
> > > +	int r = 0;
> > > +	struct kvm_assigned_dev_kernel *adev;
> > > +
> > > +	mutex_lock(&kvm->lock);
> > > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > > +				      msix_mmio->assigned_dev_id);
> > > +	if (!adev) {
> > > +		r = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +	if (msix_mmio->base_addr == 0) {
> > > +		r = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +	adev->msix_mmio_base = msix_mmio->base_addr;
> > > +
> > > +	kvm_iodevice_init(&adev->msix_mmio_dev, &msix_mmio_ops);
> > > +out:
> > > +	mutex_unlock(&kvm->lock);
> > > +
> > > +	return r;
> > > +}
> > > 
> > >  #endif
> > >  
> > >  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> > > 
> > > @@ -982,6 +1026,22 @@ long kvm_vm_ioctl_assigned_device(struct kvm
> > > *kvm, unsigned ioctl, goto out;
> > > 
> > >  		break;
> > >  	
> > >  	}
> > > 
> > > +	case KVM_ASSIGN_REG_MSIX_MMIO: {
> > > +		struct kvm_assigned_msix_mmio msix_mmio;
> > > +
> > > +		r = -EFAULT;
> > > +		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
> > > +			goto out;
> > > +
> > > +		r = -EINVAL;
> > > +		if (msix_mmio.flags != 0 || msix_mmio.reserved != 0)
> > > +			goto out;
> > > +
> > > +		r = kvm_vm_ioctl_register_msix_mmio(kvm, &msix_mmio);
> > > +		if (r)
> > > +			goto out;
> > > +		break;
> > > +	}
> > > 
> > >  #endif
> > >  
> > >  	}
> > >  
> > >  out:

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 8/8] KVM: Emulation MSI-X mask bits for assigned devices
  2010-10-22  4:42       ` Sheng Yang
@ 2010-10-22 10:17         ` Michael S. Tsirkin
  2010-10-22 13:30           ` Sheng Yang
  0 siblings, 1 reply; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-10-22 10:17 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Fri, Oct 22, 2010 at 12:42:43PM +0800, Sheng Yang wrote:
> On Thursday 21 October 2010 16:39:07 Michael S. Tsirkin wrote:
> > On Thu, Oct 21, 2010 at 04:30:02PM +0800, Sheng Yang wrote:
> > > On Wednesday 20 October 2010 16:26:32 Sheng Yang wrote:
> > > > This patch enable per-vector mask for assigned devices using MSI-X.
> > > 
> > > The basic idea of kernel and QEmu's responsibilities are:
> > > 
> > > 1. Because QEmu owned the irq routing table, so the change of table
> > > should still go to the QEmu, like we did in msix_mmio_write().
> > > 
> > > 2. And the others things can be done in kernel, for performance. Here we
> > > covered the reading(converted entry from routing table and mask bit
> > > state of enabled MSI-X entries), and writing the mask bit for enabled
> > > MSI-X entries. Originally we only has mask bit handled in kernel, but
> > > later we found that Linux kernel would read MSI-X mmio just after every
> > > writing to mask bit, in order to flush the writing. So we add reading
> > > MSI data/addr as well.
> > > 
> > > 3. Disabled entries's mask bit accessing would go to QEmu, because it may
> > > result in disable/enable MSI-X. Explained later.
> > > 
> > > 4. Only QEmu has knowledge of PCI configuration space, so it's QEmu to
> > > decide enable/disable MSI-X for device.
> > > .
> > 
> > Config space yes, but it's a simple global yes/no after all.
> > 
> > > 5. There is an distinction between enabled entry and disabled entry of
> > > MSI-X table.
> > 
> > That's my point. There's no such thing as 'enabled entries'
> > in the spec. There are only masked and unmasked entries.
> >
> > Current interface deals with gsi numbers so qemu had to work around
> > this. The hack used there is removing gsi for masked vector which has 0
> > address and data.  It works because this is what linux and windows
> > guests happen to do, but it is out of spec: vector/data value for a
> > masked entry have no meaning.
> 
> Well, I just realized something unnatural about the 0 contained data/address 
> entry. So I checked spec again, and found the mask bit should be set after reset. 
> So after fix this, I think unmasked 0 address/data entry shouldn't be there 
> anymore.

You are right that this 0 check is completely out of spec.  But see
below. The issue is the spec does not require  you to mask an entry if
the device does not use it. And we do not want to waste host vectors.
One can imagine a logic where we would detect that an interrupt
has not been used in a long while, mask it and give up
a host vector. Then check the pending bit once in a while to
see whether device started using it.

> > Since you are building a new interface, can design it without
> > constraints...
> 
> A constraint is pci_enable_msix().

It's an internal kernel API. No one prevents us from changing it.

> We have to use it to allocate irq for each 
> entry, as well as program the entry in the real hardware. pci_enable_msix() is 
> only a yes/no choice. We can't add new enabled entries after pci_enable_msix(), 

With the current APIs.

> and we can only enable/disable/mask/unmask one IRQ through kernel API, not the 
> entry in the MSI-X table. And we still have to allocate new IRQ for new entry.  
> When guest unmask "disabled entry", we have to disable and enable MSI-X again in 
> order to use the new entry. That's why "enabled/disabled entry" concept existed.
> 
> So even guest only unmasked one entry, it's a totally different work for KVM 
> underlaying. This logic won't change no matter where the MMIO handling is. And in 
> fact I don't like this kind of tricky things in kernel...

A more fundamental problem is that host vectors are a limited resource,
we don't want to waste them on entries that will end up unused.
One can imagine some kind of logic where we check the pending
bit on a masked entry and after a while give up the host vector.

This is what I said: making it spec compliant is harder as it will
need core kernel changes. Still, it seems silly to design a
kerne/userspace API around an internal API limitation ...

> > > The entries we had used for pci_enable_msix()(not necessary in sequence
> > > number) are already enabled, the others are disabled. When device's MSI-X
> > > is enabled and guest want to enable an disabled entry, we would go back
> > > to QEmu because this vector didn't exist in the routing table. Also due
> > > to pci_enable_msix() in kernel didn't allow us to enable vectors one by
> > > one, but all at once. So we have to disable MSI-X first, then enable it
> > > with new entries, which contained the new vector guest want to use. This
> > > situation is only happen when device is being initialized. After that,
> > > kernel can know and handle the mask bit of the enabled entry.
> > > 
> > > I've also considered handle all MMIO operation in kernel, and changing
> > > irq routing in kernel directly. But as long as irq routing is owned by
> > > QEmu, I think it's better to leave to it...
> > 
> > Yes, this is my suggestion, except we don't need no routing :)
> > To inject MSI you just need address/data pairs.
> > Look at kvm_set_msi: given address/data you can just inject
> > the interrupt. No need for table lookups.
> 
> You still need to look up data/address pair in the guest MSI-X table. The routing 
> table used here is just an replacement for the table, because we can construct the 
> entry according to the routing table. Two choices, using the routing table, or 
> creating an new MSI-X table. 
> 
> Still, the key is about who to own the routing/MSI-X table. If kernel own it, it 
> would be straightforward to intercept all the MMIO in the kernel; but if it's 
> QEmu, we still need go back to QEmu for it.

Looks cleaner to do it in kernel...

> > > Notice the mask/unmask bits must be handled together, either in kernel or
> > > in userspace. Because if kernel has handled enabled vector's mask bit
> > > directly, it would be unsync with QEmu's records. It doesn't matter when
> > > QEmu don't access the related record. And the only place QEmu want to
> > > consult it's enabled entries' mask bit state is writing to MSI
> > > addr/data. The writing should be discarded if the entry is unmasked.
> > > This checking has already been done by kernel in this patchset, so we
> > > are fine here.
> > > 
> > > If we want to access the enabled entries' mask bit in the future, we can
> > > directly access device's MMIO.
> > 
> > We really must implement this for correctness, btw. If you do not pass
> > reads to the device, messages intended for the masked entry
> > might still be in flight.
> 
> Oh, yes, kernel would also mask the device as well. I would take this into 
> consideration.
> 
> --
> regards
> Yang, Sheng
> 
> > 
> > > That's the reason why I have followed Michael's advice to use
> > > mask/unmask directly.
> > > Hope this would make the patches more clear. I meant to add comments for
> > > this changeset, but miss it later.
> > > 
> > > --
> > > regards
> > > Yang, Sheng
> > > 
> > > > Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> > > > ---
> > > > 
> > > >  Documentation/kvm/api.txt |   22 ++++++++++++++++
> > > >  arch/x86/kvm/x86.c        |    6 ++++
> > > >  include/linux/kvm.h       |    8 +++++-
> > > >  virt/kvm/assigned-dev.c   |   60
> > > > 
> > > > +++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 
> 95
> > > > insertions(+), 1 deletions(-)
> > > > 
> > > > diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
> > > > index d82d637..f324a50 100644
> > > > --- a/Documentation/kvm/api.txt
> > > > +++ b/Documentation/kvm/api.txt
> > > > @@ -1087,6 +1087,28 @@ of 4 instructions that make up a hypercall.
> > > > 
> > > >  If any additional field gets added to this structure later on, a bit
> > > >  for
> > > > 
> > > > that additional piece of information will be set in the flags bitmap.
> > > > 
> > > > +4.47 KVM_ASSIGN_REG_MSIX_MMIO
> > > > +
> > > > +Capability: KVM_CAP_DEVICE_MSIX_MASK
> > > > +Architectures: x86
> > > > +Type: vm ioctl
> > > > +Parameters: struct kvm_assigned_msix_mmio (in)
> > > > +Returns: 0 on success, !0 on error
> > > > +
> > > > +struct kvm_assigned_msix_mmio {
> > > > +	/* Assigned device's ID */
> > > > +	__u32 assigned_dev_id;
> > > > +	/* MSI-X table MMIO address */
> > > > +	__u64 base_addr;
> > > > +	/* Must be 0 */
> > > > +	__u32 flags;
> > > > +	/* Must be 0, reserved for future use */
> > > > +	__u64 reserved;
> > > > +};
> > > > +
> > > > +This ioctl would enable in-kernel MSI-X emulation, which would handle
> > > > MSI-X +mask bit in the kernel.
> > > > +
> > > > 
> > > >  5. The kvm_run structure
> > > >  
> > > >  Application code obtains a pointer to the kvm_run structure by
> > > > 
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index fc62546..ba07a2f 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -1927,6 +1927,8 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > > 
> > > >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > > >  	case KVM_CAP_XSAVE:
> > > > 
> > > >  	case KVM_CAP_ENABLE_CAP:
> > > > +	case KVM_CAP_DEVICE_MSIX_EXT:
> > > > 
> > > > +	case KVM_CAP_DEVICE_MSIX_MASK:
> > > >  		r = 1;
> > > >  		break;
> > > >  	
> > > >  	case KVM_CAP_COALESCED_MMIO:
> > > > @@ -2717,6 +2719,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct
> > > > kvm_vcpu *vcpu, return -EINVAL;
> > > > 
> > > >  	switch (cap->cap) {
> > > > 
> > > > +	case KVM_CAP_DEVICE_MSIX_EXT:
> > > > +		vcpu->kvm->arch.msix_flags_enabled = true;
> > > > +		r = 0;
> > > > +		break;
> > > > 
> > > >  	default:
> > > >  		r = -EINVAL;
> > > >  		break;
> > > > 
> > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > index 0a7bd34..1494ed0 100644
> > > > --- a/include/linux/kvm.h
> > > > +++ b/include/linux/kvm.h
> > > > @@ -540,6 +540,10 @@ struct kvm_ppc_pvinfo {
> > > > 
> > > >  #endif
> > > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > > 
> > > > +#ifdef __KVM_HAVE_MSIX
> > > > +#define KVM_CAP_DEVICE_MSIX_EXT 59
> > > > +#define KVM_CAP_DEVICE_MSIX_MASK 60
> > > > +#endif
> > > > 
> > > >  #ifdef KVM_CAP_IRQ_ROUTING
> > > > 
> > > > @@ -671,6 +675,8 @@ struct kvm_clock_data {
> > > > 
> > > >  #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct
> > > > 
> > > > kvm_xen_hvm_config) #define KVM_SET_CLOCK             _IOW(KVMIO, 
> > > > 0x7b, struct kvm_clock_data) #define KVM_GET_CLOCK            
> > > > _IOR(KVMIO, 0x7c, struct kvm_clock_data) +#define
> > > > KVM_ASSIGN_REG_MSIX_MMIO _IOW(KVMIO,  0x7d, \
> > > > +					struct kvm_assigned_msix_mmio)
> > > > 
> > > >  /* Available with KVM_CAP_PIT_STATE2 */
> > > >  #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct
> > > > 
> > > > kvm_pit_state2) #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0,
> > > > struct kvm_pit_state2) @@ -802,7 +808,7 @@ struct
> > > > kvm_assigned_msix_mmio {
> > > > 
> > > >  	__u32 assigned_dev_id;
> > > >  	__u64 base_addr;
> > > >  	__u32 flags;
> > > > 
> > > > -	__u32 reserved[2];
> > > > +	__u64 reserved;
> > > > 
> > > >  };
> > > >  
> > > >  #endif /* __LINUX_KVM_H */
> > > > 
> > > > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > > > index 5d2adc4..9573194 100644
> > > > --- a/virt/kvm/assigned-dev.c
> > > > +++ b/virt/kvm/assigned-dev.c
> > > > @@ -17,6 +17,8 @@
> > > > 
> > > >  #include <linux/pci.h>
> > > >  #include <linux/interrupt.h>
> > > >  #include <linux/slab.h>
> > > > 
> > > > +#include <linux/irqnr.h>
> > > > +
> > > > 
> > > >  #include "irq.h"
> > > >  
> > > >  static struct kvm_assigned_dev_kernel *kvm_find_assigned_dev(struct
> > > > 
> > > > list_head *head, @@ -169,6 +171,14 @@ static void
> > > > deassign_host_irq(struct kvm *kvm, */
> > > > 
> > > >  	if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
> > > >  	
> > > >  		int i;
> > > > 
> > > > +#ifdef __KVM_HAVE_MSIX
> > > > +		if (assigned_dev->msix_mmio_base) {
> > > > +			mutex_lock(&kvm->slots_lock);
> > > > +			kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS,
> > > > +					&assigned_dev->msix_mmio_dev);
> > > > +			mutex_unlock(&kvm->slots_lock);
> > > > +		}
> > > > +#endif
> > > > 
> > > >  		for (i = 0; i < assigned_dev->entries_nr; i++)
> > > >  		
> > > >  			disable_irq_nosync(assigned_dev->
> > > >  			
> > > >  					   host_msix_entries[i].vector);
> > > > 
> > > > @@ -318,6 +328,15 @@ static int assigned_device_enable_host_msix(struct
> > > > kvm *kvm, goto err;
> > > > 
> > > >  	}
> > > > 
> > > > +	if (dev->msix_mmio_base) {
> > > > +		mutex_lock(&kvm->slots_lock);
> > > > +		r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
> > > > +				&dev->msix_mmio_dev);
> > > > +		mutex_unlock(&kvm->slots_lock);
> > > > +		if (r)
> > > > +			goto err;
> > > > +	}
> > > > +
> > > > 
> > > >  	return 0;
> > > >  
> > > >  err:
> > > >  	for (i -= 1; i >= 0; i--)
> > > > 
> > > > @@ -870,6 +889,31 @@ static const struct kvm_io_device_ops
> > > > msix_mmio_ops = { .write    = msix_mmio_write,
> > > > 
> > > >  };
> > > > 
> > > > +static int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm,
> > > > +				struct kvm_assigned_msix_mmio *msix_mmio)
> > > > +{
> > > > +	int r = 0;
> > > > +	struct kvm_assigned_dev_kernel *adev;
> > > > +
> > > > +	mutex_lock(&kvm->lock);
> > > > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > > > +				      msix_mmio->assigned_dev_id);
> > > > +	if (!adev) {
> > > > +		r = -EINVAL;
> > > > +		goto out;
> > > > +	}
> > > > +	if (msix_mmio->base_addr == 0) {
> > > > +		r = -EINVAL;
> > > > +		goto out;
> > > > +	}
> > > > +	adev->msix_mmio_base = msix_mmio->base_addr;
> > > > +
> > > > +	kvm_iodevice_init(&adev->msix_mmio_dev, &msix_mmio_ops);
> > > > +out:
> > > > +	mutex_unlock(&kvm->lock);
> > > > +
> > > > +	return r;
> > > > +}
> > > > 
> > > >  #endif
> > > >  
> > > >  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> > > > 
> > > > @@ -982,6 +1026,22 @@ long kvm_vm_ioctl_assigned_device(struct kvm
> > > > *kvm, unsigned ioctl, goto out;
> > > > 
> > > >  		break;
> > > >  	
> > > >  	}
> > > > 
> > > > +	case KVM_ASSIGN_REG_MSIX_MMIO: {
> > > > +		struct kvm_assigned_msix_mmio msix_mmio;
> > > > +
> > > > +		r = -EFAULT;
> > > > +		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
> > > > +			goto out;
> > > > +
> > > > +		r = -EINVAL;
> > > > +		if (msix_mmio.flags != 0 || msix_mmio.reserved != 0)
> > > > +			goto out;
> > > > +
> > > > +		r = kvm_vm_ioctl_register_msix_mmio(kvm, &msix_mmio);
> > > > +		if (r)
> > > > +			goto out;
> > > > +		break;
> > > > +	}
> > > > 
> > > >  #endif
> > > >  
> > > >  	}
> > > >  
> > > >  out:

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 8/8] KVM: Emulation MSI-X mask bits for assigned devices
  2010-10-22 10:17         ` Michael S. Tsirkin
@ 2010-10-22 13:30           ` Sheng Yang
  2010-10-22 14:32             ` Michael S. Tsirkin
  0 siblings, 1 reply; 66+ messages in thread
From: Sheng Yang @ 2010-10-22 13:30 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Friday 22 October 2010 18:17:05 Michael S. Tsirkin wrote:
> On Fri, Oct 22, 2010 at 12:42:43PM +0800, Sheng Yang wrote:
> > On Thursday 21 October 2010 16:39:07 Michael S. Tsirkin wrote:
> > > On Thu, Oct 21, 2010 at 04:30:02PM +0800, Sheng Yang wrote:
> > > > On Wednesday 20 October 2010 16:26:32 Sheng Yang wrote:
> > > > > This patch enable per-vector mask for assigned devices using MSI-X.
> > > > 
> > > > The basic idea of kernel and QEmu's responsibilities are:
> > > > 
> > > > 1. Because QEmu owned the irq routing table, so the change of table
> > > > should still go to the QEmu, like we did in msix_mmio_write().
> > > > 
> > > > 2. And the others things can be done in kernel, for performance. Here
> > > > we covered the reading(converted entry from routing table and mask
> > > > bit state of enabled MSI-X entries), and writing the mask bit for
> > > > enabled MSI-X entries. Originally we only has mask bit handled in
> > > > kernel, but later we found that Linux kernel would read MSI-X mmio
> > > > just after every writing to mask bit, in order to flush the writing.
> > > > So we add reading MSI data/addr as well.
> > > > 
> > > > 3. Disabled entries's mask bit accessing would go to QEmu, because it
> > > > may result in disable/enable MSI-X. Explained later.
> > > > 
> > > > 4. Only QEmu has knowledge of PCI configuration space, so it's QEmu
> > > > to decide enable/disable MSI-X for device.
> > > > .
> > > 
> > > Config space yes, but it's a simple global yes/no after all.
> > > 
> > > > 5. There is an distinction between enabled entry and disabled entry
> > > > of MSI-X table.
> > > 
> > > That's my point. There's no such thing as 'enabled entries'
> > > in the spec. There are only masked and unmasked entries.
> > > 
> > > Current interface deals with gsi numbers so qemu had to work around
> > > this. The hack used there is removing gsi for masked vector which has 0
> > > address and data.  It works because this is what linux and windows
> > > guests happen to do, but it is out of spec: vector/data value for a
> > > masked entry have no meaning.
> > 
> > Well, I just realized something unnatural about the 0 contained
> > data/address entry. So I checked spec again, and found the mask bit
> > should be set after reset. So after fix this, I think unmasked 0
> > address/data entry shouldn't be there anymore.
> 
> You are right that this 0 check is completely out of spec.  But see
> below. The issue is the spec does not require  you to mask an entry if
> the device does not use it. And we do not want to waste host vectors.
> One can imagine a logic where we would detect that an interrupt
> has not been used in a long while, mask it and give up
> a host vector. Then check the pending bit once in a while to
> see whether device started using it.

I don't think introducing this complex and speculative logic makes sense. I 
haven't seen any alike scenario yet. What's the issue of current implemenation?
 
> > > Since you are building a new interface, can design it without
> > > constraints...
> > 
> > A constraint is pci_enable_msix().
> 
> It's an internal kernel API. No one prevents us from changing it.

You can say no one. But I'm afraid if we want to overhaul this kind of core PCI 
functions, it may be take months to get it checked in upstream - also assume we 
can persuade them this overhaul is absolutely needed (I hope I'm wrong on this). 
This may also means we have to find out another user for this kind of change. The 
key issue is I don't know what we can gain for certain from it. Current 
disable/enable mechanism still works well. I don't know why we need spend a lot of 
effort on this just because spec don't say there is "enabled/disabled entries". 

Yes it's not that elegant, but we need carefully think if the effort worth it.
> 
> > We have to use it to allocate irq for each
> > entry, as well as program the entry in the real hardware.
> > pci_enable_msix() is only a yes/no choice. We can't add new enabled
> > entries after pci_enable_msix(),
> 
> With the current APIs.
> 
> > and we can only enable/disable/mask/unmask one IRQ through kernel API,
> > not the entry in the MSI-X table. And we still have to allocate new IRQ
> > for new entry. When guest unmask "disabled entry", we have to disable
> > and enable MSI-X again in order to use the new entry. That's why
> > "enabled/disabled entry" concept existed.
> > 
> > So even guest only unmasked one entry, it's a totally different work for
> > KVM underlaying. This logic won't change no matter where the MMIO
> > handling is. And in fact I don't like this kind of tricky things in
> > kernel...
> 
> A more fundamental problem is that host vectors are a limited resource,
> we don't want to waste them on entries that will end up unused.
> One can imagine some kind of logic where we check the pending
> bit on a masked entry and after a while give up the host vector.

For the data/address != 0 entries, I haven't seen any of them leave unused in the 
end. So I don't understand your point here.
> 
> This is what I said: making it spec compliant is harder as it will
> need core kernel changes. Still, it seems silly to design a
> kerne/userspace API around an internal API limitation ...

I still don't think we violate the spec here, in which case we fail to comply the 
spec? If some devices want to send 0 message to 0 address on x86, it's fail to 
comply the x86 spec.

And I know the current implementation is not elegant due to internal API 
limitation, but only the word "change" is not enough. Function to enable separate 
entry is of course good to have, but I still think we don't have enough reason for 
doing it. 

--
regards
Yang, Sheng
> 
> > > > The entries we had used for pci_enable_msix()(not necessary in
> > > > sequence number) are already enabled, the others are disabled. When
> > > > device's MSI-X is enabled and guest want to enable an disabled
> > > > entry, we would go back to QEmu because this vector didn't exist in
> > > > the routing table. Also due to pci_enable_msix() in kernel didn't
> > > > allow us to enable vectors one by one, but all at once. So we have
> > > > to disable MSI-X first, then enable it with new entries, which
> > > > contained the new vector guest want to use. This situation is only
> > > > happen when device is being initialized. After that, kernel can know
> > > > and handle the mask bit of the enabled entry.
> > > > 
> > > > I've also considered handle all MMIO operation in kernel, and
> > > > changing irq routing in kernel directly. But as long as irq routing
> > > > is owned by QEmu, I think it's better to leave to it...
> > > 
> > > Yes, this is my suggestion, except we don't need no routing :)
> > > To inject MSI you just need address/data pairs.
> > > Look at kvm_set_msi: given address/data you can just inject
> > > the interrupt. No need for table lookups.
> > 
> > You still need to look up data/address pair in the guest MSI-X table. The
> > routing table used here is just an replacement for the table, because we
> > can construct the entry according to the routing table. Two choices,
> > using the routing table, or creating an new MSI-X table.
> > 
> > Still, the key is about who to own the routing/MSI-X table. If kernel own
> > it, it would be straightforward to intercept all the MMIO in the kernel;
> > but if it's QEmu, we still need go back to QEmu for it.
> 
> Looks cleaner to do it in kernel...
> 
> > > > Notice the mask/unmask bits must be handled together, either in
> > > > kernel or in userspace. Because if kernel has handled enabled
> > > > vector's mask bit directly, it would be unsync with QEmu's records.
> > > > It doesn't matter when QEmu don't access the related record. And the
> > > > only place QEmu want to consult it's enabled entries' mask bit state
> > > > is writing to MSI addr/data. The writing should be discarded if the
> > > > entry is unmasked. This checking has already been done by kernel in
> > > > this patchset, so we are fine here.
> > > > 
> > > > If we want to access the enabled entries' mask bit in the future, we
> > > > can directly access device's MMIO.
> > > 
> > > We really must implement this for correctness, btw. If you do not pass
> > > reads to the device, messages intended for the masked entry
> > > might still be in flight.
> > 
> > Oh, yes, kernel would also mask the device as well. I would take this
> > into consideration.
> > 
> > --
> > regards
> > Yang, Sheng
> > 
> > > > That's the reason why I have followed Michael's advice to use
> > > > mask/unmask directly.
> > > > Hope this would make the patches more clear. I meant to add comments
> > > > for this changeset, but miss it later.
> > > > 
> > > > --
> > > > regards
> > > > Yang, Sheng
> > > > 
> > > > > Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> > > > > ---
> > > > > 
> > > > >  Documentation/kvm/api.txt |   22 ++++++++++++++++
> > > > >  arch/x86/kvm/x86.c        |    6 ++++
> > > > >  include/linux/kvm.h       |    8 +++++-
> > > > >  virt/kvm/assigned-dev.c   |   60
> > > > > 
> > > > > +++++++++++++++++++++++++++++++++++++++++++++ 4 files 
changed,
> > 
> > 95
> > 
> > > > > insertions(+), 1 deletions(-)
> > > > > 
> > > > > diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
> > > > > index d82d637..f324a50 100644
> > > > > --- a/Documentation/kvm/api.txt
> > > > > +++ b/Documentation/kvm/api.txt
> > > > > @@ -1087,6 +1087,28 @@ of 4 instructions that make up a hypercall.
> > > > > 
> > > > >  If any additional field gets added to this structure later on, a
> > > > >  bit for
> > > > > 
> > > > > that additional piece of information will be set in the flags
> > > > > bitmap.
> > > > > 
> > > > > +4.47 KVM_ASSIGN_REG_MSIX_MMIO
> > > > > +
> > > > > +Capability: KVM_CAP_DEVICE_MSIX_MASK
> > > > > +Architectures: x86
> > > > > +Type: vm ioctl
> > > > > +Parameters: struct kvm_assigned_msix_mmio (in)
> > > > > +Returns: 0 on success, !0 on error
> > > > > +
> > > > > +struct kvm_assigned_msix_mmio {
> > > > > +	/* Assigned device's ID */
> > > > > +	__u32 assigned_dev_id;
> > > > > +	/* MSI-X table MMIO address */
> > > > > +	__u64 base_addr;
> > > > > +	/* Must be 0 */
> > > > > +	__u32 flags;
> > > > > +	/* Must be 0, reserved for future use */
> > > > > +	__u64 reserved;
> > > > > +};
> > > > > +
> > > > > +This ioctl would enable in-kernel MSI-X emulation, which would
> > > > > handle MSI-X +mask bit in the kernel.
> > > > > +
> > > > > 
> > > > >  5. The kvm_run structure
> > > > >  
> > > > >  Application code obtains a pointer to the kvm_run structure by
> > > > > 
> > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > index fc62546..ba07a2f 100644
> > > > > --- a/arch/x86/kvm/x86.c
> > > > > +++ b/arch/x86/kvm/x86.c
> > > > > @@ -1927,6 +1927,8 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > > > 
> > > > >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > > > >  	case KVM_CAP_XSAVE:
> > > > > 
> > > > >  	case KVM_CAP_ENABLE_CAP:
> > > > > +	case KVM_CAP_DEVICE_MSIX_EXT:
> > > > > 
> > > > > +	case KVM_CAP_DEVICE_MSIX_MASK:
> > > > >  		r = 1;
> > > > >  		break;
> > > > >  	
> > > > >  	case KVM_CAP_COALESCED_MMIO:
> > > > > @@ -2717,6 +2719,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct
> > > > > kvm_vcpu *vcpu, return -EINVAL;
> > > > > 
> > > > >  	switch (cap->cap) {
> > > > > 
> > > > > +	case KVM_CAP_DEVICE_MSIX_EXT:
> > > > > +		vcpu->kvm->arch.msix_flags_enabled = true;
> > > > > +		r = 0;
> > > > > +		break;
> > > > > 
> > > > >  	default:
> > > > >  		r = -EINVAL;
> > > > >  		break;
> > > > > 
> > > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > > index 0a7bd34..1494ed0 100644
> > > > > --- a/include/linux/kvm.h
> > > > > +++ b/include/linux/kvm.h
> > > > > @@ -540,6 +540,10 @@ struct kvm_ppc_pvinfo {
> > > > > 
> > > > >  #endif
> > > > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > > > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > > > 
> > > > > +#ifdef __KVM_HAVE_MSIX
> > > > > +#define KVM_CAP_DEVICE_MSIX_EXT 59
> > > > > +#define KVM_CAP_DEVICE_MSIX_MASK 60
> > > > > +#endif
> > > > > 
> > > > >  #ifdef KVM_CAP_IRQ_ROUTING
> > > > > 
> > > > > @@ -671,6 +675,8 @@ struct kvm_clock_data {
> > > > > 
> > > > >  #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct
> > > > > 
> > > > > kvm_xen_hvm_config) #define KVM_SET_CLOCK             _IOW(KVMIO,
> > > > > 0x7b, struct kvm_clock_data) #define KVM_GET_CLOCK
> > > > > _IOR(KVMIO, 0x7c, struct kvm_clock_data) +#define
> > > > > KVM_ASSIGN_REG_MSIX_MMIO _IOW(KVMIO,  0x7d, \
> > > > > +					struct kvm_assigned_msix_mmio)
> > > > > 
> > > > >  /* Available with KVM_CAP_PIT_STATE2 */
> > > > >  #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct
> > > > > 
> > > > > kvm_pit_state2) #define KVM_SET_PIT2              _IOW(KVMIO, 
> > > > > 0xa0, struct kvm_pit_state2) @@ -802,7 +808,7 @@ struct
> > > > > kvm_assigned_msix_mmio {
> > > > > 
> > > > >  	__u32 assigned_dev_id;
> > > > >  	__u64 base_addr;
> > > > >  	__u32 flags;
> > > > > 
> > > > > -	__u32 reserved[2];
> > > > > +	__u64 reserved;
> > > > > 
> > > > >  };
> > > > >  
> > > > >  #endif /* __LINUX_KVM_H */
> > > > > 
> > > > > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > > > > index 5d2adc4..9573194 100644
> > > > > --- a/virt/kvm/assigned-dev.c
> > > > > +++ b/virt/kvm/assigned-dev.c
> > > > > @@ -17,6 +17,8 @@
> > > > > 
> > > > >  #include <linux/pci.h>
> > > > >  #include <linux/interrupt.h>
> > > > >  #include <linux/slab.h>
> > > > > 
> > > > > +#include <linux/irqnr.h>
> > > > > +
> > > > > 
> > > > >  #include "irq.h"
> > > > >  
> > > > >  static struct kvm_assigned_dev_kernel
> > > > >  *kvm_find_assigned_dev(struct
> > > > > 
> > > > > list_head *head, @@ -169,6 +171,14 @@ static void
> > > > > deassign_host_irq(struct kvm *kvm, */
> > > > > 
> > > > >  	if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
> > > > >  	
> > > > >  		int i;
> > > > > 
> > > > > +#ifdef __KVM_HAVE_MSIX
> > > > > +		if (assigned_dev->msix_mmio_base) {
> > > > > +			mutex_lock(&kvm->slots_lock);
> > > > > +			kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS,
> > > > > +					&assigned_dev->msix_mmio_dev);
> > > > > +			mutex_unlock(&kvm->slots_lock);
> > > > > +		}
> > > > > +#endif
> > > > > 
> > > > >  		for (i = 0; i < assigned_dev->entries_nr; i++)
> > > > >  		
> > > > >  			disable_irq_nosync(assigned_dev->
> > > > >  			
> > > > >  					   host_msix_entries[i].vector);
> > > > > 
> > > > > @@ -318,6 +328,15 @@ static int
> > > > > assigned_device_enable_host_msix(struct kvm *kvm, goto err;
> > > > > 
> > > > >  	}
> > > > > 
> > > > > +	if (dev->msix_mmio_base) {
> > > > > +		mutex_lock(&kvm->slots_lock);
> > > > > +		r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
> > > > > +				&dev->msix_mmio_dev);
> > > > > +		mutex_unlock(&kvm->slots_lock);
> > > > > +		if (r)
> > > > > +			goto err;
> > > > > +	}
> > > > > +
> > > > > 
> > > > >  	return 0;
> > > > >  
> > > > >  err:
> > > > >  	for (i -= 1; i >= 0; i--)
> > > > > 
> > > > > @@ -870,6 +889,31 @@ static const struct kvm_io_device_ops
> > > > > msix_mmio_ops = { .write    = msix_mmio_write,
> > > > > 
> > > > >  };
> > > > > 
> > > > > +static int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm,
> > > > > +				struct kvm_assigned_msix_mmio *msix_mmio)
> > > > > +{
> > > > > +	int r = 0;
> > > > > +	struct kvm_assigned_dev_kernel *adev;
> > > > > +
> > > > > +	mutex_lock(&kvm->lock);
> > > > > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > > > > +				      msix_mmio->assigned_dev_id);
> > > > > +	if (!adev) {
> > > > > +		r = -EINVAL;
> > > > > +		goto out;
> > > > > +	}
> > > > > +	if (msix_mmio->base_addr == 0) {
> > > > > +		r = -EINVAL;
> > > > > +		goto out;
> > > > > +	}
> > > > > +	adev->msix_mmio_base = msix_mmio->base_addr;
> > > > > +
> > > > > +	kvm_iodevice_init(&adev->msix_mmio_dev, &msix_mmio_ops);
> > > > > +out:
> > > > > +	mutex_unlock(&kvm->lock);
> > > > > +
> > > > > +	return r;
> > > > > +}
> > > > > 
> > > > >  #endif
> > > > >  
> > > > >  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> > > > > 
> > > > > @@ -982,6 +1026,22 @@ long kvm_vm_ioctl_assigned_device(struct kvm
> > > > > *kvm, unsigned ioctl, goto out;
> > > > > 
> > > > >  		break;
> > > > >  	
> > > > >  	}
> > > > > 
> > > > > +	case KVM_ASSIGN_REG_MSIX_MMIO: {
> > > > > +		struct kvm_assigned_msix_mmio msix_mmio;
> > > > > +
> > > > > +		r = -EFAULT;
> > > > > +		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
> > > > > +			goto out;
> > > > > +
> > > > > +		r = -EINVAL;
> > > > > +		if (msix_mmio.flags != 0 || msix_mmio.reserved != 0)
> > > > > +			goto out;
> > > > > +
> > > > > +		r = kvm_vm_ioctl_register_msix_mmio(kvm, &msix_mmio);
> > > > > +		if (r)
> > > > > +			goto out;
> > > > > +		break;
> > > > > +	}
> > > > > 
> > > > >  #endif
> > > > >  
> > > > >  	}
> > > > >  
> > > > >  out:

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 8/8] KVM: Emulation MSI-X mask bits for assigned devices
  2010-10-22 13:30           ` Sheng Yang
@ 2010-10-22 14:32             ` Michael S. Tsirkin
  0 siblings, 0 replies; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-10-22 14:32 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Fri, Oct 22, 2010 at 09:30:09PM +0800, Sheng Yang wrote:
> On Friday 22 October 2010 18:17:05 Michael S. Tsirkin wrote:
> > On Fri, Oct 22, 2010 at 12:42:43PM +0800, Sheng Yang wrote:
> > > On Thursday 21 October 2010 16:39:07 Michael S. Tsirkin wrote:
> > > > On Thu, Oct 21, 2010 at 04:30:02PM +0800, Sheng Yang wrote:
> > > > > On Wednesday 20 October 2010 16:26:32 Sheng Yang wrote:
> > > > > > This patch enable per-vector mask for assigned devices using MSI-X.
> > > > > 
> > > > > The basic idea of kernel and QEmu's responsibilities are:
> > > > > 
> > > > > 1. Because QEmu owned the irq routing table, so the change of table
> > > > > should still go to the QEmu, like we did in msix_mmio_write().
> > > > > 
> > > > > 2. And the others things can be done in kernel, for performance. Here
> > > > > we covered the reading(converted entry from routing table and mask
> > > > > bit state of enabled MSI-X entries), and writing the mask bit for
> > > > > enabled MSI-X entries. Originally we only has mask bit handled in
> > > > > kernel, but later we found that Linux kernel would read MSI-X mmio
> > > > > just after every writing to mask bit, in order to flush the writing.
> > > > > So we add reading MSI data/addr as well.
> > > > > 
> > > > > 3. Disabled entries's mask bit accessing would go to QEmu, because it
> > > > > may result in disable/enable MSI-X. Explained later.
> > > > > 
> > > > > 4. Only QEmu has knowledge of PCI configuration space, so it's QEmu
> > > > > to decide enable/disable MSI-X for device.
> > > > > .
> > > > 
> > > > Config space yes, but it's a simple global yes/no after all.
> > > > 
> > > > > 5. There is an distinction between enabled entry and disabled entry
> > > > > of MSI-X table.
> > > > 
> > > > That's my point. There's no such thing as 'enabled entries'
> > > > in the spec. There are only masked and unmasked entries.
> > > > 
> > > > Current interface deals with gsi numbers so qemu had to work around
> > > > this. The hack used there is removing gsi for masked vector which has 0
> > > > address and data.  It works because this is what linux and windows
> > > > guests happen to do, but it is out of spec: vector/data value for a
> > > > masked entry have no meaning.
> > > 
> > > Well, I just realized something unnatural about the 0 contained
> > > data/address entry. So I checked spec again, and found the mask bit
> > > should be set after reset. So after fix this, I think unmasked 0
> > > address/data entry shouldn't be there anymore.
> > 
> > You are right that this 0 check is completely out of spec.  But see
> > below. The issue is the spec does not require  you to mask an entry if
> > the device does not use it. And we do not want to waste host vectors.
> > One can imagine a logic where we would detect that an interrupt
> > has not been used in a long while, mask it and give up
> > a host vector. Then check the pending bit once in a while to
> > see whether device started using it.
> 
> I don't think introducing this complex and speculative logic makes sense. I 
> haven't seen any alike scenario yet. What's the issue of current implemenation?

Main issue I think is that current implementation assumes that when
MSI-X is enabled, all vectors that are masked or have 0 value will not be
used anymore.  It says so no where in spec.

> > > > Since you are building a new interface, can design it without
> > > > constraints...
> > > 
> > > A constraint is pci_enable_msix().
> > 
> > It's an internal kernel API. No one prevents us from changing it.
> 
> You can say no one. But I'm afraid if we want to overhaul this kind of core PCI 
> functions, it may be take months to get it checked in upstream - also assume we 
> can persuade them this overhaul is absolutely needed (I hope I'm wrong on this). 
> This may also means we have to find out another user for this kind of change.

No, usually a single user is enough to add internal APIs :)
This is because we do not promise backwards compatibility there.

> The 
> key issue is I don't know what we can gain for certain from it. Current 
> disable/enable mechanism still works well.

Heh, given that things seem to work without mask support
at all, I am not sure what does this mean.

> I don't know why we need spend a lot of 
> effort on this just because spec don't say there is "enabled/disabled entries". 

I would like to see an implementation using 100% architectural features.
It's enough that we need to worry about device quirks. Relying on guests
to only use them in a specific way means each bug we'll have to worry
whether guest is doing something we did not expect.

If something is just too hard to implement, a temporary work
around would be to still build an interfae to allow this,
and then add a trace in kernel making this easy to detect.
Then we can fix this fully in the next version without
affecting userspace.

> Yes it's not that elegant, but we need carefully think if the effort worth it.


What I think we should be careful about is making sure we avoid
making kerne/userspace API depend on an internal kernel one. Supporing such API
long term when internal one changes will be much more painful.

> > 
> > > We have to use it to allocate irq for each
> > > entry, as well as program the entry in the real hardware.
> > > pci_enable_msix() is only a yes/no choice. We can't add new enabled
> > > entries after pci_enable_msix(),
> > 
> > With the current APIs.
> > 
> > > and we can only enable/disable/mask/unmask one IRQ through kernel API,
> > > not the entry in the MSI-X table. And we still have to allocate new IRQ
> > > for new entry. When guest unmask "disabled entry", we have to disable
> > > and enable MSI-X again in order to use the new entry. That's why
> > > "enabled/disabled entry" concept existed.
> > > 
> > > So even guest only unmasked one entry, it's a totally different work for
> > > KVM underlaying. This logic won't change no matter where the MMIO
> > > handling is. And in fact I don't like this kind of tricky things in
> > > kernel...
> > 
> > A more fundamental problem is that host vectors are a limited resource,
> > we don't want to waste them on entries that will end up unused.
> > One can imagine some kind of logic where we check the pending
> > bit on a masked entry and after a while give up the host vector.
> 
> For the data/address != 0 entries, I haven't seen any of them leave unused in the 
> end. So I don't understand your point here.

I'll try to give another example.

Imagine a guest driver loaded and using N vectors.  It gets unloaded,
then another driver is loaded using N-1 vectors.  vector N is unused but
it is != 0 so we will think it is used and will allocate an entry for
it.


> > 
> > This is what I said: making it spec compliant is harder as it will
> > need core kernel changes. Still, it seems silly to design a
> > kerne/userspace API around an internal API limitation ...
> 
> I still don't think we violate the spec here, in which case we fail to comply the 
> spec? If some devices want to send 0 message to 0 address on x86, it's fail to 
> comply the x86 spec.

Guest can write a non-0 value there and use the vector afterwards.
If we allocate vectors upfront we won't be ablke to support this.

In other words, it works but it's just a heuristic. It's nice that it
works usually, on x86, but we are just lucky and I suspect it will fail
in strange scenarious such as driver change, because it is not
architectural in the spec.



> And I know the current implementation is not elegant due to internal API 
> limitation, but only the word "change" is not enough. Function to enable separate 
> entry is of course good to have, but I still think we don't have enough reason for 
> doing it. 

Implementation is one thing, kernel/user API is another one.
I think we should try to make the second one futureproof
but the implementation can be partial, have some TODO
items and it's okay, and even a good idea to build things step by step.

> --
> regards
> Yang, Sheng
> > 
> > > > > The entries we had used for pci_enable_msix()(not necessary in
> > > > > sequence number) are already enabled, the others are disabled. When
> > > > > device's MSI-X is enabled and guest want to enable an disabled
> > > > > entry, we would go back to QEmu because this vector didn't exist in
> > > > > the routing table. Also due to pci_enable_msix() in kernel didn't
> > > > > allow us to enable vectors one by one, but all at once. So we have
> > > > > to disable MSI-X first, then enable it with new entries, which
> > > > > contained the new vector guest want to use. This situation is only
> > > > > happen when device is being initialized. After that, kernel can know
> > > > > and handle the mask bit of the enabled entry.
> > > > > 
> > > > > I've also considered handle all MMIO operation in kernel, and
> > > > > changing irq routing in kernel directly. But as long as irq routing
> > > > > is owned by QEmu, I think it's better to leave to it...
> > > > 
> > > > Yes, this is my suggestion, except we don't need no routing :)
> > > > To inject MSI you just need address/data pairs.
> > > > Look at kvm_set_msi: given address/data you can just inject
> > > > the interrupt. No need for table lookups.
> > > 
> > > You still need to look up data/address pair in the guest MSI-X table. The
> > > routing table used here is just an replacement for the table, because we
> > > can construct the entry according to the routing table. Two choices,
> > > using the routing table, or creating an new MSI-X table.
> > > 
> > > Still, the key is about who to own the routing/MSI-X table. If kernel own
> > > it, it would be straightforward to intercept all the MMIO in the kernel;
> > > but if it's QEmu, we still need go back to QEmu for it.
> > 
> > Looks cleaner to do it in kernel...
> > 
> > > > > Notice the mask/unmask bits must be handled together, either in
> > > > > kernel or in userspace. Because if kernel has handled enabled
> > > > > vector's mask bit directly, it would be unsync with QEmu's records.
> > > > > It doesn't matter when QEmu don't access the related record. And the
> > > > > only place QEmu want to consult it's enabled entries' mask bit state
> > > > > is writing to MSI addr/data. The writing should be discarded if the
> > > > > entry is unmasked. This checking has already been done by kernel in
> > > > > this patchset, so we are fine here.
> > > > > 
> > > > > If we want to access the enabled entries' mask bit in the future, we
> > > > > can directly access device's MMIO.
> > > > 
> > > > We really must implement this for correctness, btw. If you do not pass
> > > > reads to the device, messages intended for the masked entry
> > > > might still be in flight.
> > > 
> > > Oh, yes, kernel would also mask the device as well. I would take this
> > > into consideration.
> > > 
> > > --
> > > regards
> > > Yang, Sheng
> > > 
> > > > > That's the reason why I have followed Michael's advice to use
> > > > > mask/unmask directly.
> > > > > Hope this would make the patches more clear. I meant to add comments
> > > > > for this changeset, but miss it later.
> > > > > 
> > > > > --
> > > > > regards
> > > > > Yang, Sheng
> > > > > 
> > > > > > Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> > > > > > ---
> > > > > > 
> > > > > >  Documentation/kvm/api.txt |   22 ++++++++++++++++
> > > > > >  arch/x86/kvm/x86.c        |    6 ++++
> > > > > >  include/linux/kvm.h       |    8 +++++-
> > > > > >  virt/kvm/assigned-dev.c   |   60
> > > > > > 
> > > > > > +++++++++++++++++++++++++++++++++++++++++++++ 4 files 
> changed,
> > > 
> > > 95
> > > 
> > > > > > insertions(+), 1 deletions(-)
> > > > > > 
> > > > > > diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
> > > > > > index d82d637..f324a50 100644
> > > > > > --- a/Documentation/kvm/api.txt
> > > > > > +++ b/Documentation/kvm/api.txt
> > > > > > @@ -1087,6 +1087,28 @@ of 4 instructions that make up a hypercall.
> > > > > > 
> > > > > >  If any additional field gets added to this structure later on, a
> > > > > >  bit for
> > > > > > 
> > > > > > that additional piece of information will be set in the flags
> > > > > > bitmap.
> > > > > > 
> > > > > > +4.47 KVM_ASSIGN_REG_MSIX_MMIO
> > > > > > +
> > > > > > +Capability: KVM_CAP_DEVICE_MSIX_MASK
> > > > > > +Architectures: x86
> > > > > > +Type: vm ioctl
> > > > > > +Parameters: struct kvm_assigned_msix_mmio (in)
> > > > > > +Returns: 0 on success, !0 on error
> > > > > > +
> > > > > > +struct kvm_assigned_msix_mmio {
> > > > > > +	/* Assigned device's ID */
> > > > > > +	__u32 assigned_dev_id;
> > > > > > +	/* MSI-X table MMIO address */
> > > > > > +	__u64 base_addr;
> > > > > > +	/* Must be 0 */
> > > > > > +	__u32 flags;
> > > > > > +	/* Must be 0, reserved for future use */
> > > > > > +	__u64 reserved;
> > > > > > +};
> > > > > > +
> > > > > > +This ioctl would enable in-kernel MSI-X emulation, which would
> > > > > > handle MSI-X +mask bit in the kernel.
> > > > > > +
> > > > > > 
> > > > > >  5. The kvm_run structure
> > > > > >  
> > > > > >  Application code obtains a pointer to the kvm_run structure by
> > > > > > 
> > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > > index fc62546..ba07a2f 100644
> > > > > > --- a/arch/x86/kvm/x86.c
> > > > > > +++ b/arch/x86/kvm/x86.c
> > > > > > @@ -1927,6 +1927,8 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > > > > 
> > > > > >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > > > > >  	case KVM_CAP_XSAVE:
> > > > > > 
> > > > > >  	case KVM_CAP_ENABLE_CAP:
> > > > > > +	case KVM_CAP_DEVICE_MSIX_EXT:
> > > > > > 
> > > > > > +	case KVM_CAP_DEVICE_MSIX_MASK:
> > > > > >  		r = 1;
> > > > > >  		break;
> > > > > >  	
> > > > > >  	case KVM_CAP_COALESCED_MMIO:
> > > > > > @@ -2717,6 +2719,10 @@ static int kvm_vcpu_ioctl_enable_cap(struct
> > > > > > kvm_vcpu *vcpu, return -EINVAL;
> > > > > > 
> > > > > >  	switch (cap->cap) {
> > > > > > 
> > > > > > +	case KVM_CAP_DEVICE_MSIX_EXT:
> > > > > > +		vcpu->kvm->arch.msix_flags_enabled = true;
> > > > > > +		r = 0;
> > > > > > +		break;
> > > > > > 
> > > > > >  	default:
> > > > > >  		r = -EINVAL;
> > > > > >  		break;
> > > > > > 
> > > > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > > > index 0a7bd34..1494ed0 100644
> > > > > > --- a/include/linux/kvm.h
> > > > > > +++ b/include/linux/kvm.h
> > > > > > @@ -540,6 +540,10 @@ struct kvm_ppc_pvinfo {
> > > > > > 
> > > > > >  #endif
> > > > > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > > > > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > > > > 
> > > > > > +#ifdef __KVM_HAVE_MSIX
> > > > > > +#define KVM_CAP_DEVICE_MSIX_EXT 59
> > > > > > +#define KVM_CAP_DEVICE_MSIX_MASK 60
> > > > > > +#endif
> > > > > > 
> > > > > >  #ifdef KVM_CAP_IRQ_ROUTING
> > > > > > 
> > > > > > @@ -671,6 +675,8 @@ struct kvm_clock_data {
> > > > > > 
> > > > > >  #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct
> > > > > > 
> > > > > > kvm_xen_hvm_config) #define KVM_SET_CLOCK             _IOW(KVMIO,
> > > > > > 0x7b, struct kvm_clock_data) #define KVM_GET_CLOCK
> > > > > > _IOR(KVMIO, 0x7c, struct kvm_clock_data) +#define
> > > > > > KVM_ASSIGN_REG_MSIX_MMIO _IOW(KVMIO,  0x7d, \
> > > > > > +					struct kvm_assigned_msix_mmio)
> > > > > > 
> > > > > >  /* Available with KVM_CAP_PIT_STATE2 */
> > > > > >  #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct
> > > > > > 
> > > > > > kvm_pit_state2) #define KVM_SET_PIT2              _IOW(KVMIO, 
> > > > > > 0xa0, struct kvm_pit_state2) @@ -802,7 +808,7 @@ struct
> > > > > > kvm_assigned_msix_mmio {
> > > > > > 
> > > > > >  	__u32 assigned_dev_id;
> > > > > >  	__u64 base_addr;
> > > > > >  	__u32 flags;
> > > > > > 
> > > > > > -	__u32 reserved[2];
> > > > > > +	__u64 reserved;
> > > > > > 
> > > > > >  };
> > > > > >  
> > > > > >  #endif /* __LINUX_KVM_H */
> > > > > > 
> > > > > > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > > > > > index 5d2adc4..9573194 100644
> > > > > > --- a/virt/kvm/assigned-dev.c
> > > > > > +++ b/virt/kvm/assigned-dev.c
> > > > > > @@ -17,6 +17,8 @@
> > > > > > 
> > > > > >  #include <linux/pci.h>
> > > > > >  #include <linux/interrupt.h>
> > > > > >  #include <linux/slab.h>
> > > > > > 
> > > > > > +#include <linux/irqnr.h>
> > > > > > +
> > > > > > 
> > > > > >  #include "irq.h"
> > > > > >  
> > > > > >  static struct kvm_assigned_dev_kernel
> > > > > >  *kvm_find_assigned_dev(struct
> > > > > > 
> > > > > > list_head *head, @@ -169,6 +171,14 @@ static void
> > > > > > deassign_host_irq(struct kvm *kvm, */
> > > > > > 
> > > > > >  	if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
> > > > > >  	
> > > > > >  		int i;
> > > > > > 
> > > > > > +#ifdef __KVM_HAVE_MSIX
> > > > > > +		if (assigned_dev->msix_mmio_base) {
> > > > > > +			mutex_lock(&kvm->slots_lock);
> > > > > > +			kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS,
> > > > > > +					&assigned_dev->msix_mmio_dev);
> > > > > > +			mutex_unlock(&kvm->slots_lock);
> > > > > > +		}
> > > > > > +#endif
> > > > > > 
> > > > > >  		for (i = 0; i < assigned_dev->entries_nr; i++)
> > > > > >  		
> > > > > >  			disable_irq_nosync(assigned_dev->
> > > > > >  			
> > > > > >  					   host_msix_entries[i].vector);
> > > > > > 
> > > > > > @@ -318,6 +328,15 @@ static int
> > > > > > assigned_device_enable_host_msix(struct kvm *kvm, goto err;
> > > > > > 
> > > > > >  	}
> > > > > > 
> > > > > > +	if (dev->msix_mmio_base) {
> > > > > > +		mutex_lock(&kvm->slots_lock);
> > > > > > +		r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
> > > > > > +				&dev->msix_mmio_dev);
> > > > > > +		mutex_unlock(&kvm->slots_lock);
> > > > > > +		if (r)
> > > > > > +			goto err;
> > > > > > +	}
> > > > > > +
> > > > > > 
> > > > > >  	return 0;
> > > > > >  
> > > > > >  err:
> > > > > >  	for (i -= 1; i >= 0; i--)
> > > > > > 
> > > > > > @@ -870,6 +889,31 @@ static const struct kvm_io_device_ops
> > > > > > msix_mmio_ops = { .write    = msix_mmio_write,
> > > > > > 
> > > > > >  };
> > > > > > 
> > > > > > +static int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm,
> > > > > > +				struct kvm_assigned_msix_mmio *msix_mmio)
> > > > > > +{
> > > > > > +	int r = 0;
> > > > > > +	struct kvm_assigned_dev_kernel *adev;
> > > > > > +
> > > > > > +	mutex_lock(&kvm->lock);
> > > > > > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > > > > > +				      msix_mmio->assigned_dev_id);
> > > > > > +	if (!adev) {
> > > > > > +		r = -EINVAL;
> > > > > > +		goto out;
> > > > > > +	}
> > > > > > +	if (msix_mmio->base_addr == 0) {
> > > > > > +		r = -EINVAL;
> > > > > > +		goto out;
> > > > > > +	}
> > > > > > +	adev->msix_mmio_base = msix_mmio->base_addr;
> > > > > > +
> > > > > > +	kvm_iodevice_init(&adev->msix_mmio_dev, &msix_mmio_ops);
> > > > > > +out:
> > > > > > +	mutex_unlock(&kvm->lock);
> > > > > > +
> > > > > > +	return r;
> > > > > > +}
> > > > > > 
> > > > > >  #endif
> > > > > >  
> > > > > >  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> > > > > > 
> > > > > > @@ -982,6 +1026,22 @@ long kvm_vm_ioctl_assigned_device(struct kvm
> > > > > > *kvm, unsigned ioctl, goto out;
> > > > > > 
> > > > > >  		break;
> > > > > >  	
> > > > > >  	}
> > > > > > 
> > > > > > +	case KVM_ASSIGN_REG_MSIX_MMIO: {
> > > > > > +		struct kvm_assigned_msix_mmio msix_mmio;
> > > > > > +
> > > > > > +		r = -EFAULT;
> > > > > > +		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
> > > > > > +			goto out;
> > > > > > +
> > > > > > +		r = -EINVAL;
> > > > > > +		if (msix_mmio.flags != 0 || msix_mmio.reserved != 0)
> > > > > > +			goto out;
> > > > > > +
> > > > > > +		r = kvm_vm_ioctl_register_msix_mmio(kvm, &msix_mmio);
> > > > > > +		if (r)
> > > > > > +			goto out;
> > > > > > +		break;
> > > > > > +	}
> > > > > > 
> > > > > >  #endif
> > > > > >  
> > > > > >  	}
> > > > > >  
> > > > > >  out:

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 6/8] KVM: assigned dev: Preparation for mask support in userspace
  2010-10-20  8:26 ` [PATCH 6/8] KVM: assigned dev: Preparation for mask support in userspace Sheng Yang
  2010-10-20  9:30   ` Avi Kivity
@ 2010-10-22 14:53   ` Marcelo Tosatti
  2010-10-24 12:19     ` Sheng Yang
  1 sibling, 1 reply; 66+ messages in thread
From: Marcelo Tosatti @ 2010-10-22 14:53 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Avi Kivity, kvm, Michael S. Tsirkin

On Wed, Oct 20, 2010 at 04:26:30PM +0800, Sheng Yang wrote:
> The feature wouldn't be enabled until later patch set msix_flags_enabled. It
> would be enabled along with mask support in kernel.
> 
> Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    2 ++
>  include/linux/kvm.h             |    6 +++++-
>  include/linux/kvm_host.h        |    1 +
>  virt/kvm/assigned-dev.c         |   39 +++++++++++++++++++++++++++++++++++++++
>  4 files changed, 47 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index e209078..2bb69ba 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -456,6 +456,8 @@ struct kvm_arch {
>  	/* fields used by HYPER-V emulation */
>  	u64 hv_guest_os_id;
>  	u64 hv_hypercall;
> +
> +	bool msix_flags_enabled;
>  };
>  
>  struct kvm_vm_stat {
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index 919ae53..a699ec9 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -787,11 +787,15 @@ struct kvm_assigned_msix_nr {
>  };
>  
>  #define KVM_MAX_MSIX_PER_DEV		256
> +
> +#define KVM_MSIX_FLAG_MASK	1
> +
>  struct kvm_assigned_msix_entry {
>  	__u32 assigned_dev_id;
>  	__u32 gsi;
>  	__u16 entry; /* The index of entry in the MSI-X table */
> -	__u16 padding[3];
> +	__u16 flags;
> +	__u16 padding[2];
>  };
>  
>  #endif /* __LINUX_KVM_H */
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 30f83cd..81a6284 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -438,6 +438,7 @@ struct kvm_irq_ack_notifier {
>  };
>  
>  #define KVM_ASSIGNED_MSIX_PENDING		0x1
> +#define KVM_ASSIGNED_MSIX_MASK			0x2
>  struct kvm_guest_msix_entry {
>  	u32 vector;
>  	u16 entry;
> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> index 7c98928..bf96ea7 100644
> --- a/virt/kvm/assigned-dev.c
> +++ b/virt/kvm/assigned-dev.c
> @@ -666,11 +666,35 @@ msix_nr_out:
>  	return r;
>  }
>  
> +static void update_msix_mask(struct kvm_assigned_dev_kernel *assigned_dev,
> +			     int index)
> +{
> +	int irq;
> +	struct irq_desc *desc;
> +
> +	if (!assigned_dev->dev->msix_enabled ||
> +	    !(assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX))
> +		return;
> +
> +	irq = assigned_dev->host_msix_entries[index].vector;
> +	BUG_ON(irq == 0);
> +	desc = irq_to_desc(irq);
> +	BUG_ON(!desc->msi_desc);
> +
> +	if (assigned_dev->guest_msix_entries[index].flags &
> +			KVM_ASSIGNED_MSIX_MASK) {
> +		desc->chip->mask(irq);
> +		flush_work(&assigned_dev->interrupt_work);
> +	} else
> +		desc->chip->unmask(irq);
> +}

Bypassing irq handling code like this is wrong (see all the state
keeping in kernel/irq/handle.c).

You need a guarantee that MSIX per-vector mask is used for
disable_irq/enable_irq, right? I can't see how this provides it.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 6/8] KVM: assigned dev: Preparation for mask support in userspace
  2010-10-22 14:53   ` Marcelo Tosatti
@ 2010-10-24 12:19     ` Sheng Yang
  2010-10-24 12:23       ` Michael S. Tsirkin
  0 siblings, 1 reply; 66+ messages in thread
From: Sheng Yang @ 2010-10-24 12:19 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, kvm, Michael S. Tsirkin

On Friday 22 October 2010 22:53:13 Marcelo Tosatti wrote:
> On Wed, Oct 20, 2010 at 04:26:30PM +0800, Sheng Yang wrote:
> > The feature wouldn't be enabled until later patch set msix_flags_enabled.
> > It would be enabled along with mask support in kernel.
> > 
> > Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> > ---
> > 
> >  arch/x86/include/asm/kvm_host.h |    2 ++
> >  include/linux/kvm.h             |    6 +++++-
> >  include/linux/kvm_host.h        |    1 +
> >  virt/kvm/assigned-dev.c         |   39
> >  +++++++++++++++++++++++++++++++++++++++ 4 files changed, 47
> >  insertions(+), 1 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h
> > b/arch/x86/include/asm/kvm_host.h index e209078..2bb69ba 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -456,6 +456,8 @@ struct kvm_arch {
> > 
> >  	/* fields used by HYPER-V emulation */
> >  	u64 hv_guest_os_id;
> >  	u64 hv_hypercall;
> > 
> > +
> > +	bool msix_flags_enabled;
> > 
> >  };
> >  
> >  struct kvm_vm_stat {
> > 
> > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > index 919ae53..a699ec9 100644
> > --- a/include/linux/kvm.h
> > +++ b/include/linux/kvm.h
> > @@ -787,11 +787,15 @@ struct kvm_assigned_msix_nr {
> > 
> >  };
> >  
> >  #define KVM_MAX_MSIX_PER_DEV		256
> > 
> > +
> > +#define KVM_MSIX_FLAG_MASK	1
> > +
> > 
> >  struct kvm_assigned_msix_entry {
> >  
> >  	__u32 assigned_dev_id;
> >  	__u32 gsi;
> >  	__u16 entry; /* The index of entry in the MSI-X table */
> > 
> > -	__u16 padding[3];
> > +	__u16 flags;
> > +	__u16 padding[2];
> > 
> >  };
> >  
> >  #endif /* __LINUX_KVM_H */
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 30f83cd..81a6284 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -438,6 +438,7 @@ struct kvm_irq_ack_notifier {
> > 
> >  };
> >  
> >  #define KVM_ASSIGNED_MSIX_PENDING		0x1
> > 
> > +#define KVM_ASSIGNED_MSIX_MASK			0x2
> > 
> >  struct kvm_guest_msix_entry {
> >  
> >  	u32 vector;
> >  	u16 entry;
> > 
> > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > index 7c98928..bf96ea7 100644
> > --- a/virt/kvm/assigned-dev.c
> > +++ b/virt/kvm/assigned-dev.c
> > 
> > @@ -666,11 +666,35 @@ msix_nr_out:
> >  	return r;
> >  
> >  }
> > 
> > +static void update_msix_mask(struct kvm_assigned_dev_kernel
> > *assigned_dev, +			     int index)
> > +{
> > +	int irq;
> > +	struct irq_desc *desc;
> > +
> > +	if (!assigned_dev->dev->msix_enabled ||
> > +	    !(assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX))
> > +		return;
> > +
> > +	irq = assigned_dev->host_msix_entries[index].vector;
> > +	BUG_ON(irq == 0);
> > +	desc = irq_to_desc(irq);
> > +	BUG_ON(!desc->msi_desc);
> > +
> > +	if (assigned_dev->guest_msix_entries[index].flags &
> > +			KVM_ASSIGNED_MSIX_MASK) {
> > +		desc->chip->mask(irq);
> > +		flush_work(&assigned_dev->interrupt_work);
> > +	} else
> > +		desc->chip->unmask(irq);
> > +}
> 
> Bypassing irq handling code like this is wrong (see all the state
> keeping in kernel/irq/handle.c).

Would check it.
> 
> You need a guarantee that MSIX per-vector mask is used for
> disable_irq/enable_irq, right? I can't see how this provides it.

This one meant to directly operate the mask/unmask bit of the MSI-X table, to 
emulate the mask/unmask behavior that guest want. In the previous patch I used 
enable_irq()/disable_irq(), but they won't directly operate the MSI-X table unless 
it's necessary, and Michael want to read the table in userspace then he prefer 
using mask/unmask directly.

--
regards
Yang, Sheng

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 6/8] KVM: assigned dev: Preparation for mask support in userspace
  2010-10-24 12:19     ` Sheng Yang
@ 2010-10-24 12:23       ` Michael S. Tsirkin
  2010-10-28  8:21         ` Sheng Yang
  0 siblings, 1 reply; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-10-24 12:23 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Marcelo Tosatti, Avi Kivity, kvm

On Sun, Oct 24, 2010 at 08:19:09PM +0800, Sheng Yang wrote:
> > 
> > You need a guarantee that MSIX per-vector mask is used for
> > disable_irq/enable_irq, right? I can't see how this provides it.
> 
> This one meant to directly operate the mask/unmask bit of the MSI-X table, to 
> emulate the mask/unmask behavior that guest want. In the previous patch I used 
> enable_irq()/disable_irq(), but they won't directly operate the MSI-X table unless 
> it's necessary, and Michael want to read the table in userspace then he prefer 
> using mask/unmask directly.

As I said, the main problem was really that the implementation
proposed only works for interrupts used by assigned devices.
I would like for it to work for irqfd as well.

-- 
MST

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 6/8] KVM: assigned dev: Preparation for mask support in userspace
  2010-10-24 12:23       ` Michael S. Tsirkin
@ 2010-10-28  8:21         ` Sheng Yang
  0 siblings, 0 replies; 66+ messages in thread
From: Sheng Yang @ 2010-10-28  8:21 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Marcelo Tosatti, Avi Kivity, kvm

On Sunday 24 October 2010 20:23:20 Michael S. Tsirkin wrote:
> On Sun, Oct 24, 2010 at 08:19:09PM +0800, Sheng Yang wrote:
> > > You need a guarantee that MSIX per-vector mask is used for
> > > disable_irq/enable_irq, right? I can't see how this provides it.
> > 
> > This one meant to directly operate the mask/unmask bit of the MSI-X
> > table, to emulate the mask/unmask behavior that guest want. In the
> > previous patch I used enable_irq()/disable_irq(), but they won't
> > directly operate the MSI-X table unless it's necessary, and Michael want
> > to read the table in userspace then he prefer using mask/unmask
> > directly.
> 
> As I said, the main problem was really that the implementation
> proposed only works for interrupts used by assigned devices.
> I would like for it to work for irqfd as well.

I think we can't let QEmu access mask or pending bit directly. It must communicate 
with kernel to get the info if kernel owned mask.

That's because in fact guest supposed mask/unmask operation has nothing todo with 
what host would do. Maybe we can emulate it by do the same thing on the device, 
but it's two layer in fact. Also we know host kernel would does disabling/enabling 
according to its own mechanism, e.g. it may disable interrupt temporarily if there 
are too many interrupts. What host does should be transparent to guest. Directly 
accessing the data from device should be prohibited. 

And pending bit case is the same. In fact kernel knows which IRQ is pending, we 
can check IRQ_PENDING bit of desc, though we don't have such interface now. But we 
can do it in the future if it's necessary.

I'm purposing an new interface like "kvm_get_msix_entry", to return the mask bit 
of specific entry. The pending bit support can be added in the future if it's 
needed. But we can't directly access the MSI-X table/PBA in theory.

--
regards
Yang, Sheng

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2010-10-28  8:21 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-20  8:26 [PATCH 0/8][v2] MSI-X mask emulation support for assigned device Sheng Yang
2010-10-20  8:26 ` [PATCH 1/8] PCI: MSI: Move MSI-X entry definition to pci_regs.h Sheng Yang
2010-10-20 11:07   ` Matthew Wilcox
2010-10-20  8:26 ` [PATCH 2/8] irq: Export irq_to_desc() to modules Sheng Yang
2010-10-20  8:26 ` [PATCH 3/8] KVM: x86: Enable ENABLE_CAP capability for x86 Sheng Yang
2010-10-20  8:26 ` [PATCH 4/8] KVM: Move struct kvm_io_device to kvm_host.h Sheng Yang
2010-10-20  8:26 ` [PATCH 5/8] KVM: Add kvm_get_irq_routing_entry() func Sheng Yang
2010-10-20  8:53   ` Avi Kivity
2010-10-20  8:58     ` Sheng Yang
2010-10-20  9:13     ` Sheng Yang
2010-10-20  9:17       ` Sheng Yang
2010-10-20  9:32         ` Avi Kivity
2010-10-20  8:26 ` [PATCH 6/8] KVM: assigned dev: Preparation for mask support in userspace Sheng Yang
2010-10-20  9:30   ` Avi Kivity
2010-10-22 14:53   ` Marcelo Tosatti
2010-10-24 12:19     ` Sheng Yang
2010-10-24 12:23       ` Michael S. Tsirkin
2010-10-28  8:21         ` Sheng Yang
2010-10-20  8:26 ` [PATCH 7/8] KVM: assigned dev: Introduce io_device for MSI-X MMIO accessing Sheng Yang
2010-10-20  9:46   ` Avi Kivity
2010-10-20 10:33     ` Michael S. Tsirkin
2010-10-21  6:46     ` Sheng Yang
2010-10-21  9:27       ` Avi Kivity
2010-10-21  9:24         ` Michael S. Tsirkin
2010-10-21  9:47           ` Avi Kivity
2010-10-21 10:51             ` Michael S. Tsirkin
2010-10-20 22:35   ` Michael S. Tsirkin
2010-10-21  7:44     ` Sheng Yang
2010-10-20  8:26 ` [PATCH 8/8] KVM: Emulation MSI-X mask bits for assigned devices Sheng Yang
2010-10-20  9:49   ` Avi Kivity
2010-10-20 22:24   ` Michael S. Tsirkin
2010-10-21  8:30   ` Sheng Yang
2010-10-21  8:39     ` Michael S. Tsirkin
2010-10-22  4:42       ` Sheng Yang
2010-10-22 10:17         ` Michael S. Tsirkin
2010-10-22 13:30           ` Sheng Yang
2010-10-22 14:32             ` Michael S. Tsirkin
2010-10-20  9:51 ` [PATCH 0/8][v2] MSI-X mask emulation support for assigned device Avi Kivity
2010-10-20 10:44   ` Michael S. Tsirkin
2010-10-20 10:59     ` Avi Kivity
2010-10-20 13:43       ` Michael S. Tsirkin
2010-10-20 14:58         ` Alex Williamson
2010-10-20 14:58           ` Michael S. Tsirkin
2010-10-20 15:12             ` Alex Williamson
2010-10-20 15:17         ` Avi Kivity
2010-10-20 15:22           ` Alex Williamson
2010-10-20 15:26             ` Avi Kivity
2010-10-20 15:38               ` Alex Williamson
2010-10-20 14:47     ` Alex Williamson
2010-10-20 14:46       ` Michael S. Tsirkin
2010-10-20 15:07         ` Alex Williamson
2010-10-20 15:13           ` Michael S. Tsirkin
2010-10-20 20:13             ` Alex Williamson
2010-10-20 22:06               ` Michael S. Tsirkin
2010-10-20 15:23       ` Avi Kivity
2010-10-20 15:38         ` Alex Williamson
2010-10-20 15:54           ` Avi Kivity
2010-10-20 15:59             ` Michael S. Tsirkin
2010-10-20 16:13               ` Avi Kivity
2010-10-20 17:11                 ` Michael S. Tsirkin
2010-10-20 18:31               ` Alex Williamson
2010-10-21  7:41   ` Sheng Yang
2010-10-20 19:02 ` Marcelo Tosatti
2010-10-21  7:10   ` Sheng Yang
2010-10-21  8:21     ` Michael S. Tsirkin
2010-10-20 22:20 ` Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.