All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/2] kvm: level irqfd and new eoifd
@ 2012-07-03 19:21 Alex Williamson
  2012-07-03 19:21 ` [PATCH v3 1/2] kvm: Extend irqfd to support level interrupts Alex Williamson
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Alex Williamson @ 2012-07-03 19:21 UTC (permalink / raw)
  To: avi, mst, gleb; +Cc: kvm, linux-kernel, jan.kiszka

Here's the latest iteration of adding an interface to assert and
de-assert level interrupts from external drivers like vfio.  These
apply on top of the previous argument cleanup, documentation, and
sanitization patches for irqfd.  It would be great to get this queued
in next for linux 3.6.

I believe I've addressed all the previous comments, including fixing
the locking problems in eoifd.  I've run this with lockdep adding
and removing level irqfd/eoifd pairs without any problems.  Please
let me know if there are any further comments.  Thanks,

Alex

---

Alex Williamson (2):
      kvm: KVM_EOIFD, an eventfd for EOIs
      kvm: Extend irqfd to support level interrupts


 Documentation/virtual/kvm/api.txt |   27 +++
 arch/x86/kvm/x86.c                |    2 
 include/linux/kvm.h               |   17 ++
 include/linux/kvm_host.h          |   13 ++
 virt/kvm/eventfd.c                |  307 +++++++++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c               |   11 +
 6 files changed, 373 insertions(+), 4 deletions(-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3 1/2] kvm: Extend irqfd to support level interrupts
  2012-07-03 19:21 [PATCH v3 0/2] kvm: level irqfd and new eoifd Alex Williamson
@ 2012-07-03 19:21 ` Alex Williamson
  2012-07-03 19:21 ` [PATCH v3 2/2] kvm: KVM_EOIFD, an eventfd for EOIs Alex Williamson
  2012-07-11  9:53 ` [PATCH v3 0/2] kvm: level irqfd and new eoifd Avi Kivity
  2 siblings, 0 replies; 22+ messages in thread
From: Alex Williamson @ 2012-07-03 19:21 UTC (permalink / raw)
  To: avi, mst, gleb; +Cc: kvm, linux-kernel, jan.kiszka

In order to inject a level interrupt from an external source using an
irqfd, we need to allocate a new irq_source_id.  This allows us to
assert and (later) de-assert an interrupt line independently from
users of KVM_IRQ_LINE and avoid lost interrupts.

We also add what may appear like a bit of excessive infrastructure
around an object for storing this irq_source_id.  However, notice
that we only provide a way to assert the interrupt here.  A follow-on
interface will make use of the same irq_source_id to allow de-assert.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 Documentation/virtual/kvm/api.txt |    6 ++
 arch/x86/kvm/x86.c                |    1 
 include/linux/kvm.h               |    3 +
 virt/kvm/eventfd.c                |  103 ++++++++++++++++++++++++++++++++++++-
 4 files changed, 109 insertions(+), 4 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 100acde..c7267d5 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1981,6 +1981,12 @@ the guest using the specified gsi pin.  The irqfd is removed using
 the KVM_IRQFD_FLAG_DEASSIGN flag, specifying both kvm_irqfd.fd
 and kvm_irqfd.gsi.
 
+The KVM_IRQFD_FLAG_LEVEL flag indicates the gsi input is for a level
+triggered interrupt.  In this case a new irqchip input is allocated
+which is logically OR'd with other inputs allowing multiple sources
+to independently assert level interrupts.  The KVM_IRQFD_FLAG_LEVEL
+is only necessary on setup, teardown is identical to that above.
+KVM_IRQFD_FLAG_LEVEL support is indicated by KVM_CAP_IRQFD_LEVEL.
 
 5. The kvm_run structure
 ------------------------
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a01a424..80bed07 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2148,6 +2148,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_GET_TSC_KHZ:
 	case KVM_CAP_PCI_2_3:
 	case KVM_CAP_KVMCLOCK_CTRL:
+	case KVM_CAP_IRQFD_LEVEL:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 2ce09aa..b2e6e4f 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -618,6 +618,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_GET_SMMU_INFO 78
 #define KVM_CAP_S390_COW 79
 #define KVM_CAP_PPC_ALLOC_HTAB 80
+#define KVM_CAP_IRQFD_LEVEL 81
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -683,6 +684,8 @@ struct kvm_xen_hvm_config {
 #endif
 
 #define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
+/* Available with KVM_CAP_IRQFD_LEVEL */
+#define KVM_IRQFD_FLAG_LEVEL (1 << 1)
 
 struct kvm_irqfd {
 	__u32 fd;
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 7d7e2aa..92aa5ba 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -36,6 +36,64 @@
 #include "iodev.h"
 
 /*
+ * An irq_source_id can be created from KVM_IRQFD for level interrupt
+ * injections and shared with other interfaces for EOI or de-assert.
+ * Create an object with reference counting to make it easy to use.
+ */
+struct _irq_source {
+	int id; /* the IRQ source ID */
+	struct kvm *kvm;
+	struct kref kref;
+};
+
+static void _irq_source_release(struct kref *kref)
+{
+	struct _irq_source *source;
+
+	source = container_of(kref, struct _irq_source, kref);
+
+	kvm_free_irq_source_id(source->kvm, source->id);
+	kfree(source);
+}
+
+static void _irq_source_put(struct _irq_source *source)
+{
+	if (source)
+		kref_put(&source->kref, _irq_source_release);
+}
+
+static struct _irq_source *__attribute__ ((used)) /* white lie for now */
+_irq_source_get(struct _irq_source *source)
+{
+	if (source)
+		kref_get(&source->kref);
+
+	return source;
+}
+
+static struct _irq_source *_irq_source_alloc(struct kvm *kvm)
+{
+	struct _irq_source *source;
+	int id;
+
+	source = kzalloc(sizeof(*source), GFP_KERNEL);
+	if (!source)
+		return ERR_PTR(-ENOMEM);
+
+	id = kvm_request_irq_source_id(kvm);
+	if (id < 0) {
+		kfree(source);
+		return ERR_PTR(id);
+	}
+
+	kref_init(&source->kref);
+	source->kvm = kvm;
+	source->id = id;
+
+	return source;
+}
+
+/*
  * --------------------------------------------------------------------
  * irqfd: Allows an fd to be used to inject an interrupt to the guest
  *
@@ -52,6 +110,8 @@ struct _irqfd {
 	/* Used for level IRQ fast-path */
 	int gsi;
 	struct work_struct inject;
+	/* IRQ source ID for level triggered irqfds */
+	struct _irq_source *source;
 	/* Used for setup/shutdown */
 	struct eventfd_ctx *eventfd;
 	struct list_head list;
@@ -62,7 +122,7 @@ struct _irqfd {
 static struct workqueue_struct *irqfd_cleanup_wq;
 
 static void
-irqfd_inject(struct work_struct *work)
+irqfd_inject_edge(struct work_struct *work)
 {
 	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
 	struct kvm *kvm = irqfd->kvm;
@@ -71,6 +131,22 @@ irqfd_inject(struct work_struct *work)
 	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
 }
 
+static void
+irqfd_inject_level(struct work_struct *work)
+{
+	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
+
+	/*
+	 * We can safely ignore the return value here.  If masked, the irr
+	 * bit is still set and will eventually be serviced.  This interface
+	 * does not guarantee immediate injection.  If coalesced, an eoi
+	 * will be coming where we can de-assert and re-inject if necessary.
+	 * NB, if you need to know if an interrupt was coalesced, this
+	 * interface is not for you.
+	 */
+	kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 1);
+}
+
 /*
  * Race-free decouple logic (ordering is critical)
  */
@@ -96,6 +172,9 @@ irqfd_shutdown(struct work_struct *work)
 	 * It is now safe to release the object's resources
 	 */
 	eventfd_ctx_put(irqfd->eventfd);
+
+	_irq_source_put(irqfd->source);
+
 	kfree(irqfd);
 }
 
@@ -202,6 +281,7 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
 {
 	struct kvm_irq_routing_table *irq_rt;
 	struct _irqfd *irqfd, *tmp;
+	struct _irq_source *source = NULL;
 	struct file *file = NULL;
 	struct eventfd_ctx *eventfd = NULL;
 	int ret;
@@ -214,7 +294,19 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
 	irqfd->kvm = kvm;
 	irqfd->gsi = args->gsi;
 	INIT_LIST_HEAD(&irqfd->list);
-	INIT_WORK(&irqfd->inject, irqfd_inject);
+
+	if (args->flags & KVM_IRQFD_FLAG_LEVEL) {
+		source = _irq_source_alloc(kvm);
+		if (IS_ERR(source)) {
+			ret = PTR_ERR(source);
+			goto fail;
+		}
+
+		irqfd->source = source;
+		INIT_WORK(&irqfd->inject, irqfd_inject_level);
+	} else
+		INIT_WORK(&irqfd->inject, irqfd_inject_edge);
+
 	INIT_WORK(&irqfd->shutdown, irqfd_shutdown);
 
 	file = eventfd_fget(args->fd);
@@ -276,10 +368,13 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
 	return 0;
 
 fail:
+	if (source && !IS_ERR(source))
+		_irq_source_put(source);
+
 	if (eventfd && !IS_ERR(eventfd))
 		eventfd_ctx_put(eventfd);
 
-	if (!IS_ERR(file))
+	if (file && !IS_ERR(file))
 		fput(file);
 
 	kfree(irqfd);
@@ -340,7 +435,7 @@ kvm_irqfd_deassign(struct kvm *kvm, struct kvm_irqfd *args)
 int
 kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args)
 {
-	if (args->flags & ~KVM_IRQFD_FLAG_DEASSIGN)
+	if (args->flags & ~(KVM_IRQFD_FLAG_DEASSIGN | KVM_IRQFD_FLAG_LEVEL))
 		return -EINVAL;
 
 	if (args->flags & KVM_IRQFD_FLAG_DEASSIGN)


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 2/2] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-03 19:21 [PATCH v3 0/2] kvm: level irqfd and new eoifd Alex Williamson
  2012-07-03 19:21 ` [PATCH v3 1/2] kvm: Extend irqfd to support level interrupts Alex Williamson
@ 2012-07-03 19:21 ` Alex Williamson
  2012-07-04 14:00   ` Michael S. Tsirkin
  2012-07-11  9:53 ` [PATCH v3 0/2] kvm: level irqfd and new eoifd Avi Kivity
  2 siblings, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2012-07-03 19:21 UTC (permalink / raw)
  To: avi, mst, gleb; +Cc: kvm, linux-kernel, jan.kiszka

This new ioctl enables an eventfd to be triggered when an EOI is
written for a specified irqchip pin.  By default this is a simple
notification, but we can also tie the eoifd to a level irqfd, which
enables the irqchip pin to be automatically de-asserted on EOI.
This mode is particularly useful for device-assignment applications
where the unmask and notify triggers a hardware unmask.  The default
mode is most applicable to simple notify with no side-effects for
userspace usage, such as Qemu.

Here we make use of the reference counting of the _irq_source
object allowing us to share it with an irqfd and cleanup regardless
of the release order.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 Documentation/virtual/kvm/api.txt |   21 ++++
 arch/x86/kvm/x86.c                |    1 
 include/linux/kvm.h               |   14 ++
 include/linux/kvm_host.h          |   13 ++
 virt/kvm/eventfd.c                |  208 +++++++++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c               |   11 ++
 6 files changed, 266 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index c7267d5..a38af14 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1988,6 +1988,27 @@ to independently assert level interrupts.  The KVM_IRQFD_FLAG_LEVEL
 is only necessary on setup, teardown is identical to that above.
 KVM_IRQFD_FLAG_LEVEL support is indicated by KVM_CAP_IRQFD_LEVEL.
 
+4.77 KVM_EOIFD
+
+Capability: KVM_CAP_EOIFD
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_eoifd (in)
+Returns: 0 on success, -1 on error
+
+KVM_EOIFD allows userspace to receive interrupt EOI notification
+through an eventfd.  kvm_eoifd.fd specifies the eventfd used for
+notification and kvm_eoifd.gsi specifies the irchip pin, similar to
+KVM_IRQFD.  The eoifd is removed using the KVM_EOIFD_FLAG_DEASSIGN
+flag, specifying both kvm_eoifd.fd and kvm_eoifd.gsi.
+
+The KVM_EOIFD_FLAG_LEVEL_IRQFD flag indicates that the provided
+kvm_eoifd stucture includes a valid kvm_eoifd.irqfd file descriptor
+for a level irqfd configured using the KVM_IRQFD_FLAG_LEVEL flag.
+In this mode the level interrupt is de-asserted prior to EOI eventfd
+notification.  The KVM_EOIFD_FLAG_LEVEL_IRQFD is only necessary on
+setup, teardown is identical to that above.
+
 5. The kvm_run structure
 ------------------------
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 80bed07..62d6eca 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2149,6 +2149,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_PCI_2_3:
 	case KVM_CAP_KVMCLOCK_CTRL:
 	case KVM_CAP_IRQFD_LEVEL:
+	case KVM_CAP_EOIFD:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index b2e6e4f..7567e7d 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -619,6 +619,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_S390_COW 79
 #define KVM_CAP_PPC_ALLOC_HTAB 80
 #define KVM_CAP_IRQFD_LEVEL 81
+#define KVM_CAP_EOIFD 82
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -694,6 +695,17 @@ struct kvm_irqfd {
 	__u8  pad[20];
 };
 
+#define KVM_EOIFD_FLAG_DEASSIGN (1 << 0)
+#define KVM_EOIFD_FLAG_LEVEL_IRQFD (1 << 1)
+
+struct kvm_eoifd {
+	__u32 fd;
+	__u32 gsi;
+	__u32 flags;
+	__u32 irqfd;
+	__u8 pad[16];
+};
+
 struct kvm_clock_data {
 	__u64 clock;
 	__u32 flags;
@@ -834,6 +846,8 @@ struct kvm_s390_ucas_mapping {
 #define KVM_PPC_GET_SMMU_INFO	  _IOR(KVMIO,  0xa6, struct kvm_ppc_smmu_info)
 /* Available with KVM_CAP_PPC_ALLOC_HTAB */
 #define KVM_PPC_ALLOCATE_HTAB	  _IOWR(KVMIO, 0xa7, __u32)
+/* Available with KVM_CAP_EOIFD */
+#define KVM_EOIFD                 _IOW(KVMIO,  0xa8, struct kvm_eoifd)
 
 /*
  * ioctls for vcpu fds
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ae3b426..83472eb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -285,6 +285,10 @@ struct kvm {
 		struct list_head  items;
 	} irqfds;
 	struct list_head ioeventfds;
+	struct {
+		spinlock_t        lock;
+		struct list_head  items;
+	} eoifds;
 #endif
 	struct kvm_vm_stat stat;
 	struct kvm_arch arch;
@@ -828,6 +832,8 @@ int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args);
 void kvm_irqfd_release(struct kvm *kvm);
 void kvm_irq_routing_update(struct kvm *, struct kvm_irq_routing_table *);
 int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args);
+int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args);
+void kvm_eoifd_release(struct kvm *kvm);
 
 #else
 
@@ -853,6 +859,13 @@ static inline int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
 	return -ENOSYS;
 }
 
+static inline int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
+{
+	return -ENOSYS;
+}
+
+static inline void kvm_eoifd_release(struct kvm *kvm) {}
+
 #endif /* CONFIG_HAVE_KVM_EVENTFD */
 
 #ifdef CONFIG_KVM_APIC_ARCHITECTURE
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 92aa5ba..2bc9768 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -62,8 +62,7 @@ static void _irq_source_put(struct _irq_source *source)
 		kref_put(&source->kref, _irq_source_release);
 }
 
-static struct _irq_source *__attribute__ ((used)) /* white lie for now */
-_irq_source_get(struct _irq_source *source)
+static struct _irq_source *_irq_source_get(struct _irq_source *source)
 {
 	if (source)
 		kref_get(&source->kref);
@@ -119,6 +118,39 @@ struct _irqfd {
 	struct work_struct shutdown;
 };
 
+static struct _irqfd *_irqfd_fdget(struct kvm *kvm, int fd)
+{
+	struct eventfd_ctx *eventfd;
+	struct _irqfd *tmp, *irqfd = NULL;
+
+	eventfd = eventfd_ctx_fdget(fd);
+	if (IS_ERR(eventfd))
+		return (struct _irqfd *)eventfd;
+
+	spin_lock_irq(&kvm->irqfds.lock);
+
+	list_for_each_entry(tmp, &kvm->irqfds.items, list) {
+		if (tmp->eventfd == eventfd) {
+			irqfd = tmp;
+			break;
+		}
+	}
+
+	spin_unlock_irq(&kvm->irqfds.lock);
+
+	if (!irqfd) {
+		eventfd_ctx_put(eventfd);
+		return ERR_PTR(-ENODEV);
+	}
+
+	return irqfd;
+}
+
+static void _irqfd_put(struct _irqfd *irqfd)
+{
+	eventfd_ctx_put(irqfd->eventfd);
+}
+
 static struct workqueue_struct *irqfd_cleanup_wq;
 
 static void
@@ -387,6 +419,8 @@ kvm_eventfd_init(struct kvm *kvm)
 	spin_lock_init(&kvm->irqfds.lock);
 	INIT_LIST_HEAD(&kvm->irqfds.items);
 	INIT_LIST_HEAD(&kvm->ioeventfds);
+	spin_lock_init(&kvm->eoifds.lock);
+	INIT_LIST_HEAD(&kvm->eoifds.items);
 }
 
 /*
@@ -753,3 +787,173 @@ kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
 
 	return kvm_assign_ioeventfd(kvm, args);
 }
+
+/*
+ * --------------------------------------------------------------------
+ *  eoifd: Translate KVM APIC/IOAPIC EOI into eventfd signal.
+ *
+ *  userspace can register GSIs with an eventfd for receiving
+ *  notification when an EOI occurs.
+ * --------------------------------------------------------------------
+ */
+
+struct _eoifd {
+	struct eventfd_ctx *eventfd;
+	struct _irq_source *source; /* for de-asserting level irqfd */
+	struct kvm *kvm;
+	struct kvm_irq_ack_notifier notifier;
+	struct list_head list;
+};
+
+static void eoifd_event(struct kvm_irq_ack_notifier *notifier)
+{
+	struct _eoifd *eoifd;
+
+	eoifd = container_of(notifier, struct _eoifd, notifier);
+
+	/*
+	 * If the eoifd is tied to a level irqfd we de-assert it here.
+	 * The user is responsible for re-asserting it if their device
+	 * still needs attention.  For notification-only, skip this.
+	 */
+	if (eoifd->source)
+		kvm_set_irq(eoifd->kvm, eoifd->source->id,
+			    eoifd->notifier.gsi, 0);
+
+	eventfd_signal(eoifd->eventfd, 1);
+}
+
+static int kvm_assign_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
+{
+	struct eventfd_ctx *eventfd;
+	struct _eoifd *eoifd, *tmp;
+	struct _irq_source *source = NULL;
+
+	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
+		struct _irqfd *irqfd = _irqfd_fdget(kvm, args->irqfd);
+		if (IS_ERR(irqfd))
+			return PTR_ERR(irqfd);
+
+		if (irqfd->gsi != args->gsi) {
+			_irqfd_put(irqfd);
+			return -EINVAL;
+		}
+
+		source = _irq_source_get(irqfd->source);
+		_irqfd_put(irqfd);
+		if (!source)
+			return -EINVAL;
+	}
+
+	eventfd = eventfd_ctx_fdget(args->fd);
+	if (IS_ERR(eventfd)) {
+		_irq_source_put(source);
+		return PTR_ERR(eventfd);
+	}
+
+	eoifd = kzalloc(sizeof(*eoifd), GFP_KERNEL);
+	if (!eoifd) {
+		_irq_source_put(source);
+		eventfd_ctx_put(eventfd);
+		return -ENOMEM;
+	}
+
+	INIT_LIST_HEAD(&eoifd->list);
+	eoifd->kvm = kvm;
+	eoifd->eventfd = eventfd;
+	eoifd->source = source;
+	eoifd->notifier.gsi = args->gsi;
+	eoifd->notifier.irq_acked = eoifd_event;
+
+	spin_lock_irq(&kvm->eoifds.lock);
+
+	list_for_each_entry(tmp, &kvm->eoifds.items, list) {
+		if (eoifd->eventfd != tmp->eventfd)
+			continue;
+
+		spin_unlock_irq(&kvm->eoifds.lock);
+		_irq_source_put(source);
+		eventfd_ctx_put(eventfd);
+		kfree(eoifd);
+		return -EBUSY;
+	}
+
+	list_add_tail(&eoifd->list, &kvm->eoifds.items);
+
+	spin_unlock_irq(&kvm->eoifds.lock);
+
+	kvm_register_irq_ack_notifier(kvm, &eoifd->notifier);
+
+	return 0;
+}
+
+static void eoifd_deactivate(struct kvm *kvm, struct _eoifd *eoifd)
+{
+	kvm_unregister_irq_ack_notifier(kvm, &eoifd->notifier);
+	_irq_source_put(eoifd->source);
+	eventfd_ctx_put(eoifd->eventfd);
+	kfree(eoifd);
+}
+
+void kvm_eoifd_release(struct kvm *kvm)
+{
+	spin_lock_irq(&kvm->eoifds.lock);
+
+	while (!list_empty(&kvm->eoifds.items)) {
+		struct _eoifd *eoifd;
+
+		eoifd = list_first_entry(&kvm->eoifds.items,
+					 struct _eoifd, list);
+		list_del(&eoifd->list);
+
+		/* Drop spinlocks since eoifd_deactivate can sleep */
+		spin_unlock_irq(&kvm->eoifds.lock);
+		eoifd_deactivate(kvm, eoifd);
+		spin_lock_irq(&kvm->eoifds.lock);
+	}
+
+	spin_unlock_irq(&kvm->eoifds.lock);
+}
+
+static int kvm_deassign_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
+{
+	struct eventfd_ctx *eventfd;
+	struct _eoifd *tmp, *eoifd = NULL;
+
+	eventfd = eventfd_ctx_fdget(args->fd);
+	if (IS_ERR(eventfd))
+		return PTR_ERR(eventfd);
+
+	spin_lock_irq(&kvm->eoifds.lock);
+
+	list_for_each_entry(tmp, &kvm->eoifds.items, list) {
+		if (tmp->eventfd == eventfd && tmp->notifier.gsi == args->gsi) {
+			eoifd = tmp;
+			list_del(&eoifd->list);
+			break;
+		}
+	}
+
+	spin_unlock_irq(&kvm->eoifds.lock);
+
+	eventfd_ctx_put(eventfd);
+
+	if (!eoifd)
+		return -ENOENT;
+
+	eoifd_deactivate(kvm, eoifd);
+
+	return 0;
+}
+
+int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
+{
+	if (args->flags & ~(KVM_EOIFD_FLAG_DEASSIGN |
+			    KVM_EOIFD_FLAG_LEVEL_IRQFD))
+		return -EINVAL;
+
+	if (args->flags & KVM_EOIFD_FLAG_DEASSIGN)
+		return kvm_deassign_eoifd(kvm, args);
+
+	return kvm_assign_eoifd(kvm, args);
+}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b4ad14cc..5b41df1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -620,6 +620,8 @@ static int kvm_vm_release(struct inode *inode, struct file *filp)
 
 	kvm_irqfd_release(kvm);
 
+	kvm_eoifd_release(kvm);
+
 	kvm_put_kvm(kvm);
 	return 0;
 }
@@ -2093,6 +2095,15 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 #endif
+	case KVM_EOIFD: {
+		struct kvm_eoifd data;
+
+		r = -EFAULT;
+		if (copy_from_user(&data, argp, sizeof data))
+			goto out;
+		r = kvm_eoifd(kvm, &data);
+		break;
+	}
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 		if (r == -ENOTTY)


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 2/2] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-03 19:21 ` [PATCH v3 2/2] kvm: KVM_EOIFD, an eventfd for EOIs Alex Williamson
@ 2012-07-04 14:00   ` Michael S. Tsirkin
  2012-07-05  4:24     ` Alex Williamson
  0 siblings, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2012-07-04 14:00 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Tue, Jul 03, 2012 at 01:21:29PM -0600, Alex Williamson wrote:
> This new ioctl enables an eventfd to be triggered when an EOI is
> written for a specified irqchip pin.  By default this is a simple
> notification, but we can also tie the eoifd to a level irqfd, which
> enables the irqchip pin to be automatically de-asserted on EOI.
> This mode is particularly useful for device-assignment applications
> where the unmask and notify triggers a hardware unmask.  The default
> mode is most applicable to simple notify with no side-effects for
> userspace usage, such as Qemu.
> 
> Here we make use of the reference counting of the _irq_source
> object allowing us to share it with an irqfd and cleanup regardless
> of the release order.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
> 
>  Documentation/virtual/kvm/api.txt |   21 ++++
>  arch/x86/kvm/x86.c                |    1 
>  include/linux/kvm.h               |   14 ++
>  include/linux/kvm_host.h          |   13 ++
>  virt/kvm/eventfd.c                |  208 +++++++++++++++++++++++++++++++++++++
>  virt/kvm/kvm_main.c               |   11 ++
>  6 files changed, 266 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index c7267d5..a38af14 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -1988,6 +1988,27 @@ to independently assert level interrupts.  The KVM_IRQFD_FLAG_LEVEL
>  is only necessary on setup, teardown is identical to that above.
>  KVM_IRQFD_FLAG_LEVEL support is indicated by KVM_CAP_IRQFD_LEVEL.
>  
> +4.77 KVM_EOIFD
> +
> +Capability: KVM_CAP_EOIFD

Maybe add a specific capability KVM_CAP_EOIFD_LEVEL_IRQFD too?

> +Architectures: x86
> +Type: vm ioctl
> +Parameters: struct kvm_eoifd (in)
> +Returns: 0 on success, -1 on error
> +
> +KVM_EOIFD allows userspace to receive interrupt EOI notification
> +through an eventfd.  kvm_eoifd.fd specifies the eventfd used for
> +notification and kvm_eoifd.gsi specifies the irchip pin, similar to
> +KVM_IRQFD.  The eoifd is removed using the KVM_EOIFD_FLAG_DEASSIGN
> +flag, specifying both kvm_eoifd.fd and kvm_eoifd.gsi.

This text reads like it would give you EOIs for any GSI,
but you haven't actually implemented this for edge GSIs - and
making it work would bloat the datapath for fast (edge) interrupts.

What's the intended use of this part of the interface? qemu with
irqchip disabled?

> +
> +The KVM_EOIFD_FLAG_LEVEL_IRQFD flag indicates that the provided
> +kvm_eoifd stucture includes a valid kvm_eoifd.irqfd file descriptor
> +for a level irqfd configured using the KVM_IRQFD_FLAG_LEVEL flag.
> +In this mode the level interrupt is de-asserted prior to EOI eventfd
> +notification.  The KVM_EOIFD_FLAG_LEVEL_IRQFD is only necessary on
> +setup, teardown is identical to that above.

It seems a single fd can not be used multiple times for
different irqfds? In that case:
1. Need to document this limitation
2. This differs from other notifiers e.g.  ioeventfd.

Pls note: if you decide to remove this limitation, we need
to add some other limit on the # of EOIFDs that VM can have
as won't be limited by # of fds anymore.

> +
>  5. The kvm_run structure
>  ------------------------
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 80bed07..62d6eca 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2149,6 +2149,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>  	case KVM_CAP_PCI_2_3:
>  	case KVM_CAP_KVMCLOCK_CTRL:
>  	case KVM_CAP_IRQFD_LEVEL:
> +	case KVM_CAP_EOIFD:
>  		r = 1;
>  		break;
>  	case KVM_CAP_COALESCED_MMIO:
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index b2e6e4f..7567e7d 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -619,6 +619,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_S390_COW 79
>  #define KVM_CAP_PPC_ALLOC_HTAB 80
>  #define KVM_CAP_IRQFD_LEVEL 81
> +#define KVM_CAP_EOIFD 82
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -694,6 +695,17 @@ struct kvm_irqfd {
>  	__u8  pad[20];
>  };
>  
> +#define KVM_EOIFD_FLAG_DEASSIGN (1 << 0)
> +#define KVM_EOIFD_FLAG_LEVEL_IRQFD (1 << 1)
> +
> +struct kvm_eoifd {
> +	__u32 fd;
> +	__u32 gsi;
> +	__u32 flags;
> +	__u32 irqfd;
> +	__u8 pad[16];
> +};
> +
>  struct kvm_clock_data {
>  	__u64 clock;
>  	__u32 flags;
> @@ -834,6 +846,8 @@ struct kvm_s390_ucas_mapping {
>  #define KVM_PPC_GET_SMMU_INFO	  _IOR(KVMIO,  0xa6, struct kvm_ppc_smmu_info)
>  /* Available with KVM_CAP_PPC_ALLOC_HTAB */
>  #define KVM_PPC_ALLOCATE_HTAB	  _IOWR(KVMIO, 0xa7, __u32)
> +/* Available with KVM_CAP_EOIFD */
> +#define KVM_EOIFD                 _IOW(KVMIO,  0xa8, struct kvm_eoifd)
>  
>  /*
>   * ioctls for vcpu fds
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ae3b426..83472eb 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -285,6 +285,10 @@ struct kvm {
>  		struct list_head  items;
>  	} irqfds;
>  	struct list_head ioeventfds;
> +	struct {
> +		spinlock_t        lock;

Can't this be a mutex? It seems that code will be much simpler
if you can just take the mutex, do everything and unlock.
For example you could prevent changing eoifds while irqfds
are changed.

> +		struct list_head  items;
> +	} eoifds;
>  #endif
>  	struct kvm_vm_stat stat;
>  	struct kvm_arch arch;
> @@ -828,6 +832,8 @@ int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args);
>  void kvm_irqfd_release(struct kvm *kvm);
>  void kvm_irq_routing_update(struct kvm *, struct kvm_irq_routing_table *);
>  int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args);
> +int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args);
> +void kvm_eoifd_release(struct kvm *kvm);
>  
>  #else
>  
> @@ -853,6 +859,13 @@ static inline int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
>  	return -ENOSYS;
>  }
>  
> +static inline int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
> +{
> +	return -ENOSYS;
> +}
> +
> +static inline void kvm_eoifd_release(struct kvm *kvm) {}
> +
>  #endif /* CONFIG_HAVE_KVM_EVENTFD */
>  
>  #ifdef CONFIG_KVM_APIC_ARCHITECTURE
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index 92aa5ba..2bc9768 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -62,8 +62,7 @@ static void _irq_source_put(struct _irq_source *source)
>  		kref_put(&source->kref, _irq_source_release);
>  }
>  
> -static struct _irq_source *__attribute__ ((used)) /* white lie for now */
> -_irq_source_get(struct _irq_source *source)
> +static struct _irq_source *_irq_source_get(struct _irq_source *source)
>  {
>  	if (source)
>  		kref_get(&source->kref);
> @@ -119,6 +118,39 @@ struct _irqfd {
>  	struct work_struct shutdown;
>  };
>  
> +static struct _irqfd *_irqfd_fdget(struct kvm *kvm, int fd)
> +{
> +	struct eventfd_ctx *eventfd;
> +	struct _irqfd *tmp, *irqfd = NULL;
> +
> +	eventfd = eventfd_ctx_fdget(fd);
> +	if (IS_ERR(eventfd))
> +		return (struct _irqfd *)eventfd;
> +
> +	spin_lock_irq(&kvm->irqfds.lock);
> +
> +	list_for_each_entry(tmp, &kvm->irqfds.items, list) {
> +		if (tmp->eventfd == eventfd) {
> +			irqfd = tmp;
> +			break;
> +		}
> +	}
> +
> +	spin_unlock_irq(&kvm->irqfds.lock);
> +
> +	if (!irqfd) {
> +		eventfd_ctx_put(eventfd);
> +		return ERR_PTR(-ENODEV);
> +	}

Can irqfd get dassigned at this point?
Could one way to fix be to take eoifd.lock in kvm_irqfd?

> +
> +	return irqfd;
> +}
> +
> +static void _irqfd_put(struct _irqfd *irqfd)
> +{
> +	eventfd_ctx_put(irqfd->eventfd);
> +}
> +
>  static struct workqueue_struct *irqfd_cleanup_wq;
>  
>  static void
> @@ -387,6 +419,8 @@ kvm_eventfd_init(struct kvm *kvm)
>  	spin_lock_init(&kvm->irqfds.lock);
>  	INIT_LIST_HEAD(&kvm->irqfds.items);
>  	INIT_LIST_HEAD(&kvm->ioeventfds);
> +	spin_lock_init(&kvm->eoifds.lock);
> +	INIT_LIST_HEAD(&kvm->eoifds.items);
>  }
>  
>  /*
> @@ -753,3 +787,173 @@ kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
>  
>  	return kvm_assign_ioeventfd(kvm, args);
>  }
> +
> +/*
> + * --------------------------------------------------------------------
> + *  eoifd: Translate KVM APIC/IOAPIC EOI into eventfd signal.
> + *
> + *  userspace can register GSIs with an eventfd for receiving
> + *  notification when an EOI occurs.
> + * --------------------------------------------------------------------
> + */
> +
> +struct _eoifd {
> +	struct eventfd_ctx *eventfd;
> +	struct _irq_source *source; /* for de-asserting level irqfd */
> +	struct kvm *kvm;
> +	struct kvm_irq_ack_notifier notifier;
> +	struct list_head list;
> +};
> +
> +static void eoifd_event(struct kvm_irq_ack_notifier *notifier)
> +{
> +	struct _eoifd *eoifd;
> +
> +	eoifd = container_of(notifier, struct _eoifd, notifier);
> +
> +	/*
> +	 * If the eoifd is tied to a level irqfd we de-assert it here.
> +	 * The user is responsible for re-asserting it if their device
> +	 * still needs attention.  For notification-only, skip this.
> +	 */
> +	if (eoifd->source)
> +		kvm_set_irq(eoifd->kvm, eoifd->source->id,
> +			    eoifd->notifier.gsi, 0);
> +


This will fire even if the source in question did not assert an
interrupt in the first place.  This won't scale well if
the interrupt is shared. Don't we need to filter by source state?

> +	eventfd_signal(eoifd->eventfd, 1);
> +}
> +
> +static int kvm_assign_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
> +{
> +	struct eventfd_ctx *eventfd;
> +	struct _eoifd *eoifd, *tmp;
> +	struct _irq_source *source = NULL;
> +
> +	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
> +		struct _irqfd *irqfd = _irqfd_fdget(kvm, args->irqfd);
> +		if (IS_ERR(irqfd))
> +			return PTR_ERR(irqfd);
> +
> +		if (irqfd->gsi != args->gsi) {
> +			_irqfd_put(irqfd);
> +			return -EINVAL;
> +		}
> +

This requirement does not seem to be documented.
If we require this, do we need to pass in gsi at all?


> +		source = _irq_source_get(irqfd->source);
> +		_irqfd_put(irqfd);
> +		if (!source)
> +			return -EINVAL;
> +	}
> +
> +	eventfd = eventfd_ctx_fdget(args->fd);
> +	if (IS_ERR(eventfd)) {
> +		_irq_source_put(source);
> +		return PTR_ERR(eventfd);
> +	}
> +
> +	eoifd = kzalloc(sizeof(*eoifd), GFP_KERNEL);
> +	if (!eoifd) {
> +		_irq_source_put(source);
> +		eventfd_ctx_put(eventfd);
> +		return -ENOMEM;
> +	}
> +
> +	INIT_LIST_HEAD(&eoifd->list);
> +	eoifd->kvm = kvm;
> +	eoifd->eventfd = eventfd;
> +	eoifd->source = source;
> +	eoifd->notifier.gsi = args->gsi;
> +	eoifd->notifier.irq_acked = eoifd_event;
> +
> +	spin_lock_irq(&kvm->eoifds.lock);
> +
> +	list_for_each_entry(tmp, &kvm->eoifds.items, list) {
> +		if (eoifd->eventfd != tmp->eventfd)
> +			continue;
> +
> +		spin_unlock_irq(&kvm->eoifds.lock);
> +		_irq_source_put(source);
> +		eventfd_ctx_put(eventfd);
> +		kfree(eoifd);
> +		return -EBUSY;


Please rewrite this function using labels for error handling.


> +	}
> +
> +	list_add_tail(&eoifd->list, &kvm->eoifds.items);
> +
> +	spin_unlock_irq(&kvm->eoifds.lock);
> +
> +	kvm_register_irq_ack_notifier(kvm, &eoifd->notifier);
> +
> +	return 0;
> +}
> +
> +static void eoifd_deactivate(struct kvm *kvm, struct _eoifd *eoifd)

Pls find a less confusing name for this.
It just frees the eoifd. No tricky deactivation here (unlike irqfd).

> +{
> +	kvm_unregister_irq_ack_notifier(kvm, &eoifd->notifier);
> +	_irq_source_put(eoifd->source);
> +	eventfd_ctx_put(eoifd->eventfd);
> +	kfree(eoifd);
> +}
> +
> +void kvm_eoifd_release(struct kvm *kvm)
> +{
> +	spin_lock_irq(&kvm->eoifds.lock);
> +
> +	while (!list_empty(&kvm->eoifds.items)) {
> +		struct _eoifd *eoifd;
> +
> +		eoifd = list_first_entry(&kvm->eoifds.items,
> +					 struct _eoifd, list);
> +		list_del(&eoifd->list);
> +
> +		/* Drop spinlocks since eoifd_deactivate can sleep */
> +		spin_unlock_irq(&kvm->eoifds.lock);
> +		eoifd_deactivate(kvm, eoifd);
> +		spin_lock_irq(&kvm->eoifds.lock);
> +	}
> +
> +	spin_unlock_irq(&kvm->eoifds.lock);
> +}
> +
> +static int kvm_deassign_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
> +{
> +	struct eventfd_ctx *eventfd;
> +	struct _eoifd *tmp, *eoifd = NULL;
> +
> +	eventfd = eventfd_ctx_fdget(args->fd);
> +	if (IS_ERR(eventfd))
> +		return PTR_ERR(eventfd);
> +
> +	spin_lock_irq(&kvm->eoifds.lock);
> +
> +	list_for_each_entry(tmp, &kvm->eoifds.items, list) {
> +		if (tmp->eventfd == eventfd && tmp->notifier.gsi == args->gsi) {

This is repeated several times.
looks like eoifd_collision function is called for.

> +			eoifd = tmp;
> +			list_del(&eoifd->list);
> +			break;
> +		}
> +	}
> +
> +	spin_unlock_irq(&kvm->eoifds.lock);
> +
> +	eventfd_ctx_put(eventfd);
> +
> +	if (!eoifd)
> +		return -ENOENT;
> +
> +	eoifd_deactivate(kvm, eoifd);
> +
> +	return 0;
> +}
> +
> +int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
> +{
> +	if (args->flags & ~(KVM_EOIFD_FLAG_DEASSIGN |
> +			    KVM_EOIFD_FLAG_LEVEL_IRQFD))
> +		return -EINVAL;
> +
> +	if (args->flags & KVM_EOIFD_FLAG_DEASSIGN)
> +		return kvm_deassign_eoifd(kvm, args);
> +
> +	return kvm_assign_eoifd(kvm, args);
> +}
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b4ad14cc..5b41df1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -620,6 +620,8 @@ static int kvm_vm_release(struct inode *inode, struct file *filp)
>  
>  	kvm_irqfd_release(kvm);
>  
> +	kvm_eoifd_release(kvm);
> +
>  	kvm_put_kvm(kvm);
>  	return 0;
>  }
> @@ -2093,6 +2095,15 @@ static long kvm_vm_ioctl(struct file *filp,
>  		break;
>  	}
>  #endif
> +	case KVM_EOIFD: {
> +		struct kvm_eoifd data;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&data, argp, sizeof data))
> +			goto out;
> +		r = kvm_eoifd(kvm, &data);
> +		break;
> +	}
>  	default:
>  		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
>  		if (r == -ENOTTY)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 2/2] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-04 14:00   ` Michael S. Tsirkin
@ 2012-07-05  4:24     ` Alex Williamson
  2012-07-05 15:53       ` Michael S. Tsirkin
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2012-07-05  4:24 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Wed, 2012-07-04 at 17:00 +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 03, 2012 at 01:21:29PM -0600, Alex Williamson wrote:
> > This new ioctl enables an eventfd to be triggered when an EOI is
> > written for a specified irqchip pin.  By default this is a simple
> > notification, but we can also tie the eoifd to a level irqfd, which
> > enables the irqchip pin to be automatically de-asserted on EOI.
> > This mode is particularly useful for device-assignment applications
> > where the unmask and notify triggers a hardware unmask.  The default
> > mode is most applicable to simple notify with no side-effects for
> > userspace usage, such as Qemu.
> > 
> > Here we make use of the reference counting of the _irq_source
> > object allowing us to share it with an irqfd and cleanup regardless
> > of the release order.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> > 
> >  Documentation/virtual/kvm/api.txt |   21 ++++
> >  arch/x86/kvm/x86.c                |    1 
> >  include/linux/kvm.h               |   14 ++
> >  include/linux/kvm_host.h          |   13 ++
> >  virt/kvm/eventfd.c                |  208 +++++++++++++++++++++++++++++++++++++
> >  virt/kvm/kvm_main.c               |   11 ++
> >  6 files changed, 266 insertions(+), 2 deletions(-)
> > 
> > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> > index c7267d5..a38af14 100644
> > --- a/Documentation/virtual/kvm/api.txt
> > +++ b/Documentation/virtual/kvm/api.txt
> > @@ -1988,6 +1988,27 @@ to independently assert level interrupts.  The KVM_IRQFD_FLAG_LEVEL
> >  is only necessary on setup, teardown is identical to that above.
> >  KVM_IRQFD_FLAG_LEVEL support is indicated by KVM_CAP_IRQFD_LEVEL.
> >  
> > +4.77 KVM_EOIFD
> > +
> > +Capability: KVM_CAP_EOIFD
> 
> Maybe add a specific capability KVM_CAP_EOIFD_LEVEL_IRQFD too?

Good idea, allows them to be split later.

> > +Architectures: x86
> > +Type: vm ioctl
> > +Parameters: struct kvm_eoifd (in)
> > +Returns: 0 on success, -1 on error
> > +
> > +KVM_EOIFD allows userspace to receive interrupt EOI notification
> > +through an eventfd.  kvm_eoifd.fd specifies the eventfd used for
> > +notification and kvm_eoifd.gsi specifies the irchip pin, similar to
> > +KVM_IRQFD.  The eoifd is removed using the KVM_EOIFD_FLAG_DEASSIGN
> > +flag, specifying both kvm_eoifd.fd and kvm_eoifd.gsi.
> 
> This text reads like it would give you EOIs for any GSI,
> but you haven't actually implemented this for edge GSIs - and
> making it work would bloat the datapath for fast (edge) interrupts.

I do allow you to register any GSI, but you won't be getting EOIs unless
it's operating in level triggered mode.  Perhaps it's best to specify it
as unsupported and let some future patch create a new capability if
support is added.  I'll add a comment.

> What's the intended use of this part of the interface? qemu with
> irqchip disabled?

VFIO should not be dependent on KVM, therefore when kvm is not enabled
we need to add an interface in qemu for devices to be notified of eoi.
This doesn't currently exist.  VFIO can take additional advantage of
irqchip when it is enabled, thus the interface below.  However, I don't
feel I can propose an eoi notifier in qemu that stops working as soon as
irqchip is enabled, even if I'm the only user.  This theoretical qemu
eoi notifier could then use the above when irqchip is enabled.

> > +
> > +The KVM_EOIFD_FLAG_LEVEL_IRQFD flag indicates that the provided
> > +kvm_eoifd stucture includes a valid kvm_eoifd.irqfd file descriptor
> > +for a level irqfd configured using the KVM_IRQFD_FLAG_LEVEL flag.
> > +In this mode the level interrupt is de-asserted prior to EOI eventfd
> > +notification.  The KVM_EOIFD_FLAG_LEVEL_IRQFD is only necessary on
> > +setup, teardown is identical to that above.
> 
> It seems a single fd can not be used multiple times for
> different irqfds? In that case:
> 1. Need to document this limitation

Ok

> 2. This differs from other notifiers e.g.  ioeventfd.

But the same as an irqfd.  Neither use case I'm thinking of needs to
allow eventfds to be re-used and knowing that an eventfd is unique on
our list makes matching it much easier.  For instance, if we have an
eoifd bound to a level irqfd and it's being de-assigned, we'd have to
require the eoifd is de-assigned before the irqfd so that we can
validate matching eventfd, gsi, and irqfd.  That gets very racy when
userspace aborts.

> Pls note: if you decide to remove this limitation, we need
> to add some other limit on the # of EOIFDs that VM can have
> as won't be limited by # of fds anymore.

I don't plan on changing this unless someone has an argument why it
would be useful.

> > +
> >  5. The kvm_run structure
> >  ------------------------
> >  
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 80bed07..62d6eca 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2149,6 +2149,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> >  	case KVM_CAP_PCI_2_3:
> >  	case KVM_CAP_KVMCLOCK_CTRL:
> >  	case KVM_CAP_IRQFD_LEVEL:
> > +	case KVM_CAP_EOIFD:
> >  		r = 1;
> >  		break;
> >  	case KVM_CAP_COALESCED_MMIO:
> > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > index b2e6e4f..7567e7d 100644
> > --- a/include/linux/kvm.h
> > +++ b/include/linux/kvm.h
> > @@ -619,6 +619,7 @@ struct kvm_ppc_smmu_info {
> >  #define KVM_CAP_S390_COW 79
> >  #define KVM_CAP_PPC_ALLOC_HTAB 80
> >  #define KVM_CAP_IRQFD_LEVEL 81
> > +#define KVM_CAP_EOIFD 82
> >  
> >  #ifdef KVM_CAP_IRQ_ROUTING
> >  
> > @@ -694,6 +695,17 @@ struct kvm_irqfd {
> >  	__u8  pad[20];
> >  };
> >  
> > +#define KVM_EOIFD_FLAG_DEASSIGN (1 << 0)
> > +#define KVM_EOIFD_FLAG_LEVEL_IRQFD (1 << 1)
> > +
> > +struct kvm_eoifd {
> > +	__u32 fd;
> > +	__u32 gsi;
> > +	__u32 flags;
> > +	__u32 irqfd;
> > +	__u8 pad[16];
> > +};
> > +
> >  struct kvm_clock_data {
> >  	__u64 clock;
> >  	__u32 flags;
> > @@ -834,6 +846,8 @@ struct kvm_s390_ucas_mapping {
> >  #define KVM_PPC_GET_SMMU_INFO	  _IOR(KVMIO,  0xa6, struct kvm_ppc_smmu_info)
> >  /* Available with KVM_CAP_PPC_ALLOC_HTAB */
> >  #define KVM_PPC_ALLOCATE_HTAB	  _IOWR(KVMIO, 0xa7, __u32)
> > +/* Available with KVM_CAP_EOIFD */
> > +#define KVM_EOIFD                 _IOW(KVMIO,  0xa8, struct kvm_eoifd)
> >  
> >  /*
> >   * ioctls for vcpu fds
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index ae3b426..83472eb 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -285,6 +285,10 @@ struct kvm {
> >  		struct list_head  items;
> >  	} irqfds;
> >  	struct list_head ioeventfds;
> > +	struct {
> > +		spinlock_t        lock;
> 
> Can't this be a mutex? It seems that code will be much simpler
> if you can just take the mutex, do everything and unlock.

Yeah, looks like it could.

> For example you could prevent changing eoifds while irqfds
> are changed.

I'm not sure we need to be that strict, but we could be sure to get a
reference to the _source_id using the irqfd lock that way... actually we
could do that in the current locking scheme too.

> > +		struct list_head  items;
> > +	} eoifds;
> >  #endif
> >  	struct kvm_vm_stat stat;
> >  	struct kvm_arch arch;
> > @@ -828,6 +832,8 @@ int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args);
> >  void kvm_irqfd_release(struct kvm *kvm);
> >  void kvm_irq_routing_update(struct kvm *, struct kvm_irq_routing_table *);
> >  int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args);
> > +int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args);
> > +void kvm_eoifd_release(struct kvm *kvm);
> >  
> >  #else
> >  
> > @@ -853,6 +859,13 @@ static inline int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
> >  	return -ENOSYS;
> >  }
> >  
> > +static inline int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
> > +{
> > +	return -ENOSYS;
> > +}
> > +
> > +static inline void kvm_eoifd_release(struct kvm *kvm) {}
> > +
> >  #endif /* CONFIG_HAVE_KVM_EVENTFD */
> >  
> >  #ifdef CONFIG_KVM_APIC_ARCHITECTURE
> > diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> > index 92aa5ba..2bc9768 100644
> > --- a/virt/kvm/eventfd.c
> > +++ b/virt/kvm/eventfd.c
> > @@ -62,8 +62,7 @@ static void _irq_source_put(struct _irq_source *source)
> >  		kref_put(&source->kref, _irq_source_release);
> >  }
> >  
> > -static struct _irq_source *__attribute__ ((used)) /* white lie for now */
> > -_irq_source_get(struct _irq_source *source)
> > +static struct _irq_source *_irq_source_get(struct _irq_source *source)
> >  {
> >  	if (source)
> >  		kref_get(&source->kref);
> > @@ -119,6 +118,39 @@ struct _irqfd {
> >  	struct work_struct shutdown;
> >  };
> >  
> > +static struct _irqfd *_irqfd_fdget(struct kvm *kvm, int fd)
> > +{
> > +	struct eventfd_ctx *eventfd;
> > +	struct _irqfd *tmp, *irqfd = NULL;
> > +
> > +	eventfd = eventfd_ctx_fdget(fd);
> > +	if (IS_ERR(eventfd))
> > +		return (struct _irqfd *)eventfd;
> > +
> > +	spin_lock_irq(&kvm->irqfds.lock);
> > +
> > +	list_for_each_entry(tmp, &kvm->irqfds.items, list) {
> > +		if (tmp->eventfd == eventfd) {
> > +			irqfd = tmp;
> > +			break;
> > +		}
> > +	}
> > +
> > +	spin_unlock_irq(&kvm->irqfds.lock);
> > +
> > +	if (!irqfd) {
> > +		eventfd_ctx_put(eventfd);
> > +		return ERR_PTR(-ENODEV);
> > +	}
> 
> Can irqfd get dassigned at this point?
> Could one way to fix be to take eoifd.lock in kvm_irqfd?

Yeah, I'm concerned this isn't a sufficient reference to the _irqfd.
I'll play with locking; I think we actually want the reverse, holding
irqfd.lock for the search through getting a reference to the
_irq_source.

> > +
> > +	return irqfd;
> > +}
> > +
> > +static void _irqfd_put(struct _irqfd *irqfd)
> > +{
> > +	eventfd_ctx_put(irqfd->eventfd);
> > +}
> > +
> >  static struct workqueue_struct *irqfd_cleanup_wq;
> >  
> >  static void
> > @@ -387,6 +419,8 @@ kvm_eventfd_init(struct kvm *kvm)
> >  	spin_lock_init(&kvm->irqfds.lock);
> >  	INIT_LIST_HEAD(&kvm->irqfds.items);
> >  	INIT_LIST_HEAD(&kvm->ioeventfds);
> > +	spin_lock_init(&kvm->eoifds.lock);
> > +	INIT_LIST_HEAD(&kvm->eoifds.items);
> >  }
> >  
> >  /*
> > @@ -753,3 +787,173 @@ kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
> >  
> >  	return kvm_assign_ioeventfd(kvm, args);
> >  }
> > +
> > +/*
> > + * --------------------------------------------------------------------
> > + *  eoifd: Translate KVM APIC/IOAPIC EOI into eventfd signal.
> > + *
> > + *  userspace can register GSIs with an eventfd for receiving
> > + *  notification when an EOI occurs.
> > + * --------------------------------------------------------------------
> > + */
> > +
> > +struct _eoifd {
> > +	struct eventfd_ctx *eventfd;
> > +	struct _irq_source *source; /* for de-asserting level irqfd */
> > +	struct kvm *kvm;
> > +	struct kvm_irq_ack_notifier notifier;
> > +	struct list_head list;
> > +};
> > +
> > +static void eoifd_event(struct kvm_irq_ack_notifier *notifier)
> > +{
> > +	struct _eoifd *eoifd;
> > +
> > +	eoifd = container_of(notifier, struct _eoifd, notifier);
> > +
> > +	/*
> > +	 * If the eoifd is tied to a level irqfd we de-assert it here.
> > +	 * The user is responsible for re-asserting it if their device
> > +	 * still needs attention.  For notification-only, skip this.
> > +	 */
> > +	if (eoifd->source)
> > +		kvm_set_irq(eoifd->kvm, eoifd->source->id,
> > +			    eoifd->notifier.gsi, 0);
> > +
> 
> 
> This will fire even if the source in question did not assert an
> interrupt in the first place.  This won't scale well if
> the interrupt is shared. Don't we need to filter by source state?

Extra spurious EOIs are better than dropped EOIs.  However, it does seem
we could add an atomic_t to _irq_source tracking the assertion state of
our source id.  I'll experiment with that.

> > +	eventfd_signal(eoifd->eventfd, 1);
> > +}
> > +
> > +static int kvm_assign_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
> > +{
> > +	struct eventfd_ctx *eventfd;
> > +	struct _eoifd *eoifd, *tmp;
> > +	struct _irq_source *source = NULL;
> > +
> > +	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
> > +		struct _irqfd *irqfd = _irqfd_fdget(kvm, args->irqfd);
> > +		if (IS_ERR(irqfd))
> > +			return PTR_ERR(irqfd);
> > +
> > +		if (irqfd->gsi != args->gsi) {
> > +			_irqfd_put(irqfd);
> > +			return -EINVAL;
> > +		}
> > +
> 
> This requirement does not seem to be documented.
> If we require this, do we need to pass in gsi at all?

It seemed obvious to me that when using KVM_EOIFD_FLAG_LEVEL_IRQFD the
gsi needs to match that used for the kvm_irqfd.  Suppose I should never
assume anything is obvious in an API spec.  I'll add a comment.  I think
it would make the user interface more confusing to specify that gsi is
not required when using a level irqfd, besides it gives us an easy
sanity check.

> > +		source = _irq_source_get(irqfd->source);
> > +		_irqfd_put(irqfd);
> > +		if (!source)
> > +			return -EINVAL;
> > +	}
> > +
> > +	eventfd = eventfd_ctx_fdget(args->fd);
> > +	if (IS_ERR(eventfd)) {
> > +		_irq_source_put(source);
> > +		return PTR_ERR(eventfd);
> > +	}
> > +
> > +	eoifd = kzalloc(sizeof(*eoifd), GFP_KERNEL);
> > +	if (!eoifd) {
> > +		_irq_source_put(source);
> > +		eventfd_ctx_put(eventfd);
> > +		return -ENOMEM;
> > +	}
> > +
> > +	INIT_LIST_HEAD(&eoifd->list);
> > +	eoifd->kvm = kvm;
> > +	eoifd->eventfd = eventfd;
> > +	eoifd->source = source;
> > +	eoifd->notifier.gsi = args->gsi;
> > +	eoifd->notifier.irq_acked = eoifd_event;
> > +
> > +	spin_lock_irq(&kvm->eoifds.lock);
> > +
> > +	list_for_each_entry(tmp, &kvm->eoifds.items, list) {
> > +		if (eoifd->eventfd != tmp->eventfd)
> > +			continue;
> > +
> > +		spin_unlock_irq(&kvm->eoifds.lock);
> > +		_irq_source_put(source);
> > +		eventfd_ctx_put(eventfd);
> > +		kfree(eoifd);
> > +		return -EBUSY;
> 
> 
> Please rewrite this function using labels for error handling.

Ok

> > +	}
> > +
> > +	list_add_tail(&eoifd->list, &kvm->eoifds.items);
> > +
> > +	spin_unlock_irq(&kvm->eoifds.lock);
> > +
> > +	kvm_register_irq_ack_notifier(kvm, &eoifd->notifier);
> > +
> > +	return 0;
> > +}
> > +
> > +static void eoifd_deactivate(struct kvm *kvm, struct _eoifd *eoifd)
> 
> Pls find a less confusing name for this.
> It just frees the eoifd. No tricky deactivation here (unlike irqfd).

Uh, confusing?  It does actually deactivate it by unregistering the irq
ack notifier.  I was attempting to be consistent with the function
naming by following irqfd_deactivate, I didn't realize there was a
complexity requirement associated with the name.

> > +{
> > +	kvm_unregister_irq_ack_notifier(kvm, &eoifd->notifier);
> > +	_irq_source_put(eoifd->source);
> > +	eventfd_ctx_put(eoifd->eventfd);
> > +	kfree(eoifd);
> > +}
> > +
> > +void kvm_eoifd_release(struct kvm *kvm)
> > +{
> > +	spin_lock_irq(&kvm->eoifds.lock);
> > +
> > +	while (!list_empty(&kvm->eoifds.items)) {
> > +		struct _eoifd *eoifd;
> > +
> > +		eoifd = list_first_entry(&kvm->eoifds.items,
> > +					 struct _eoifd, list);
> > +		list_del(&eoifd->list);
> > +
> > +		/* Drop spinlocks since eoifd_deactivate can sleep */
> > +		spin_unlock_irq(&kvm->eoifds.lock);
> > +		eoifd_deactivate(kvm, eoifd);
> > +		spin_lock_irq(&kvm->eoifds.lock);
> > +	}
> > +
> > +	spin_unlock_irq(&kvm->eoifds.lock);
> > +}
> > +
> > +static int kvm_deassign_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
> > +{
> > +	struct eventfd_ctx *eventfd;
> > +	struct _eoifd *tmp, *eoifd = NULL;
> > +
> > +	eventfd = eventfd_ctx_fdget(args->fd);
> > +	if (IS_ERR(eventfd))
> > +		return PTR_ERR(eventfd);
> > +
> > +	spin_lock_irq(&kvm->eoifds.lock);
> > +
> > +	list_for_each_entry(tmp, &kvm->eoifds.items, list) {
> > +		if (tmp->eventfd == eventfd && tmp->notifier.gsi == args->gsi) {
> 
> This is repeated several times.
> looks like eoifd_collision function is called for.

The tests aren't quite identical.  Here we test eventfd and gsi, the
latter primarily to validate kvm_eoifd parameters.  Further above we
test that eventfd is unique on addition without comparing gsi.  Both of
these match the existing precedent set by irqfd.  With _irqfd_fdget we
do have two searches of the irqfds.items list comparing only eventfd...
I'll see about making a trivial helper there.  Thanks,

Alex

> > +			eoifd = tmp;
> > +			list_del(&eoifd->list);
> > +			break;
> > +		}
> > +	}
> > +
> > +	spin_unlock_irq(&kvm->eoifds.lock);
> > +
> > +	eventfd_ctx_put(eventfd);
> > +
> > +	if (!eoifd)
> > +		return -ENOENT;
> > +
> > +	eoifd_deactivate(kvm, eoifd);
> > +
> > +	return 0;
> > +}
> > +
> > +int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
> > +{
> > +	if (args->flags & ~(KVM_EOIFD_FLAG_DEASSIGN |
> > +			    KVM_EOIFD_FLAG_LEVEL_IRQFD))
> > +		return -EINVAL;
> > +
> > +	if (args->flags & KVM_EOIFD_FLAG_DEASSIGN)
> > +		return kvm_deassign_eoifd(kvm, args);
> > +
> > +	return kvm_assign_eoifd(kvm, args);
> > +}
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index b4ad14cc..5b41df1 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -620,6 +620,8 @@ static int kvm_vm_release(struct inode *inode, struct file *filp)
> >  
> >  	kvm_irqfd_release(kvm);
> >  
> > +	kvm_eoifd_release(kvm);
> > +
> >  	kvm_put_kvm(kvm);
> >  	return 0;
> >  }
> > @@ -2093,6 +2095,15 @@ static long kvm_vm_ioctl(struct file *filp,
> >  		break;
> >  	}
> >  #endif
> > +	case KVM_EOIFD: {
> > +		struct kvm_eoifd data;
> > +
> > +		r = -EFAULT;
> > +		if (copy_from_user(&data, argp, sizeof data))
> > +			goto out;
> > +		r = kvm_eoifd(kvm, &data);
> > +		break;
> > +	}
> >  	default:
> >  		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> >  		if (r == -ENOTTY)




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 2/2] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-05  4:24     ` Alex Williamson
@ 2012-07-05 15:53       ` Michael S. Tsirkin
  2012-07-09 20:35         ` Alex Williamson
  0 siblings, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2012-07-05 15:53 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Wed, Jul 04, 2012 at 10:24:54PM -0600, Alex Williamson wrote:
> On Wed, 2012-07-04 at 17:00 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jul 03, 2012 at 01:21:29PM -0600, Alex Williamson wrote:
> > > This new ioctl enables an eventfd to be triggered when an EOI is
> > > written for a specified irqchip pin.  By default this is a simple
> > > notification, but we can also tie the eoifd to a level irqfd, which
> > > enables the irqchip pin to be automatically de-asserted on EOI.
> > > This mode is particularly useful for device-assignment applications
> > > where the unmask and notify triggers a hardware unmask.  The default
> > > mode is most applicable to simple notify with no side-effects for
> > > userspace usage, such as Qemu.
> > > 
> > > Here we make use of the reference counting of the _irq_source
> > > object allowing us to share it with an irqfd and cleanup regardless
> > > of the release order.
> > > 
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > ---
> > > 
> > >  Documentation/virtual/kvm/api.txt |   21 ++++
> > >  arch/x86/kvm/x86.c                |    1 
> > >  include/linux/kvm.h               |   14 ++
> > >  include/linux/kvm_host.h          |   13 ++
> > >  virt/kvm/eventfd.c                |  208 +++++++++++++++++++++++++++++++++++++
> > >  virt/kvm/kvm_main.c               |   11 ++
> > >  6 files changed, 266 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> > > index c7267d5..a38af14 100644
> > > --- a/Documentation/virtual/kvm/api.txt
> > > +++ b/Documentation/virtual/kvm/api.txt
> > > @@ -1988,6 +1988,27 @@ to independently assert level interrupts.  The KVM_IRQFD_FLAG_LEVEL
> > >  is only necessary on setup, teardown is identical to that above.
> > >  KVM_IRQFD_FLAG_LEVEL support is indicated by KVM_CAP_IRQFD_LEVEL.
> > >  
> > > +4.77 KVM_EOIFD
> > > +
> > > +Capability: KVM_CAP_EOIFD
> > 
> > Maybe add a specific capability KVM_CAP_EOIFD_LEVEL_IRQFD too?
> 
> Good idea, allows them to be split later.
> 
> > > +Architectures: x86
> > > +Type: vm ioctl
> > > +Parameters: struct kvm_eoifd (in)
> > > +Returns: 0 on success, -1 on error
> > > +
> > > +KVM_EOIFD allows userspace to receive interrupt EOI notification
> > > +through an eventfd.  kvm_eoifd.fd specifies the eventfd used for
> > > +notification and kvm_eoifd.gsi specifies the irchip pin, similar to
> > > +KVM_IRQFD.  The eoifd is removed using the KVM_EOIFD_FLAG_DEASSIGN
> > > +flag, specifying both kvm_eoifd.fd and kvm_eoifd.gsi.
> > 
> > This text reads like it would give you EOIs for any GSI,
> > but you haven't actually implemented this for edge GSIs - and
> > making it work would bloat the datapath for fast (edge) interrupts.
> 
> I do allow you to register any GSI, but you won't be getting EOIs unless
> it's operating in level triggered mode.  Perhaps it's best to specify it
> as unsupported and let some future patch create a new capability if
> support is added.  I'll add a comment.
> 
> > What's the intended use of this part of the interface? qemu with
> > irqchip disabled?
> 
> VFIO should not be dependent on KVM, therefore when kvm is not enabled
> we need to add an interface in qemu for devices to be notified of eoi.
>
> This doesn't currently exist.  VFIO can take additional advantage of
> irqchip when it is enabled, thus the interface below.  However, I don't
> feel I can propose an eoi notifier in qemu that stops working as soon as
> irqchip is enabled, even if I'm the only user.  This theoretical qemu
> eoi notifier could then use the above when irqchip is enabled.

Well internal qemu APIs are qemu's problem and can be addressed there.
For example, can we make it mimic our interface: make qemu EOI notifier
accept an object that includes qemu_irq without irqchip and irqfd with?

In other words adding interface with no users looks weird.

> > > +
> > > +The KVM_EOIFD_FLAG_LEVEL_IRQFD flag indicates that the provided
> > > +kvm_eoifd stucture includes a valid kvm_eoifd.irqfd file descriptor
> > > +for a level irqfd configured using the KVM_IRQFD_FLAG_LEVEL flag.
> > > +In this mode the level interrupt is de-asserted prior to EOI eventfd
> > > +notification.  The KVM_EOIFD_FLAG_LEVEL_IRQFD is only necessary on
> > > +setup, teardown is identical to that above.
> > 
> > It seems a single fd can not be used multiple times for
> > different irqfds? In that case:
> > 1. Need to document this limitation
> 
> Ok
> 
> > 2. This differs from other notifiers e.g.  ioeventfd.
> 
> But the same as an irqfd.

However irqfd is about interrupting guest. eoifd is more like ioeventfd
really: kvm writes ioeventfd/eoifd but reads irqfd.

>  Neither use case I'm thinking of needs to
> allow eventfds to be re-used and knowing that an eventfd is unique on
> our list makes matching it much easier.  For instance, if we have an
> eoifd bound to a level irqfd and it's being de-assigned, we'd have to
> require the eoifd is de-assigned before the irqfd so that we can
> validate matching eventfd, gsi, and irqfd.  That gets very racy when
> userspace aborts.

Why can't eoifd simply keep a reference to irqfd? Or register to
the hangup event.

> > Pls note: if you decide to remove this limitation, we need
> > to add some other limit on the # of EOIFDs that VM can have
> > as won't be limited by # of fds anymore.
> 
> I don't plan on changing this unless someone has an argument why it
> would be useful.

For example you can use same eventfd as ioeventfd
and eoifd.  Why not two eoifds?

I need to think about specifics more but it just breaks how eventfd is
normally used. It should be an agrregator that can count any writers and
support 1 reader.

Let me draw this:

W1 ---->\
W2 ----> +--> eventfd ----> R
W3 ---->/


This explains the difference: in irqfd kvm reads eventfd so only 1 is
supported. in ioeventfd (and eoifd) kvm is the writer so any number
should be supported.


> > > +
> > >  5. The kvm_run structure
> > >  ------------------------
> > >  
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 80bed07..62d6eca 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -2149,6 +2149,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > >  	case KVM_CAP_PCI_2_3:
> > >  	case KVM_CAP_KVMCLOCK_CTRL:
> > >  	case KVM_CAP_IRQFD_LEVEL:
> > > +	case KVM_CAP_EOIFD:
> > >  		r = 1;
> > >  		break;
> > >  	case KVM_CAP_COALESCED_MMIO:
> > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > index b2e6e4f..7567e7d 100644
> > > --- a/include/linux/kvm.h
> > > +++ b/include/linux/kvm.h
> > > @@ -619,6 +619,7 @@ struct kvm_ppc_smmu_info {
> > >  #define KVM_CAP_S390_COW 79
> > >  #define KVM_CAP_PPC_ALLOC_HTAB 80
> > >  #define KVM_CAP_IRQFD_LEVEL 81
> > > +#define KVM_CAP_EOIFD 82
> > >  
> > >  #ifdef KVM_CAP_IRQ_ROUTING
> > >  
> > > @@ -694,6 +695,17 @@ struct kvm_irqfd {
> > >  	__u8  pad[20];
> > >  };
> > >  
> > > +#define KVM_EOIFD_FLAG_DEASSIGN (1 << 0)
> > > +#define KVM_EOIFD_FLAG_LEVEL_IRQFD (1 << 1)
> > > +
> > > +struct kvm_eoifd {
> > > +	__u32 fd;
> > > +	__u32 gsi;
> > > +	__u32 flags;
> > > +	__u32 irqfd;
> > > +	__u8 pad[16];
> > > +};
> > > +
> > >  struct kvm_clock_data {
> > >  	__u64 clock;
> > >  	__u32 flags;
> > > @@ -834,6 +846,8 @@ struct kvm_s390_ucas_mapping {
> > >  #define KVM_PPC_GET_SMMU_INFO	  _IOR(KVMIO,  0xa6, struct kvm_ppc_smmu_info)
> > >  /* Available with KVM_CAP_PPC_ALLOC_HTAB */
> > >  #define KVM_PPC_ALLOCATE_HTAB	  _IOWR(KVMIO, 0xa7, __u32)
> > > +/* Available with KVM_CAP_EOIFD */
> > > +#define KVM_EOIFD                 _IOW(KVMIO,  0xa8, struct kvm_eoifd)
> > >  
> > >  /*
> > >   * ioctls for vcpu fds
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index ae3b426..83472eb 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -285,6 +285,10 @@ struct kvm {
> > >  		struct list_head  items;
> > >  	} irqfds;
> > >  	struct list_head ioeventfds;
> > > +	struct {
> > > +		spinlock_t        lock;
> > 
> > Can't this be a mutex? It seems that code will be much simpler
> > if you can just take the mutex, do everything and unlock.
> 
> Yeah, looks like it could.
> 
> > For example you could prevent changing eoifds while irqfds
> > are changed.
> 
> I'm not sure we need to be that strict, but we could be sure to get a
> reference to the _source_id using the irqfd lock that way... actually we
> could do that in the current locking scheme too.

Just a suggestion.

> > > +		struct list_head  items;
> > > +	} eoifds;
> > >  #endif
> > >  	struct kvm_vm_stat stat;
> > >  	struct kvm_arch arch;
> > > @@ -828,6 +832,8 @@ int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args);
> > >  void kvm_irqfd_release(struct kvm *kvm);
> > >  void kvm_irq_routing_update(struct kvm *, struct kvm_irq_routing_table *);
> > >  int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args);
> > > +int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args);
> > > +void kvm_eoifd_release(struct kvm *kvm);
> > >  
> > >  #else
> > >  
> > > @@ -853,6 +859,13 @@ static inline int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
> > >  	return -ENOSYS;
> > >  }
> > >  
> > > +static inline int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
> > > +{
> > > +	return -ENOSYS;
> > > +}
> > > +
> > > +static inline void kvm_eoifd_release(struct kvm *kvm) {}
> > > +
> > >  #endif /* CONFIG_HAVE_KVM_EVENTFD */
> > >  
> > >  #ifdef CONFIG_KVM_APIC_ARCHITECTURE
> > > diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> > > index 92aa5ba..2bc9768 100644
> > > --- a/virt/kvm/eventfd.c
> > > +++ b/virt/kvm/eventfd.c
> > > @@ -62,8 +62,7 @@ static void _irq_source_put(struct _irq_source *source)
> > >  		kref_put(&source->kref, _irq_source_release);
> > >  }
> > >  
> > > -static struct _irq_source *__attribute__ ((used)) /* white lie for now */
> > > -_irq_source_get(struct _irq_source *source)
> > > +static struct _irq_source *_irq_source_get(struct _irq_source *source)
> > >  {
> > >  	if (source)
> > >  		kref_get(&source->kref);
> > > @@ -119,6 +118,39 @@ struct _irqfd {
> > >  	struct work_struct shutdown;
> > >  };
> > >  
> > > +static struct _irqfd *_irqfd_fdget(struct kvm *kvm, int fd)
> > > +{
> > > +	struct eventfd_ctx *eventfd;
> > > +	struct _irqfd *tmp, *irqfd = NULL;
> > > +
> > > +	eventfd = eventfd_ctx_fdget(fd);
> > > +	if (IS_ERR(eventfd))
> > > +		return (struct _irqfd *)eventfd;
> > > +
> > > +	spin_lock_irq(&kvm->irqfds.lock);
> > > +
> > > +	list_for_each_entry(tmp, &kvm->irqfds.items, list) {
> > > +		if (tmp->eventfd == eventfd) {
> > > +			irqfd = tmp;
> > > +			break;
> > > +		}
> > > +	}
> > > +
> > > +	spin_unlock_irq(&kvm->irqfds.lock);
> > > +
> > > +	if (!irqfd) {
> > > +		eventfd_ctx_put(eventfd);
> > > +		return ERR_PTR(-ENODEV);
> > > +	}
> > 
> > Can irqfd get dassigned at this point?
> > Could one way to fix be to take eoifd.lock in kvm_irqfd?
> 
> Yeah, I'm concerned this isn't a sufficient reference to the _irqfd.
> I'll play with locking; I think we actually want the reverse, holding
> irqfd.lock for the search through getting a reference to the
> _irq_source.
> 
> > > +
> > > +	return irqfd;
> > > +}
> > > +
> > > +static void _irqfd_put(struct _irqfd *irqfd)
> > > +{
> > > +	eventfd_ctx_put(irqfd->eventfd);
> > > +}
> > > +
> > >  static struct workqueue_struct *irqfd_cleanup_wq;
> > >  
> > >  static void
> > > @@ -387,6 +419,8 @@ kvm_eventfd_init(struct kvm *kvm)
> > >  	spin_lock_init(&kvm->irqfds.lock);
> > >  	INIT_LIST_HEAD(&kvm->irqfds.items);
> > >  	INIT_LIST_HEAD(&kvm->ioeventfds);
> > > +	spin_lock_init(&kvm->eoifds.lock);
> > > +	INIT_LIST_HEAD(&kvm->eoifds.items);
> > >  }
> > >  
> > >  /*
> > > @@ -753,3 +787,173 @@ kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
> > >  
> > >  	return kvm_assign_ioeventfd(kvm, args);
> > >  }
> > > +
> > > +/*
> > > + * --------------------------------------------------------------------
> > > + *  eoifd: Translate KVM APIC/IOAPIC EOI into eventfd signal.
> > > + *
> > > + *  userspace can register GSIs with an eventfd for receiving
> > > + *  notification when an EOI occurs.
> > > + * --------------------------------------------------------------------
> > > + */
> > > +
> > > +struct _eoifd {
> > > +	struct eventfd_ctx *eventfd;
> > > +	struct _irq_source *source; /* for de-asserting level irqfd */
> > > +	struct kvm *kvm;
> > > +	struct kvm_irq_ack_notifier notifier;
> > > +	struct list_head list;
> > > +};
> > > +
> > > +static void eoifd_event(struct kvm_irq_ack_notifier *notifier)
> > > +{
> > > +	struct _eoifd *eoifd;
> > > +
> > > +	eoifd = container_of(notifier, struct _eoifd, notifier);
> > > +
> > > +	/*
> > > +	 * If the eoifd is tied to a level irqfd we de-assert it here.
> > > +	 * The user is responsible for re-asserting it if their device
> > > +	 * still needs attention.  For notification-only, skip this.
> > > +	 */
> > > +	if (eoifd->source)
> > > +		kvm_set_irq(eoifd->kvm, eoifd->source->id,
> > > +			    eoifd->notifier.gsi, 0);
> > > +
> > 
> > 
> > This will fire even if the source in question did not assert an
> > interrupt in the first place.  This won't scale well if
> > the interrupt is shared. Don't we need to filter by source state?
> 
> Extra spurious EOIs are better than dropped EOIs.  However, it does seem
> we could add an atomic_t to _irq_source tracking the assertion state of
> our source id.  I'll experiment with that.

I don't see why do we need more state. kvm_irq_line_state already
uses atomics for bits in irq_state.

> > > +	eventfd_signal(eoifd->eventfd, 1);
> > > +}
> > > +
> > > +static int kvm_assign_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
> > > +{
> > > +	struct eventfd_ctx *eventfd;
> > > +	struct _eoifd *eoifd, *tmp;
> > > +	struct _irq_source *source = NULL;
> > > +
> > > +	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
> > > +		struct _irqfd *irqfd = _irqfd_fdget(kvm, args->irqfd);
> > > +		if (IS_ERR(irqfd))
> > > +			return PTR_ERR(irqfd);
> > > +
> > > +		if (irqfd->gsi != args->gsi) {
> > > +			_irqfd_put(irqfd);
> > > +			return -EINVAL;
> > > +		}
> > > +
> > 
> > This requirement does not seem to be documented.
> > If we require this, do we need to pass in gsi at all?
> 
> It seemed obvious to me that when using KVM_EOIFD_FLAG_LEVEL_IRQFD the
> gsi needs to match that used for the kvm_irqfd.  Suppose I should never
> assume anything is obvious in an API spec.  I'll add a comment.  I think
> it would make the user interface more confusing to specify that gsi is
> not required when using a level irqfd,

Maybe just make it a union: one or the other?


> besides it gives us an easy sanity check.

Well pass less parameters and there will be less bugs and less stuff
to check :)

> > > +		source = _irq_source_get(irqfd->source);
> > > +		_irqfd_put(irqfd);
> > > +		if (!source)
> > > +			return -EINVAL;
> > > +	}
> > > +
> > > +	eventfd = eventfd_ctx_fdget(args->fd);
> > > +	if (IS_ERR(eventfd)) {
> > > +		_irq_source_put(source);
> > > +		return PTR_ERR(eventfd);
> > > +	}
> > > +
> > > +	eoifd = kzalloc(sizeof(*eoifd), GFP_KERNEL);
> > > +	if (!eoifd) {
> > > +		_irq_source_put(source);
> > > +		eventfd_ctx_put(eventfd);
> > > +		return -ENOMEM;
> > > +	}
> > > +
> > > +	INIT_LIST_HEAD(&eoifd->list);
> > > +	eoifd->kvm = kvm;
> > > +	eoifd->eventfd = eventfd;
> > > +	eoifd->source = source;
> > > +	eoifd->notifier.gsi = args->gsi;
> > > +	eoifd->notifier.irq_acked = eoifd_event;
> > > +
> > > +	spin_lock_irq(&kvm->eoifds.lock);
> > > +
> > > +	list_for_each_entry(tmp, &kvm->eoifds.items, list) {
> > > +		if (eoifd->eventfd != tmp->eventfd)
> > > +			continue;
> > > +
> > > +		spin_unlock_irq(&kvm->eoifds.lock);
> > > +		_irq_source_put(source);
> > > +		eventfd_ctx_put(eventfd);
> > > +		kfree(eoifd);
> > > +		return -EBUSY;
> > 
> > 
> > Please rewrite this function using labels for error handling.
> 
> Ok
> 
> > > +	}
> > > +
> > > +	list_add_tail(&eoifd->list, &kvm->eoifds.items);
> > > +
> > > +	spin_unlock_irq(&kvm->eoifds.lock);
> > > +
> > > +	kvm_register_irq_ack_notifier(kvm, &eoifd->notifier);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static void eoifd_deactivate(struct kvm *kvm, struct _eoifd *eoifd)
> > 
> > Pls find a less confusing name for this.
> > It just frees the eoifd. No tricky deactivation here (unlike irqfd).
> 
> Uh, confusing?  It does actually deactivate it by unregistering the irq
> ack notifier.

It also destroys it completely. To me deactivate implies "make inactive
but do not destroy".

>  I was attempting to be consistent with the function
> naming by following irqfd_deactivate, I didn't realize there was a
> complexity requirement associated with the name.

irqfd_deactivate is named like this because it does not free the object:
it merely deactivates it and queues up the free job.
And that in turn is because of tricky locking issues specific to irqfd
that have to do with the fact kvm reads it. eoifd is written
from kvm so no such problem.


> > > +{
> > > +	kvm_unregister_irq_ack_notifier(kvm, &eoifd->notifier);
> > > +	_irq_source_put(eoifd->source);
> > > +	eventfd_ctx_put(eoifd->eventfd);
> > > +	kfree(eoifd);
> > > +}
> > > +
> > > +void kvm_eoifd_release(struct kvm *kvm)
> > > +{
> > > +	spin_lock_irq(&kvm->eoifds.lock);
> > > +
> > > +	while (!list_empty(&kvm->eoifds.items)) {
> > > +		struct _eoifd *eoifd;
> > > +
> > > +		eoifd = list_first_entry(&kvm->eoifds.items,
> > > +					 struct _eoifd, list);
> > > +		list_del(&eoifd->list);
> > > +
> > > +		/* Drop spinlocks since eoifd_deactivate can sleep */
> > > +		spin_unlock_irq(&kvm->eoifds.lock);
> > > +		eoifd_deactivate(kvm, eoifd);
> > > +		spin_lock_irq(&kvm->eoifds.lock);
> > > +	}
> > > +
> > > +	spin_unlock_irq(&kvm->eoifds.lock);
> > > +}
> > > +
> > > +static int kvm_deassign_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
> > > +{
> > > +	struct eventfd_ctx *eventfd;
> > > +	struct _eoifd *tmp, *eoifd = NULL;
> > > +
> > > +	eventfd = eventfd_ctx_fdget(args->fd);
> > > +	if (IS_ERR(eventfd))
> > > +		return PTR_ERR(eventfd);
> > > +
> > > +	spin_lock_irq(&kvm->eoifds.lock);
> > > +
> > > +	list_for_each_entry(tmp, &kvm->eoifds.items, list) {
> > > +		if (tmp->eventfd == eventfd && tmp->notifier.gsi == args->gsi) {
> > 
> > This is repeated several times.
> > looks like eoifd_collision function is called for.
> 
> The tests aren't quite identical.  Here we test eventfd and gsi, the
> latter primarily to validate kvm_eoifd parameters.  Further above we
> test that eventfd is unique on addition without comparing gsi.  Both of
> these match the existing precedent set by irqfd.  With _irqfd_fdget we
> do have two searches of the irqfds.items list comparing only eventfd...
> I'll see about making a trivial helper there.  Thanks,
> 
> Alex

I hope I clarified that irqfd is not a good match. Pls make it
behave like ioeventfd instead.

> > > +			eoifd = tmp;
> > > +			list_del(&eoifd->list);
> > > +			break;
> > > +		}
> > > +	}
> > > +
> > > +	spin_unlock_irq(&kvm->eoifds.lock);
> > > +
> > > +	eventfd_ctx_put(eventfd);
> > > +
> > > +	if (!eoifd)
> > > +		return -ENOENT;
> > > +
> > > +	eoifd_deactivate(kvm, eoifd);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
> > > +{
> > > +	if (args->flags & ~(KVM_EOIFD_FLAG_DEASSIGN |
> > > +			    KVM_EOIFD_FLAG_LEVEL_IRQFD))
> > > +		return -EINVAL;
> > > +
> > > +	if (args->flags & KVM_EOIFD_FLAG_DEASSIGN)
> > > +		return kvm_deassign_eoifd(kvm, args);
> > > +
> > > +	return kvm_assign_eoifd(kvm, args);
> > > +}
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index b4ad14cc..5b41df1 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -620,6 +620,8 @@ static int kvm_vm_release(struct inode *inode, struct file *filp)
> > >  
> > >  	kvm_irqfd_release(kvm);
> > >  
> > > +	kvm_eoifd_release(kvm);
> > > +
> > >  	kvm_put_kvm(kvm);
> > >  	return 0;
> > >  }
> > > @@ -2093,6 +2095,15 @@ static long kvm_vm_ioctl(struct file *filp,
> > >  		break;
> > >  	}
> > >  #endif
> > > +	case KVM_EOIFD: {
> > > +		struct kvm_eoifd data;
> > > +
> > > +		r = -EFAULT;
> > > +		if (copy_from_user(&data, argp, sizeof data))
> > > +			goto out;
> > > +		r = kvm_eoifd(kvm, &data);
> > > +		break;
> > > +	}
> > >  	default:
> > >  		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> > >  		if (r == -ENOTTY)
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 2/2] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-05 15:53       ` Michael S. Tsirkin
@ 2012-07-09 20:35         ` Alex Williamson
  2012-07-13 13:36           ` Michael S. Tsirkin
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2012-07-09 20:35 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Thu, 2012-07-05 at 18:53 +0300, Michael S. Tsirkin wrote:
> On Wed, Jul 04, 2012 at 10:24:54PM -0600, Alex Williamson wrote:
> > On Wed, 2012-07-04 at 17:00 +0300, Michael S. Tsirkin wrote:
> > > On Tue, Jul 03, 2012 at 01:21:29PM -0600, Alex Williamson wrote:
> > > > This new ioctl enables an eventfd to be triggered when an EOI is
> > > > written for a specified irqchip pin.  By default this is a simple
> > > > notification, but we can also tie the eoifd to a level irqfd, which
> > > > enables the irqchip pin to be automatically de-asserted on EOI.
> > > > This mode is particularly useful for device-assignment applications
> > > > where the unmask and notify triggers a hardware unmask.  The default
> > > > mode is most applicable to simple notify with no side-effects for
> > > > userspace usage, such as Qemu.
> > > > 
> > > > Here we make use of the reference counting of the _irq_source
> > > > object allowing us to share it with an irqfd and cleanup regardless
> > > > of the release order.
> > > > 
> > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > ---
> > > > 
> > > >  Documentation/virtual/kvm/api.txt |   21 ++++
> > > >  arch/x86/kvm/x86.c                |    1 
> > > >  include/linux/kvm.h               |   14 ++
> > > >  include/linux/kvm_host.h          |   13 ++
> > > >  virt/kvm/eventfd.c                |  208 +++++++++++++++++++++++++++++++++++++
> > > >  virt/kvm/kvm_main.c               |   11 ++
> > > >  6 files changed, 266 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> > > > index c7267d5..a38af14 100644
> > > > --- a/Documentation/virtual/kvm/api.txt
> > > > +++ b/Documentation/virtual/kvm/api.txt
> > > > @@ -1988,6 +1988,27 @@ to independently assert level interrupts.  The KVM_IRQFD_FLAG_LEVEL
> > > >  is only necessary on setup, teardown is identical to that above.
> > > >  KVM_IRQFD_FLAG_LEVEL support is indicated by KVM_CAP_IRQFD_LEVEL.
> > > >  
> > > > +4.77 KVM_EOIFD
> > > > +
> > > > +Capability: KVM_CAP_EOIFD
> > > 
> > > Maybe add a specific capability KVM_CAP_EOIFD_LEVEL_IRQFD too?
> > 
> > Good idea, allows them to be split later.
> > 
> > > > +Architectures: x86
> > > > +Type: vm ioctl
> > > > +Parameters: struct kvm_eoifd (in)
> > > > +Returns: 0 on success, -1 on error
> > > > +
> > > > +KVM_EOIFD allows userspace to receive interrupt EOI notification
> > > > +through an eventfd.  kvm_eoifd.fd specifies the eventfd used for
> > > > +notification and kvm_eoifd.gsi specifies the irchip pin, similar to
> > > > +KVM_IRQFD.  The eoifd is removed using the KVM_EOIFD_FLAG_DEASSIGN
> > > > +flag, specifying both kvm_eoifd.fd and kvm_eoifd.gsi.
> > > 
> > > This text reads like it would give you EOIs for any GSI,
> > > but you haven't actually implemented this for edge GSIs - and
> > > making it work would bloat the datapath for fast (edge) interrupts.
> > 
> > I do allow you to register any GSI, but you won't be getting EOIs unless
> > it's operating in level triggered mode.  Perhaps it's best to specify it
> > as unsupported and let some future patch create a new capability if
> > support is added.  I'll add a comment.
> > 
> > > What's the intended use of this part of the interface? qemu with
> > > irqchip disabled?
> > 
> > VFIO should not be dependent on KVM, therefore when kvm is not enabled
> > we need to add an interface in qemu for devices to be notified of eoi.
> >
> > This doesn't currently exist.  VFIO can take additional advantage of
> > irqchip when it is enabled, thus the interface below.  However, I don't
> > feel I can propose an eoi notifier in qemu that stops working as soon as
> > irqchip is enabled, even if I'm the only user.  This theoretical qemu
> > eoi notifier could then use the above when irqchip is enabled.
> 
> Well internal qemu APIs are qemu's problem and can be addressed there.
> For example, can we make it mimic our interface: make qemu EOI notifier
> accept an object that includes qemu_irq without irqchip and irqfd with?

So the qemu API changes based on whether irqchip is enabled or not?!

> In other words adding interface with no users looks weird.

If we could ever finish this patch, I could start working on the
user... ;)

> > > > +
> > > > +The KVM_EOIFD_FLAG_LEVEL_IRQFD flag indicates that the provided
> > > > +kvm_eoifd stucture includes a valid kvm_eoifd.irqfd file descriptor
> > > > +for a level irqfd configured using the KVM_IRQFD_FLAG_LEVEL flag.
> > > > +In this mode the level interrupt is de-asserted prior to EOI eventfd
> > > > +notification.  The KVM_EOIFD_FLAG_LEVEL_IRQFD is only necessary on
> > > > +setup, teardown is identical to that above.
> > > 
> > > It seems a single fd can not be used multiple times for
> > > different irqfds? In that case:
> > > 1. Need to document this limitation
> > 
> > Ok
> > 
> > > 2. This differs from other notifiers e.g.  ioeventfd.
> > 
> > But the same as an irqfd.
> 
> However irqfd is about interrupting guest. eoifd is more like ioeventfd
> really: kvm writes ioeventfd/eoifd but reads irqfd.
> 
> >  Neither use case I'm thinking of needs to
> > allow eventfds to be re-used and knowing that an eventfd is unique on
> > our list makes matching it much easier.  For instance, if we have an
> > eoifd bound to a level irqfd and it's being de-assigned, we'd have to
> > require the eoifd is de-assigned before the irqfd so that we can
> > validate matching eventfd, gsi, and irqfd.  That gets very racy when
> > userspace aborts.
> 
> Why can't eoifd simply keep a reference to irqfd? Or register to
> the hangup event.

An irqfd is just a file descriptor and we can't block close().  I was
trying to use struct _irq_source as the reference counted object because
I didn't see any need to keep the irqfd around.

> > > Pls note: if you decide to remove this limitation, we need
> > > to add some other limit on the # of EOIFDs that VM can have
> > > as won't be limited by # of fds anymore.
> > 
> > I don't plan on changing this unless someone has an argument why it
> > would be useful.
> 
> For example you can use same eventfd as ioeventfd
> and eoifd.  Why not two eoifds?
> 
> I need to think about specifics more but it just breaks how eventfd is
> normally used. It should be an agrregator that can count any writers and
> support 1 reader.
> 
> Let me draw this:
> 
> W1 ---->\
> W2 ----> +--> eventfd ----> R
> W3 ---->/
> 
> 
> This explains the difference: in irqfd kvm reads eventfd so only 1 is
> supported. in ioeventfd (and eoifd) kvm is the writer so any number
> should be supported.

Yes, removal is the problem.  If I allow the same eventfd to fire for
multiple EOIs, then I need to match eventfd and irqfd on de-assign.
That means I need to cache some bit of context about the irqfd in case
it's closed before the eoifd.  Perhaps the _eoifd holds a reference to
_irqfd.eventfd for that.

> > > > +
> > > >  5. The kvm_run structure
> > > >  ------------------------
> > > >  
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index 80bed07..62d6eca 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -2149,6 +2149,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > >  	case KVM_CAP_PCI_2_3:
> > > >  	case KVM_CAP_KVMCLOCK_CTRL:
> > > >  	case KVM_CAP_IRQFD_LEVEL:
> > > > +	case KVM_CAP_EOIFD:
> > > >  		r = 1;
> > > >  		break;
> > > >  	case KVM_CAP_COALESCED_MMIO:
> > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > index b2e6e4f..7567e7d 100644
> > > > --- a/include/linux/kvm.h
> > > > +++ b/include/linux/kvm.h
> > > > @@ -619,6 +619,7 @@ struct kvm_ppc_smmu_info {
> > > >  #define KVM_CAP_S390_COW 79
> > > >  #define KVM_CAP_PPC_ALLOC_HTAB 80
> > > >  #define KVM_CAP_IRQFD_LEVEL 81
> > > > +#define KVM_CAP_EOIFD 82
> > > >  
> > > >  #ifdef KVM_CAP_IRQ_ROUTING
> > > >  
> > > > @@ -694,6 +695,17 @@ struct kvm_irqfd {
> > > >  	__u8  pad[20];
> > > >  };
> > > >  
> > > > +#define KVM_EOIFD_FLAG_DEASSIGN (1 << 0)
> > > > +#define KVM_EOIFD_FLAG_LEVEL_IRQFD (1 << 1)
> > > > +
> > > > +struct kvm_eoifd {
> > > > +	__u32 fd;
> > > > +	__u32 gsi;
> > > > +	__u32 flags;
> > > > +	__u32 irqfd;
> > > > +	__u8 pad[16];
> > > > +};
> > > > +
> > > >  struct kvm_clock_data {
> > > >  	__u64 clock;
> > > >  	__u32 flags;
> > > > @@ -834,6 +846,8 @@ struct kvm_s390_ucas_mapping {
> > > >  #define KVM_PPC_GET_SMMU_INFO	  _IOR(KVMIO,  0xa6, struct kvm_ppc_smmu_info)
> > > >  /* Available with KVM_CAP_PPC_ALLOC_HTAB */
> > > >  #define KVM_PPC_ALLOCATE_HTAB	  _IOWR(KVMIO, 0xa7, __u32)
> > > > +/* Available with KVM_CAP_EOIFD */
> > > > +#define KVM_EOIFD                 _IOW(KVMIO,  0xa8, struct kvm_eoifd)
> > > >  
> > > >  /*
> > > >   * ioctls for vcpu fds
> > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > index ae3b426..83472eb 100644
> > > > --- a/include/linux/kvm_host.h
> > > > +++ b/include/linux/kvm_host.h
> > > > @@ -285,6 +285,10 @@ struct kvm {
> > > >  		struct list_head  items;
> > > >  	} irqfds;
> > > >  	struct list_head ioeventfds;
> > > > +	struct {
> > > > +		spinlock_t        lock;
> > > 
> > > Can't this be a mutex? It seems that code will be much simpler
> > > if you can just take the mutex, do everything and unlock.
> > 
> > Yeah, looks like it could.
> > 
> > > For example you could prevent changing eoifds while irqfds
> > > are changed.
> > 
> > I'm not sure we need to be that strict, but we could be sure to get a
> > reference to the _source_id using the irqfd lock that way... actually we
> > could do that in the current locking scheme too.
> 
> Just a suggestion.
> 
> > > > +		struct list_head  items;
> > > > +	} eoifds;
> > > >  #endif
> > > >  	struct kvm_vm_stat stat;
> > > >  	struct kvm_arch arch;
> > > > @@ -828,6 +832,8 @@ int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args);
> > > >  void kvm_irqfd_release(struct kvm *kvm);
> > > >  void kvm_irq_routing_update(struct kvm *, struct kvm_irq_routing_table *);
> > > >  int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args);
> > > > +int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args);
> > > > +void kvm_eoifd_release(struct kvm *kvm);
> > > >  
> > > >  #else
> > > >  
> > > > @@ -853,6 +859,13 @@ static inline int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
> > > >  	return -ENOSYS;
> > > >  }
> > > >  
> > > > +static inline int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
> > > > +{
> > > > +	return -ENOSYS;
> > > > +}
> > > > +
> > > > +static inline void kvm_eoifd_release(struct kvm *kvm) {}
> > > > +
> > > >  #endif /* CONFIG_HAVE_KVM_EVENTFD */
> > > >  
> > > >  #ifdef CONFIG_KVM_APIC_ARCHITECTURE
> > > > diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> > > > index 92aa5ba..2bc9768 100644
> > > > --- a/virt/kvm/eventfd.c
> > > > +++ b/virt/kvm/eventfd.c
> > > > @@ -62,8 +62,7 @@ static void _irq_source_put(struct _irq_source *source)
> > > >  		kref_put(&source->kref, _irq_source_release);
> > > >  }
> > > >  
> > > > -static struct _irq_source *__attribute__ ((used)) /* white lie for now */
> > > > -_irq_source_get(struct _irq_source *source)
> > > > +static struct _irq_source *_irq_source_get(struct _irq_source *source)
> > > >  {
> > > >  	if (source)
> > > >  		kref_get(&source->kref);
> > > > @@ -119,6 +118,39 @@ struct _irqfd {
> > > >  	struct work_struct shutdown;
> > > >  };
> > > >  
> > > > +static struct _irqfd *_irqfd_fdget(struct kvm *kvm, int fd)
> > > > +{
> > > > +	struct eventfd_ctx *eventfd;
> > > > +	struct _irqfd *tmp, *irqfd = NULL;
> > > > +
> > > > +	eventfd = eventfd_ctx_fdget(fd);
> > > > +	if (IS_ERR(eventfd))
> > > > +		return (struct _irqfd *)eventfd;
> > > > +
> > > > +	spin_lock_irq(&kvm->irqfds.lock);
> > > > +
> > > > +	list_for_each_entry(tmp, &kvm->irqfds.items, list) {
> > > > +		if (tmp->eventfd == eventfd) {
> > > > +			irqfd = tmp;
> > > > +			break;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	spin_unlock_irq(&kvm->irqfds.lock);
> > > > +
> > > > +	if (!irqfd) {
> > > > +		eventfd_ctx_put(eventfd);
> > > > +		return ERR_PTR(-ENODEV);
> > > > +	}
> > > 
> > > Can irqfd get dassigned at this point?
> > > Could one way to fix be to take eoifd.lock in kvm_irqfd?
> > 
> > Yeah, I'm concerned this isn't a sufficient reference to the _irqfd.
> > I'll play with locking; I think we actually want the reverse, holding
> > irqfd.lock for the search through getting a reference to the
> > _irq_source.
> > 
> > > > +
> > > > +	return irqfd;
> > > > +}
> > > > +
> > > > +static void _irqfd_put(struct _irqfd *irqfd)
> > > > +{
> > > > +	eventfd_ctx_put(irqfd->eventfd);
> > > > +}
> > > > +
> > > >  static struct workqueue_struct *irqfd_cleanup_wq;
> > > >  
> > > >  static void
> > > > @@ -387,6 +419,8 @@ kvm_eventfd_init(struct kvm *kvm)
> > > >  	spin_lock_init(&kvm->irqfds.lock);
> > > >  	INIT_LIST_HEAD(&kvm->irqfds.items);
> > > >  	INIT_LIST_HEAD(&kvm->ioeventfds);
> > > > +	spin_lock_init(&kvm->eoifds.lock);
> > > > +	INIT_LIST_HEAD(&kvm->eoifds.items);
> > > >  }
> > > >  
> > > >  /*
> > > > @@ -753,3 +787,173 @@ kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
> > > >  
> > > >  	return kvm_assign_ioeventfd(kvm, args);
> > > >  }
> > > > +
> > > > +/*
> > > > + * --------------------------------------------------------------------
> > > > + *  eoifd: Translate KVM APIC/IOAPIC EOI into eventfd signal.
> > > > + *
> > > > + *  userspace can register GSIs with an eventfd for receiving
> > > > + *  notification when an EOI occurs.
> > > > + * --------------------------------------------------------------------
> > > > + */
> > > > +
> > > > +struct _eoifd {
> > > > +	struct eventfd_ctx *eventfd;
> > > > +	struct _irq_source *source; /* for de-asserting level irqfd */
> > > > +	struct kvm *kvm;
> > > > +	struct kvm_irq_ack_notifier notifier;
> > > > +	struct list_head list;
> > > > +};
> > > > +
> > > > +static void eoifd_event(struct kvm_irq_ack_notifier *notifier)
> > > > +{
> > > > +	struct _eoifd *eoifd;
> > > > +
> > > > +	eoifd = container_of(notifier, struct _eoifd, notifier);
> > > > +
> > > > +	/*
> > > > +	 * If the eoifd is tied to a level irqfd we de-assert it here.
> > > > +	 * The user is responsible for re-asserting it if their device
> > > > +	 * still needs attention.  For notification-only, skip this.
> > > > +	 */
> > > > +	if (eoifd->source)
> > > > +		kvm_set_irq(eoifd->kvm, eoifd->source->id,
> > > > +			    eoifd->notifier.gsi, 0);
> > > > +
> > > 
> > > 
> > > This will fire even if the source in question did not assert an
> > > interrupt in the first place.  This won't scale well if
> > > the interrupt is shared. Don't we need to filter by source state?
> > 
> > Extra spurious EOIs are better than dropped EOIs.  However, it does seem
> > we could add an atomic_t to _irq_source tracking the assertion state of
> > our source id.  I'll experiment with that.
> 
> I don't see why do we need more state. kvm_irq_line_state already
> uses atomics for bits in irq_state.

kvm_irq_line_state:
(a) static inline in irq_comm.c
(b) changes bits
(c) requires irq_state from ???


> > > > +	eventfd_signal(eoifd->eventfd, 1);
> > > > +}
> > > > +
> > > > +static int kvm_assign_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
> > > > +{
> > > > +	struct eventfd_ctx *eventfd;
> > > > +	struct _eoifd *eoifd, *tmp;
> > > > +	struct _irq_source *source = NULL;
> > > > +
> > > > +	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
> > > > +		struct _irqfd *irqfd = _irqfd_fdget(kvm, args->irqfd);
> > > > +		if (IS_ERR(irqfd))
> > > > +			return PTR_ERR(irqfd);
> > > > +
> > > > +		if (irqfd->gsi != args->gsi) {
> > > > +			_irqfd_put(irqfd);
> > > > +			return -EINVAL;
> > > > +		}
> > > > +
> > > 
> > > This requirement does not seem to be documented.
> > > If we require this, do we need to pass in gsi at all?
> > 
> > It seemed obvious to me that when using KVM_EOIFD_FLAG_LEVEL_IRQFD the
> > gsi needs to match that used for the kvm_irqfd.  Suppose I should never
> > assume anything is obvious in an API spec.  I'll add a comment.  I think
> > it would make the user interface more confusing to specify that gsi is
> > not required when using a level irqfd,
> 
> Maybe just make it a union: one or the other?
> 
> 
> > besides it gives us an easy sanity check.
> 
> Well pass less parameters and there will be less bugs and less stuff
> to check :)
> 
> > > > +		source = _irq_source_get(irqfd->source);
> > > > +		_irqfd_put(irqfd);
> > > > +		if (!source)
> > > > +			return -EINVAL;
> > > > +	}
> > > > +
> > > > +	eventfd = eventfd_ctx_fdget(args->fd);
> > > > +	if (IS_ERR(eventfd)) {
> > > > +		_irq_source_put(source);
> > > > +		return PTR_ERR(eventfd);
> > > > +	}
> > > > +
> > > > +	eoifd = kzalloc(sizeof(*eoifd), GFP_KERNEL);
> > > > +	if (!eoifd) {
> > > > +		_irq_source_put(source);
> > > > +		eventfd_ctx_put(eventfd);
> > > > +		return -ENOMEM;
> > > > +	}
> > > > +
> > > > +	INIT_LIST_HEAD(&eoifd->list);
> > > > +	eoifd->kvm = kvm;
> > > > +	eoifd->eventfd = eventfd;
> > > > +	eoifd->source = source;
> > > > +	eoifd->notifier.gsi = args->gsi;
> > > > +	eoifd->notifier.irq_acked = eoifd_event;
> > > > +
> > > > +	spin_lock_irq(&kvm->eoifds.lock);
> > > > +
> > > > +	list_for_each_entry(tmp, &kvm->eoifds.items, list) {
> > > > +		if (eoifd->eventfd != tmp->eventfd)
> > > > +			continue;
> > > > +
> > > > +		spin_unlock_irq(&kvm->eoifds.lock);
> > > > +		_irq_source_put(source);
> > > > +		eventfd_ctx_put(eventfd);
> > > > +		kfree(eoifd);
> > > > +		return -EBUSY;
> > > 
> > > 
> > > Please rewrite this function using labels for error handling.
> > 
> > Ok
> > 
> > > > +	}
> > > > +
> > > > +	list_add_tail(&eoifd->list, &kvm->eoifds.items);
> > > > +
> > > > +	spin_unlock_irq(&kvm->eoifds.lock);
> > > > +
> > > > +	kvm_register_irq_ack_notifier(kvm, &eoifd->notifier);
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +static void eoifd_deactivate(struct kvm *kvm, struct _eoifd *eoifd)
> > > 
> > > Pls find a less confusing name for this.
> > > It just frees the eoifd. No tricky deactivation here (unlike irqfd).
> > 
> > Uh, confusing?  It does actually deactivate it by unregistering the irq
> > ack notifier.
> 
> It also destroys it completely. To me deactivate implies "make inactive
> but do not destroy".

Umm, but as you note irqfd_deactivate does result in destroying it.

> >  I was attempting to be consistent with the function
> > naming by following irqfd_deactivate, I didn't realize there was a
> > complexity requirement associated with the name.
> 
> irqfd_deactivate is named like this because it does not free the object:
> it merely deactivates it and queues up the free job.
> And that in turn is because of tricky locking issues specific to irqfd
> that have to do with the fact kvm reads it. eoifd is written
> from kvm so no such problem.

I fail to see the logic that deactivating and *queuing* a free job is
significantly different from deactivating and actually freeing.  In
either case, the object is gone at the end as far as the user of the
foo_deactivate function is concerned.  Maybe the irqfd function is
poorly named too.

> > > > +{
> > > > +	kvm_unregister_irq_ack_notifier(kvm, &eoifd->notifier);
> > > > +	_irq_source_put(eoifd->source);
> > > > +	eventfd_ctx_put(eoifd->eventfd);
> > > > +	kfree(eoifd);
> > > > +}
> > > > +
> > > > +void kvm_eoifd_release(struct kvm *kvm)
> > > > +{
> > > > +	spin_lock_irq(&kvm->eoifds.lock);
> > > > +
> > > > +	while (!list_empty(&kvm->eoifds.items)) {
> > > > +		struct _eoifd *eoifd;
> > > > +
> > > > +		eoifd = list_first_entry(&kvm->eoifds.items,
> > > > +					 struct _eoifd, list);
> > > > +		list_del(&eoifd->list);
> > > > +
> > > > +		/* Drop spinlocks since eoifd_deactivate can sleep */
> > > > +		spin_unlock_irq(&kvm->eoifds.lock);
> > > > +		eoifd_deactivate(kvm, eoifd);
> > > > +		spin_lock_irq(&kvm->eoifds.lock);
> > > > +	}
> > > > +
> > > > +	spin_unlock_irq(&kvm->eoifds.lock);
> > > > +}
> > > > +
> > > > +static int kvm_deassign_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
> > > > +{
> > > > +	struct eventfd_ctx *eventfd;
> > > > +	struct _eoifd *tmp, *eoifd = NULL;
> > > > +
> > > > +	eventfd = eventfd_ctx_fdget(args->fd);
> > > > +	if (IS_ERR(eventfd))
> > > > +		return PTR_ERR(eventfd);
> > > > +
> > > > +	spin_lock_irq(&kvm->eoifds.lock);
> > > > +
> > > > +	list_for_each_entry(tmp, &kvm->eoifds.items, list) {
> > > > +		if (tmp->eventfd == eventfd && tmp->notifier.gsi == args->gsi) {
> > > 
> > > This is repeated several times.
> > > looks like eoifd_collision function is called for.
> > 
> > The tests aren't quite identical.  Here we test eventfd and gsi, the
> > latter primarily to validate kvm_eoifd parameters.  Further above we
> > test that eventfd is unique on addition without comparing gsi.  Both of
> > these match the existing precedent set by irqfd.  With _irqfd_fdget we
> > do have two searches of the irqfds.items list comparing only eventfd...
> > I'll see about making a trivial helper there.  Thanks,
> > 
> > Alex
> 
> I hope I clarified that irqfd is not a good match. Pls make it
> behave like ioeventfd instead.
> 
> > > > +			eoifd = tmp;
> > > > +			list_del(&eoifd->list);
> > > > +			break;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	spin_unlock_irq(&kvm->eoifds.lock);
> > > > +
> > > > +	eventfd_ctx_put(eventfd);
> > > > +
> > > > +	if (!eoifd)
> > > > +		return -ENOENT;
> > > > +
> > > > +	eoifd_deactivate(kvm, eoifd);
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
> > > > +{
> > > > +	if (args->flags & ~(KVM_EOIFD_FLAG_DEASSIGN |
> > > > +			    KVM_EOIFD_FLAG_LEVEL_IRQFD))
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (args->flags & KVM_EOIFD_FLAG_DEASSIGN)
> > > > +		return kvm_deassign_eoifd(kvm, args);
> > > > +
> > > > +	return kvm_assign_eoifd(kvm, args);
> > > > +}
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > index b4ad14cc..5b41df1 100644
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > > @@ -620,6 +620,8 @@ static int kvm_vm_release(struct inode *inode, struct file *filp)
> > > >  
> > > >  	kvm_irqfd_release(kvm);
> > > >  
> > > > +	kvm_eoifd_release(kvm);
> > > > +
> > > >  	kvm_put_kvm(kvm);
> > > >  	return 0;
> > > >  }
> > > > @@ -2093,6 +2095,15 @@ static long kvm_vm_ioctl(struct file *filp,
> > > >  		break;
> > > >  	}
> > > >  #endif
> > > > +	case KVM_EOIFD: {
> > > > +		struct kvm_eoifd data;
> > > > +
> > > > +		r = -EFAULT;
> > > > +		if (copy_from_user(&data, argp, sizeof data))
> > > > +			goto out;
> > > > +		r = kvm_eoifd(kvm, &data);
> > > > +		break;
> > > > +	}
> > > >  	default:
> > > >  		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> > > >  		if (r == -ENOTTY)
> > 
> > 




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 0/2] kvm: level irqfd and new eoifd
  2012-07-03 19:21 [PATCH v3 0/2] kvm: level irqfd and new eoifd Alex Williamson
  2012-07-03 19:21 ` [PATCH v3 1/2] kvm: Extend irqfd to support level interrupts Alex Williamson
  2012-07-03 19:21 ` [PATCH v3 2/2] kvm: KVM_EOIFD, an eventfd for EOIs Alex Williamson
@ 2012-07-11  9:53 ` Avi Kivity
  2012-07-11 10:18   ` Jan Kiszka
  2 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2012-07-11  9:53 UTC (permalink / raw)
  To: Alex Williamson; +Cc: mst, gleb, kvm, linux-kernel, jan.kiszka

On 07/03/2012 10:21 PM, Alex Williamson wrote:
> Here's the latest iteration of adding an interface to assert and
> de-assert level interrupts from external drivers like vfio.  These
> apply on top of the previous argument cleanup, documentation, and
> sanitization patches for irqfd.  It would be great to get this queued
> in next for linux 3.6.
> 
> I believe I've addressed all the previous comments, including fixing
> the locking problems in eoifd.  I've run this with lockdep adding
> and removing level irqfd/eoifd pairs without any problems.  Please
> let me know if there are any further comments.  Thanks,

Is there any performance justification for level irqfd?  Don't all
new/high bandwidth devices support msi, and this is just a legacy path?

eoifd does add new functionality (which parallels the existing ack
notifier usage in kvm device assignment); it's not just an optimization.

-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 0/2] kvm: level irqfd and new eoifd
  2012-07-11  9:53 ` [PATCH v3 0/2] kvm: level irqfd and new eoifd Avi Kivity
@ 2012-07-11 10:18   ` Jan Kiszka
  2012-07-11 10:49     ` Avi Kivity
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Kiszka @ 2012-07-11 10:18 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alex Williamson, mst, gleb, kvm, linux-kernel

On 2012-07-11 11:53, Avi Kivity wrote:
> On 07/03/2012 10:21 PM, Alex Williamson wrote:
>> Here's the latest iteration of adding an interface to assert and
>> de-assert level interrupts from external drivers like vfio.  These
>> apply on top of the previous argument cleanup, documentation, and
>> sanitization patches for irqfd.  It would be great to get this queued
>> in next for linux 3.6.
>>
>> I believe I've addressed all the previous comments, including fixing
>> the locking problems in eoifd.  I've run this with lockdep adding
>> and removing level irqfd/eoifd pairs without any problems.  Please
>> let me know if there are any further comments.  Thanks,
> 
> Is there any performance justification for level irqfd?  Don't all
> new/high bandwidth devices support msi, and this is just a legacy path?

I think we've been there before, but the situation hasn't improved much:

Apparently, some GPUs still prefer INTx over MSI. Some wireless chipsets
too. And then there is not easily replaceable legacy hardware like old
telephony adapters or industrial I/O cards etc. that want this.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 0/2] kvm: level irqfd and new eoifd
  2012-07-11 10:18   ` Jan Kiszka
@ 2012-07-11 10:49     ` Avi Kivity
  2012-07-11 11:23       ` Jan Kiszka
  0 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2012-07-11 10:49 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Alex Williamson, mst, gleb, kvm, linux-kernel

On 07/11/2012 01:18 PM, Jan Kiszka wrote:
> On 2012-07-11 11:53, Avi Kivity wrote:
>> On 07/03/2012 10:21 PM, Alex Williamson wrote:
>>> Here's the latest iteration of adding an interface to assert and
>>> de-assert level interrupts from external drivers like vfio.  These
>>> apply on top of the previous argument cleanup, documentation, and
>>> sanitization patches for irqfd.  It would be great to get this queued
>>> in next for linux 3.6.
>>>
>>> I believe I've addressed all the previous comments, including fixing
>>> the locking problems in eoifd.  I've run this with lockdep adding
>>> and removing level irqfd/eoifd pairs without any problems.  Please
>>> let me know if there are any further comments.  Thanks,
>> 
>> Is there any performance justification for level irqfd?  Don't all
>> new/high bandwidth devices support msi, and this is just a legacy path?
> 
> I think we've been there before, but the situation hasn't improved much:
> 
> Apparently, some GPUs still prefer INTx over MSI. Some wireless chipsets
> too. 

I'd appreciate a couple of examples for formality's sake.

> And then there is not easily replaceable legacy hardware like old
> telephony adapters or industrial I/O cards etc. that want this.

I imagine legacy hardware will live with the speed of routing through
qemu, when running on modern platforms.

-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 0/2] kvm: level irqfd and new eoifd
  2012-07-11 10:49     ` Avi Kivity
@ 2012-07-11 11:23       ` Jan Kiszka
  2012-07-11 11:51         ` Avi Kivity
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Kiszka @ 2012-07-11 11:23 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Alex Williamson, mst, gleb, kvm, linux-kernel

On 2012-07-11 12:49, Avi Kivity wrote:
> On 07/11/2012 01:18 PM, Jan Kiszka wrote:
>> On 2012-07-11 11:53, Avi Kivity wrote:
>>> On 07/03/2012 10:21 PM, Alex Williamson wrote:
>>>> Here's the latest iteration of adding an interface to assert and
>>>> de-assert level interrupts from external drivers like vfio.  These
>>>> apply on top of the previous argument cleanup, documentation, and
>>>> sanitization patches for irqfd.  It would be great to get this queued
>>>> in next for linux 3.6.
>>>>
>>>> I believe I've addressed all the previous comments, including fixing
>>>> the locking problems in eoifd.  I've run this with lockdep adding
>>>> and removing level irqfd/eoifd pairs without any problems.  Please
>>>> let me know if there are any further comments.  Thanks,
>>>
>>> Is there any performance justification for level irqfd?  Don't all
>>> new/high bandwidth devices support msi, and this is just a legacy path?
>>
>> I think we've been there before, but the situation hasn't improved much:
>>
>> Apparently, some GPUs still prefer INTx over MSI. Some wireless chipsets
>> too. 
> 
> I'd appreciate a couple of examples for formality's sake.

>From the top of my head: NVIDIA FX3700 (granted, legacy by now), Atheros
AR9287. For others, I need to check.

> 
>> And then there is not easily replaceable legacy hardware like old
>> telephony adapters or industrial I/O cards etc. that want this.
> 
> I imagine legacy hardware will live with the speed of routing through
> qemu, when running on modern platforms.

Just because it's "legacy" doesn't mean it's "low performance" and "low
interrupt rate".

We still have classic KVM device assignment to provide fast-path INTx.
But if we want to replace it midterm, I think it's necessary for VFIO to
be able to provide such a path as well.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 0/2] kvm: level irqfd and new eoifd
  2012-07-11 11:23       ` Jan Kiszka
@ 2012-07-11 11:51         ` Avi Kivity
  2012-07-11 19:57           ` Alex Williamson
  0 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2012-07-11 11:51 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Alex Williamson, mst, gleb, kvm, linux-kernel

On 07/11/2012 02:23 PM, Jan Kiszka wrote:
>> 
>> I'd appreciate a couple of examples for formality's sake.
> 
> From the top of my head: NVIDIA FX3700 (granted, legacy by now), Atheros
> AR9287. For others, I need to check.

Thanks.

>> 
>>> And then there is not easily replaceable legacy hardware like old
>>> telephony adapters or industrial I/O cards etc. that want this.
>> 
>> I imagine legacy hardware will live with the speed of routing through
>> qemu, when running on modern platforms.
> 
> Just because it's "legacy" doesn't mean it's "low performance" and "low
> interrupt rate".

I meant that it was used with lower throughput hardware, so the overhead
of routing through qemu will be masked by the improved host hardware.
But most of the improvement in hardware in recent years is the increase
in core/thread count.

> We still have classic KVM device assignment to provide fast-path INTx.
> But if we want to replace it midterm, I think it's necessary for VFIO to
> be able to provide such a path as well.

I would like VFIO to have no regressions vs. kvm device assignment,
except perhaps in uncommon corner cases.  So I agree.

-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 0/2] kvm: level irqfd and new eoifd
  2012-07-11 11:51         ` Avi Kivity
@ 2012-07-11 19:57           ` Alex Williamson
  2012-07-12  9:35             ` Avi Kivity
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2012-07-11 19:57 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, mst, gleb, kvm, linux-kernel

On Wed, 2012-07-11 at 14:51 +0300, Avi Kivity wrote:
> On 07/11/2012 02:23 PM, Jan Kiszka wrote:
> >> 
> >> I'd appreciate a couple of examples for formality's sake.
> > 
> > From the top of my head: NVIDIA FX3700 (granted, legacy by now), Atheros
> > AR9287. For others, I need to check.
> 
> Thanks.
> 
> >> 
> >>> And then there is not easily replaceable legacy hardware like old
> >>> telephony adapters or industrial I/O cards etc. that want this.
> >> 
> >> I imagine legacy hardware will live with the speed of routing through
> >> qemu, when running on modern platforms.
> > 
> > Just because it's "legacy" doesn't mean it's "low performance" and "low
> > interrupt rate".
> 
> I meant that it was used with lower throughput hardware, so the overhead
> of routing through qemu will be masked by the improved host hardware.
> But most of the improvement in hardware in recent years is the increase
> in core/thread count.
> 
> > We still have classic KVM device assignment to provide fast-path INTx.
> > But if we want to replace it midterm, I think it's necessary for VFIO to
> > be able to provide such a path as well.
> 
> I would like VFIO to have no regressions vs. kvm device assignment,
> except perhaps in uncommon corner cases.  So I agree.

I ran a few TCP_RR netperf tests forcing a 1Gb tg3 nic to use INTx.
Without irqchip support vfio gets a bit more than 60% of KVM device
assignment.  That's a little bit of an unfair comparison since it's more
than just the I/O path.  With the proposed interfaces here, enabling
irqchip, vfio is within 10% of KVM device assignment for INTx.  For MSI,
I can actually make vfio come out more than 30% better than KVM device
assignment if I send the eventfd from the hard irq handler.  Using a
threaded handler as the code currently does, vfio is still behind KVM.
It's hard to beat a direct call chain.

For more devices, one that seems common among the non-enterprise users
are TV capture cards, like the old PVR-250/350 devices.  These don't
support MSI.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 0/2] kvm: level irqfd and new eoifd
  2012-07-11 19:57           ` Alex Williamson
@ 2012-07-12  9:35             ` Avi Kivity
  2012-07-12 16:19               ` Alex Williamson
  0 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2012-07-12  9:35 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Jan Kiszka, mst, gleb, kvm, linux-kernel

On 07/11/2012 10:57 PM, Alex Williamson wrote:
>> 
>> > We still have classic KVM device assignment to provide fast-path INTx.
>> > But if we want to replace it midterm, I think it's necessary for VFIO to
>> > be able to provide such a path as well.
>> 
>> I would like VFIO to have no regressions vs. kvm device assignment,
>> except perhaps in uncommon corner cases.  So I agree.
> 
> I ran a few TCP_RR netperf tests forcing a 1Gb tg3 nic to use INTx.
> Without irqchip support vfio gets a bit more than 60% of KVM device
> assignment.  That's a little bit of an unfair comparison since it's more
> than just the I/O path.  With the proposed interfaces here, enabling
> irqchip, vfio is within 10% of KVM device assignment for INTx.  For MSI,
> I can actually make vfio come out more than 30% better than KVM device
> assignment if I send the eventfd from the hard irq handler.  Using a
> threaded handler as the code currently does, vfio is still behind KVM.
> It's hard to beat a direct call chain.

We can have a direct call chain with vfio too, using a custom eventfd
poll function, no?  Assuming we set up a fast path for unicast msi.

> For more devices, one that seems common among the non-enterprise users
> are TV capture cards, like the old PVR-250/350 devices.  These don't
> support MSI.  Thanks,

That doesn't mean they require an interrupt rate that warrants a fast
path.  But I guess that some combination of old guests or old hardware
will want it.

-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 0/2] kvm: level irqfd and new eoifd
  2012-07-12  9:35             ` Avi Kivity
@ 2012-07-12 16:19               ` Alex Williamson
  2012-07-12 17:38                 ` Alex Williamson
  2012-07-15  8:33                 ` Avi Kivity
  0 siblings, 2 replies; 22+ messages in thread
From: Alex Williamson @ 2012-07-12 16:19 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, mst, gleb, kvm, linux-kernel

On Thu, 2012-07-12 at 12:35 +0300, Avi Kivity wrote:
> On 07/11/2012 10:57 PM, Alex Williamson wrote:
> >> 
> >> > We still have classic KVM device assignment to provide fast-path INTx.
> >> > But if we want to replace it midterm, I think it's necessary for VFIO to
> >> > be able to provide such a path as well.
> >> 
> >> I would like VFIO to have no regressions vs. kvm device assignment,
> >> except perhaps in uncommon corner cases.  So I agree.
> > 
> > I ran a few TCP_RR netperf tests forcing a 1Gb tg3 nic to use INTx.
> > Without irqchip support vfio gets a bit more than 60% of KVM device
> > assignment.  That's a little bit of an unfair comparison since it's more
> > than just the I/O path.  With the proposed interfaces here, enabling
> > irqchip, vfio is within 10% of KVM device assignment for INTx.  For MSI,
> > I can actually make vfio come out more than 30% better than KVM device
> > assignment if I send the eventfd from the hard irq handler.  Using a
> > threaded handler as the code currently does, vfio is still behind KVM.
> > It's hard to beat a direct call chain.
> 
> We can have a direct call chain with vfio too, using a custom eventfd
> poll function, no?  Assuming we set up a fast path for unicast msi.

You'll have to help me out a little, eventfd_signal walks the wait_queue
and calls each function.  On the injection path that includes
irqfd_wakeup.  For an MSI that seems to already provide direct
injection.  For level we'll schedule_work, so that explains the overhead
in that path, but it's not too dissimilar to a a threaded irq.  vfio
does something very similar, so there's a schedule_work both on inject
and on eoi.  I'll have to check whether anything prevents the unmask
from the wait_queue function in vfio, that could be a significant chunk
of the gap.  Where's the custom poll function come into play?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 0/2] kvm: level irqfd and new eoifd
  2012-07-12 16:19               ` Alex Williamson
@ 2012-07-12 17:38                 ` Alex Williamson
  2012-07-15 10:09                   ` Avi Kivity
  2012-07-15  8:33                 ` Avi Kivity
  1 sibling, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2012-07-12 17:38 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, mst, gleb, kvm, linux-kernel

On Thu, 2012-07-12 at 10:19 -0600, Alex Williamson wrote:
> On Thu, 2012-07-12 at 12:35 +0300, Avi Kivity wrote:
> > On 07/11/2012 10:57 PM, Alex Williamson wrote:
> > >> 
> > >> > We still have classic KVM device assignment to provide fast-path INTx.
> > >> > But if we want to replace it midterm, I think it's necessary for VFIO to
> > >> > be able to provide such a path as well.
> > >> 
> > >> I would like VFIO to have no regressions vs. kvm device assignment,
> > >> except perhaps in uncommon corner cases.  So I agree.
> > > 
> > > I ran a few TCP_RR netperf tests forcing a 1Gb tg3 nic to use INTx.
> > > Without irqchip support vfio gets a bit more than 60% of KVM device
> > > assignment.  That's a little bit of an unfair comparison since it's more
> > > than just the I/O path.  With the proposed interfaces here, enabling
> > > irqchip, vfio is within 10% of KVM device assignment for INTx.  For MSI,
> > > I can actually make vfio come out more than 30% better than KVM device
> > > assignment if I send the eventfd from the hard irq handler.  Using a
> > > threaded handler as the code currently does, vfio is still behind KVM.
> > > It's hard to beat a direct call chain.
> > 
> > We can have a direct call chain with vfio too, using a custom eventfd
> > poll function, no?  Assuming we set up a fast path for unicast msi.
> 
> You'll have to help me out a little, eventfd_signal walks the wait_queue
> and calls each function.  On the injection path that includes
> irqfd_wakeup.  For an MSI that seems to already provide direct
> injection.  For level we'll schedule_work, so that explains the overhead
> in that path, but it's not too dissimilar to a a threaded irq.  vfio
> does something very similar, so there's a schedule_work both on inject
> and on eoi.  I'll have to check whether anything prevents the unmask
> from the wait_queue function in vfio, that could be a significant chunk
> of the gap.

Yep, the schedule_work in the eoi is the culprit.  A direct unmask from
the wait queue function gives me better results than kvm for INTx.
We'll have to see how the leapfrogging goes once KVM switches to
injection from the hard handler.  I'm still curious what this custom
poll function would give us though.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 2/2] kvm: KVM_EOIFD, an eventfd for EOIs
  2012-07-09 20:35         ` Alex Williamson
@ 2012-07-13 13:36           ` Michael S. Tsirkin
  0 siblings, 0 replies; 22+ messages in thread
From: Michael S. Tsirkin @ 2012-07-13 13:36 UTC (permalink / raw)
  To: Alex Williamson; +Cc: avi, gleb, kvm, linux-kernel, jan.kiszka

On Mon, Jul 09, 2012 at 02:35:13PM -0600, Alex Williamson wrote:
> > Well internal qemu APIs are qemu's problem and can be addressed there.
> > For example, can we make it mimic our interface: make qemu EOI notifier
> > accept an object that includes qemu_irq without irqchip and irqfd with?
> 
> So the qemu API changes based on whether irqchip is enabled or not?!

It's an opaque structure so no.

> > In other words adding interface with no users looks weird.
> 
> If we could ever finish this patch, I could start working on the
> user... ;)

But why? vfio will bypass qemu so it won't need it.

> > > > > +
> > > > > +The KVM_EOIFD_FLAG_LEVEL_IRQFD flag indicates that the provided
> > > > > +kvm_eoifd stucture includes a valid kvm_eoifd.irqfd file descriptor
> > > > > +for a level irqfd configured using the KVM_IRQFD_FLAG_LEVEL flag.
> > > > > +In this mode the level interrupt is de-asserted prior to EOI eventfd
> > > > > +notification.  The KVM_EOIFD_FLAG_LEVEL_IRQFD is only necessary on
> > > > > +setup, teardown is identical to that above.
> > > > 
> > > > It seems a single fd can not be used multiple times for
> > > > different irqfds? In that case:
> > > > 1. Need to document this limitation
> > > 
> > > Ok
> > > 
> > > > 2. This differs from other notifiers e.g.  ioeventfd.
> > > 
> > > But the same as an irqfd.
> > 
> > However irqfd is about interrupting guest. eoifd is more like ioeventfd
> > really: kvm writes ioeventfd/eoifd but reads irqfd.
> > 
> > >  Neither use case I'm thinking of needs to
> > > allow eventfds to be re-used and knowing that an eventfd is unique on
> > > our list makes matching it much easier.  For instance, if we have an
> > > eoifd bound to a level irqfd and it's being de-assigned, we'd have to
> > > require the eoifd is de-assigned before the irqfd so that we can
> > > validate matching eventfd, gsi, and irqfd.  That gets very racy when
> > > userspace aborts.
> > 
> > Why can't eoifd simply keep a reference to irqfd? Or register to
> > the hangup event.
> 
> An irqfd is just a file descriptor and we can't block close().

But you can reference the fd.

> I was
> trying to use struct _irq_source as the reference counted object because
> I didn't see any need to keep the irqfd around.
> 
> > > > Pls note: if you decide to remove this limitation, we need
> > > > to add some other limit on the # of EOIFDs that VM can have
> > > > as won't be limited by # of fds anymore.
> > > 
> > > I don't plan on changing this unless someone has an argument why it
> > > would be useful.
> > 
> > For example you can use same eventfd as ioeventfd
> > and eoifd.  Why not two eoifds?
> > 
> > I need to think about specifics more but it just breaks how eventfd is
> > normally used. It should be an agrregator that can count any writers and
> > support 1 reader.
> > 
> > Let me draw this:
> > 
> > W1 ---->\
> > W2 ----> +--> eventfd ----> R
> > W3 ---->/
> > 
> > 
> > This explains the difference: in irqfd kvm reads eventfd so only 1 is
> > supported. in ioeventfd (and eoifd) kvm is the writer so any number
> > should be supported.
> 
> Yes, removal is the problem.  If I allow the same eventfd to fire for
> multiple EOIs, then I need to match eventfd and irqfd on de-assign.
> That means I need to cache some bit of context about the irqfd in case
> it's closed before the eoifd.  Perhaps the _eoifd holds a reference to
> _irqfd.eventfd for that.

Yes this can work.

> > > > > +
> > > > >  5. The kvm_run structure
> > > > >  ------------------------
> > > > >  
> > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > index 80bed07..62d6eca 100644
> > > > > --- a/arch/x86/kvm/x86.c
> > > > > +++ b/arch/x86/kvm/x86.c
> > > > > @@ -2149,6 +2149,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > > >  	case KVM_CAP_PCI_2_3:
> > > > >  	case KVM_CAP_KVMCLOCK_CTRL:
> > > > >  	case KVM_CAP_IRQFD_LEVEL:
> > > > > +	case KVM_CAP_EOIFD:
> > > > >  		r = 1;
> > > > >  		break;
> > > > >  	case KVM_CAP_COALESCED_MMIO:
> > > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > > index b2e6e4f..7567e7d 100644
> > > > > --- a/include/linux/kvm.h
> > > > > +++ b/include/linux/kvm.h
> > > > > @@ -619,6 +619,7 @@ struct kvm_ppc_smmu_info {
> > > > >  #define KVM_CAP_S390_COW 79
> > > > >  #define KVM_CAP_PPC_ALLOC_HTAB 80
> > > > >  #define KVM_CAP_IRQFD_LEVEL 81
> > > > > +#define KVM_CAP_EOIFD 82
> > > > >  
> > > > >  #ifdef KVM_CAP_IRQ_ROUTING
> > > > >  
> > > > > @@ -694,6 +695,17 @@ struct kvm_irqfd {
> > > > >  	__u8  pad[20];
> > > > >  };
> > > > >  
> > > > > +#define KVM_EOIFD_FLAG_DEASSIGN (1 << 0)
> > > > > +#define KVM_EOIFD_FLAG_LEVEL_IRQFD (1 << 1)
> > > > > +
> > > > > +struct kvm_eoifd {
> > > > > +	__u32 fd;
> > > > > +	__u32 gsi;
> > > > > +	__u32 flags;
> > > > > +	__u32 irqfd;
> > > > > +	__u8 pad[16];
> > > > > +};
> > > > > +
> > > > >  struct kvm_clock_data {
> > > > >  	__u64 clock;
> > > > >  	__u32 flags;
> > > > > @@ -834,6 +846,8 @@ struct kvm_s390_ucas_mapping {
> > > > >  #define KVM_PPC_GET_SMMU_INFO	  _IOR(KVMIO,  0xa6, struct kvm_ppc_smmu_info)
> > > > >  /* Available with KVM_CAP_PPC_ALLOC_HTAB */
> > > > >  #define KVM_PPC_ALLOCATE_HTAB	  _IOWR(KVMIO, 0xa7, __u32)
> > > > > +/* Available with KVM_CAP_EOIFD */
> > > > > +#define KVM_EOIFD                 _IOW(KVMIO,  0xa8, struct kvm_eoifd)
> > > > >  
> > > > >  /*
> > > > >   * ioctls for vcpu fds
> > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > > index ae3b426..83472eb 100644
> > > > > --- a/include/linux/kvm_host.h
> > > > > +++ b/include/linux/kvm_host.h
> > > > > @@ -285,6 +285,10 @@ struct kvm {
> > > > >  		struct list_head  items;
> > > > >  	} irqfds;
> > > > >  	struct list_head ioeventfds;
> > > > > +	struct {
> > > > > +		spinlock_t        lock;
> > > > 
> > > > Can't this be a mutex? It seems that code will be much simpler
> > > > if you can just take the mutex, do everything and unlock.
> > > 
> > > Yeah, looks like it could.
> > > 
> > > > For example you could prevent changing eoifds while irqfds
> > > > are changed.
> > > 
> > > I'm not sure we need to be that strict, but we could be sure to get a
> > > reference to the _source_id using the irqfd lock that way... actually we
> > > could do that in the current locking scheme too.
> > 
> > Just a suggestion.
> > 
> > > > > +		struct list_head  items;
> > > > > +	} eoifds;
> > > > >  #endif
> > > > >  	struct kvm_vm_stat stat;
> > > > >  	struct kvm_arch arch;
> > > > > @@ -828,6 +832,8 @@ int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args);
> > > > >  void kvm_irqfd_release(struct kvm *kvm);
> > > > >  void kvm_irq_routing_update(struct kvm *, struct kvm_irq_routing_table *);
> > > > >  int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args);
> > > > > +int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args);
> > > > > +void kvm_eoifd_release(struct kvm *kvm);
> > > > >  
> > > > >  #else
> > > > >  
> > > > > @@ -853,6 +859,13 @@ static inline int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
> > > > >  	return -ENOSYS;
> > > > >  }
> > > > >  
> > > > > +static inline int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
> > > > > +{
> > > > > +	return -ENOSYS;
> > > > > +}
> > > > > +
> > > > > +static inline void kvm_eoifd_release(struct kvm *kvm) {}
> > > > > +
> > > > >  #endif /* CONFIG_HAVE_KVM_EVENTFD */
> > > > >  
> > > > >  #ifdef CONFIG_KVM_APIC_ARCHITECTURE
> > > > > diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> > > > > index 92aa5ba..2bc9768 100644
> > > > > --- a/virt/kvm/eventfd.c
> > > > > +++ b/virt/kvm/eventfd.c
> > > > > @@ -62,8 +62,7 @@ static void _irq_source_put(struct _irq_source *source)
> > > > >  		kref_put(&source->kref, _irq_source_release);
> > > > >  }
> > > > >  
> > > > > -static struct _irq_source *__attribute__ ((used)) /* white lie for now */
> > > > > -_irq_source_get(struct _irq_source *source)
> > > > > +static struct _irq_source *_irq_source_get(struct _irq_source *source)
> > > > >  {
> > > > >  	if (source)
> > > > >  		kref_get(&source->kref);
> > > > > @@ -119,6 +118,39 @@ struct _irqfd {
> > > > >  	struct work_struct shutdown;
> > > > >  };
> > > > >  
> > > > > +static struct _irqfd *_irqfd_fdget(struct kvm *kvm, int fd)
> > > > > +{
> > > > > +	struct eventfd_ctx *eventfd;
> > > > > +	struct _irqfd *tmp, *irqfd = NULL;
> > > > > +
> > > > > +	eventfd = eventfd_ctx_fdget(fd);
> > > > > +	if (IS_ERR(eventfd))
> > > > > +		return (struct _irqfd *)eventfd;
> > > > > +
> > > > > +	spin_lock_irq(&kvm->irqfds.lock);
> > > > > +
> > > > > +	list_for_each_entry(tmp, &kvm->irqfds.items, list) {
> > > > > +		if (tmp->eventfd == eventfd) {
> > > > > +			irqfd = tmp;
> > > > > +			break;
> > > > > +		}
> > > > > +	}
> > > > > +
> > > > > +	spin_unlock_irq(&kvm->irqfds.lock);
> > > > > +
> > > > > +	if (!irqfd) {
> > > > > +		eventfd_ctx_put(eventfd);
> > > > > +		return ERR_PTR(-ENODEV);
> > > > > +	}
> > > > 
> > > > Can irqfd get dassigned at this point?
> > > > Could one way to fix be to take eoifd.lock in kvm_irqfd?
> > > 
> > > Yeah, I'm concerned this isn't a sufficient reference to the _irqfd.
> > > I'll play with locking; I think we actually want the reverse, holding
> > > irqfd.lock for the search through getting a reference to the
> > > _irq_source.
> > > 
> > > > > +
> > > > > +	return irqfd;
> > > > > +}
> > > > > +
> > > > > +static void _irqfd_put(struct _irqfd *irqfd)
> > > > > +{
> > > > > +	eventfd_ctx_put(irqfd->eventfd);
> > > > > +}
> > > > > +
> > > > >  static struct workqueue_struct *irqfd_cleanup_wq;
> > > > >  
> > > > >  static void
> > > > > @@ -387,6 +419,8 @@ kvm_eventfd_init(struct kvm *kvm)
> > > > >  	spin_lock_init(&kvm->irqfds.lock);
> > > > >  	INIT_LIST_HEAD(&kvm->irqfds.items);
> > > > >  	INIT_LIST_HEAD(&kvm->ioeventfds);
> > > > > +	spin_lock_init(&kvm->eoifds.lock);
> > > > > +	INIT_LIST_HEAD(&kvm->eoifds.items);
> > > > >  }
> > > > >  
> > > > >  /*
> > > > > @@ -753,3 +787,173 @@ kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
> > > > >  
> > > > >  	return kvm_assign_ioeventfd(kvm, args);
> > > > >  }
> > > > > +
> > > > > +/*
> > > > > + * --------------------------------------------------------------------
> > > > > + *  eoifd: Translate KVM APIC/IOAPIC EOI into eventfd signal.
> > > > > + *
> > > > > + *  userspace can register GSIs with an eventfd for receiving
> > > > > + *  notification when an EOI occurs.
> > > > > + * --------------------------------------------------------------------
> > > > > + */
> > > > > +
> > > > > +struct _eoifd {
> > > > > +	struct eventfd_ctx *eventfd;
> > > > > +	struct _irq_source *source; /* for de-asserting level irqfd */
> > > > > +	struct kvm *kvm;
> > > > > +	struct kvm_irq_ack_notifier notifier;
> > > > > +	struct list_head list;
> > > > > +};
> > > > > +
> > > > > +static void eoifd_event(struct kvm_irq_ack_notifier *notifier)
> > > > > +{
> > > > > +	struct _eoifd *eoifd;
> > > > > +
> > > > > +	eoifd = container_of(notifier, struct _eoifd, notifier);
> > > > > +
> > > > > +	/*
> > > > > +	 * If the eoifd is tied to a level irqfd we de-assert it here.
> > > > > +	 * The user is responsible for re-asserting it if their device
> > > > > +	 * still needs attention.  For notification-only, skip this.
> > > > > +	 */
> > > > > +	if (eoifd->source)
> > > > > +		kvm_set_irq(eoifd->kvm, eoifd->source->id,
> > > > > +			    eoifd->notifier.gsi, 0);
> > > > > +
> > > > 
> > > > 
> > > > This will fire even if the source in question did not assert an
> > > > interrupt in the first place.  This won't scale well if
> > > > the interrupt is shared. Don't we need to filter by source state?
> > > 
> > > Extra spurious EOIs are better than dropped EOIs.  However, it does seem
> > > we could add an atomic_t to _irq_source tracking the assertion state of
> > > our source id.  I'll experiment with that.
> > 
> > I don't see why do we need more state. kvm_irq_line_state already
> > uses atomics for bits in irq_state.
> 
> kvm_irq_line_state:
> (a) static inline in irq_comm.c
> (b) changes bits
> (c) requires irq_state from ???
> 

sorry I failed to parse above. I was merely saying each source
bit is changed using atomics like set_bit etc so no need to
add more atomics.

> > > > > +	eventfd_signal(eoifd->eventfd, 1);
> > > > > +}
> > > > > +
> > > > > +static int kvm_assign_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
> > > > > +{
> > > > > +	struct eventfd_ctx *eventfd;
> > > > > +	struct _eoifd *eoifd, *tmp;
> > > > > +	struct _irq_source *source = NULL;
> > > > > +
> > > > > +	if (args->flags & KVM_EOIFD_FLAG_LEVEL_IRQFD) {
> > > > > +		struct _irqfd *irqfd = _irqfd_fdget(kvm, args->irqfd);
> > > > > +		if (IS_ERR(irqfd))
> > > > > +			return PTR_ERR(irqfd);
> > > > > +
> > > > > +		if (irqfd->gsi != args->gsi) {
> > > > > +			_irqfd_put(irqfd);
> > > > > +			return -EINVAL;
> > > > > +		}
> > > > > +
> > > > 
> > > > This requirement does not seem to be documented.
> > > > If we require this, do we need to pass in gsi at all?
> > > 
> > > It seemed obvious to me that when using KVM_EOIFD_FLAG_LEVEL_IRQFD the
> > > gsi needs to match that used for the kvm_irqfd.  Suppose I should never
> > > assume anything is obvious in an API spec.  I'll add a comment.  I think
> > > it would make the user interface more confusing to specify that gsi is
> > > not required when using a level irqfd,
> > 
> > Maybe just make it a union: one or the other?
> > 
> > 
> > > besides it gives us an easy sanity check.
> > 
> > Well pass less parameters and there will be less bugs and less stuff
> > to check :)
> > 
> > > > > +		source = _irq_source_get(irqfd->source);
> > > > > +		_irqfd_put(irqfd);
> > > > > +		if (!source)
> > > > > +			return -EINVAL;
> > > > > +	}
> > > > > +
> > > > > +	eventfd = eventfd_ctx_fdget(args->fd);
> > > > > +	if (IS_ERR(eventfd)) {
> > > > > +		_irq_source_put(source);
> > > > > +		return PTR_ERR(eventfd);
> > > > > +	}
> > > > > +
> > > > > +	eoifd = kzalloc(sizeof(*eoifd), GFP_KERNEL);
> > > > > +	if (!eoifd) {
> > > > > +		_irq_source_put(source);
> > > > > +		eventfd_ctx_put(eventfd);
> > > > > +		return -ENOMEM;
> > > > > +	}
> > > > > +
> > > > > +	INIT_LIST_HEAD(&eoifd->list);
> > > > > +	eoifd->kvm = kvm;
> > > > > +	eoifd->eventfd = eventfd;
> > > > > +	eoifd->source = source;
> > > > > +	eoifd->notifier.gsi = args->gsi;
> > > > > +	eoifd->notifier.irq_acked = eoifd_event;
> > > > > +
> > > > > +	spin_lock_irq(&kvm->eoifds.lock);
> > > > > +
> > > > > +	list_for_each_entry(tmp, &kvm->eoifds.items, list) {
> > > > > +		if (eoifd->eventfd != tmp->eventfd)
> > > > > +			continue;
> > > > > +
> > > > > +		spin_unlock_irq(&kvm->eoifds.lock);
> > > > > +		_irq_source_put(source);
> > > > > +		eventfd_ctx_put(eventfd);
> > > > > +		kfree(eoifd);
> > > > > +		return -EBUSY;
> > > > 
> > > > 
> > > > Please rewrite this function using labels for error handling.
> > > 
> > > Ok
> > > 
> > > > > +	}
> > > > > +
> > > > > +	list_add_tail(&eoifd->list, &kvm->eoifds.items);
> > > > > +
> > > > > +	spin_unlock_irq(&kvm->eoifds.lock);
> > > > > +
> > > > > +	kvm_register_irq_ack_notifier(kvm, &eoifd->notifier);
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +static void eoifd_deactivate(struct kvm *kvm, struct _eoifd *eoifd)
> > > > 
> > > > Pls find a less confusing name for this.
> > > > It just frees the eoifd. No tricky deactivation here (unlike irqfd).
> > > 
> > > Uh, confusing?  It does actually deactivate it by unregistering the irq
> > > ack notifier.
> > 
> > It also destroys it completely. To me deactivate implies "make inactive
> > but do not destroy".
> 
> Umm, but as you note irqfd_deactivate does result in destroying it.

It doesn't result in this directly. It merely deactivates and queues it
so it will be destroyed later.

> > >  I was attempting to be consistent with the function
> > > naming by following irqfd_deactivate, I didn't realize there was a
> > > complexity requirement associated with the name.
> > 
> > irqfd_deactivate is named like this because it does not free the object:
> > it merely deactivates it and queues up the free job.
> > And that in turn is because of tricky locking issues specific to irqfd
> > that have to do with the fact kvm reads it. eoifd is written
> > from kvm so no such problem.
> 
> I fail to see the logic that deactivating and *queuing* a free job is
> significantly different from deactivating and actually freeing.  In
> either case, the object is gone at the end as far as the user of the
> foo_deactivate function is concerned.  Maybe the irqfd function is
> poorly named too.

Maybe. We have a two stage destructor for irqfd so instead of usual
_release we had to invent two different names for
the two stages in an attempt to clarify that neither is the complete
destructor. Current code calls them deactivate and shutdown.

eoi is more like ioeventfd, it does not need two stages
so it can have straight-forward names and call it eoi_release.

> > > > > +{
> > > > > +	kvm_unregister_irq_ack_notifier(kvm, &eoifd->notifier);
> > > > > +	_irq_source_put(eoifd->source);
> > > > > +	eventfd_ctx_put(eoifd->eventfd);
> > > > > +	kfree(eoifd);
> > > > > +}
> > > > > +
> > > > > +void kvm_eoifd_release(struct kvm *kvm)
> > > > > +{
> > > > > +	spin_lock_irq(&kvm->eoifds.lock);
> > > > > +
> > > > > +	while (!list_empty(&kvm->eoifds.items)) {
> > > > > +		struct _eoifd *eoifd;
> > > > > +
> > > > > +		eoifd = list_first_entry(&kvm->eoifds.items,
> > > > > +					 struct _eoifd, list);
> > > > > +		list_del(&eoifd->list);
> > > > > +
> > > > > +		/* Drop spinlocks since eoifd_deactivate can sleep */
> > > > > +		spin_unlock_irq(&kvm->eoifds.lock);
> > > > > +		eoifd_deactivate(kvm, eoifd);
> > > > > +		spin_lock_irq(&kvm->eoifds.lock);
> > > > > +	}
> > > > > +
> > > > > +	spin_unlock_irq(&kvm->eoifds.lock);
> > > > > +}
> > > > > +
> > > > > +static int kvm_deassign_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
> > > > > +{
> > > > > +	struct eventfd_ctx *eventfd;
> > > > > +	struct _eoifd *tmp, *eoifd = NULL;
> > > > > +
> > > > > +	eventfd = eventfd_ctx_fdget(args->fd);
> > > > > +	if (IS_ERR(eventfd))
> > > > > +		return PTR_ERR(eventfd);
> > > > > +
> > > > > +	spin_lock_irq(&kvm->eoifds.lock);
> > > > > +
> > > > > +	list_for_each_entry(tmp, &kvm->eoifds.items, list) {
> > > > > +		if (tmp->eventfd == eventfd && tmp->notifier.gsi == args->gsi) {
> > > > 
> > > > This is repeated several times.
> > > > looks like eoifd_collision function is called for.
> > > 
> > > The tests aren't quite identical.  Here we test eventfd and gsi, the
> > > latter primarily to validate kvm_eoifd parameters.  Further above we
> > > test that eventfd is unique on addition without comparing gsi.  Both of
> > > these match the existing precedent set by irqfd.  With _irqfd_fdget we
> > > do have two searches of the irqfds.items list comparing only eventfd...
> > > I'll see about making a trivial helper there.  Thanks,
> > > 
> > > Alex
> > 
> > I hope I clarified that irqfd is not a good match. Pls make it
> > behave like ioeventfd instead.
> > 
> > > > > +			eoifd = tmp;
> > > > > +			list_del(&eoifd->list);
> > > > > +			break;
> > > > > +		}
> > > > > +	}
> > > > > +
> > > > > +	spin_unlock_irq(&kvm->eoifds.lock);
> > > > > +
> > > > > +	eventfd_ctx_put(eventfd);
> > > > > +
> > > > > +	if (!eoifd)
> > > > > +		return -ENOENT;
> > > > > +
> > > > > +	eoifd_deactivate(kvm, eoifd);
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
> > > > > +{
> > > > > +	if (args->flags & ~(KVM_EOIFD_FLAG_DEASSIGN |
> > > > > +			    KVM_EOIFD_FLAG_LEVEL_IRQFD))
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	if (args->flags & KVM_EOIFD_FLAG_DEASSIGN)
> > > > > +		return kvm_deassign_eoifd(kvm, args);
> > > > > +
> > > > > +	return kvm_assign_eoifd(kvm, args);
> > > > > +}
> > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > > index b4ad14cc..5b41df1 100644
> > > > > --- a/virt/kvm/kvm_main.c
> > > > > +++ b/virt/kvm/kvm_main.c
> > > > > @@ -620,6 +620,8 @@ static int kvm_vm_release(struct inode *inode, struct file *filp)
> > > > >  
> > > > >  	kvm_irqfd_release(kvm);
> > > > >  
> > > > > +	kvm_eoifd_release(kvm);
> > > > > +
> > > > >  	kvm_put_kvm(kvm);
> > > > >  	return 0;
> > > > >  }
> > > > > @@ -2093,6 +2095,15 @@ static long kvm_vm_ioctl(struct file *filp,
> > > > >  		break;
> > > > >  	}
> > > > >  #endif
> > > > > +	case KVM_EOIFD: {
> > > > > +		struct kvm_eoifd data;
> > > > > +
> > > > > +		r = -EFAULT;
> > > > > +		if (copy_from_user(&data, argp, sizeof data))
> > > > > +			goto out;
> > > > > +		r = kvm_eoifd(kvm, &data);
> > > > > +		break;
> > > > > +	}
> > > > >  	default:
> > > > >  		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
> > > > >  		if (r == -ENOTTY)
> > > 
> > > 
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 0/2] kvm: level irqfd and new eoifd
  2012-07-12 16:19               ` Alex Williamson
  2012-07-12 17:38                 ` Alex Williamson
@ 2012-07-15  8:33                 ` Avi Kivity
  2012-07-16 14:03                   ` Alex Williamson
  1 sibling, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2012-07-15  8:33 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Jan Kiszka, mst, gleb, kvm, linux-kernel

On 07/12/2012 07:19 PM, Alex Williamson wrote:
> On Thu, 2012-07-12 at 12:35 +0300, Avi Kivity wrote:
>> On 07/11/2012 10:57 PM, Alex Williamson wrote:
>> >> 
>> >> > We still have classic KVM device assignment to provide fast-path INTx.
>> >> > But if we want to replace it midterm, I think it's necessary for VFIO to
>> >> > be able to provide such a path as well.
>> >> 
>> >> I would like VFIO to have no regressions vs. kvm device assignment,
>> >> except perhaps in uncommon corner cases.  So I agree.
>> > 
>> > I ran a few TCP_RR netperf tests forcing a 1Gb tg3 nic to use INTx.
>> > Without irqchip support vfio gets a bit more than 60% of KVM device
>> > assignment.  That's a little bit of an unfair comparison since it's more
>> > than just the I/O path.  With the proposed interfaces here, enabling
>> > irqchip, vfio is within 10% of KVM device assignment for INTx.  For MSI,
>> > I can actually make vfio come out more than 30% better than KVM device
>> > assignment if I send the eventfd from the hard irq handler.  Using a
>> > threaded handler as the code currently does, vfio is still behind KVM.
>> > It's hard to beat a direct call chain.
>> 
>> We can have a direct call chain with vfio too, using a custom eventfd
>> poll function, no?  Assuming we set up a fast path for unicast msi.
> 
> You'll have to help me out a little, eventfd_signal walks the wait_queue
> and calls each function.  On the injection path that includes
> irqfd_wakeup.  

This is what I meant, except I forgot that we already do direct path for
MSI.

> For an MSI that seems to already provide direct
> injection.  

Ugh, even for a broadcast MSI into 254 vcpu guests.  That's going to be
one slow interrupt.

> For level we'll schedule_work, so that explains the overhead
> in that path, but it's not too dissimilar to a a threaded irq.  vfio
> does something very similar, so there's a schedule_work both on inject
> and on eoi.  I'll have to check whether anything prevents the unmask
> from the wait_queue function in vfio, that could be a significant chunk
> of the gap.  Where's the custom poll function come into play?  Thanks,

So I don't understand where the gap comes from.  The number of context
switches for kvm and vfio is the same, as long as both use MSI (and
either both use threaded irq or both don't).

-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 0/2] kvm: level irqfd and new eoifd
  2012-07-12 17:38                 ` Alex Williamson
@ 2012-07-15 10:09                   ` Avi Kivity
  2012-07-16 14:08                     ` Alex Williamson
  0 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2012-07-15 10:09 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Jan Kiszka, mst, gleb, kvm, linux-kernel

On 07/12/2012 08:38 PM, Alex Williamson wrote:
> On Thu, 2012-07-12 at 10:19 -0600, Alex Williamson wrote:
>> On Thu, 2012-07-12 at 12:35 +0300, Avi Kivity wrote:
>> > On 07/11/2012 10:57 PM, Alex Williamson wrote:
>> > >> 
>> > >> > We still have classic KVM device assignment to provide fast-path INTx.
>> > >> > But if we want to replace it midterm, I think it's necessary for VFIO to
>> > >> > be able to provide such a path as well.
>> > >> 
>> > >> I would like VFIO to have no regressions vs. kvm device assignment,
>> > >> except perhaps in uncommon corner cases.  So I agree.
>> > > 
>> > > I ran a few TCP_RR netperf tests forcing a 1Gb tg3 nic to use INTx.
>> > > Without irqchip support vfio gets a bit more than 60% of KVM device
>> > > assignment.  That's a little bit of an unfair comparison since it's more
>> > > than just the I/O path.  With the proposed interfaces here, enabling
>> > > irqchip, vfio is within 10% of KVM device assignment for INTx.  For MSI,
>> > > I can actually make vfio come out more than 30% better than KVM device
>> > > assignment if I send the eventfd from the hard irq handler.  Using a
>> > > threaded handler as the code currently does, vfio is still behind KVM.
>> > > It's hard to beat a direct call chain.
>> > 
>> > We can have a direct call chain with vfio too, using a custom eventfd
>> > poll function, no?  Assuming we set up a fast path for unicast msi.
>> 
>> You'll have to help me out a little, eventfd_signal walks the wait_queue
>> and calls each function.  On the injection path that includes
>> irqfd_wakeup.  For an MSI that seems to already provide direct
>> injection.  For level we'll schedule_work, so that explains the overhead
>> in that path, but it's not too dissimilar to a a threaded irq.  vfio
>> does something very similar, so there's a schedule_work both on inject
>> and on eoi.  I'll have to check whether anything prevents the unmask
>> from the wait_queue function in vfio, that could be a significant chunk
>> of the gap.
> 
> Yep, the schedule_work in the eoi is the culprit.  A direct unmask from
> the wait queue function gives me better results than kvm for INTx.
> We'll have to see how the leapfrogging goes once KVM switches to
> injection from the hard handler.  I'm still curious what this custom
> poll function would give us though.  Thanks,
> 

btw, why is the overhead so large?  A context switch should be on the
order of 1 microsecond or less.  Given that, every 5000 context switches
per second cost a 1% cpu load on one core.  You would need a very heavy
interrupt load to see  a large degradation.  Or is the extra latency the
problem?

-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 0/2] kvm: level irqfd and new eoifd
  2012-07-15  8:33                 ` Avi Kivity
@ 2012-07-16 14:03                   ` Alex Williamson
  2012-07-16 14:35                     ` Avi Kivity
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2012-07-16 14:03 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, mst, gleb, kvm, linux-kernel

On Sun, 2012-07-15 at 11:33 +0300, Avi Kivity wrote:
> On 07/12/2012 07:19 PM, Alex Williamson wrote:
> > On Thu, 2012-07-12 at 12:35 +0300, Avi Kivity wrote:
> >> On 07/11/2012 10:57 PM, Alex Williamson wrote:
> >> >> 
> >> >> > We still have classic KVM device assignment to provide fast-path INTx.
> >> >> > But if we want to replace it midterm, I think it's necessary for VFIO to
> >> >> > be able to provide such a path as well.
> >> >> 
> >> >> I would like VFIO to have no regressions vs. kvm device assignment,
> >> >> except perhaps in uncommon corner cases.  So I agree.
> >> > 
> >> > I ran a few TCP_RR netperf tests forcing a 1Gb tg3 nic to use INTx.
> >> > Without irqchip support vfio gets a bit more than 60% of KVM device
> >> > assignment.  That's a little bit of an unfair comparison since it's more
> >> > than just the I/O path.  With the proposed interfaces here, enabling
> >> > irqchip, vfio is within 10% of KVM device assignment for INTx.  For MSI,
> >> > I can actually make vfio come out more than 30% better than KVM device
> >> > assignment if I send the eventfd from the hard irq handler.  Using a
> >> > threaded handler as the code currently does, vfio is still behind KVM.
> >> > It's hard to beat a direct call chain.
> >> 
> >> We can have a direct call chain with vfio too, using a custom eventfd
> >> poll function, no?  Assuming we set up a fast path for unicast msi.
> > 
> > You'll have to help me out a little, eventfd_signal walks the wait_queue
> > and calls each function.  On the injection path that includes
> > irqfd_wakeup.  
> 
> This is what I meant, except I forgot that we already do direct path for
> MSI.

Ok, vfio now does it for the unmask irqfd-line interface too.  Except
when we re-inject from eoifd we have to do the eventfd_signal from a
work queue as we can't have nested eventfd_signals.  We probably need to
do some benchmarks to see if that re-injection path saves us anything vs
letting hardware fire again.

> > For an MSI that seems to already provide direct
> > injection.  
> 
> Ugh, even for a broadcast MSI into 254 vcpu guests.  That's going to be
> one slow interrupt.
> 
> > For level we'll schedule_work, so that explains the overhead
> > in that path, but it's not too dissimilar to a a threaded irq.  vfio
> > does something very similar, so there's a schedule_work both on inject
> > and on eoi.  I'll have to check whether anything prevents the unmask
> > from the wait_queue function in vfio, that could be a significant chunk
> > of the gap.  Where's the custom poll function come into play?  Thanks,
> 
> So I don't understand where the gap comes from.  The number of context
> switches for kvm and vfio is the same, as long as both use MSI (and
> either both use threaded irq or both don't).

Right, we're not exactly apples to apples yet.  Using threaded
interrupts and work queue injection, vfio is a little slower.  There's
an extra work queue in that path vs kvm though.  Using non-threaded
interrupts and direct injection, vfio is faster.  Once kvm moves to
non-threaded interrupt handling, I expect we'll be pretty similar.  My
benchmarks are just rough estimates at this point as I'm both trying to
work out lockdep and get some ballpark performance comparison.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 0/2] kvm: level irqfd and new eoifd
  2012-07-15 10:09                   ` Avi Kivity
@ 2012-07-16 14:08                     ` Alex Williamson
  0 siblings, 0 replies; 22+ messages in thread
From: Alex Williamson @ 2012-07-16 14:08 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Jan Kiszka, mst, gleb, kvm, linux-kernel

On Sun, 2012-07-15 at 13:09 +0300, Avi Kivity wrote:
> On 07/12/2012 08:38 PM, Alex Williamson wrote:
> > On Thu, 2012-07-12 at 10:19 -0600, Alex Williamson wrote:
> >> On Thu, 2012-07-12 at 12:35 +0300, Avi Kivity wrote:
> >> > On 07/11/2012 10:57 PM, Alex Williamson wrote:
> >> > >> 
> >> > >> > We still have classic KVM device assignment to provide fast-path INTx.
> >> > >> > But if we want to replace it midterm, I think it's necessary for VFIO to
> >> > >> > be able to provide such a path as well.
> >> > >> 
> >> > >> I would like VFIO to have no regressions vs. kvm device assignment,
> >> > >> except perhaps in uncommon corner cases.  So I agree.
> >> > > 
> >> > > I ran a few TCP_RR netperf tests forcing a 1Gb tg3 nic to use INTx.
> >> > > Without irqchip support vfio gets a bit more than 60% of KVM device
> >> > > assignment.  That's a little bit of an unfair comparison since it's more
> >> > > than just the I/O path.  With the proposed interfaces here, enabling
> >> > > irqchip, vfio is within 10% of KVM device assignment for INTx.  For MSI,
> >> > > I can actually make vfio come out more than 30% better than KVM device
> >> > > assignment if I send the eventfd from the hard irq handler.  Using a
> >> > > threaded handler as the code currently does, vfio is still behind KVM.
> >> > > It's hard to beat a direct call chain.
> >> > 
> >> > We can have a direct call chain with vfio too, using a custom eventfd
> >> > poll function, no?  Assuming we set up a fast path for unicast msi.
> >> 
> >> You'll have to help me out a little, eventfd_signal walks the wait_queue
> >> and calls each function.  On the injection path that includes
> >> irqfd_wakeup.  For an MSI that seems to already provide direct
> >> injection.  For level we'll schedule_work, so that explains the overhead
> >> in that path, but it's not too dissimilar to a a threaded irq.  vfio
> >> does something very similar, so there's a schedule_work both on inject
> >> and on eoi.  I'll have to check whether anything prevents the unmask
> >> from the wait_queue function in vfio, that could be a significant chunk
> >> of the gap.
> > 
> > Yep, the schedule_work in the eoi is the culprit.  A direct unmask from
> > the wait queue function gives me better results than kvm for INTx.
> > We'll have to see how the leapfrogging goes once KVM switches to
> > injection from the hard handler.  I'm still curious what this custom
> > poll function would give us though.  Thanks,
> > 
> 
> btw, why is the overhead so large?  A context switch should be on the
> order of 1 microsecond or less.  Given that, every 5000 context switches
> per second cost a 1% cpu load on one core.  You would need a very heavy
> interrupt load to see  a large degradation.  Or is the extra latency the
> problem?

I'm using TCP_RR, so latency is the factor.  As I mentioned though, I
have way too much kernel debugging enabled to take these as anything
more than rough estimates.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 0/2] kvm: level irqfd and new eoifd
  2012-07-16 14:03                   ` Alex Williamson
@ 2012-07-16 14:35                     ` Avi Kivity
  0 siblings, 0 replies; 22+ messages in thread
From: Avi Kivity @ 2012-07-16 14:35 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Jan Kiszka, mst, gleb, kvm, linux-kernel

On 07/16/2012 05:03 PM, Alex Williamson wrote:
>> 
>> This is what I meant, except I forgot that we already do direct path for
>> MSI.
> 
> Ok, vfio now does it for the unmask irqfd-line interface too.  Except
> when we re-inject from eoifd we have to do the eventfd_signal from a
> work queue as we can't have nested eventfd_signals.  We probably need to
> do some benchmarks to see if that re-injection path saves us anything vs
> letting hardware fire again.

If you do that you might as well proxy it in userspace.  Yes the big
qemu lock will get in the way but we shouldn't code the kernel with the
expectation that userspace will be forever broken.

> 
>> > For an MSI that seems to already provide direct
>> > injection.  
>> 
>> Ugh, even for a broadcast MSI into 254 vcpu guests.  That's going to be
>> one slow interrupt.
>> 
>> > For level we'll schedule_work, so that explains the overhead
>> > in that path, but it's not too dissimilar to a a threaded irq.  vfio
>> > does something very similar, so there's a schedule_work both on inject
>> > and on eoi.  I'll have to check whether anything prevents the unmask
>> > from the wait_queue function in vfio, that could be a significant chunk
>> > of the gap.  Where's the custom poll function come into play?  Thanks,
>> 
>> So I don't understand where the gap comes from.  The number of context
>> switches for kvm and vfio is the same, as long as both use MSI (and
>> either both use threaded irq or both don't).
> 
> Right, we're not exactly apples to apples yet.  Using threaded
> interrupts and work queue injection, vfio is a little slower.  There's
> an extra work queue in that path vs kvm though.  Using non-threaded
> interrupts and direct injection, vfio is faster.  Once kvm moves to
> non-threaded interrupt handling, I expect we'll be pretty similar.  My
> benchmarks are just rough estimates at this point as I'm both trying to
> work out lockdep and get some ballpark performance comparison.  Thanks,

Okay.  I'm not really interested in how the code compares today, but
whether there is something in vfio that prevents it achieving kvm
performance once it's completely optimized.  Given the above, I don't
think there is.


-- 
error compiling committee.c: too many arguments to function



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2012-07-16 14:35 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-03 19:21 [PATCH v3 0/2] kvm: level irqfd and new eoifd Alex Williamson
2012-07-03 19:21 ` [PATCH v3 1/2] kvm: Extend irqfd to support level interrupts Alex Williamson
2012-07-03 19:21 ` [PATCH v3 2/2] kvm: KVM_EOIFD, an eventfd for EOIs Alex Williamson
2012-07-04 14:00   ` Michael S. Tsirkin
2012-07-05  4:24     ` Alex Williamson
2012-07-05 15:53       ` Michael S. Tsirkin
2012-07-09 20:35         ` Alex Williamson
2012-07-13 13:36           ` Michael S. Tsirkin
2012-07-11  9:53 ` [PATCH v3 0/2] kvm: level irqfd and new eoifd Avi Kivity
2012-07-11 10:18   ` Jan Kiszka
2012-07-11 10:49     ` Avi Kivity
2012-07-11 11:23       ` Jan Kiszka
2012-07-11 11:51         ` Avi Kivity
2012-07-11 19:57           ` Alex Williamson
2012-07-12  9:35             ` Avi Kivity
2012-07-12 16:19               ` Alex Williamson
2012-07-12 17:38                 ` Alex Williamson
2012-07-15 10:09                   ` Avi Kivity
2012-07-16 14:08                     ` Alex Williamson
2012-07-15  8:33                 ` Avi Kivity
2012-07-16 14:03                   ` Alex Williamson
2012-07-16 14:35                     ` Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.