All of lore.kernel.org
 help / color / mirror / Atom feed
* [KVM PATCH v5 0/4] irqfd fixes and enhancements
@ 2009-06-25 13:28 Gregory Haskins
  2009-06-25 13:28 ` [KVM PATCH v5 1/4] kvm: prepare irqfd for having interrupts disabled during eventfd->release Gregory Haskins
                   ` (4 more replies)
  0 siblings, 5 replies; 28+ messages in thread
From: Gregory Haskins @ 2009-06-25 13:28 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, mst, avi, paulmck, davidel, rusty

(Applies to kvm.git/master:4631e094)

The following is the latest attempt to fix the races in irqfd/eventfd, as
well as restore DEASSIGN support.  For more details, please read the patch
headers.

This series has been tested against the kvm-eventfd unit test, and
appears to be functioning properly.  You can download this test here:

ftp://ftp.novell.com/dev/ghaskins/kvm-eventfd.tar.bz2

I've included version 4 of Davide's eventfd patch (ported to kvm.git) so
that its a complete reviewable series.  Note, however, that there may be
later versions of his patch to consider for merging, so we should
coordinate with him.

-Greg

---

Davide Libenzi (1):
      eventfd - revised interface and cleanups (4th rev)

Gregory Haskins (3):
      KVM: add irqfd DEASSIGN feature
      KVM: Fix races in irqfd using new eventfd_kref_get interface
      kvm: prepare irqfd for having interrupts disabled during eventfd->release


 drivers/lguest/lg.h          |    2 
 drivers/lguest/lguest_user.c |    4 -
 fs/aio.c                     |   24 +---
 fs/eventfd.c                 |  126 ++++++++++++++++---
 include/linux/aio.h          |    4 -
 include/linux/eventfd.h      |   35 ++++-
 include/linux/kvm.h          |    2 
 include/linux/kvm_host.h     |    7 +
 virt/kvm/Kconfig             |    1 
 virt/kvm/eventfd.c           |  284 +++++++++++++++++++++++++++++++++---------
 10 files changed, 379 insertions(+), 110 deletions(-)

-- 
Signature

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [KVM PATCH v5 1/4] kvm: prepare irqfd for having interrupts disabled during eventfd->release
  2009-06-25 13:28 [KVM PATCH v5 0/4] irqfd fixes and enhancements Gregory Haskins
@ 2009-06-25 13:28 ` Gregory Haskins
  2009-06-25 13:28 ` [KVM PATCH v5 2/4] eventfd - revised interface and cleanups (4th rev) Gregory Haskins
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 28+ messages in thread
From: Gregory Haskins @ 2009-06-25 13:28 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, mst, avi, paulmck, davidel, rusty

We need to plug some race conditions on eventfd shutdown.  In order to
do this, we need to change the context in which the release notification
is delivered so that the wqh lock is now held.  However, there is currently
code in the release callback that assumes it can sleep.

We have a slight chicken and egg problem where we cant fix the race
without adding the lock, and we can't add the lock without breaking
the sleepy code.  Normally we could deal with this by making both
changes in an atomic changeset.  However, we want to keep the eventfd
and kvm specific changes isolated to ease the reviewer burden on upstream
eventfd (at the specific request of upstream).  Therefore, we have this
intermediate patch.

This intermediate patch allows the release() method to work in an atomic
context, at the expense of correctness w.r.t. memory-leaks.  Today we have
a race condition.  With this patch applied we leak.  Both issues will be
resolved later in the series.  It is the author's opinion that a leak is
better for bisectability than the hang would be should we leave the sleepy
code in place after the locking changeover.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 virt/kvm/eventfd.c |   89 ++++++++++++++++------------------------------------
 1 files changed, 27 insertions(+), 62 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index a9e7de7..9656027 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -28,7 +28,6 @@
 #include <linux/file.h>
 #include <linux/list.h>
 #include <linux/eventfd.h>
-#include <linux/srcu.h>
 
 /*
  * --------------------------------------------------------------------
@@ -38,8 +37,6 @@
  * --------------------------------------------------------------------
  */
 struct _irqfd {
-	struct mutex              lock;
-	struct srcu_struct        srcu;
 	struct kvm               *kvm;
 	int                       gsi;
 	struct list_head          list;
@@ -53,48 +50,12 @@ static void
 irqfd_inject(struct work_struct *work)
 {
 	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
-	struct kvm *kvm;
-	int idx;
+	struct kvm *kvm = irqfd->kvm;
 
-	idx = srcu_read_lock(&irqfd->srcu);
-
-	kvm = rcu_dereference(irqfd->kvm);
-	if (kvm) {
-		mutex_lock(&kvm->irq_lock);
-		kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
-		kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
-		mutex_unlock(&kvm->irq_lock);
-	}
-
-	srcu_read_unlock(&irqfd->srcu, idx);
-}
-
-static void
-irqfd_disconnect(struct _irqfd *irqfd)
-{
-	struct kvm *kvm;
-
-	mutex_lock(&irqfd->lock);
-
-	kvm = rcu_dereference(irqfd->kvm);
-	rcu_assign_pointer(irqfd->kvm, NULL);
-
-	mutex_unlock(&irqfd->lock);
-
-	if (!kvm)
-		return;
-
-	mutex_lock(&kvm->lock);
-	list_del(&irqfd->list);
-	mutex_unlock(&kvm->lock);
-
-	/*
-	 * It is important to not drop the kvm reference until the next grace
-	 * period because there might be lockless references in flight up
-	 * until then
-	 */
-	synchronize_srcu(&irqfd->srcu);
-	kvm_put_kvm(kvm);
+	mutex_lock(&kvm->irq_lock);
+	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
+	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
+	mutex_unlock(&kvm->irq_lock);
 }
 
 static int
@@ -103,26 +64,24 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
 	struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
 	unsigned long flags = (unsigned long)key;
 
+	/*
+	 * Assume we will be called with interrupts disabled
+	 */
 	if (flags & POLLIN)
 		/*
-		 * The POLLIN wake_up is called with interrupts disabled.
-		 * Therefore we need to defer the IRQ injection until later
-		 * since we need to acquire the kvm->lock to do so.
+		 * Defer the IRQ injection until later since we need to
+		 * acquire the kvm->lock to do so.
 		 */
 		schedule_work(&irqfd->inject);
 
 	if (flags & POLLHUP) {
 		/*
-		 * The POLLHUP is called unlocked, so it theoretically should
-		 * be safe to remove ourselves from the wqh using the locked
-		 * variant of remove_wait_queue()
+		 * for now, just remove ourselves from the list and let
+		 * the rest dangle.  We will fix this up later once
+		 * the races in eventfd are fixed
 		 */
-		remove_wait_queue(irqfd->wqh, &irqfd->wait);
-		flush_work(&irqfd->inject);
-		irqfd_disconnect(irqfd);
-
-		cleanup_srcu_struct(&irqfd->srcu);
-		kfree(irqfd);
+		__remove_wait_queue(irqfd->wqh, &irqfd->wait);
+		irqfd->wqh = NULL;
 	}
 
 	return 0;
@@ -150,8 +109,6 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
 	if (!irqfd)
 		return -ENOMEM;
 
-	mutex_init(&irqfd->lock);
-	init_srcu_struct(&irqfd->srcu);
 	irqfd->kvm = kvm;
 	irqfd->gsi = gsi;
 	INIT_LIST_HEAD(&irqfd->list);
@@ -172,8 +129,6 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
 
 	events = file->f_op->poll(file, &irqfd->pt);
 
-	kvm_get_kvm(kvm);
-
 	mutex_lock(&kvm->lock);
 	list_add_tail(&irqfd->list, &kvm->irqfds);
 	mutex_unlock(&kvm->lock);
@@ -211,6 +166,16 @@ kvm_irqfd_release(struct kvm *kvm)
 {
 	struct _irqfd *irqfd, *tmp;
 
-	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list)
-		irqfd_disconnect(irqfd);
+	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) {
+		if (irqfd->wqh)
+			remove_wait_queue(irqfd->wqh, &irqfd->wait);
+
+		flush_work(&irqfd->inject);
+
+		mutex_lock(&kvm->lock);
+		list_del(&irqfd->list);
+		mutex_unlock(&kvm->lock);
+
+		kfree(irqfd);
+	}
 }


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [KVM PATCH v5 2/4] eventfd - revised interface and cleanups (4th rev)
  2009-06-25 13:28 [KVM PATCH v5 0/4] irqfd fixes and enhancements Gregory Haskins
  2009-06-25 13:28 ` [KVM PATCH v5 1/4] kvm: prepare irqfd for having interrupts disabled during eventfd->release Gregory Haskins
@ 2009-06-25 13:28 ` Gregory Haskins
  2009-06-25 13:28 ` [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface Gregory Haskins
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 28+ messages in thread
From: Gregory Haskins @ 2009-06-25 13:28 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, mst, avi, paulmck, davidel, rusty

From: Davide Libenzi <davidel@xmailserver.org>

The following patch changes the eventfd interface to de-couple the eventfd
memory context, from the file pointer instance.
Without such change, there is no clean way to racely free handle the
POLLHUP event sent when the last instance of the file* goes away.
Also, now the internal eventfd APIs are using the eventfd context instead
of the file*.

Gregory, none of the APIs changed, so your patches do not need to be
rebased over this one. The last three revisions just updated comments.

Andrew, I'm not posting the "AIO select EVENTFD" patch at this time. Will
eventually try another push in your skull once I come back from vacation ;)

[gmh: resolved merge conflict against prior POLLHUP patch ]

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 drivers/lguest/lg.h          |    2 -
 drivers/lguest/lguest_user.c |    4 +
 fs/aio.c                     |   24 ++------
 fs/eventfd.c                 |  126 ++++++++++++++++++++++++++++++++++++------
 include/linux/aio.h          |    4 +
 include/linux/eventfd.h      |   35 +++++++++---
 6 files changed, 147 insertions(+), 48 deletions(-)

diff --git a/drivers/lguest/lg.h b/drivers/lguest/lg.h
index d4e8979..9c31382 100644
--- a/drivers/lguest/lg.h
+++ b/drivers/lguest/lg.h
@@ -82,7 +82,7 @@ struct lg_cpu {
 
 struct lg_eventfd {
 	unsigned long addr;
-	struct file *event;
+	struct eventfd_ctx *event;
 };
 
 struct lg_eventfd_map {
diff --git a/drivers/lguest/lguest_user.c b/drivers/lguest/lguest_user.c
index 32e2971..9f9a295 100644
--- a/drivers/lguest/lguest_user.c
+++ b/drivers/lguest/lguest_user.c
@@ -50,7 +50,7 @@ static int add_eventfd(struct lguest *lg, unsigned long addr, int fd)
 
 	/* Now append new entry. */
 	new->map[new->num].addr = addr;
-	new->map[new->num].event = eventfd_fget(fd);
+	new->map[new->num].event = eventfd_ctx_fdget(fd);
 	if (IS_ERR(new->map[new->num].event)) {
 		kfree(new);
 		return PTR_ERR(new->map[new->num].event);
@@ -357,7 +357,7 @@ static int close(struct inode *inode, struct file *file)
 
 	/* Release any eventfds they registered. */
 	for (i = 0; i < lg->eventfds->num; i++)
-		fput(lg->eventfds->map[i].event);
+		eventfd_ctx_put(lg->eventfds->map[i].event);
 	kfree(lg->eventfds);
 
 	/* If lg->dead doesn't contain an error code it will be NULL or a
diff --git a/fs/aio.c b/fs/aio.c
index 76da125..d065b2c 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -485,6 +485,8 @@ static inline void really_put_req(struct kioctx *ctx, struct kiocb *req)
 {
 	assert_spin_locked(&ctx->ctx_lock);
 
+	if (req->ki_eventfd != NULL)
+		eventfd_ctx_put(req->ki_eventfd);
 	if (req->ki_dtor)
 		req->ki_dtor(req);
 	if (req->ki_iovec != &req->ki_inline_vec)
@@ -509,8 +511,6 @@ static void aio_fput_routine(struct work_struct *data)
 		/* Complete the fput(s) */
 		if (req->ki_filp != NULL)
 			__fput(req->ki_filp);
-		if (req->ki_eventfd != NULL)
-			__fput(req->ki_eventfd);
 
 		/* Link the iocb into the context's free list */
 		spin_lock_irq(&ctx->ctx_lock);
@@ -528,8 +528,6 @@ static void aio_fput_routine(struct work_struct *data)
  */
 static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
 {
-	int schedule_putreq = 0;
-
 	dprintk(KERN_DEBUG "aio_put(%p): f_count=%ld\n",
 		req, atomic_long_read(&req->ki_filp->f_count));
 
@@ -549,24 +547,16 @@ static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
 	 * we would not be holding the last reference to the file*, so
 	 * this function will be executed w/out any aio kthread wakeup.
 	 */
-	if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count)))
-		schedule_putreq++;
-	else
-		req->ki_filp = NULL;
-	if (req->ki_eventfd != NULL) {
-		if (unlikely(atomic_long_dec_and_test(&req->ki_eventfd->f_count)))
-			schedule_putreq++;
-		else
-			req->ki_eventfd = NULL;
-	}
-	if (unlikely(schedule_putreq)) {
+	if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count))) {
 		get_ioctx(ctx);
 		spin_lock(&fput_lock);
 		list_add(&req->ki_list, &fput_head);
 		spin_unlock(&fput_lock);
 		queue_work(aio_wq, &fput_work);
-	} else
+	} else {
+		req->ki_filp = NULL;
 		really_put_req(ctx, req);
+	}
 	return 1;
 }
 
@@ -1622,7 +1612,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 		 * an eventfd() fd, and will be signaled for each completed
 		 * event using the eventfd_signal() function.
 		 */
-		req->ki_eventfd = eventfd_fget((int) iocb->aio_resfd);
+		req->ki_eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd);
 		if (IS_ERR(req->ki_eventfd)) {
 			ret = PTR_ERR(req->ki_eventfd);
 			req->ki_eventfd = NULL;
diff --git a/fs/eventfd.c b/fs/eventfd.c
index 72f5f8d..31d12de 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -14,35 +14,44 @@
 #include <linux/list.h>
 #include <linux/spinlock.h>
 #include <linux/anon_inodes.h>
-#include <linux/eventfd.h>
 #include <linux/syscalls.h>
 #include <linux/module.h>
+#include <linux/kref.h>
+#include <linux/eventfd.h>
 
 struct eventfd_ctx {
+	struct kref kref;
 	wait_queue_head_t wqh;
 	/*
 	 * Every time that a write(2) is performed on an eventfd, the
 	 * value of the __u64 being written is added to "count" and a
 	 * wakeup is performed on "wqh". A read(2) will return the "count"
 	 * value to userspace, and will reset "count" to zero. The kernel
-	 * size eventfd_signal() also, adds to the "count" counter and
+	 * side eventfd_signal() also, adds to the "count" counter and
 	 * issue a wakeup.
 	 */
 	__u64 count;
 	unsigned int flags;
 };
 
-/*
- * Adds "n" to the eventfd counter "count". Returns "n" in case of
- * success, or a value lower then "n" in case of coutner overflow.
- * This function is supposed to be called by the kernel in paths
- * that do not allow sleeping. In this function we allow the counter
- * to reach the ULLONG_MAX value, and we signal this as overflow
- * condition by returining a POLLERR to poll(2).
+/**
+ * eventfd_signal - Adds @n to the eventfd counter.
+ * @ctx: [in] Pointer to the eventfd context.
+ * @n: [in] Value of the counter to be added to the eventfd internal counter.
+ *          The value cannot be negative.
+ *
+ * This function is supposed to be called by the kernel in paths that do not
+ * allow sleeping. In this function we allow the counter to reach the ULLONG_MAX
+ * value, and we signal this as overflow condition by returining a POLLERR
+ * to poll(2).
+ *
+ * Returns @n in case of success, a non-negative number lower than @n in case
+ * of overflow, or the following error codes:
+ *
+ * -EINVAL    : The value of @n is negative.
  */
-int eventfd_signal(struct file *file, int n)
+int eventfd_signal(struct eventfd_ctx *ctx, int n)
 {
-	struct eventfd_ctx *ctx = file->private_data;
 	unsigned long flags;
 
 	if (n < 0)
@@ -59,17 +68,45 @@ int eventfd_signal(struct file *file, int n)
 }
 EXPORT_SYMBOL_GPL(eventfd_signal);
 
+static void eventfd_free(struct kref *kref)
+{
+	struct eventfd_ctx *ctx = container_of(kref, struct eventfd_ctx, kref);
+
+	kfree(ctx);
+}
+
+/**
+ * eventfd_ctx_get - Acquires a reference to the internal eventfd context.
+ * @ctx: [in] Pointer to the eventfd context.
+ *
+ * Returns: In case of success, returns a pointer to the eventfd context.
+ */
+struct eventfd_ctx *eventfd_ctx_get(struct eventfd_ctx *ctx)
+{
+	kref_get(&ctx->kref);
+	return ctx;
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_get);
+
+/**
+ * eventfd_ctx_put - Releases a reference to the internal eventfd context.
+ * @ctx: [in] Pointer to eventfd context.
+ *
+ * The eventfd context reference must have been previously acquired either
+ * with eventfd_ctx_get() or eventfd_ctx_fdget()).
+ */
+void eventfd_ctx_put(struct eventfd_ctx *ctx)
+{
+	kref_put(&ctx->kref, eventfd_free);
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_put);
+
 static int eventfd_release(struct inode *inode, struct file *file)
 {
 	struct eventfd_ctx *ctx = file->private_data;
 
-	/*
-	 * No need to hold the lock here, since we are on the file cleanup
-	 * path and the ones still attached to the wait queue will be
-	 * serialized by wake_up_locked_poll().
-	 */
-	wake_up_locked_poll(&ctx->wqh, POLLHUP);
-	kfree(ctx);
+	wake_up_poll(&ctx->wqh, POLLHUP);
+	eventfd_ctx_put(ctx);
 	return 0;
 }
 
@@ -193,6 +230,16 @@ static const struct file_operations eventfd_fops = {
 	.write		= eventfd_write,
 };
 
+/**
+ * eventfd_fget - Acquire a reference of an eventfd file descriptor.
+ * @fd: [in] Eventfd file descriptor.
+ *
+ * Returns a pointer to the eventfd file structure in case of success, or the
+ * following error pointer:
+ *
+ * -EBADF    : Invalid @fd file descriptor.
+ * -EINVAL   : The @fd file descriptor is not an eventfd file.
+ */
 struct file *eventfd_fget(int fd)
 {
 	struct file *file;
@@ -209,6 +256,48 @@ struct file *eventfd_fget(int fd)
 }
 EXPORT_SYMBOL_GPL(eventfd_fget);
 
+/**
+ * eventfd_ctx_fdget - Acquires a reference to the internal eventfd context.
+ * @fd: [in] Eventfd file descriptor.
+ *
+ * Returns a pointer to the internal eventfd context, otherwise the error
+ * pointers returned by the following functions:
+ *
+ * eventfd_fget
+ */
+struct eventfd_ctx *eventfd_ctx_fdget(int fd)
+{
+	struct file *file;
+	struct eventfd_ctx *ctx;
+
+	file = eventfd_fget(fd);
+	if (IS_ERR(file))
+		return (struct eventfd_ctx *) file;
+	ctx = eventfd_ctx_get(file->private_data);
+	fput(file);
+
+	return ctx;
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_fdget);
+
+/**
+ * eventfd_ctx_fileget - Acquires a reference to the internal eventfd context.
+ * @file: [in] Eventfd file pointer.
+ *
+ * Returns a pointer to the internal eventfd context, otherwise the error
+ * pointer:
+ *
+ * -EINVAL   : The @fd file descriptor is not an eventfd file.
+ */
+struct eventfd_ctx *eventfd_ctx_fileget(struct file *file)
+{
+	if (file->f_op != &eventfd_fops)
+		return ERR_PTR(-EINVAL);
+
+	return eventfd_ctx_get(file->private_data);
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_fileget);
+
 SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
 {
 	int fd;
@@ -225,6 +314,7 @@ SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
 	if (!ctx)
 		return -ENOMEM;
 
+	kref_init(&ctx->kref);
 	init_waitqueue_head(&ctx->wqh);
 	ctx->count = count;
 	ctx->flags = flags;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index b16a957..47f7d93 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -121,9 +121,9 @@ struct kiocb {
 
 	/*
 	 * If the aio_resfd field of the userspace iocb is not zero,
-	 * this is the underlying file* to deliver event to.
+	 * this is the underlying eventfd context to deliver events to.
 	 */
-	struct file		*ki_eventfd;
+	struct eventfd_ctx	*ki_eventfd;
 };
 
 #define is_sync_kiocb(iocb)	((iocb)->ki_key == KIOCB_SYNC_KEY)
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index f45a8ae..3b85ba6 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -8,10 +8,8 @@
 #ifndef _LINUX_EVENTFD_H
 #define _LINUX_EVENTFD_H
 
-#ifdef CONFIG_EVENTFD
-
-/* For O_CLOEXEC and O_NONBLOCK */
 #include <linux/fcntl.h>
+#include <linux/file.h>
 
 /*
  * CAREFUL: Check include/asm-generic/fcntl.h when defining
@@ -27,16 +25,37 @@
 #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
 #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
 
+#ifdef CONFIG_EVENTFD
+
+struct eventfd_ctx *eventfd_ctx_get(struct eventfd_ctx *ctx);
+void eventfd_ctx_put(struct eventfd_ctx *ctx);
 struct file *eventfd_fget(int fd);
-int eventfd_signal(struct file *file, int n);
+struct eventfd_ctx *eventfd_ctx_fdget(int fd);
+struct eventfd_ctx *eventfd_ctx_fileget(struct file *file);
+int eventfd_signal(struct eventfd_ctx *ctx, int n);
 
 #else /* CONFIG_EVENTFD */
 
-#define eventfd_fget(fd) ERR_PTR(-ENOSYS)
-static inline int eventfd_signal(struct file *file, int n)
-{ return 0; }
+/*
+ * Ugly ugly ugly error layer to support modules that uses eventfd but
+ * pretend to work in !CONFIG_EVENTFD configurations. Namely, AIO.
+ */
+static inline struct eventfd_ctx *eventfd_ctx_fdget(int fd)
+{
+	return ERR_PTR(-ENOSYS);
+}
+
+static inline int eventfd_signal(struct eventfd_ctx *ctx, int n)
+{
+	return -ENOSYS;
+}
+
+static inline void eventfd_ctx_put(struct eventfd_ctx *ctx)
+{
+
+}
 
-#endif /* CONFIG_EVENTFD */
+#endif
 
 #endif /* _LINUX_EVENTFD_H */
 


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface
  2009-06-25 13:28 [KVM PATCH v5 0/4] irqfd fixes and enhancements Gregory Haskins
  2009-06-25 13:28 ` [KVM PATCH v5 1/4] kvm: prepare irqfd for having interrupts disabled during eventfd->release Gregory Haskins
  2009-06-25 13:28 ` [KVM PATCH v5 2/4] eventfd - revised interface and cleanups (4th rev) Gregory Haskins
@ 2009-06-25 13:28 ` Gregory Haskins
  2009-06-26 14:05   ` Gregory Haskins
                     ` (2 more replies)
  2009-06-25 13:28 ` [KVM PATCH v5 4/4] KVM: add irqfd DEASSIGN feature Gregory Haskins
  2009-06-25 13:59 ` [KVM PATCH v5 0/4] irqfd fixes and enhancements Gregory Haskins
  4 siblings, 3 replies; 28+ messages in thread
From: Gregory Haskins @ 2009-06-25 13:28 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, mst, avi, paulmck, davidel, rusty

eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
"release" callback.  This lets eventfd clients know if the eventfd is about
to go away and is very useful particularly for in-kernel clients.  However,
until recently it is not possible to use this feature of eventfd in a
race-free way.  This patch utilizes a new eventfd interface to rectify
the problem.

Note that one final race is known to exist: the slow-work thread may race
with module removal.  We are currently working with slow-work upstream
to fix this issue as well.  Since the code prior to this patch also
races with module_put(), we are not making anything worse, but rather
shifting the cause of the race.  Once the slow-work code is patched we
will be fixing the last remaining issue.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 include/linux/kvm_host.h |    7 +-
 virt/kvm/Kconfig         |    1 
 virt/kvm/eventfd.c       |  199 ++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 179 insertions(+), 28 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2451f48..d94ee72 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -141,7 +141,12 @@ struct kvm {
 	struct kvm_io_bus mmio_bus;
 	struct kvm_io_bus pio_bus;
 #ifdef CONFIG_HAVE_KVM_EVENTFD
-	struct list_head irqfds;
+	struct {
+		spinlock_t        lock;
+		struct list_head  items;
+		atomic_t          outstanding;
+		wait_queue_head_t wqh;
+	} irqfds;
 #endif
 	struct kvm_vm_stat stat;
 	struct kvm_arch arch;
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index daece36..ab7848a 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -9,6 +9,7 @@ config HAVE_KVM_IRQCHIP
 config HAVE_KVM_EVENTFD
        bool
        select EVENTFD
+       select SLOW_WORK
 
 config KVM_APIC_ARCHITECTURE
        bool
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 9656027..ca21e8a 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -28,6 +28,7 @@
 #include <linux/file.h>
 #include <linux/list.h>
 #include <linux/eventfd.h>
+#include <linux/slow-work.h>
 
 /*
  * --------------------------------------------------------------------
@@ -36,17 +37,36 @@
  * Credit goes to Avi Kivity for the original idea.
  * --------------------------------------------------------------------
  */
+
 struct _irqfd {
 	struct kvm               *kvm;
+	struct eventfd_ctx       *eventfd;
 	int                       gsi;
 	struct list_head          list;
 	poll_table                pt;
 	wait_queue_head_t        *wqh;
 	wait_queue_t              wait;
 	struct work_struct        inject;
+	struct slow_work          shutdown;
+	int                       active:1;
 };
 
 static void
+irqfd_release(struct _irqfd *irqfd)
+{
+	eventfd_ctx_put(irqfd->eventfd);
+	kfree(irqfd);
+}
+
+/* assumes kvm->irqfds.lock is held */
+static void
+irqfd_deactivate(struct _irqfd *irqfd)
+{
+	irqfd->active = false;
+	list_del(&irqfd->list);
+}
+
+static void
 irqfd_inject(struct work_struct *work)
 {
 	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
@@ -58,6 +78,55 @@ irqfd_inject(struct work_struct *work)
 	mutex_unlock(&kvm->irq_lock);
 }
 
+static struct _irqfd *
+work_to_irqfd(struct slow_work *work)
+{
+	return container_of(work, struct _irqfd, shutdown);
+}
+
+static int
+irqfd_shutdown_get_ref(struct slow_work *work)
+{
+	struct _irqfd *irqfd = work_to_irqfd(work);
+	struct kvm *kvm = irqfd->kvm;
+
+	atomic_inc(&kvm->irqfds.outstanding);
+
+	return 0;
+}
+
+static void
+irqfd_shutdown_put_ref(struct slow_work *work)
+{
+	struct _irqfd *irqfd = work_to_irqfd(work);
+	struct kvm *kvm = irqfd->kvm;
+
+	irqfd_release(irqfd);
+
+	if (atomic_dec_and_test(&kvm->irqfds.outstanding))
+		wake_up(&kvm->irqfds.wqh);
+}
+
+static void
+irqfd_shutdown_execute(struct slow_work *work)
+{
+	struct _irqfd *irqfd = work_to_irqfd(work);
+
+	/*
+	 * Ensure there are no outstanding "inject" work-items before we blow
+	 * away our state.  Once this job completes, the slow_work
+	 * infrastructure will drop the irqfd object completely via put_ref
+	 */
+	flush_work(&irqfd->inject);
+}
+
+const static struct slow_work_ops irqfd_shutdown_work_ops = {
+	.get_ref = irqfd_shutdown_get_ref,
+	.put_ref = irqfd_shutdown_put_ref,
+	.execute = irqfd_shutdown_execute,
+};
+
+
 static int
 irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
 {
@@ -65,25 +134,39 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
 	unsigned long flags = (unsigned long)key;
 
 	/*
-	 * Assume we will be called with interrupts disabled
+	 * Called with interrupts disabled
 	 */
 	if (flags & POLLIN)
-		/*
-		 * Defer the IRQ injection until later since we need to
-		 * acquire the kvm->lock to do so.
-		 */
+		/* An event has been signaled, inject an interrupt */
 		schedule_work(&irqfd->inject);
 
 	if (flags & POLLHUP) {
-		/*
-		 * for now, just remove ourselves from the list and let
-		 * the rest dangle.  We will fix this up later once
-		 * the races in eventfd are fixed
-		 */
+		/* The eventfd is closing, detach from KVM */
+		struct kvm *kvm = irqfd->kvm;
+		unsigned long flags;
+
 		__remove_wait_queue(irqfd->wqh, &irqfd->wait);
-		irqfd->wqh = NULL;
+
+		spin_lock_irqsave(&kvm->irqfds.lock, flags);
+
+		if (irqfd->active) {
+			/*
+			 * If the item is still active we can be sure that
+			 * no-one else is trying to shutdown this object at
+			 * the same time.
+			 *
+			 * Defer the shutdown to a thread so we can flush
+			 * all remaining inject jobs.  We use a slow-work
+			 * item to prevent a deadlock against the work-queue
+			 */
+			irqfd_deactivate(irqfd);
+			slow_work_enqueue(&irqfd->shutdown);
+		}
+
+		spin_unlock_irqrestore(&kvm->irqfds.lock, flags);
 	}
 
+
 	return 0;
 }
 
@@ -102,6 +185,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
 {
 	struct _irqfd *irqfd;
 	struct file *file = NULL;
+	struct eventfd_ctx *eventfd = NULL;
 	int ret;
 	unsigned int events;
 
@@ -113,6 +197,8 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
 	irqfd->gsi = gsi;
 	INIT_LIST_HEAD(&irqfd->list);
 	INIT_WORK(&irqfd->inject, irqfd_inject);
+	slow_work_init(&irqfd->shutdown, &irqfd_shutdown_work_ops);
+	irqfd->active = true;
 
 	file = eventfd_fget(fd);
 	if (IS_ERR(file)) {
@@ -129,12 +215,21 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
 
 	events = file->f_op->poll(file, &irqfd->pt);
 
-	mutex_lock(&kvm->lock);
-	list_add_tail(&irqfd->list, &kvm->irqfds);
-	mutex_unlock(&kvm->lock);
+	spin_lock_irq(&kvm->irqfds.lock);
+	list_add_tail(&irqfd->list, &kvm->irqfds.items);
+	spin_unlock_irq(&kvm->irqfds.lock);
+
+	eventfd = eventfd_ctx_fileget(file);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto fail;
+	}
+
+	irqfd->eventfd = eventfd;
 
 	/*
-	 * Check if there was an event already queued
+	 * Check if there was an event already pending on the eventfd
+	 * before we registered, and trigger it as if we didn't miss it.
 	 */
 	if (events & POLLIN)
 		schedule_work(&irqfd->inject);
@@ -148,6 +243,9 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
 	return 0;
 
 fail:
+	if (eventfd && !IS_ERR(eventfd))
+		eventfd_ctx_put(eventfd);
+
 	if (file && !IS_ERR(file))
 		fput(file);
 
@@ -158,24 +256,71 @@ fail:
 void
 kvm_irqfd_init(struct kvm *kvm)
 {
-	INIT_LIST_HEAD(&kvm->irqfds);
+	slow_work_register_user();
+
+	spin_lock_init(&kvm->irqfds.lock);
+	INIT_LIST_HEAD(&kvm->irqfds.items);
+	atomic_set(&kvm->irqfds.outstanding, 0);
+	init_waitqueue_head(&kvm->irqfds.wqh);
+}
+
+static struct _irqfd *
+irqfd_pop(struct kvm *kvm)
+{
+	struct _irqfd *irqfd = NULL;
+
+	spin_lock_irq(&kvm->irqfds.lock);
+
+	if (!list_empty(&kvm->irqfds.items)) {
+		irqfd = list_first_entry(&kvm->irqfds.items,
+					 struct _irqfd, list);
+		irqfd_deactivate(irqfd);
+	}
+
+	spin_unlock_irq(&kvm->irqfds.lock);
+
+	return irqfd;
+}
+
+/*
+ * locally releases the irqfd
+ *
+ * This function is called when KVM won the race with eventfd (signalled by
+ * finding the item active on the kvm->irqfds.item list). We are now guaranteed
+ * that we will never schedule a deferred shutdown task against this object,
+ * so we take steps to perform the shutdown ourselves.
+ *
+ * 1) We must remove ourselves from the wait-queue to prevent further events,
+ *    which will simultaneously act to sync us with eventfd (via wqh->lock)
+ * 2) Flush any outstanding inject-tasks to ensure its safe to free memory
+ * 3) Delete the object
+ */
+static void
+irqfd_shutdown(struct _irqfd *irqfd)
+{
+	remove_wait_queue(irqfd->wqh, &irqfd->wait);
+	flush_work(&irqfd->inject);
+	irqfd_release(irqfd);
 }
 
 void
 kvm_irqfd_release(struct kvm *kvm)
 {
-	struct _irqfd *irqfd, *tmp;
-
-	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) {
-		if (irqfd->wqh)
-			remove_wait_queue(irqfd->wqh, &irqfd->wait);
+	struct _irqfd *irqfd;
 
-		flush_work(&irqfd->inject);
+	/*
+	 * Shutdown all irqfds that still remain
+	 */
+	while ((irqfd = irqfd_pop(kvm)))
+		irqfd_shutdown(irqfd);
 
-		mutex_lock(&kvm->lock);
-		list_del(&irqfd->list);
-		mutex_unlock(&kvm->lock);
+	/*
+	 * irqfds.outstanding tracks the number of outstanding "shutdown"
+	 * jobs pending at any given time.  Once we get here, we know that
+	 * no more jobs will get scheduled, so go ahead and block until all
+	 * of them complete
+	 */
+	wait_event(kvm->irqfds.wqh, (!atomic_read(&kvm->irqfds.outstanding)));
 
-		kfree(irqfd);
-	}
+	slow_work_unregister_user();
 }


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [KVM PATCH v5 4/4] KVM: add irqfd DEASSIGN feature
  2009-06-25 13:28 [KVM PATCH v5 0/4] irqfd fixes and enhancements Gregory Haskins
                   ` (2 preceding siblings ...)
  2009-06-25 13:28 ` [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface Gregory Haskins
@ 2009-06-25 13:28 ` Gregory Haskins
  2009-06-28 10:46   ` Michael S. Tsirkin
  2009-06-25 13:59 ` [KVM PATCH v5 0/4] irqfd fixes and enhancements Gregory Haskins
  4 siblings, 1 reply; 28+ messages in thread
From: Gregory Haskins @ 2009-06-25 13:28 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, mst, avi, paulmck, davidel, rusty

DEASSIGN allows us to optionally disassociate an IRQFD from its underlying
eventfd without destroying the eventfd in the process.  This is useful
for conditions like live-migration which may have an eventfd associated
with a device and an IRQFD.  We need to be able to decouple the guest
from the event source while not perturbing the event source itself.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Michael S. Tsirkin <mst@redhat.com>
---

 include/linux/kvm.h |    2 ++
 virt/kvm/eventfd.c  |   56 +++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 38ff31e..6710518 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -490,6 +490,8 @@ struct kvm_x86_mce {
 };
 #endif
 
+#define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
+
 struct kvm_irqfd {
 	__u32 fd;
 	__u32 gsi;
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index ca21e8a..2d4549c 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -180,8 +180,8 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
 	add_wait_queue(wqh, &irqfd->wait);
 }
 
-int
-kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
+static int
+kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
 {
 	struct _irqfd *irqfd;
 	struct file *file = NULL;
@@ -303,6 +303,58 @@ irqfd_shutdown(struct _irqfd *irqfd)
 	irqfd_release(irqfd);
 }
 
+/*
+ * assumes kvm->irqfds.lock is held
+ */
+static struct _irqfd *
+irqfd_find(struct kvm *kvm, int fd, int gsi)
+{
+	struct _irqfd *irqfd, *tmp, *ret = ERR_PTR(-ENOENT);
+	struct eventfd_ctx *eventfd;
+
+	eventfd = eventfd_ctx_fdget(fd);
+	if (IS_ERR(eventfd))
+		return ERR_PTR(PTR_ERR(eventfd));
+
+	spin_lock_irq(&kvm->irqfds.lock);
+
+	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list) {
+		if (irqfd->eventfd == eventfd && irqfd->gsi == gsi) {
+			irqfd_deactivate(irqfd);
+			ret = irqfd;
+			break;
+		}
+	}
+
+	spin_unlock_irq(&kvm->irqfds.lock);
+	eventfd_ctx_put(eventfd);
+
+	return ret;
+}
+
+static int
+kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
+{
+	struct _irqfd *irqfd;
+
+	irqfd = irqfd_find(kvm, fd, gsi);
+	if (IS_ERR(irqfd))
+		return PTR_ERR(irqfd);
+
+	irqfd_shutdown(irqfd);
+
+	return 0;
+}
+
+int
+kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
+{
+	if (flags & KVM_IRQFD_FLAG_DEASSIGN)
+		return kvm_irqfd_deassign(kvm, fd, gsi);
+
+	return kvm_irqfd_assign(kvm, fd, gsi);
+}
+
 void
 kvm_irqfd_release(struct kvm *kvm)
 {


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [KVM PATCH v5 0/4] irqfd fixes and enhancements
  2009-06-25 13:28 [KVM PATCH v5 0/4] irqfd fixes and enhancements Gregory Haskins
                   ` (3 preceding siblings ...)
  2009-06-25 13:28 ` [KVM PATCH v5 4/4] KVM: add irqfd DEASSIGN feature Gregory Haskins
@ 2009-06-25 13:59 ` Gregory Haskins
  2009-06-25 16:44   ` Davide Libenzi
  2009-06-28 11:03   ` Avi Kivity
  4 siblings, 2 replies; 28+ messages in thread
From: Gregory Haskins @ 2009-06-25 13:59 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, mst, avi, paulmck, davidel, rusty

[-- Attachment #1: Type: text/plain, Size: 2236 bytes --]

Gregory Haskins wrote:
> (Applies to kvm.git/master:4631e094)
>
> The following is the latest attempt to fix the races in irqfd/eventfd, as
> well as restore DEASSIGN support.  For more details, please read the patch
> headers.
>
> This series has been tested against the kvm-eventfd unit test, and
> appears to be functioning properly.  You can download this test here:
>
> ftp://ftp.novell.com/dev/ghaskins/kvm-eventfd.tar.bz2
>
> I've included version 4 of Davide's eventfd patch (ported to kvm.git) so
> that its a complete reviewable series.  Note, however, that there may be
> later versions of his patch to consider for merging, so we should
> coordinate with him.
>   

So I know we talked yesterday in the review session about whether it was
actually worth all this complexity to deal with the POLLHUP or if we
should just revert to the prior "two syscall" model and be done with
it.  Rusty reflected these same sentiments this morning in response to
Davide's patch in a different thread.

I am a bit torn myself, tbh.  I do feel as though I have a good handle
on the issue and that it is indeed now fixed (at least, if this series
is applied and the slow-work issue is fixed, still pending upstream
ACK).  I have a lot invested in going the POLLHUP direction having spent
so much time thinking about the problem and working on the patches, so I
a bit of a biased opinion, I know.

The reason why I am pushing this series out now is at least partly so we
can tie up these loose ends.  We have both solutions in front of us and
can make a decision either way.  At least the solution is formally
documented in the internet archives forever this way ;)

I took the review comments to heart that the shutdown code was
substantially larger and more complex than the actual fast-path code.  I
went though last night and simplified and clarified it.  I think the
latest result is leaner and clearer, so please give it another review
(particularly for races) before dismissing it.

Ultimately, I think the concept of a release notification for eventfd is
a good thing for all eventfd users, so I don't think this thing should
go away per se even if irqfd decides to not use it. 

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [KVM PATCH v5 0/4] irqfd fixes and enhancements
  2009-06-25 13:59 ` [KVM PATCH v5 0/4] irqfd fixes and enhancements Gregory Haskins
@ 2009-06-25 16:44   ` Davide Libenzi
  2009-06-28 11:03   ` Avi Kivity
  1 sibling, 0 replies; 28+ messages in thread
From: Davide Libenzi @ 2009-06-25 16:44 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: kvm, Linux Kernel Mailing List, mst, avi, paulmck, Rusty Russell

On Thu, 25 Jun 2009, Gregory Haskins wrote:

> So I know we talked yesterday in the review session about whether it was
> actually worth all this complexity to deal with the POLLHUP or if we
> should just revert to the prior "two syscall" model and be done with
> it.  Rusty reflected these same sentiments this morning in response to
> Davide's patch in a different thread.
> 
> I am a bit torn myself, tbh.  I do feel as though I have a good handle
> on the issue and that it is indeed now fixed (at least, if this series
> is applied and the slow-work issue is fixed, still pending upstream
> ACK).  I have a lot invested in going the POLLHUP direction having spent
> so much time thinking about the problem and working on the patches, so I
> a bit of a biased opinion, I know.
> 
> The reason why I am pushing this series out now is at least partly so we
> can tie up these loose ends.  We have both solutions in front of us and
> can make a decision either way.  At least the solution is formally
> documented in the internet archives forever this way ;)
> 
> I took the review comments to heart that the shutdown code was
> substantially larger and more complex than the actual fast-path code.  I
> went though last night and simplified and clarified it.  I think the
> latest result is leaner and clearer, so please give it another review
> (particularly for races) before dismissing it.
> 
> Ultimately, I think the concept of a release notification for eventfd is
> a good thing for all eventfd users, so I don't think this thing should
> go away per se even if irqfd decides to not use it. 

Whatever you guys decide if fine for me. Next time though, I think I'll 
wait a month or so after taking any action :)


- Davide



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface
  2009-06-25 13:28 ` [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface Gregory Haskins
@ 2009-06-26 14:05   ` Gregory Haskins
  2009-06-28 11:06   ` Michael S. Tsirkin
  2009-06-28 11:48   ` Michael S. Tsirkin
  2 siblings, 0 replies; 28+ messages in thread
From: Gregory Haskins @ 2009-06-26 14:05 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, mst, avi, paulmck, davidel, rusty

[-- Attachment #1: Type: text/plain, Size: 11130 bytes --]

Gregory Haskins wrote:
> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
> "release" callback.  This lets eventfd clients know if the eventfd is about
> to go away and is very useful particularly for in-kernel clients.  However,
> until recently it is not possible to use this feature of eventfd in a
> race-free way.  This patch utilizes a new eventfd interface to rectify
> the problem.
>
> Note that one final race is known to exist: the slow-work thread may race
> with module removal.  We are currently working with slow-work upstream
> to fix this issue as well.  Since the code prior to this patch also
> races with module_put(), we are not making anything worse, but rather
> shifting the cause of the race.  Once the slow-work code is patched we
> will be fixing the last remaining issue.
>
> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> ---
>
>  include/linux/kvm_host.h |    7 +-
>  virt/kvm/Kconfig         |    1 
>  virt/kvm/eventfd.c       |  199 ++++++++++++++++++++++++++++++++++++++++------
>  3 files changed, 179 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 2451f48..d94ee72 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -141,7 +141,12 @@ struct kvm {
>  	struct kvm_io_bus mmio_bus;
>  	struct kvm_io_bus pio_bus;
>  #ifdef CONFIG_HAVE_KVM_EVENTFD
> -	struct list_head irqfds;
> +	struct {
> +		spinlock_t        lock;
> +		struct list_head  items;
> +		atomic_t          outstanding;
> +		wait_queue_head_t wqh;
> +	} irqfds;
>  #endif
>  	struct kvm_vm_stat stat;
>  	struct kvm_arch arch;
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index daece36..ab7848a 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -9,6 +9,7 @@ config HAVE_KVM_IRQCHIP
>  config HAVE_KVM_EVENTFD
>         bool
>         select EVENTFD
> +       select SLOW_WORK
>  
>  config KVM_APIC_ARCHITECTURE
>         bool
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index 9656027..ca21e8a 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -28,6 +28,7 @@
>  #include <linux/file.h>
>  #include <linux/list.h>
>  #include <linux/eventfd.h>
> +#include <linux/slow-work.h>
>  
>  /*
>   * --------------------------------------------------------------------
> @@ -36,17 +37,36 @@
>   * Credit goes to Avi Kivity for the original idea.
>   * --------------------------------------------------------------------
>   */
> +
>  struct _irqfd {
>  	struct kvm               *kvm;
> +	struct eventfd_ctx       *eventfd;
>  	int                       gsi;
>  	struct list_head          list;
>  	poll_table                pt;
>  	wait_queue_head_t        *wqh;
>  	wait_queue_t              wait;
>  	struct work_struct        inject;
> +	struct slow_work          shutdown;
> +	int                       active:1;
>  };
>  
>  static void
> +irqfd_release(struct _irqfd *irqfd)
> +{
> +	eventfd_ctx_put(irqfd->eventfd);
> +	kfree(irqfd);
> +}
> +
> +/* assumes kvm->irqfds.lock is held */
> +static void
> +irqfd_deactivate(struct _irqfd *irqfd)
> +{
> +	irqfd->active = false;
> +	list_del(&irqfd->list);
> +}
> +
> +static void
>  irqfd_inject(struct work_struct *work)
>  {
>  	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
> @@ -58,6 +78,55 @@ irqfd_inject(struct work_struct *work)
>  	mutex_unlock(&kvm->irq_lock);
>  }
>  
> +static struct _irqfd *
> +work_to_irqfd(struct slow_work *work)
> +{
> +	return container_of(work, struct _irqfd, shutdown);
> +}
> +
> +static int
> +irqfd_shutdown_get_ref(struct slow_work *work)
> +{
> +	struct _irqfd *irqfd = work_to_irqfd(work);
> +	struct kvm *kvm = irqfd->kvm;
> +
> +	atomic_inc(&kvm->irqfds.outstanding);
> +
> +	return 0;
> +}
> +
> +static void
> +irqfd_shutdown_put_ref(struct slow_work *work)
> +{
> +	struct _irqfd *irqfd = work_to_irqfd(work);
> +	struct kvm *kvm = irqfd->kvm;
> +
> +	irqfd_release(irqfd);
> +
> +	if (atomic_dec_and_test(&kvm->irqfds.outstanding))
> +		wake_up(&kvm->irqfds.wqh);
> +}
> +
> +static void
> +irqfd_shutdown_execute(struct slow_work *work)
> +{
> +	struct _irqfd *irqfd = work_to_irqfd(work);
> +
> +	/*
> +	 * Ensure there are no outstanding "inject" work-items before we blow
> +	 * away our state.  Once this job completes, the slow_work
> +	 * infrastructure will drop the irqfd object completely via put_ref
> +	 */
> +	flush_work(&irqfd->inject);
> +}
> +
> +const static struct slow_work_ops irqfd_shutdown_work_ops = {
> +	.get_ref = irqfd_shutdown_get_ref,
> +	.put_ref = irqfd_shutdown_put_ref,
> +	.execute = irqfd_shutdown_execute,
> +};
> +
> +
>  static int
>  irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
>  {
> @@ -65,25 +134,39 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
>  	unsigned long flags = (unsigned long)key;
>  
>  	/*
> -	 * Assume we will be called with interrupts disabled
> +	 * Called with interrupts disabled
>  	 */
>  	if (flags & POLLIN)
> -		/*
> -		 * Defer the IRQ injection until later since we need to
> -		 * acquire the kvm->lock to do so.
> -		 */
> +		/* An event has been signaled, inject an interrupt */
>  		schedule_work(&irqfd->inject);
>  
>  	if (flags & POLLHUP) {
> -		/*
> -		 * for now, just remove ourselves from the list and let
> -		 * the rest dangle.  We will fix this up later once
> -		 * the races in eventfd are fixed
> -		 */
> +		/* The eventfd is closing, detach from KVM */
> +		struct kvm *kvm = irqfd->kvm;
> +		unsigned long flags;
> +
>  		__remove_wait_queue(irqfd->wqh, &irqfd->wait);
> -		irqfd->wqh = NULL;
> +
> +		spin_lock_irqsave(&kvm->irqfds.lock, flags);
> +
> +		if (irqfd->active) {
> +			/*
> +			 * If the item is still active we can be sure that
> +			 * no-one else is trying to shutdown this object at
> +			 * the same time.
> +			 *
> +			 * Defer the shutdown to a thread so we can flush
> +			 * all remaining inject jobs.  We use a slow-work
> +			 * item to prevent a deadlock against the work-queue
> +			 */
> +			irqfd_deactivate(irqfd);
> +			slow_work_enqueue(&irqfd->shutdown);
> +		}
> +
> +		spin_unlock_irqrestore(&kvm->irqfds.lock, flags);
>  	}
>  
> +
>  	return 0;
>  }
>  
> @@ -102,6 +185,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>  {
>  	struct _irqfd *irqfd;
>  	struct file *file = NULL;
> +	struct eventfd_ctx *eventfd = NULL;
>  	int ret;
>  	unsigned int events;
>  
> @@ -113,6 +197,8 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>  	irqfd->gsi = gsi;
>  	INIT_LIST_HEAD(&irqfd->list);
>  	INIT_WORK(&irqfd->inject, irqfd_inject);
> +	slow_work_init(&irqfd->shutdown, &irqfd_shutdown_work_ops);
> +	irqfd->active = true;
>  
>  	file = eventfd_fget(fd);
>  	if (IS_ERR(file)) {
> @@ -129,12 +215,21 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>  
>  	events = file->f_op->poll(file, &irqfd->pt);
>  
> -	mutex_lock(&kvm->lock);
> -	list_add_tail(&irqfd->list, &kvm->irqfds);
> -	mutex_unlock(&kvm->lock);
> +	spin_lock_irq(&kvm->irqfds.lock);
> +	list_add_tail(&irqfd->list, &kvm->irqfds.items);
> +	spin_unlock_irq(&kvm->irqfds.lock);
> +
> +	eventfd = eventfd_ctx_fileget(file);
> +	if (IS_ERR(file)) {
> +		ret = PTR_ERR(file);
> +		goto fail;
> +	}
> +
> +	irqfd->eventfd = eventfd;
>   

<sigh> I just noticed this while doing a self review:  I need to assign
the eventfd context *before* putting the item on the list.  Not sure why
I even did that.  I suspect I re-arranged the code at the last minute
and didn't notice what a dumb thing I was doing.

So this will need at least a v6, but I will wait to hear if there are
any other comments from Michael et. al.

-Greg

>  
>  	/*
> -	 * Check if there was an event already queued
> +	 * Check if there was an event already pending on the eventfd
> +	 * before we registered, and trigger it as if we didn't miss it.
>  	 */
>  	if (events & POLLIN)
>  		schedule_work(&irqfd->inject);
> @@ -148,6 +243,9 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>  	return 0;
>  
>  fail:
> +	if (eventfd && !IS_ERR(eventfd))
> +		eventfd_ctx_put(eventfd);
> +
>  	if (file && !IS_ERR(file))
>  		fput(file);
>  
> @@ -158,24 +256,71 @@ fail:
>  void
>  kvm_irqfd_init(struct kvm *kvm)
>  {
> -	INIT_LIST_HEAD(&kvm->irqfds);
> +	slow_work_register_user();
> +
> +	spin_lock_init(&kvm->irqfds.lock);
> +	INIT_LIST_HEAD(&kvm->irqfds.items);
> +	atomic_set(&kvm->irqfds.outstanding, 0);
> +	init_waitqueue_head(&kvm->irqfds.wqh);
> +}
> +
> +static struct _irqfd *
> +irqfd_pop(struct kvm *kvm)
> +{
> +	struct _irqfd *irqfd = NULL;
> +
> +	spin_lock_irq(&kvm->irqfds.lock);
> +
> +	if (!list_empty(&kvm->irqfds.items)) {
> +		irqfd = list_first_entry(&kvm->irqfds.items,
> +					 struct _irqfd, list);
> +		irqfd_deactivate(irqfd);
> +	}
> +
> +	spin_unlock_irq(&kvm->irqfds.lock);
> +
> +	return irqfd;
> +}
> +
> +/*
> + * locally releases the irqfd
> + *
> + * This function is called when KVM won the race with eventfd (signalled by
> + * finding the item active on the kvm->irqfds.item list). We are now guaranteed
> + * that we will never schedule a deferred shutdown task against this object,
> + * so we take steps to perform the shutdown ourselves.
> + *
> + * 1) We must remove ourselves from the wait-queue to prevent further events,
> + *    which will simultaneously act to sync us with eventfd (via wqh->lock)
> + * 2) Flush any outstanding inject-tasks to ensure its safe to free memory
> + * 3) Delete the object
> + */
> +static void
> +irqfd_shutdown(struct _irqfd *irqfd)
> +{
> +	remove_wait_queue(irqfd->wqh, &irqfd->wait);
> +	flush_work(&irqfd->inject);
> +	irqfd_release(irqfd);
>  }
>  
>  void
>  kvm_irqfd_release(struct kvm *kvm)
>  {
> -	struct _irqfd *irqfd, *tmp;
> -
> -	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) {
> -		if (irqfd->wqh)
> -			remove_wait_queue(irqfd->wqh, &irqfd->wait);
> +	struct _irqfd *irqfd;
>  
> -		flush_work(&irqfd->inject);
> +	/*
> +	 * Shutdown all irqfds that still remain
> +	 */
> +	while ((irqfd = irqfd_pop(kvm)))
> +		irqfd_shutdown(irqfd);
>  
> -		mutex_lock(&kvm->lock);
> -		list_del(&irqfd->list);
> -		mutex_unlock(&kvm->lock);
> +	/*
> +	 * irqfds.outstanding tracks the number of outstanding "shutdown"
> +	 * jobs pending at any given time.  Once we get here, we know that
> +	 * no more jobs will get scheduled, so go ahead and block until all
> +	 * of them complete
> +	 */
> +	wait_event(kvm->irqfds.wqh, (!atomic_read(&kvm->irqfds.outstanding)));
>  
> -		kfree(irqfd);
> -	}
> +	slow_work_unregister_user();
>  }
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [KVM PATCH v5 4/4] KVM: add irqfd DEASSIGN feature
  2009-06-25 13:28 ` [KVM PATCH v5 4/4] KVM: add irqfd DEASSIGN feature Gregory Haskins
@ 2009-06-28 10:46   ` Michael S. Tsirkin
  2009-06-28 12:39     ` Gregory Haskins
  0 siblings, 1 reply; 28+ messages in thread
From: Michael S. Tsirkin @ 2009-06-28 10:46 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm, linux-kernel, avi, paulmck, davidel, rusty

On Thu, Jun 25, 2009 at 09:28:32AM -0400, Gregory Haskins wrote:
> DEASSIGN allows us to optionally disassociate an IRQFD from its underlying
> eventfd without destroying the eventfd in the process.  This is useful
> for conditions like live-migration which may have an eventfd associated
> with a device and an IRQFD.  We need to be able to decouple the guest
> from the event source while not perturbing the event source itself.
> 
> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> CC: Michael S. Tsirkin <mst@redhat.com>
> ---
> 
>  include/linux/kvm.h |    2 ++
>  virt/kvm/eventfd.c  |   56 +++++++++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 56 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index 38ff31e..6710518 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -490,6 +490,8 @@ struct kvm_x86_mce {
>  };
>  #endif
>  
> +#define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
> +
>  struct kvm_irqfd {
>  	__u32 fd;
>  	__u32 gsi;
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index ca21e8a..2d4549c 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -180,8 +180,8 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
>  	add_wait_queue(wqh, &irqfd->wait);
>  }
>  
> -int
> -kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
> +static int
> +kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
>  {
>  	struct _irqfd *irqfd;
>  	struct file *file = NULL;
> @@ -303,6 +303,58 @@ irqfd_shutdown(struct _irqfd *irqfd)
>  	irqfd_release(irqfd);
>  }
>  
> +/*
> + * assumes kvm->irqfds.lock is held
> + */
> +static struct _irqfd *
> +irqfd_find(struct kvm *kvm, int fd, int gsi)
> +{
> +	struct _irqfd *irqfd, *tmp, *ret = ERR_PTR(-ENOENT);
> +	struct eventfd_ctx *eventfd;
> +
> +	eventfd = eventfd_ctx_fdget(fd);
> +	if (IS_ERR(eventfd))
> +		return ERR_PTR(PTR_ERR(eventfd));
> +
> +	spin_lock_irq(&kvm->irqfds.lock);
> +
> +	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list) {
> +		if (irqfd->eventfd == eventfd && irqfd->gsi == gsi) {
> +			irqfd_deactivate(irqfd);
> +			ret = irqfd;
> +			break;
> +		}
> +	}
> +
> +	spin_unlock_irq(&kvm->irqfds.lock);
> +	eventfd_ctx_put(eventfd);
> +
> +	return ret;
> +}
> +
> +static int
> +kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
> +{
> +	struct _irqfd *irqfd;
> +
> +	irqfd = irqfd_find(kvm, fd, gsi);
> +	if (IS_ERR(irqfd))
> +		return PTR_ERR(irqfd);
> +
> +	irqfd_shutdown(irqfd);
> +
> +	return 0;
> +}


I think that, to make this work properly, you must
add irqfd to list the last thing so do.
As it is, when you assign irqfd, the last thing you do is

        irqfd->eventfd = eventfd;

I think you should move this to within a spinlock.

> +
> +int
> +kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
> +{
> +	if (flags & KVM_IRQFD_FLAG_DEASSIGN)
> +		return kvm_irqfd_deassign(kvm, fd, gsi);
> +
> +	return kvm_irqfd_assign(kvm, fd, gsi);
> +}
> +


At some point we discussed limiting the number of
irqfds that can be created in some way, so that userspace
can not consume unlimited amount of kernel memory.

What happened to that idea?

This will happen naturally if
- we keep fget on the file until irqfd goes away
- we allow the same file be bound to only one irqfd
but there might be other ways to do this

-- 
MST

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [KVM PATCH v5 0/4] irqfd fixes and enhancements
  2009-06-25 13:59 ` [KVM PATCH v5 0/4] irqfd fixes and enhancements Gregory Haskins
  2009-06-25 16:44   ` Davide Libenzi
@ 2009-06-28 11:03   ` Avi Kivity
  2009-06-28 12:59     ` Gregory Haskins
  1 sibling, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2009-06-28 11:03 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm, linux-kernel, mst, paulmck, davidel, rusty

On 06/25/2009 04:59 PM, Gregory Haskins wrote:
> Gregory Haskins wrote:
>    
>> (Applies to kvm.git/master:4631e094)
>>
>> The following is the latest attempt to fix the races in irqfd/eventfd, as
>> well as restore DEASSIGN support.  For more details, please read the patch
>> headers.
>>
>> This series has been tested against the kvm-eventfd unit test, and
>> appears to be functioning properly.  You can download this test here:
>>
>> ftp://ftp.novell.com/dev/ghaskins/kvm-eventfd.tar.bz2
>>
>> I've included version 4 of Davide's eventfd patch (ported to kvm.git) so
>> that its a complete reviewable series.  Note, however, that there may be
>> later versions of his patch to consider for merging, so we should
>> coordinate with him.
>>
>>      
>
> So I know we talked yesterday in the review session about whether it was
> actually worth all this complexity to deal with the POLLHUP or if we
> should just revert to the prior "two syscall" model and be done with
> it.  Rusty reflected these same sentiments this morning in response to
> Davide's patch in a different thread.
>
> I am a bit torn myself, tbh.  I do feel as though I have a good handle
> on the issue and that it is indeed now fixed (at least, if this series
> is applied and the slow-work issue is fixed, still pending upstream
> ACK).  I have a lot invested in going the POLLHUP direction having spent
> so much time thinking about the problem and working on the patches, so I
> a bit of a biased opinion, I know.
>
> The reason why I am pushing this series out now is at least partly so we
> can tie up these loose ends.  We have both solutions in front of us and
> can make a decision either way.  At least the solution is formally
> documented in the internet archives forever this way ;)
>
> I took the review comments to heart that the shutdown code was
> substantially larger and more complex than the actual fast-path code.  I
> went though last night and simplified and clarified it.  I think the
> latest result is leaner and clearer, so please give it another review
> (particularly for races) before dismissing it.
>    

Yes, it's much nicer.  I can't say I'm certain it's race free but it's a 
lot more digestible

> Ultimately, I think the concept of a release notification for eventfd is
> a good thing for all eventfd users, so I don't think this thing should
> go away per se even if irqfd decides to not use it.
>    

I agree that we want POLLHUP support, it's better than holding on to the 
eventfd.  But I think we can make it even cleaner by merging it with 
deassign.  Basically, when we get POLLHUP, we launch a slow_work (or 
something) that does a regular deassign.  That slow_work can grab a ref 
to the vm, so we don't race with the VM disappearing.

But given that the current slow_work does almost nothing, I'm not sure 
it's worth it.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface
  2009-06-25 13:28 ` [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface Gregory Haskins
  2009-06-26 14:05   ` Gregory Haskins
@ 2009-06-28 11:06   ` Michael S. Tsirkin
  2009-06-28 12:50     ` Gregory Haskins
  2009-06-28 11:48   ` Michael S. Tsirkin
  2 siblings, 1 reply; 28+ messages in thread
From: Michael S. Tsirkin @ 2009-06-28 11:06 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm, linux-kernel, avi, paulmck, davidel, rusty

On Thu, Jun 25, 2009 at 09:28:27AM -0400, Gregory Haskins wrote:
> @@ -65,25 +134,39 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
>  	unsigned long flags = (unsigned long)key;
>  
>  	/*
> -	 * Assume we will be called with interrupts disabled
> +	 * Called with interrupts disabled
>  	 */
>  	if (flags & POLLIN)
> -		/*
> -		 * Defer the IRQ injection until later since we need to
> -		 * acquire the kvm->lock to do so.
> -		 */
> +		/* An event has been signaled, inject an interrupt */
>  		schedule_work(&irqfd->inject);
>  
>  	if (flags & POLLHUP) {
> -		/*
> -		 * for now, just remove ourselves from the list and let
> -		 * the rest dangle.  We will fix this up later once
> -		 * the races in eventfd are fixed
> -		 */
> +		/* The eventfd is closing, detach from KVM */
> +		struct kvm *kvm = irqfd->kvm;
> +		unsigned long flags;
> +
>  		__remove_wait_queue(irqfd->wqh, &irqfd->wait);
> -		irqfd->wqh = NULL;
> +
> +		spin_lock_irqsave(&kvm->irqfds.lock, flags);
> +
> +		if (irqfd->active) {
> +			/*
> +			 * If the item is still active we can be sure that
> +			 * no-one else is trying to shutdown this object at
> +			 * the same time.
> +			 *
> +			 * Defer the shutdown to a thread so we can flush
> +			 * all remaining inject jobs.  We use a slow-work
> +			 * item to prevent a deadlock against the work-queue
> +			 */
> +			irqfd_deactivate(irqfd);
> +			slow_work_enqueue(&irqfd->shutdown);

Greg, in your patch for slow-work module removal, you write:
  "Callers must ensure that their module has at least
  one reference held while the work is enqueued."
  Where does this guarantee come from, in this case?

-- 
MST

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface
  2009-06-25 13:28 ` [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface Gregory Haskins
  2009-06-26 14:05   ` Gregory Haskins
  2009-06-28 11:06   ` Michael S. Tsirkin
@ 2009-06-28 11:48   ` Michael S. Tsirkin
  2009-06-28 12:53     ` Gregory Haskins
  2 siblings, 1 reply; 28+ messages in thread
From: Michael S. Tsirkin @ 2009-06-28 11:48 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm, avi

On Thu, Jun 25, 2009 at 09:28:27AM -0400, Gregory Haskins wrote:
> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
> "release" callback.  This lets eventfd clients know if the eventfd is about
> to go away and is very useful particularly for in-kernel clients.  However,
> until recently it is not possible to use this feature of eventfd in a
> race-free way.  This patch utilizes a new eventfd interface to rectify
> the problem.
> 
> Note that one final race is known to exist: the slow-work thread may race
> with module removal.  We are currently working with slow-work upstream
> to fix this issue as well.  Since the code prior to this patch also
> races with module_put(), we are not making anything worse, but rather
> shifting the cause of the race.  Once the slow-work code is patched we
> will be fixing the last remaining issue.

By the way, why are we using slow-work here? Wouldn't a regular
workqueue do just as well, with less code, and avoid the race?

-- 
MST

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [KVM PATCH v5 4/4] KVM: add irqfd DEASSIGN feature
  2009-06-28 10:46   ` Michael S. Tsirkin
@ 2009-06-28 12:39     ` Gregory Haskins
  0 siblings, 0 replies; 28+ messages in thread
From: Gregory Haskins @ 2009-06-28 12:39 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, linux-kernel, avi, paulmck, davidel, rusty

[-- Attachment #1: Type: text/plain, Size: 4071 bytes --]

Michael S. Tsirkin wrote:
> On Thu, Jun 25, 2009 at 09:28:32AM -0400, Gregory Haskins wrote:
>   
>> DEASSIGN allows us to optionally disassociate an IRQFD from its underlying
>> eventfd without destroying the eventfd in the process.  This is useful
>> for conditions like live-migration which may have an eventfd associated
>> with a device and an IRQFD.  We need to be able to decouple the guest
>> from the event source while not perturbing the event source itself.
>>
>> Signed-off-by: Gregory Haskins <ghaskins@novell.com>
>> CC: Michael S. Tsirkin <mst@redhat.com>
>> ---
>>
>>  include/linux/kvm.h |    2 ++
>>  virt/kvm/eventfd.c  |   56 +++++++++++++++++++++++++++++++++++++++++++++++++--
>>  2 files changed, 56 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
>> index 38ff31e..6710518 100644
>> --- a/include/linux/kvm.h
>> +++ b/include/linux/kvm.h
>> @@ -490,6 +490,8 @@ struct kvm_x86_mce {
>>  };
>>  #endif
>>  
>> +#define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
>> +
>>  struct kvm_irqfd {
>>  	__u32 fd;
>>  	__u32 gsi;
>> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
>> index ca21e8a..2d4549c 100644
>> --- a/virt/kvm/eventfd.c
>> +++ b/virt/kvm/eventfd.c
>> @@ -180,8 +180,8 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
>>  	add_wait_queue(wqh, &irqfd->wait);
>>  }
>>  
>> -int
>> -kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>> +static int
>> +kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
>>  {
>>  	struct _irqfd *irqfd;
>>  	struct file *file = NULL;
>> @@ -303,6 +303,58 @@ irqfd_shutdown(struct _irqfd *irqfd)
>>  	irqfd_release(irqfd);
>>  }
>>  
>> +/*
>> + * assumes kvm->irqfds.lock is held
>> + */
>> +static struct _irqfd *
>> +irqfd_find(struct kvm *kvm, int fd, int gsi)
>> +{
>> +	struct _irqfd *irqfd, *tmp, *ret = ERR_PTR(-ENOENT);
>> +	struct eventfd_ctx *eventfd;
>> +
>> +	eventfd = eventfd_ctx_fdget(fd);
>> +	if (IS_ERR(eventfd))
>> +		return ERR_PTR(PTR_ERR(eventfd));
>> +
>> +	spin_lock_irq(&kvm->irqfds.lock);
>> +
>> +	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds.items, list) {
>> +		if (irqfd->eventfd == eventfd && irqfd->gsi == gsi) {
>> +			irqfd_deactivate(irqfd);
>> +			ret = irqfd;
>> +			break;
>> +		}
>> +	}
>> +
>> +	spin_unlock_irq(&kvm->irqfds.lock);
>> +	eventfd_ctx_put(eventfd);
>> +
>> +	return ret;
>> +}
>> +
>> +static int
>> +kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
>> +{
>> +	struct _irqfd *irqfd;
>> +
>> +	irqfd = irqfd_find(kvm, fd, gsi);
>> +	if (IS_ERR(irqfd))
>> +		return PTR_ERR(irqfd);
>> +
>> +	irqfd_shutdown(irqfd);
>> +
>> +	return 0;
>> +}
>>     
>
>
> I think that, to make this work properly, you must
> add irqfd to list the last thing so do.
> As it is, when you assign irqfd, the last thing you do is
>
>         irqfd->eventfd = eventfd;
>   

Yeah, I agree.  I actually already replied to this effect on the thread
for 3/4. ;)

> I think you should move this to within a spinlock.
>   

I think if I fix the ordering, the list spinlock should be sufficient. 
Am I missing something?

>   
>> +
>> +int
>> +kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>> +{
>> +	if (flags & KVM_IRQFD_FLAG_DEASSIGN)
>> +		return kvm_irqfd_deassign(kvm, fd, gsi);
>> +
>> +	return kvm_irqfd_assign(kvm, fd, gsi);
>> +}
>> +
>>     
>
>
> At some point we discussed limiting the number of
> irqfds that can be created in some way, so that userspace
> can not consume unlimited amount of kernel memory.
>
> What happened to that idea?
>   

Yeah, that is a good question.  I thought I had already done that, too,
but now I don't know what happened to the logic.  Perhaps it got lost on
a respin somewhere.  I will look into this and add the feature.

> This will happen naturally if
> - we keep fget on the file until irqfd goes away
> - we allow the same file be bound to only one irqfd
> but there might be other ways to do this
>
>   



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface
  2009-06-28 11:06   ` Michael S. Tsirkin
@ 2009-06-28 12:50     ` Gregory Haskins
  2009-06-28 13:18       ` Michael S. Tsirkin
  0 siblings, 1 reply; 28+ messages in thread
From: Gregory Haskins @ 2009-06-28 12:50 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, linux-kernel, avi, paulmck, davidel, rusty

[-- Attachment #1: Type: text/plain, Size: 2393 bytes --]

Michael S. Tsirkin wrote:
> On Thu, Jun 25, 2009 at 09:28:27AM -0400, Gregory Haskins wrote:
>   
>> @@ -65,25 +134,39 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
>>  	unsigned long flags = (unsigned long)key;
>>  
>>  	/*
>> -	 * Assume we will be called with interrupts disabled
>> +	 * Called with interrupts disabled
>>  	 */
>>  	if (flags & POLLIN)
>> -		/*
>> -		 * Defer the IRQ injection until later since we need to
>> -		 * acquire the kvm->lock to do so.
>> -		 */
>> +		/* An event has been signaled, inject an interrupt */
>>  		schedule_work(&irqfd->inject);
>>  
>>  	if (flags & POLLHUP) {
>> -		/*
>> -		 * for now, just remove ourselves from the list and let
>> -		 * the rest dangle.  We will fix this up later once
>> -		 * the races in eventfd are fixed
>> -		 */
>> +		/* The eventfd is closing, detach from KVM */
>> +		struct kvm *kvm = irqfd->kvm;
>> +		unsigned long flags;
>> +
>>  		__remove_wait_queue(irqfd->wqh, &irqfd->wait);
>> -		irqfd->wqh = NULL;
>> +
>> +		spin_lock_irqsave(&kvm->irqfds.lock, flags);
>> +
>> +		if (irqfd->active) {
>> +			/*
>> +			 * If the item is still active we can be sure that
>> +			 * no-one else is trying to shutdown this object at
>> +			 * the same time.
>> +			 *
>> +			 * Defer the shutdown to a thread so we can flush
>> +			 * all remaining inject jobs.  We use a slow-work
>> +			 * item to prevent a deadlock against the work-queue
>> +			 */
>> +			irqfd_deactivate(irqfd);
>> +			slow_work_enqueue(&irqfd->shutdown);
>>     
>
> Greg, in your patch for slow-work module removal, you write:
>   "Callers must ensure that their module has at least
>   one reference held while the work is enqueued."
>   Where does this guarantee come from, in this case?
>   
The general guarantee comes from the fact that modules naturally have to
have a reference to be able to call the enqueue function to begin with,
or the calling function was already racy.  In this particular case, we
can guarantee that the kvm vm fd is held while our slow-work is active,
and all slow work is flushed before it is released.  (I guess I am
assuming that VFS takes a module reference when an fd is opened, but I
have not verified that it actually does.  If it doesn't, I suppose KVM
is already racy w.r.t. unloading, independent of my patches)

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface
  2009-06-28 11:48   ` Michael S. Tsirkin
@ 2009-06-28 12:53     ` Gregory Haskins
  2009-06-28 12:56       ` Michael S. Tsirkin
  0 siblings, 1 reply; 28+ messages in thread
From: Gregory Haskins @ 2009-06-28 12:53 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, avi

[-- Attachment #1: Type: text/plain, Size: 1253 bytes --]

Michael S. Tsirkin wrote:
> On Thu, Jun 25, 2009 at 09:28:27AM -0400, Gregory Haskins wrote:
>   
>> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
>> "release" callback.  This lets eventfd clients know if the eventfd is about
>> to go away and is very useful particularly for in-kernel clients.  However,
>> until recently it is not possible to use this feature of eventfd in a
>> race-free way.  This patch utilizes a new eventfd interface to rectify
>> the problem.
>>
>> Note that one final race is known to exist: the slow-work thread may race
>> with module removal.  We are currently working with slow-work upstream
>> to fix this issue as well.  Since the code prior to this patch also
>> races with module_put(), we are not making anything worse, but rather
>> shifting the cause of the race.  Once the slow-work code is patched we
>> will be fixing the last remaining issue.
>>     
>
> By the way, why are we using slow-work here? Wouldn't a regular
> workqueue do just as well, with less code, and avoid the race?
>
>   
I believe it will cause a problem if you do a "flush_work()" from inside
a work-item.  I could be wrong, of course, but it looks like a recipe to
deadlock.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface
  2009-06-28 12:53     ` Gregory Haskins
@ 2009-06-28 12:56       ` Michael S. Tsirkin
  2009-06-28 12:57         ` Michael S. Tsirkin
  2009-06-28 16:25         ` Gregory Haskins
  0 siblings, 2 replies; 28+ messages in thread
From: Michael S. Tsirkin @ 2009-06-28 12:56 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm, avi

On Sun, Jun 28, 2009 at 08:53:22AM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Thu, Jun 25, 2009 at 09:28:27AM -0400, Gregory Haskins wrote:
> >   
> >> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
> >> "release" callback.  This lets eventfd clients know if the eventfd is about
> >> to go away and is very useful particularly for in-kernel clients.  However,
> >> until recently it is not possible to use this feature of eventfd in a
> >> race-free way.  This patch utilizes a new eventfd interface to rectify
> >> the problem.
> >>
> >> Note that one final race is known to exist: the slow-work thread may race
> >> with module removal.  We are currently working with slow-work upstream
> >> to fix this issue as well.  Since the code prior to this patch also
> >> races with module_put(), we are not making anything worse, but rather
> >> shifting the cause of the race.  Once the slow-work code is patched we
> >> will be fixing the last remaining issue.
> >>     
> >
> > By the way, why are we using slow-work here? Wouldn't a regular
> > workqueue do just as well, with less code, and avoid the race?
> >
> >   
> I believe it will cause a problem if you do a "flush_work()" from inside
> a work-item.  I could be wrong, of course, but it looks like a recipe to
> deadlock.
> 
> -Greg
> 

Sure, but the idea is to only flush on kvm close, never from work item.

-- 
MST

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface
  2009-06-28 12:56       ` Michael S. Tsirkin
@ 2009-06-28 12:57         ` Michael S. Tsirkin
  2009-06-28 13:20           ` Michael S. Tsirkin
  2009-06-28 16:25         ` Gregory Haskins
  1 sibling, 1 reply; 28+ messages in thread
From: Michael S. Tsirkin @ 2009-06-28 12:57 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm, avi

On Sun, Jun 28, 2009 at 03:56:12PM +0300, Michael S. Tsirkin wrote:
> On Sun, Jun 28, 2009 at 08:53:22AM -0400, Gregory Haskins wrote:
> > Michael S. Tsirkin wrote:
> > > On Thu, Jun 25, 2009 at 09:28:27AM -0400, Gregory Haskins wrote:
> > >   
> > >> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
> > >> "release" callback.  This lets eventfd clients know if the eventfd is about
> > >> to go away and is very useful particularly for in-kernel clients.  However,
> > >> until recently it is not possible to use this feature of eventfd in a
> > >> race-free way.  This patch utilizes a new eventfd interface to rectify
> > >> the problem.
> > >>
> > >> Note that one final race is known to exist: the slow-work thread may race
> > >> with module removal.  We are currently working with slow-work upstream
> > >> to fix this issue as well.  Since the code prior to this patch also
> > >> races with module_put(), we are not making anything worse, but rather
> > >> shifting the cause of the race.  Once the slow-work code is patched we
> > >> will be fixing the last remaining issue.
> > >>     
> > >
> > > By the way, why are we using slow-work here? Wouldn't a regular
> > > workqueue do just as well, with less code, and avoid the race?
> > >
> > >   
> > I believe it will cause a problem if you do a "flush_work()" from inside
> > a work-item.  I could be wrong, of course, but it looks like a recipe to
> > deadlock.
> > 
> > -Greg
> > 
> 
> Sure, but the idea is to only flush on kvm close, never from work item.

To clarify, you don't flush slow works from a work-item,
so you shouldn't need to flush workqueue either.

-- 
MST

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [KVM PATCH v5 0/4] irqfd fixes and enhancements
  2009-06-28 11:03   ` Avi Kivity
@ 2009-06-28 12:59     ` Gregory Haskins
  2009-06-28 13:40       ` Avi Kivity
  0 siblings, 1 reply; 28+ messages in thread
From: Gregory Haskins @ 2009-06-28 12:59 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm, linux-kernel, mst, paulmck, davidel, rusty

[-- Attachment #1: Type: text/plain, Size: 3528 bytes --]

Avi Kivity wrote:
> On 06/25/2009 04:59 PM, Gregory Haskins wrote:
>> Gregory Haskins wrote:
>>   
>>> (Applies to kvm.git/master:4631e094)
>>>
>>> The following is the latest attempt to fix the races in
>>> irqfd/eventfd, as
>>> well as restore DEASSIGN support.  For more details, please read the
>>> patch
>>> headers.
>>>
>>> This series has been tested against the kvm-eventfd unit test, and
>>> appears to be functioning properly.  You can download this test here:
>>>
>>> ftp://ftp.novell.com/dev/ghaskins/kvm-eventfd.tar.bz2
>>>
>>> I've included version 4 of Davide's eventfd patch (ported to
>>> kvm.git) so
>>> that its a complete reviewable series.  Note, however, that there
>>> may be
>>> later versions of his patch to consider for merging, so we should
>>> coordinate with him.
>>>
>>>      
>>
>> So I know we talked yesterday in the review session about whether it was
>> actually worth all this complexity to deal with the POLLHUP or if we
>> should just revert to the prior "two syscall" model and be done with
>> it.  Rusty reflected these same sentiments this morning in response to
>> Davide's patch in a different thread.
>>
>> I am a bit torn myself, tbh.  I do feel as though I have a good handle
>> on the issue and that it is indeed now fixed (at least, if this series
>> is applied and the slow-work issue is fixed, still pending upstream
>> ACK).  I have a lot invested in going the POLLHUP direction having spent
>> so much time thinking about the problem and working on the patches, so I
>> a bit of a biased opinion, I know.
>>
>> The reason why I am pushing this series out now is at least partly so we
>> can tie up these loose ends.  We have both solutions in front of us and
>> can make a decision either way.  At least the solution is formally
>> documented in the internet archives forever this way ;)
>>
>> I took the review comments to heart that the shutdown code was
>> substantially larger and more complex than the actual fast-path code.  I
>> went though last night and simplified and clarified it.  I think the
>> latest result is leaner and clearer, so please give it another review
>> (particularly for races) before dismissing it.
>>    
>
> Yes, it's much nicer.  I can't say I'm certain it's race free but it's
> a lot more digestible
>
>> Ultimately, I think the concept of a release notification for eventfd is
>> a good thing for all eventfd users, so I don't think this thing should
>> go away per se even if irqfd decides to not use it.
>>    
>
> I agree that we want POLLHUP support, it's better than holding on to
> the eventfd.  But I think we can make it even cleaner by merging it
> with deassign.  Basically, when we get POLLHUP, we launch a slow_work
> (or something) that does a regular deassign.  That slow_work can grab
> a ref to the vm, so we don't race with the VM disappearing.
>
> But given that the current slow_work does almost nothing, I'm not sure
> it's worth it.

Yeah, and also note that the algorithm to unhook each side is not quite
symmetrical.  I think I've captured all the common parts (in things like
irqfd_deactivate(), etc).  A minor change in kvm_irqfd_release() could
technically use a deferred job to release instead of doing it inline,
but I do not think it buys us very much to do so (as you pointed out,
the defered part is actually fairly simple).  The important parts of the
protocol lie outside of the work we can do in the work-item anyway.


-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface
  2009-06-28 12:50     ` Gregory Haskins
@ 2009-06-28 13:18       ` Michael S. Tsirkin
  2009-06-28 13:25         ` Avi Kivity
  0 siblings, 1 reply; 28+ messages in thread
From: Michael S. Tsirkin @ 2009-06-28 13:18 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm, linux-kernel, avi, paulmck, davidel, rusty

On Sun, Jun 28, 2009 at 08:50:28AM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Thu, Jun 25, 2009 at 09:28:27AM -0400, Gregory Haskins wrote:
> >   
> >> @@ -65,25 +134,39 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
> >>  	unsigned long flags = (unsigned long)key;
> >>  
> >>  	/*
> >> -	 * Assume we will be called with interrupts disabled
> >> +	 * Called with interrupts disabled
> >>  	 */
> >>  	if (flags & POLLIN)
> >> -		/*
> >> -		 * Defer the IRQ injection until later since we need to
> >> -		 * acquire the kvm->lock to do so.
> >> -		 */
> >> +		/* An event has been signaled, inject an interrupt */
> >>  		schedule_work(&irqfd->inject);
> >>  
> >>  	if (flags & POLLHUP) {
> >> -		/*
> >> -		 * for now, just remove ourselves from the list and let
> >> -		 * the rest dangle.  We will fix this up later once
> >> -		 * the races in eventfd are fixed
> >> -		 */
> >> +		/* The eventfd is closing, detach from KVM */
> >> +		struct kvm *kvm = irqfd->kvm;
> >> +		unsigned long flags;
> >> +
> >>  		__remove_wait_queue(irqfd->wqh, &irqfd->wait);
> >> -		irqfd->wqh = NULL;
> >> +
> >> +		spin_lock_irqsave(&kvm->irqfds.lock, flags);
> >> +
> >> +		if (irqfd->active) {
> >> +			/*
> >> +			 * If the item is still active we can be sure that
> >> +			 * no-one else is trying to shutdown this object at
> >> +			 * the same time.
> >> +			 *
> >> +			 * Defer the shutdown to a thread so we can flush
> >> +			 * all remaining inject jobs.  We use a slow-work
> >> +			 * item to prevent a deadlock against the work-queue
> >> +			 */
> >> +			irqfd_deactivate(irqfd);
> >> +			slow_work_enqueue(&irqfd->shutdown);
> >>     
> >
> > Greg, in your patch for slow-work module removal, you write:
> >   "Callers must ensure that their module has at least
> >   one reference held while the work is enqueued."
> >   Where does this guarantee come from, in this case?
> >   
> The general guarantee comes from the fact that modules naturally have to
> have a reference to be able to call the enqueue function to begin with,
> or the calling function was already racy.  In this particular case, we
> can guarantee that the kvm vm fd is held while our slow-work is active,
> and all slow work is flushed before it is released.  (I guess I am
> assuming that VFS takes a module reference when an fd is opened, but I
> have not verified that it actually does.  If it doesn't, I suppose KVM
> is already racy w.r.t. unloading, independent of my patches)
> 
> -Greg
> 

that could be the case, as we have, for example:

static struct file_operations kvm_vm_fops = {
        .release        = kvm_vm_release,
        .unlocked_ioctl = kvm_vm_ioctl,
        .compat_ioctl   = kvm_vm_ioctl,
        .mmap           = kvm_vm_mmap,
};

with no owner field.

Avi, shouldn't we initialize the owner field to prevent
kvm module from going away while files are open?

-- 
MST

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface
  2009-06-28 12:57         ` Michael S. Tsirkin
@ 2009-06-28 13:20           ` Michael S. Tsirkin
  2009-06-28 16:28             ` Gregory Haskins
  0 siblings, 1 reply; 28+ messages in thread
From: Michael S. Tsirkin @ 2009-06-28 13:20 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm, avi

On Sun, Jun 28, 2009 at 03:57:30PM +0300, Michael S. Tsirkin wrote:
> On Sun, Jun 28, 2009 at 03:56:12PM +0300, Michael S. Tsirkin wrote:
> > On Sun, Jun 28, 2009 at 08:53:22AM -0400, Gregory Haskins wrote:
> > > Michael S. Tsirkin wrote:
> > > > On Thu, Jun 25, 2009 at 09:28:27AM -0400, Gregory Haskins wrote:
> > > >   
> > > >> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
> > > >> "release" callback.  This lets eventfd clients know if the eventfd is about
> > > >> to go away and is very useful particularly for in-kernel clients.  However,
> > > >> until recently it is not possible to use this feature of eventfd in a
> > > >> race-free way.  This patch utilizes a new eventfd interface to rectify
> > > >> the problem.
> > > >>
> > > >> Note that one final race is known to exist: the slow-work thread may race
> > > >> with module removal.  We are currently working with slow-work upstream
> > > >> to fix this issue as well.  Since the code prior to this patch also
> > > >> races with module_put(), we are not making anything worse, but rather
> > > >> shifting the cause of the race.  Once the slow-work code is patched we
> > > >> will be fixing the last remaining issue.
> > > >>     
> > > >
> > > > By the way, why are we using slow-work here? Wouldn't a regular
> > > > workqueue do just as well, with less code, and avoid the race?
> > > >
> > > >   
> > > I believe it will cause a problem if you do a "flush_work()" from inside
> > > a work-item.  I could be wrong, of course, but it looks like a recipe to
> > > deadlock.
> > > 
> > > -Greg
> > > 
> > 
> > Sure, but the idea is to only flush on kvm close, never from work item.
> 
> To clarify, you don't flush slow works from a work-item,
> so you shouldn't need to flush workqueue either.

I guess my question is - why is slow work different? It's still
a thread pool underneath ...

-- 
MST

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface
  2009-06-28 13:18       ` Michael S. Tsirkin
@ 2009-06-28 13:25         ` Avi Kivity
  0 siblings, 0 replies; 28+ messages in thread
From: Avi Kivity @ 2009-06-28 13:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gregory Haskins, kvm, linux-kernel, paulmck, davidel, rusty

On 06/28/2009 04:18 PM, Michael S. Tsirkin wrote:
>
> that could be the case, as we have, for example:
>
> static struct file_operations kvm_vm_fops = {
>          .release        = kvm_vm_release,
>          .unlocked_ioctl = kvm_vm_ioctl,
>          .compat_ioctl   = kvm_vm_ioctl,
>          .mmap           = kvm_vm_mmap,
> };
>
> with no owner field.
>
> Avi, shouldn't we initialize the owner field to prevent
> kvm module from going away while files are open?
>    

We do initialize it:

kvm_chardev_ops.owner = module;
kvm_vm_fops.owner = module;
kvm_vcpu_fops.owner = module;

The reason it's not done through the initializer is that we set the 
owner to the vendor module (e.g. kvm-intel.ko) so that you can't remove 
the vendor module when a guest is running.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [KVM PATCH v5 0/4] irqfd fixes and enhancements
  2009-06-28 12:59     ` Gregory Haskins
@ 2009-06-28 13:40       ` Avi Kivity
  0 siblings, 0 replies; 28+ messages in thread
From: Avi Kivity @ 2009-06-28 13:40 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm, linux-kernel, mst, paulmck, davidel, rusty

On 06/28/2009 03:59 PM, Gregory Haskins wrote:
>> I agree that we want POLLHUP support, it's better than holding on to
>> the eventfd.  But I think we can make it even cleaner by merging it
>> with deassign.  Basically, when we get POLLHUP, we launch a slow_work
>> (or something) that does a regular deassign.  That slow_work can grab
>> a ref to the vm, so we don't race with the VM disappearing.
>>
>> But given that the current slow_work does almost nothing, I'm not sure
>> it's worth it.
>>      
>
> Yeah, and also note that the algorithm to unhook each side is not quite
> symmetrical.  I think I've captured all the common parts (in things like
> irqfd_deactivate(), etc).  A minor change in kvm_irqfd_release() could
> technically use a deferred job to release instead of doing it inline,
> but I do not think it buys us very much to do so (as you pointed out,
> the defered part is actually fairly simple).  The important parts of the
> protocol lie outside of the work we can do in the work-item anyway.
>    

Is the case of deassign vs POLLHUP covered?

Reusing deassign in POLLHUP at least makes it easy to verify that it is.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface
  2009-06-28 12:56       ` Michael S. Tsirkin
  2009-06-28 12:57         ` Michael S. Tsirkin
@ 2009-06-28 16:25         ` Gregory Haskins
  1 sibling, 0 replies; 28+ messages in thread
From: Gregory Haskins @ 2009-06-28 16:25 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, avi

[-- Attachment #1: Type: text/plain, Size: 1692 bytes --]

Michael S. Tsirkin wrote:
> On Sun, Jun 28, 2009 at 08:53:22AM -0400, Gregory Haskins wrote:
>   
>> Michael S. Tsirkin wrote:
>>     
>>> On Thu, Jun 25, 2009 at 09:28:27AM -0400, Gregory Haskins wrote:
>>>   
>>>       
>>>> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
>>>> "release" callback.  This lets eventfd clients know if the eventfd is about
>>>> to go away and is very useful particularly for in-kernel clients.  However,
>>>> until recently it is not possible to use this feature of eventfd in a
>>>> race-free way.  This patch utilizes a new eventfd interface to rectify
>>>> the problem.
>>>>
>>>> Note that one final race is known to exist: the slow-work thread may race
>>>> with module removal.  We are currently working with slow-work upstream
>>>> to fix this issue as well.  Since the code prior to this patch also
>>>> races with module_put(), we are not making anything worse, but rather
>>>> shifting the cause of the race.  Once the slow-work code is patched we
>>>> will be fixing the last remaining issue.
>>>>     
>>>>         
>>> By the way, why are we using slow-work here? Wouldn't a regular
>>> workqueue do just as well, with less code, and avoid the race?
>>>
>>>   
>>>       
>> I believe it will cause a problem if you do a "flush_work()" from inside
>> a work-item.  I could be wrong, of course, but it looks like a recipe to
>> deadlock.
>>
>> -Greg
>>
>>     
>
> Sure, but the idea is to only flush on kvm close, never from work item.
>
>   
The point of the flush on the eventfd side is to make sure we
synchronize with outstanding injects before we free the irqfd.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface
  2009-06-28 13:20           ` Michael S. Tsirkin
@ 2009-06-28 16:28             ` Gregory Haskins
  2009-06-28 19:07               ` Michael S. Tsirkin
  0 siblings, 1 reply; 28+ messages in thread
From: Gregory Haskins @ 2009-06-28 16:28 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, avi

[-- Attachment #1: Type: text/plain, Size: 2377 bytes --]

Michael S. Tsirkin wrote:
> On Sun, Jun 28, 2009 at 03:57:30PM +0300, Michael S. Tsirkin wrote:
>   
>> On Sun, Jun 28, 2009 at 03:56:12PM +0300, Michael S. Tsirkin wrote:
>>     
>>> On Sun, Jun 28, 2009 at 08:53:22AM -0400, Gregory Haskins wrote:
>>>       
>>>> Michael S. Tsirkin wrote:
>>>>         
>>>>> On Thu, Jun 25, 2009 at 09:28:27AM -0400, Gregory Haskins wrote:
>>>>>   
>>>>>           
>>>>>> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
>>>>>> "release" callback.  This lets eventfd clients know if the eventfd is about
>>>>>> to go away and is very useful particularly for in-kernel clients.  However,
>>>>>> until recently it is not possible to use this feature of eventfd in a
>>>>>> race-free way.  This patch utilizes a new eventfd interface to rectify
>>>>>> the problem.
>>>>>>
>>>>>> Note that one final race is known to exist: the slow-work thread may race
>>>>>> with module removal.  We are currently working with slow-work upstream
>>>>>> to fix this issue as well.  Since the code prior to this patch also
>>>>>> races with module_put(), we are not making anything worse, but rather
>>>>>> shifting the cause of the race.  Once the slow-work code is patched we
>>>>>> will be fixing the last remaining issue.
>>>>>>     
>>>>>>             
>>>>> By the way, why are we using slow-work here? Wouldn't a regular
>>>>> workqueue do just as well, with less code, and avoid the race?
>>>>>
>>>>>   
>>>>>           
>>>> I believe it will cause a problem if you do a "flush_work()" from inside
>>>> a work-item.  I could be wrong, of course, but it looks like a recipe to
>>>> deadlock.
>>>>
>>>> -Greg
>>>>
>>>>         
>>> Sure, but the idea is to only flush on kvm close, never from work item.
>>>       
>> To clarify, you don't flush slow works from a work-item,
>> so you shouldn't need to flush workqueue either.
>>     
>
> I guess my question is - why is slow work different? It's still
> a thread pool underneath ...
>
>   
Its not interdependent.  Flush-work blocks the thread..if the thread
happens to be the work-queue thread you may deadlock preventing it from
processing further jobs like the inject.  In reality it shouldnt be
possible, but its just a bad idea to assume its ok.  Slow work, on the
other hand, will just make a new thread.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface
  2009-06-28 16:28             ` Gregory Haskins
@ 2009-06-28 19:07               ` Michael S. Tsirkin
  2009-06-28 19:54                 ` Gregory Haskins
  0 siblings, 1 reply; 28+ messages in thread
From: Michael S. Tsirkin @ 2009-06-28 19:07 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm, avi

On Sun, Jun 28, 2009 at 12:28:19PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Sun, Jun 28, 2009 at 03:57:30PM +0300, Michael S. Tsirkin wrote:
> >   
> >> On Sun, Jun 28, 2009 at 03:56:12PM +0300, Michael S. Tsirkin wrote:
> >>     
> >>> On Sun, Jun 28, 2009 at 08:53:22AM -0400, Gregory Haskins wrote:
> >>>       
> >>>> Michael S. Tsirkin wrote:
> >>>>         
> >>>>> On Thu, Jun 25, 2009 at 09:28:27AM -0400, Gregory Haskins wrote:
> >>>>>   
> >>>>>           
> >>>>>> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
> >>>>>> "release" callback.  This lets eventfd clients know if the eventfd is about
> >>>>>> to go away and is very useful particularly for in-kernel clients.  However,
> >>>>>> until recently it is not possible to use this feature of eventfd in a
> >>>>>> race-free way.  This patch utilizes a new eventfd interface to rectify
> >>>>>> the problem.
> >>>>>>
> >>>>>> Note that one final race is known to exist: the slow-work thread may race
> >>>>>> with module removal.  We are currently working with slow-work upstream
> >>>>>> to fix this issue as well.  Since the code prior to this patch also
> >>>>>> races with module_put(), we are not making anything worse, but rather
> >>>>>> shifting the cause of the race.  Once the slow-work code is patched we
> >>>>>> will be fixing the last remaining issue.
> >>>>>>     
> >>>>>>             
> >>>>> By the way, why are we using slow-work here? Wouldn't a regular
> >>>>> workqueue do just as well, with less code, and avoid the race?
> >>>>>
> >>>>>   
> >>>>>           
> >>>> I believe it will cause a problem if you do a "flush_work()" from inside
> >>>> a work-item.  I could be wrong, of course, but it looks like a recipe to
> >>>> deadlock.
> >>>>
> >>>> -Greg
> >>>>
> >>>>         
> >>> Sure, but the idea is to only flush on kvm close, never from work item.
> >>>       
> >> To clarify, you don't flush slow works from a work-item,
> >> so you shouldn't need to flush workqueue either.
> >>     
> >
> > I guess my question is - why is slow work different? It's still
> > a thread pool underneath ...
> >
> >   
> Its not interdependent.  Flush-work blocks the thread..if the thread
> happens to be the work-queue thread you may deadlock preventing it from
> processing further jobs like the inject.  In reality it shouldnt be
> possible, but its just a bad idea to assume its ok.
> Slow work, on the
> other hand, will just make a new thread.
> 
> -Greg
> 

But if you create your own workqueue, and all you do there is destroy
irqfds, things are ok I think. Right?




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface
  2009-06-28 19:07               ` Michael S. Tsirkin
@ 2009-06-28 19:54                 ` Gregory Haskins
  2009-06-28 20:07                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 28+ messages in thread
From: Gregory Haskins @ 2009-06-28 19:54 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, avi

[-- Attachment #1: Type: text/plain, Size: 3168 bytes --]

Michael S. Tsirkin wrote:
> On Sun, Jun 28, 2009 at 12:28:19PM -0400, Gregory Haskins wrote:
>   
>> Michael S. Tsirkin wrote:
>>     
>>> On Sun, Jun 28, 2009 at 03:57:30PM +0300, Michael S. Tsirkin wrote:
>>>   
>>>       
>>>> On Sun, Jun 28, 2009 at 03:56:12PM +0300, Michael S. Tsirkin wrote:
>>>>     
>>>>         
>>>>> On Sun, Jun 28, 2009 at 08:53:22AM -0400, Gregory Haskins wrote:
>>>>>       
>>>>>           
>>>>>> Michael S. Tsirkin wrote:
>>>>>>         
>>>>>>             
>>>>>>> On Thu, Jun 25, 2009 at 09:28:27AM -0400, Gregory Haskins wrote:
>>>>>>>   
>>>>>>>           
>>>>>>>               
>>>>>>>> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
>>>>>>>> "release" callback.  This lets eventfd clients know if the eventfd is about
>>>>>>>> to go away and is very useful particularly for in-kernel clients.  However,
>>>>>>>> until recently it is not possible to use this feature of eventfd in a
>>>>>>>> race-free way.  This patch utilizes a new eventfd interface to rectify
>>>>>>>> the problem.
>>>>>>>>
>>>>>>>> Note that one final race is known to exist: the slow-work thread may race
>>>>>>>> with module removal.  We are currently working with slow-work upstream
>>>>>>>> to fix this issue as well.  Since the code prior to this patch also
>>>>>>>> races with module_put(), we are not making anything worse, but rather
>>>>>>>> shifting the cause of the race.  Once the slow-work code is patched we
>>>>>>>> will be fixing the last remaining issue.
>>>>>>>>     
>>>>>>>>             
>>>>>>>>                 
>>>>>>> By the way, why are we using slow-work here? Wouldn't a regular
>>>>>>> workqueue do just as well, with less code, and avoid the race?
>>>>>>>
>>>>>>>   
>>>>>>>           
>>>>>>>               
>>>>>> I believe it will cause a problem if you do a "flush_work()" from inside
>>>>>> a work-item.  I could be wrong, of course, but it looks like a recipe to
>>>>>> deadlock.
>>>>>>
>>>>>> -Greg
>>>>>>
>>>>>>         
>>>>>>             
>>>>> Sure, but the idea is to only flush on kvm close, never from work item.
>>>>>       
>>>>>           
>>>> To clarify, you don't flush slow works from a work-item,
>>>> so you shouldn't need to flush workqueue either.
>>>>     
>>>>         
>>> I guess my question is - why is slow work different? It's still
>>> a thread pool underneath ...
>>>
>>>   
>>>       
>> Its not interdependent.  Flush-work blocks the thread..if the thread
>> happens to be the work-queue thread you may deadlock preventing it from
>> processing further jobs like the inject.  In reality it shouldnt be
>> possible, but its just a bad idea to assume its ok.
>> Slow work, on the
>> other hand, will just make a new thread.
>>
>> -Greg
>>
>>     
>
> But if you create your own workqueue, and all you do there is destroy
> irqfds, things are ok I think. Right?
>   

Yep, creating your own queue works too.  I picked slow-work as an
alternate to generating a dedicated resource, but I agree either method
would work fine.  Do you have a preference? 

Regards,
-Greg
>
>
>   



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface
  2009-06-28 19:54                 ` Gregory Haskins
@ 2009-06-28 20:07                   ` Michael S. Tsirkin
  2009-06-28 20:17                     ` Gregory Haskins
  0 siblings, 1 reply; 28+ messages in thread
From: Michael S. Tsirkin @ 2009-06-28 20:07 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm, avi

On Sun, Jun 28, 2009 at 03:54:42PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Sun, Jun 28, 2009 at 12:28:19PM -0400, Gregory Haskins wrote:
> >   
> >> Michael S. Tsirkin wrote:
> >>     
> >>> On Sun, Jun 28, 2009 at 03:57:30PM +0300, Michael S. Tsirkin wrote:
> >>>   
> >>>       
> >>>> On Sun, Jun 28, 2009 at 03:56:12PM +0300, Michael S. Tsirkin wrote:
> >>>>     
> >>>>         
> >>>>> On Sun, Jun 28, 2009 at 08:53:22AM -0400, Gregory Haskins wrote:
> >>>>>       
> >>>>>           
> >>>>>> Michael S. Tsirkin wrote:
> >>>>>>         
> >>>>>>             
> >>>>>>> On Thu, Jun 25, 2009 at 09:28:27AM -0400, Gregory Haskins wrote:
> >>>>>>>   
> >>>>>>>           
> >>>>>>>               
> >>>>>>>> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
> >>>>>>>> "release" callback.  This lets eventfd clients know if the eventfd is about
> >>>>>>>> to go away and is very useful particularly for in-kernel clients.  However,
> >>>>>>>> until recently it is not possible to use this feature of eventfd in a
> >>>>>>>> race-free way.  This patch utilizes a new eventfd interface to rectify
> >>>>>>>> the problem.
> >>>>>>>>
> >>>>>>>> Note that one final race is known to exist: the slow-work thread may race
> >>>>>>>> with module removal.  We are currently working with slow-work upstream
> >>>>>>>> to fix this issue as well.  Since the code prior to this patch also
> >>>>>>>> races with module_put(), we are not making anything worse, but rather
> >>>>>>>> shifting the cause of the race.  Once the slow-work code is patched we
> >>>>>>>> will be fixing the last remaining issue.
> >>>>>>>>     
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>> By the way, why are we using slow-work here? Wouldn't a regular
> >>>>>>> workqueue do just as well, with less code, and avoid the race?
> >>>>>>>
> >>>>>>>   
> >>>>>>>           
> >>>>>>>               
> >>>>>> I believe it will cause a problem if you do a "flush_work()" from inside
> >>>>>> a work-item.  I could be wrong, of course, but it looks like a recipe to
> >>>>>> deadlock.
> >>>>>>
> >>>>>> -Greg
> >>>>>>
> >>>>>>         
> >>>>>>             
> >>>>> Sure, but the idea is to only flush on kvm close, never from work item.
> >>>>>       
> >>>>>           
> >>>> To clarify, you don't flush slow works from a work-item,
> >>>> so you shouldn't need to flush workqueue either.
> >>>>     
> >>>>         
> >>> I guess my question is - why is slow work different? It's still
> >>> a thread pool underneath ...
> >>>
> >>>   
> >>>       
> >> Its not interdependent.  Flush-work blocks the thread..if the thread
> >> happens to be the work-queue thread you may deadlock preventing it from
> >> processing further jobs like the inject.  In reality it shouldnt be
> >> possible, but its just a bad idea to assume its ok.
> >> Slow work, on the
> >> other hand, will just make a new thread.
> >>
> >> -Greg
> >>
> >>     
> >
> > But if you create your own workqueue, and all you do there is destroy
> > irqfds, things are ok I think. Right?
> >   
> 
> Yep, creating your own queue works too.  I picked slow-work as an
> alternate to generating a dedicated resource, but I agree either method
> would work fine.  Do you have a preference? 
> 
> Regards,
> -Greg

It's not something I lose sleep about, but I think workqueue might be
less code: for example, you can just flush it instead of using your own
counter. And possibly things can be further simplified by making the
workqueue single-threaded and always doing deassign from there.

-- 
MST

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface
  2009-06-28 20:07                   ` Michael S. Tsirkin
@ 2009-06-28 20:17                     ` Gregory Haskins
  0 siblings, 0 replies; 28+ messages in thread
From: Gregory Haskins @ 2009-06-28 20:17 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, avi

[-- Attachment #1: Type: text/plain, Size: 4323 bytes --]

Michael S. Tsirkin wrote:
> On Sun, Jun 28, 2009 at 03:54:42PM -0400, Gregory Haskins wrote:
>   
>> Michael S. Tsirkin wrote:
>>     
>>> On Sun, Jun 28, 2009 at 12:28:19PM -0400, Gregory Haskins wrote:
>>>   
>>>       
>>>> Michael S. Tsirkin wrote:
>>>>     
>>>>         
>>>>> On Sun, Jun 28, 2009 at 03:57:30PM +0300, Michael S. Tsirkin wrote:
>>>>>   
>>>>>       
>>>>>           
>>>>>> On Sun, Jun 28, 2009 at 03:56:12PM +0300, Michael S. Tsirkin wrote:
>>>>>>     
>>>>>>         
>>>>>>             
>>>>>>> On Sun, Jun 28, 2009 at 08:53:22AM -0400, Gregory Haskins wrote:
>>>>>>>       
>>>>>>>           
>>>>>>>               
>>>>>>>> Michael S. Tsirkin wrote:
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> On Thu, Jun 25, 2009 at 09:28:27AM -0400, Gregory Haskins wrote:
>>>>>>>>>   
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>>>> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
>>>>>>>>>> "release" callback.  This lets eventfd clients know if the eventfd is about
>>>>>>>>>> to go away and is very useful particularly for in-kernel clients.  However,
>>>>>>>>>> until recently it is not possible to use this feature of eventfd in a
>>>>>>>>>> race-free way.  This patch utilizes a new eventfd interface to rectify
>>>>>>>>>> the problem.
>>>>>>>>>>
>>>>>>>>>> Note that one final race is known to exist: the slow-work thread may race
>>>>>>>>>> with module removal.  We are currently working with slow-work upstream
>>>>>>>>>> to fix this issue as well.  Since the code prior to this patch also
>>>>>>>>>> races with module_put(), we are not making anything worse, but rather
>>>>>>>>>> shifting the cause of the race.  Once the slow-work code is patched we
>>>>>>>>>> will be fixing the last remaining issue.
>>>>>>>>>>     
>>>>>>>>>>             
>>>>>>>>>>                 
>>>>>>>>>>                     
>>>>>>>>> By the way, why are we using slow-work here? Wouldn't a regular
>>>>>>>>> workqueue do just as well, with less code, and avoid the race?
>>>>>>>>>
>>>>>>>>>   
>>>>>>>>>           
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>> I believe it will cause a problem if you do a "flush_work()" from inside
>>>>>>>> a work-item.  I could be wrong, of course, but it looks like a recipe to
>>>>>>>> deadlock.
>>>>>>>>
>>>>>>>> -Greg
>>>>>>>>
>>>>>>>>         
>>>>>>>>             
>>>>>>>>                 
>>>>>>> Sure, but the idea is to only flush on kvm close, never from work item.
>>>>>>>       
>>>>>>>           
>>>>>>>               
>>>>>> To clarify, you don't flush slow works from a work-item,
>>>>>> so you shouldn't need to flush workqueue either.
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> I guess my question is - why is slow work different? It's still
>>>>> a thread pool underneath ...
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>> Its not interdependent.  Flush-work blocks the thread..if the thread
>>>> happens to be the work-queue thread you may deadlock preventing it from
>>>> processing further jobs like the inject.  In reality it shouldnt be
>>>> possible, but its just a bad idea to assume its ok.
>>>> Slow work, on the
>>>> other hand, will just make a new thread.
>>>>
>>>> -Greg
>>>>
>>>>     
>>>>         
>>> But if you create your own workqueue, and all you do there is destroy
>>> irqfds, things are ok I think. Right?
>>>   
>>>       
>> Yep, creating your own queue works too.  I picked slow-work as an
>> alternate to generating a dedicated resource, but I agree either method
>> would work fine.  Do you have a preference? 
>>
>> Regards,
>> -Greg
>>     
>
> It's not something I lose sleep about, but I think workqueue might be
> less code: for example, you can just flush it instead of using your own
> counter. And possibly things can be further simplified by making the
> workqueue single-threaded and always doing deassign from there.
>
>   
Yeah, its not a huge deal.  The logic is probably slightly simpler with
a dedicated single-thread queue for shutdown, at the expense of having
the work-queue hang around mostly idle.  I'll code it up both ways so we
can compare.

Thanks Michael,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2009-06-28 20:18 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-25 13:28 [KVM PATCH v5 0/4] irqfd fixes and enhancements Gregory Haskins
2009-06-25 13:28 ` [KVM PATCH v5 1/4] kvm: prepare irqfd for having interrupts disabled during eventfd->release Gregory Haskins
2009-06-25 13:28 ` [KVM PATCH v5 2/4] eventfd - revised interface and cleanups (4th rev) Gregory Haskins
2009-06-25 13:28 ` [KVM PATCH v5 3/4] KVM: Fix races in irqfd using new eventfd_kref_get interface Gregory Haskins
2009-06-26 14:05   ` Gregory Haskins
2009-06-28 11:06   ` Michael S. Tsirkin
2009-06-28 12:50     ` Gregory Haskins
2009-06-28 13:18       ` Michael S. Tsirkin
2009-06-28 13:25         ` Avi Kivity
2009-06-28 11:48   ` Michael S. Tsirkin
2009-06-28 12:53     ` Gregory Haskins
2009-06-28 12:56       ` Michael S. Tsirkin
2009-06-28 12:57         ` Michael S. Tsirkin
2009-06-28 13:20           ` Michael S. Tsirkin
2009-06-28 16:28             ` Gregory Haskins
2009-06-28 19:07               ` Michael S. Tsirkin
2009-06-28 19:54                 ` Gregory Haskins
2009-06-28 20:07                   ` Michael S. Tsirkin
2009-06-28 20:17                     ` Gregory Haskins
2009-06-28 16:25         ` Gregory Haskins
2009-06-25 13:28 ` [KVM PATCH v5 4/4] KVM: add irqfd DEASSIGN feature Gregory Haskins
2009-06-28 10:46   ` Michael S. Tsirkin
2009-06-28 12:39     ` Gregory Haskins
2009-06-25 13:59 ` [KVM PATCH v5 0/4] irqfd fixes and enhancements Gregory Haskins
2009-06-25 16:44   ` Davide Libenzi
2009-06-28 11:03   ` Avi Kivity
2009-06-28 12:59     ` Gregory Haskins
2009-06-28 13:40       ` Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.