All of lore.kernel.org
 help / color / mirror / Atom feed
* [KVM PATCH v2 0/2] irqfd: use POLLHUP notification for close()
@ 2009-06-04 12:48 Gregory Haskins
  2009-06-04 12:48 ` [KVM PATCH v2 1/2] Allow waiters to be notified about the eventfd file* going away, and give Gregory Haskins
                   ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Gregory Haskins @ 2009-06-04 12:48 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, avi, davdel, mst, paulmck, akpm

(Applies to kvm.git/master:25deed73)

Please see the header for 2/2 for a description.  This patch series has been
fully tested and appears to be working correctly.

[Review notes:
      *) Paul has looked at the SRCU design and, to my knowledge, didn't find
         any holes.
      *) Michael, Avi, and myself agree that while the removal of the DEASSIGN
         vector is not desirable, the fix on close() is more important in
         the short-term.  We can always add DEASSIGN support again in the
	 future with a CAP bit.
]

[Changelog:

      v2:
         *) Pulled in Davide's official patch for 1/2 from his submission
            accepted into -mmotm.
         *) Fixed patch 2/2 to use the "key" field as a bitmap in the wakeup
            logic, per Davide's feedback.

      v1:
         *) Initial release
]
  

---

Davide Libenzi (1):
      Allow waiters to be notified about the eventfd file* going away, and give

Gregory Haskins (1):
      kvm: use POLLHUP to close an irqfd instead of an explicit ioctl


 fs/eventfd.c        |   10 +++
 include/linux/kvm.h |    2 -
 virt/kvm/eventfd.c  |  177 +++++++++++++++++++++++----------------------------
 virt/kvm/kvm_main.c |    3 +
 4 files changed, 90 insertions(+), 102 deletions(-)

-- 
Signature

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [KVM PATCH v2 1/2] Allow waiters to be notified about the eventfd file* going away, and give
  2009-06-04 12:48 [KVM PATCH v2 0/2] irqfd: use POLLHUP notification for close() Gregory Haskins
@ 2009-06-04 12:48 ` Gregory Haskins
  2009-06-04 12:48 ` [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl Gregory Haskins
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 25+ messages in thread
From: Gregory Haskins @ 2009-06-04 12:48 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, avi, davdel, mst, paulmck, akpm

From: Davide Libenzi <davidel@xmailserver.org>

them a change to unregister from the wait queue.  This is turn allows
eventfd users to use the eventfd file* w/out holding a live reference to
it.

After the eventfd user callbacks returns, any usage of the eventfd file*
should be dropped.  The eventfd user callback can acquire sleepy locks
since it is invoked lockless.

This is a feature, needed by KVM to avoid an awkward workaround when using
eventdf.

[gmh: pulled from -mmotm for inclusion in kvm.git]

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Tested-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
---

 fs/eventfd.c |   10 +++++++++-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 3f0e197..72f5f8d 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -61,7 +61,15 @@ EXPORT_SYMBOL_GPL(eventfd_signal);
 
 static int eventfd_release(struct inode *inode, struct file *file)
 {
-	kfree(file->private_data);
+	struct eventfd_ctx *ctx = file->private_data;
+
+	/*
+	 * No need to hold the lock here, since we are on the file cleanup
+	 * path and the ones still attached to the wait queue will be
+	 * serialized by wake_up_locked_poll().
+	 */
+	wake_up_locked_poll(&ctx->wqh, POLLHUP);
+	kfree(ctx);
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl
  2009-06-04 12:48 [KVM PATCH v2 0/2] irqfd: use POLLHUP notification for close() Gregory Haskins
  2009-06-04 12:48 ` [KVM PATCH v2 1/2] Allow waiters to be notified about the eventfd file* going away, and give Gregory Haskins
@ 2009-06-04 12:48 ` Gregory Haskins
  2009-06-14 11:49   ` Michael S. Tsirkin
  2009-06-04 14:02 ` [KVM PATCH v2 0/2] irqfd: use POLLHUP notification for close() Avi Kivity
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 25+ messages in thread
From: Gregory Haskins @ 2009-06-04 12:48 UTC (permalink / raw)
  To: kvm; +Cc: linux-kernel, avi, davdel, mst, paulmck, akpm

Assigning an irqfd object to a kvm object creates a relationship that we
currently manage by having the kvm oject acquire/hold a file* reference to
the underlying eventfd.  The lifetime of these objects is properly maintained
by decoupling the two objects whenever the irqfd is closed or kvm is closed,
whichever comes first.

However, the irqfd "close" method is less than ideal since it requires two
system calls to complete (one for ioctl(kvmfd, IRQFD_DEASSIGN), the other for
close(eventfd)).  This dual-call approach was utilized because there was no
notification mechanism on the eventfd side at the time irqfd was implemented.

Recently, Davide proposed a patch to send a POLLHUP wakeup whenever an
eventfd is about to close.  So we eliminate the IRQFD_DEASSIGN ioctl (*)
vector in favor of sensing the desassign automatically when the fd is closed.
The resulting code is slightly more complex as a result since we need to
allow either side to sever the relationship independently.  We utilize SRCU
to guarantee stable concurrent access to the KVM pointer without adding
additional atomic operations in the fast path.

At minimum, this design should be acked by both Davide and Paul (cc'd).

(*) The irqfd patch does not exist in any released tree, so the understanding
is that we can alter the irqfd specific ABI without taking the normal
precautions, such as CAP bits.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Davide Libenzi <davidel@xmailserver.org>
CC: Michael S. Tsirkin <mst@redhat.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

 include/linux/kvm.h |    2 -
 virt/kvm/eventfd.c  |  177 +++++++++++++++++++++++----------------------------
 virt/kvm/kvm_main.c |    3 +
 3 files changed, 81 insertions(+), 101 deletions(-)

diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 632a856..29b62cc 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -482,8 +482,6 @@ struct kvm_x86_mce {
 };
 #endif
 
-#define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
-
 struct kvm_irqfd {
 	__u32 fd;
 	__u32 gsi;
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index f3f2ea1..004c660 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -37,39 +37,92 @@
  * --------------------------------------------------------------------
  */
 struct _irqfd {
+	struct mutex              lock;
+	struct srcu_struct        srcu;
 	struct kvm               *kvm;
 	int                       gsi;
-	struct file              *file;
 	struct list_head          list;
 	poll_table                pt;
 	wait_queue_head_t        *wqh;
 	wait_queue_t              wait;
-	struct work_struct        work;
+	struct work_struct        inject;
 };
 
 static void
 irqfd_inject(struct work_struct *work)
 {
-	struct _irqfd *irqfd = container_of(work, struct _irqfd, work);
-	struct kvm *kvm = irqfd->kvm;
+	struct _irqfd *irqfd = container_of(work, struct _irqfd, inject);
+	struct kvm *kvm;
+	int idx;
+
+	idx = srcu_read_lock(&irqfd->srcu);
+
+	kvm = rcu_dereference(irqfd->kvm);
+	if (kvm) {
+		mutex_lock(&kvm->lock);
+		kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
+		kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
+		mutex_unlock(&kvm->lock);
+	}
+
+	srcu_read_unlock(&irqfd->srcu, idx);
+}
+
+static void
+irqfd_disconnect(struct _irqfd *irqfd)
+{
+	struct kvm *kvm;
+
+	mutex_lock(&irqfd->lock);
+
+	kvm = rcu_dereference(irqfd->kvm);
+	rcu_assign_pointer(irqfd->kvm, NULL);
+
+	mutex_unlock(&irqfd->lock);
+
+	if (!kvm)
+		return;
 
 	mutex_lock(&kvm->lock);
-	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
-	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
+	list_del(&irqfd->list);
 	mutex_unlock(&kvm->lock);
+
+	/*
+	 * It is important to not drop the kvm reference until the next grace
+	 * period because there might be lockless references in flight up
+	 * until then
+	 */
+	synchronize_srcu(&irqfd->srcu);
+	kvm_put_kvm(kvm);
 }
 
 static int
 irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
 {
 	struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
+	unsigned long flags = (unsigned long)key;
 
-	/*
-	 * The wake_up is called with interrupts disabled.  Therefore we need
-	 * to defer the IRQ injection until later since we need to acquire the
-	 * kvm->lock to do so.
-	 */
-	schedule_work(&irqfd->work);
+	if (flags & POLLIN)
+		/*
+		 * The POLLIN wake_up is called with interrupts disabled.
+		 * Therefore we need to defer the IRQ injection until later
+		 * since we need to acquire the kvm->lock to do so.
+		 */
+		schedule_work(&irqfd->inject);
+
+	if (flags & POLLHUP) {
+		/*
+		 * The POLLHUP is called unlocked, so it theoretically should
+		 * be safe to remove ourselves from the wqh using the locked
+		 * variant of remove_wait_queue()
+		 */
+		remove_wait_queue(irqfd->wqh, &irqfd->wait);
+		flush_work(&irqfd->inject);
+		irqfd_disconnect(irqfd);
+
+		cleanup_srcu_struct(&irqfd->srcu);
+		kfree(irqfd);
+	}
 
 	return 0;
 }
@@ -84,8 +137,8 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
 	add_wait_queue(wqh, &irqfd->wait);
 }
 
-static int
-kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
+int
+kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
 {
 	struct _irqfd *irqfd;
 	struct file *file = NULL;
@@ -95,10 +148,12 @@ kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
 	if (!irqfd)
 		return -ENOMEM;
 
+	mutex_init(&irqfd->lock);
+	init_srcu_struct(&irqfd->srcu);
 	irqfd->kvm = kvm;
 	irqfd->gsi = gsi;
 	INIT_LIST_HEAD(&irqfd->list);
-	INIT_WORK(&irqfd->work, irqfd_inject);
+	INIT_WORK(&irqfd->inject, irqfd_inject);
 
 	/*
 	 * Embed the file* lifetime in the irqfd.
@@ -120,12 +175,18 @@ kvm_assign_irqfd(struct kvm *kvm, int fd, int gsi)
 	if (ret < 0)
 		goto fail;
 
-	irqfd->file = file;
+	kvm_get_kvm(kvm);
 
 	mutex_lock(&kvm->lock);
 	list_add_tail(&irqfd->list, &kvm->irqfds);
 	mutex_unlock(&kvm->lock);
 
+	/*
+	 * do not drop the file until the irqfd is fully initialized, otherwise
+	 * we might race against the POLLHUP
+	 */
+	fput(file);
+
 	return 0;
 
 fail:
@@ -139,97 +200,17 @@ fail:
 	return ret;
 }
 
-static void
-irqfd_release(struct _irqfd *irqfd)
-{
-	/*
-	 * The ordering is important.  We must remove ourselves from the wqh
-	 * first to ensure no more event callbacks are issued, and then flush
-	 * any previously scheduled work prior to freeing the memory
-	 */
-	remove_wait_queue(irqfd->wqh, &irqfd->wait);
-
-	flush_work(&irqfd->work);
-
-	fput(irqfd->file);
-	kfree(irqfd);
-}
-
-static struct _irqfd *
-irqfd_remove(struct kvm *kvm, struct file *file, int gsi)
-{
-	struct _irqfd *irqfd;
-
-	mutex_lock(&kvm->lock);
-
-	/*
-	 * linear search isn't brilliant, but this should be an infrequent
-	 * slow-path operation, and the list should not grow very large
-	 */
-	list_for_each_entry(irqfd, &kvm->irqfds, list) {
-		if (irqfd->file != file || irqfd->gsi != gsi)
-			continue;
-
-		list_del(&irqfd->list);
-		mutex_unlock(&kvm->lock);
-
-		return irqfd;
-	}
-
-	mutex_unlock(&kvm->lock);
-
-	return NULL;
-}
-
-static int
-kvm_deassign_irqfd(struct kvm *kvm, int fd, int gsi)
-{
-	struct _irqfd *irqfd;
-	struct file *file;
-	int count = 0;
-
-	file = fget(fd);
-	if (IS_ERR(file))
-		return PTR_ERR(file);
-
-	while ((irqfd = irqfd_remove(kvm, file, gsi))) {
-		/*
-		 * We remove the item from the list under the lock, but we
-		 * free it outside the lock to avoid deadlocking with the
-		 * flush_work and the work_item taking the lock
-		 */
-		irqfd_release(irqfd);
-		count++;
-	}
-
-	fput(file);
-
-	return count ? count : -ENOENT;
-}
-
 void
 kvm_irqfd_init(struct kvm *kvm)
 {
 	INIT_LIST_HEAD(&kvm->irqfds);
 }
 
-int
-kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
-{
-	if (flags & KVM_IRQFD_FLAG_DEASSIGN)
-		return kvm_deassign_irqfd(kvm, fd, gsi);
-
-	return kvm_assign_irqfd(kvm, fd, gsi);
-}
-
 void
 kvm_irqfd_release(struct kvm *kvm)
 {
 	struct _irqfd *irqfd, *tmp;
 
-	/* don't bother with the lock..we are shutting down */
-	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list) {
-		list_del(&irqfd->list);
-		irqfd_release(irqfd);
-	}
+	list_for_each_entry_safe(irqfd, tmp, &kvm->irqfds, list)
+		irqfd_disconnect(irqfd);
 }
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 902fed9..a9f62bb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1029,7 +1029,6 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	spin_lock(&kvm_lock);
 	list_del(&kvm->vm_list);
 	spin_unlock(&kvm_lock);
-	kvm_irqfd_release(kvm);
 	kvm_free_irq_routing(kvm);
 	kvm_io_bus_destroy(&kvm->pio_bus);
 	kvm_io_bus_destroy(&kvm->mmio_bus);
@@ -1064,6 +1063,8 @@ static int kvm_vm_release(struct inode *inode, struct file *filp)
 {
 	struct kvm *kvm = filp->private_data;
 
+	kvm_irqfd_release(kvm);
+
 	kvm_put_kvm(kvm);
 	return 0;
 }


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [KVM PATCH v2 0/2] irqfd: use POLLHUP notification for close()
  2009-06-04 12:48 [KVM PATCH v2 0/2] irqfd: use POLLHUP notification for close() Gregory Haskins
  2009-06-04 12:48 ` [KVM PATCH v2 1/2] Allow waiters to be notified about the eventfd file* going away, and give Gregory Haskins
  2009-06-04 12:48 ` [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl Gregory Haskins
@ 2009-06-04 14:02 ` Avi Kivity
  2009-06-12  3:58 ` Michael S. Tsirkin
  2009-06-12  4:08 ` Michael S. Tsirkin
  4 siblings, 0 replies; 25+ messages in thread
From: Avi Kivity @ 2009-06-04 14:02 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm, linux-kernel, davdel, mst, paulmck, akpm

Gregory Haskins wrote:
> (Applies to kvm.git/master:25deed73)
>
> Please see the header for 2/2 for a description.  This patch series has been
> fully tested and appears to be working correctly.
>
>   

Applied, thanks.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [KVM PATCH v2 0/2] irqfd: use POLLHUP notification for close()
  2009-06-04 12:48 [KVM PATCH v2 0/2] irqfd: use POLLHUP notification for close() Gregory Haskins
                   ` (2 preceding siblings ...)
  2009-06-04 14:02 ` [KVM PATCH v2 0/2] irqfd: use POLLHUP notification for close() Avi Kivity
@ 2009-06-12  3:58 ` Michael S. Tsirkin
  2009-06-12  4:08 ` Michael S. Tsirkin
  4 siblings, 0 replies; 25+ messages in thread
From: Michael S. Tsirkin @ 2009-06-12  3:58 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm, linux-kernel, avi, davdel, paulmck, akpm

On Thu, Jun 04, 2009 at 08:48:02AM -0400, Gregory Haskins wrote:
> (Applies to kvm.git/master:25deed73)
> 
> Please see the header for 2/2 for a description.  This patch series has been
> fully tested and appears to be working correctly.
> 
> [Review notes:
>       *) Paul has looked at the SRCU design and, to my knowledge, didn't find
>          any holes.
>       *) Michael, Avi, and myself agree that while the removal of the DEASSIGN
>          vector is not desirable, the fix on close() is more important in
>          the short-term.  We can always add DEASSIGN support again in the
> 	 future with a CAP bit.
> ]

So, I've been thinking about this, and this approach has another
problem: it depends on pollhup on close which is AFAIK an
eventfd-specific feature. This will prevent us from supporting polling
other useful file types, such as sockets and pipes, down the road, with
this interface.

And there's DEASSIGN issue which is needed for migration and MSI vector
remapping.

I didn't realise these implications when I suggested deassign on close.
To me, it now looks like we are better off reverting this patch.
We can later add 'deassign on close' support with CAP bit after all :)

Avi, Gregory, what's your take?

-- 
MST

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [KVM PATCH v2 0/2] irqfd: use POLLHUP notification for close()
  2009-06-04 12:48 [KVM PATCH v2 0/2] irqfd: use POLLHUP notification for close() Gregory Haskins
                   ` (3 preceding siblings ...)
  2009-06-12  3:58 ` Michael S. Tsirkin
@ 2009-06-12  4:08 ` Michael S. Tsirkin
  2009-06-14 12:38   ` Gregory Haskins
  4 siblings, 1 reply; 25+ messages in thread
From: Michael S. Tsirkin @ 2009-06-12  4:08 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm, linux-kernel, avi, davidel, paulmck, akpm

[ Resending with correct address for Davide. Pls don't reply
  to the original one, you'll get bounces. ]

On Thu, Jun 04, 2009 at 08:48:02AM -0400, Gregory Haskins wrote:
> (Applies to kvm.git/master:25deed73)
> 
> Please see the header for 2/2 for a description.  This patch series has been
> fully tested and appears to be working correctly.
> 
> [Review notes:
>       *) Paul has looked at the SRCU design and, to my knowledge, didn't find
>          any holes.
>       *) Michael, Avi, and myself agree that while the removal of the DEASSIGN
>          vector is not desirable, the fix on close() is more important in
>          the short-term.  We can always add DEASSIGN support again in the
> 	 future with a CAP bit.
> ]

So, I've been thinking about this, and this approach has another
problem: it depends on pollhup on close which is AFAIK an
eventfd-specific feature. This will prevent us from supporting polling
other useful file types, such as sockets and pipes, down the road, with
this interface.

And there's DEASSIGN issue which is needed for migration and MSI vector
remapping.

I didn't realise these implications when I suggested deassign on close.
To me, it now looks like we are better off reverting this patch.
We can later add 'deassign on close' support with CAP bit after all :)

Avi, Gregory, what's your take?

-- 
MST

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl
  2009-06-04 12:48 ` [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl Gregory Haskins
@ 2009-06-14 11:49   ` Michael S. Tsirkin
  2009-06-14 12:53     ` Gregory Haskins
  0 siblings, 1 reply; 25+ messages in thread
From: Michael S. Tsirkin @ 2009-06-14 11:49 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm, linux-kernel, avi, davdel, paulmck, akpm

On Thu, Jun 04, 2009 at 08:48:12AM -0400, Gregory Haskins wrote:
> +static void
> +irqfd_disconnect(struct _irqfd *irqfd)
> +{
> +	struct kvm *kvm;
> +
> +	mutex_lock(&irqfd->lock);
> +
> +	kvm = rcu_dereference(irqfd->kvm);
> +	rcu_assign_pointer(irqfd->kvm, NULL);
> +
> +	mutex_unlock(&irqfd->lock);
> +
> +	if (!kvm)
> +		return;
>  
>  	mutex_lock(&kvm->lock);
> -	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
> -	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> +	list_del(&irqfd->list);
>  	mutex_unlock(&kvm->lock);
> +
> +	/*
> +	 * It is important to not drop the kvm reference until the next grace
> +	 * period because there might be lockless references in flight up
> +	 * until then
> +	 */
> +	synchronize_srcu(&irqfd->srcu);
> +	kvm_put_kvm(kvm);
>  }

So irqfd object will persist after kvm goes away, until eventfd is closed?

>  
>  static int
>  irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
>  {
>  	struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
> +	unsigned long flags = (unsigned long)key;
>  
> -	/*
> -	 * The wake_up is called with interrupts disabled.  Therefore we need
> -	 * to defer the IRQ injection until later since we need to acquire the
> -	 * kvm->lock to do so.
> -	 */
> -	schedule_work(&irqfd->work);
> +	if (flags & POLLIN)
> +		/*
> +		 * The POLLIN wake_up is called with interrupts disabled.
> +		 * Therefore we need to defer the IRQ injection until later
> +		 * since we need to acquire the kvm->lock to do so.
> +		 */
> +		schedule_work(&irqfd->inject);
> +
> +	if (flags & POLLHUP) {
> +		/*
> +		 * The POLLHUP is called unlocked, so it theoretically should
> +		 * be safe to remove ourselves from the wqh using the locked
> +		 * variant of remove_wait_queue()
> +		 */
> +		remove_wait_queue(irqfd->wqh, &irqfd->wait);
> +		flush_work(&irqfd->inject);
> +		irqfd_disconnect(irqfd);
> +
> +		cleanup_srcu_struct(&irqfd->srcu);
> +		kfree(irqfd);
> +	}
>  
>  	return 0;
>  }

And it is removed by this function when eventfd is closed.
But what prevents the kvm module from going away, meanwhile?

-- 
MST

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [KVM PATCH v2 0/2] irqfd: use POLLHUP notification for close()
  2009-06-12  4:08 ` Michael S. Tsirkin
@ 2009-06-14 12:38   ` Gregory Haskins
  2009-06-14 12:51     ` Avi Kivity
  0 siblings, 1 reply; 25+ messages in thread
From: Gregory Haskins @ 2009-06-14 12:38 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, linux-kernel, avi, davidel, paulmck, akpm

[-- Attachment #1: Type: text/plain, Size: 2939 bytes --]

Michael S. Tsirkin wrote:
> [ Resending with correct address for Davide. Pls don't reply
>   to the original one, you'll get bounces. ]
>
> On Thu, Jun 04, 2009 at 08:48:02AM -0400, Gregory Haskins wrote:
>   
>> (Applies to kvm.git/master:25deed73)
>>
>> Please see the header for 2/2 for a description.  This patch series has been
>> fully tested and appears to be working correctly.
>>
>> [Review notes:
>>       *) Paul has looked at the SRCU design and, to my knowledge, didn't find
>>          any holes.
>>       *) Michael, Avi, and myself agree that while the removal of the DEASSIGN
>>          vector is not desirable, the fix on close() is more important in
>>          the short-term.  We can always add DEASSIGN support again in the
>> 	 future with a CAP bit.
>> ]
>>     
>
> So, I've been thinking about this, and this approach has another
> problem: it depends on pollhup on close which is AFAIK an
> eventfd-specific feature.

Thats ok with me, as we are already married to eventfd for other reasons
(see eventfd_fget()).

>  This will prevent us from supporting polling
> other useful file types, such as sockets and pipes, down the road, with
> this interface.
>   
I am thinking that we would add explicit support in the future if there
are other fd types that might want to also inject interrupts.  For
instance, perhaps POLLHUP is added to pipes if/when they are patched as
a valid transport for irqfd.   Or perhaps irqfd is abstracted such that
eventfd_fget/POLLHUP are eventfd specific assign/deassign implementation
details.

Another option is that we s/irqfd/irq-eventfd to leave room for other
interfaces like irq-pollfd, irq-socketfd, etc.  IOW, there is no reason
to make the current irqfd code "one-fd-interface to rule them all" per
se.  The real abstraction is the kvm_set_irq() + gsi interface anyway. 
The current irqfd code is a thin shim in front of that.  Perhaps each fd
type would be better served with code to specifically handle each type,
for its hard to predict what the requirements for translating, say, a
pipe-write into a gsi-inject will be apriori.

> And there's DEASSIGN issue which is needed for migration

Can you elaborate?  I would think you would need a new fd in the
migration case anyway and therefore a close()/kvm_irqfd() sequence is
required.

>  and MSI vector
> remapping.
>   

I would think this can be worked around with close()/kvm_irqfd()
workaround unitl the DEASSIGN cap bit is in place, but I may be
misunderstanding.
> I didn't realise these implications when I suggested deassign on close.
> To me, it now looks like we are better off reverting this patch.
> We can later add 'deassign on close' support with CAP bit after all :)
>
> Avi, Gregory, what's your take?
>
>   
I like the design with the single-call close in place, so my vote is to
keep it as it is now.

Thanks Michael,
-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [KVM PATCH v2 0/2] irqfd: use POLLHUP notification for close()
  2009-06-14 12:38   ` Gregory Haskins
@ 2009-06-14 12:51     ` Avi Kivity
  0 siblings, 0 replies; 25+ messages in thread
From: Avi Kivity @ 2009-06-14 12:51 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Michael S. Tsirkin, kvm, linux-kernel, davidel, paulmck, akpm

Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
>   
>> [ Resending with correct address for Davide. Pls don't reply
>>   to the original one, you'll get bounces. ]
>>
>> On Thu, Jun 04, 2009 at 08:48:02AM -0400, Gregory Haskins wrote:
>>   
>>     
>>> (Applies to kvm.git/master:25deed73)
>>>
>>> Please see the header for 2/2 for a description.  This patch series has been
>>> fully tested and appears to be working correctly.
>>>
>>> [Review notes:
>>>       *) Paul has looked at the SRCU design and, to my knowledge, didn't find
>>>          any holes.
>>>       *) Michael, Avi, and myself agree that while the removal of the DEASSIGN
>>>          vector is not desirable, the fix on close() is more important in
>>>          the short-term.  We can always add DEASSIGN support again in the
>>> 	 future with a CAP bit.
>>> ]
>>>     
>>>       
>> So, I've been thinking about this, and this approach has another
>> problem: it depends on pollhup on close which is AFAIK an
>> eventfd-specific feature.
>>     
>
> Thats ok with me, as we are already married to eventfd for other reasons
> (see eventfd_fget()).
>
>   
>>  This will prevent us from supporting polling
>> other useful file types, such as sockets and pipes, down the road, with
>> this interface.
>>   
>>     
> I am thinking that we would add explicit support in the future if there
> are other fd types that might want to also inject interrupts.  For
> instance, perhaps POLLHUP is added to pipes if/when they are patched as
> a valid transport for irqfd.   Or perhaps irqfd is abstracted such that
> eventfd_fget/POLLHUP are eventfd specific assign/deassign implementation
> details.
>
> Another option is that we s/irqfd/irq-eventfd to leave room for other
> interfaces like irq-pollfd, irq-socketfd, etc.  IOW, there is no reason
> to make the current irqfd code "one-fd-interface to rule them all" per
> se.  The real abstraction is the kvm_set_irq() + gsi interface anyway. 
> The current irqfd code is a thin shim in front of that.  Perhaps each fd
> type would be better served with code to specifically handle each type,
> for its hard to predict what the requirements for translating, say, a
> pipe-write into a gsi-inject will be apriori.
>
>   

I don't see a reason to avoid a monogamous relationship with eventfd as 
it exactly captures the essence of an raising an interrupt: events are 
coalesced and it doesn't block.  Since irqfd will rarely work by itself 
(need a separate data channel), having things like a tcp socket inject 
an interrupt are, while exotic, fairly useless.

>> I didn't realise these implications when I suggested deassign on close.
>> To me, it now looks like we are better off reverting this patch.
>> We can later add 'deassign on close' support with CAP bit after all :)
>>
>> Avi, Gregory, what's your take?
>>
>>   
>>     
> I like the design with the single-call close in place, so my vote is to
> keep it as it is now.
>   

We could work around it by allocating a gsi private to the eventfd, and 
when we want to mask the gsi, simply drop all its routes.  Hacky.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl
  2009-06-14 11:49   ` Michael S. Tsirkin
@ 2009-06-14 12:53     ` Gregory Haskins
  2009-06-14 13:28       ` Michael S. Tsirkin
  0 siblings, 1 reply; 25+ messages in thread
From: Gregory Haskins @ 2009-06-14 12:53 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, linux-kernel, avi, davdel, paulmck, akpm

[-- Attachment #1: Type: text/plain, Size: 3263 bytes --]

Michael S. Tsirkin wrote:
> On Thu, Jun 04, 2009 at 08:48:12AM -0400, Gregory Haskins wrote:
>   
>> +static void
>> +irqfd_disconnect(struct _irqfd *irqfd)
>> +{
>> +	struct kvm *kvm;
>> +
>> +	mutex_lock(&irqfd->lock);
>> +
>> +	kvm = rcu_dereference(irqfd->kvm);
>> +	rcu_assign_pointer(irqfd->kvm, NULL);
>> +
>> +	mutex_unlock(&irqfd->lock);
>> +
>> +	if (!kvm)
>> +		return;
>>  
>>  	mutex_lock(&kvm->lock);
>> -	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
>> -	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
>> +	list_del(&irqfd->list);
>>  	mutex_unlock(&kvm->lock);
>> +
>> +	/*
>> +	 * It is important to not drop the kvm reference until the next grace
>> +	 * period because there might be lockless references in flight up
>> +	 * until then
>> +	 */
>> +	synchronize_srcu(&irqfd->srcu);
>> +	kvm_put_kvm(kvm);
>>  }
>>     
>
> So irqfd object will persist after kvm goes away, until eventfd is closed?
>   

Yep, by design.  It becomes part of the eventfd and is thus associated
with its lifetime.  Consider it as if we made our own anon-fd
implementation for irqfd and the lifetime looks similar.  The difference
is that we are reusing eventfd and its interface semantics.
>   
>>  
>>  static int
>>  irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
>>  {
>>  	struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
>> +	unsigned long flags = (unsigned long)key;
>>  
>> -	/*
>> -	 * The wake_up is called with interrupts disabled.  Therefore we need
>> -	 * to defer the IRQ injection until later since we need to acquire the
>> -	 * kvm->lock to do so.
>> -	 */
>> -	schedule_work(&irqfd->work);
>> +	if (flags & POLLIN)
>> +		/*
>> +		 * The POLLIN wake_up is called with interrupts disabled.
>> +		 * Therefore we need to defer the IRQ injection until later
>> +		 * since we need to acquire the kvm->lock to do so.
>> +		 */
>> +		schedule_work(&irqfd->inject);
>> +
>> +	if (flags & POLLHUP) {
>> +		/*
>> +		 * The POLLHUP is called unlocked, so it theoretically should
>> +		 * be safe to remove ourselves from the wqh using the locked
>> +		 * variant of remove_wait_queue()
>> +		 */
>> +		remove_wait_queue(irqfd->wqh, &irqfd->wait);
>> +		flush_work(&irqfd->inject);
>> +		irqfd_disconnect(irqfd);
>> +
>> +		cleanup_srcu_struct(&irqfd->srcu);
>> +		kfree(irqfd);
>> +	}
>>  
>>  	return 0;
>>  }
>>     
>
> And it is removed by this function when eventfd is closed.
> But what prevents the kvm module from going away, meanwhile?
>   

Well, we hold a reference to struct kvm until we call
irqfd_disconnect().  If kvm closes first, we disconnect and disassociate
all references to kvm leaving irqfd->kvm = NULL.  Likewise, if irqfd
closes first, we disassociate with kvm with the above quoted logic.  In
either case, we are holding a kvm reference up until that "disconnect"
point.  Therefore kvm should not be able to disappear before that
disconnect, and after that point we do not care.  If that is not
sufficient to prevent kvm.ko from going away in the middle, then IMO
kvm_get_kvm() has a bug, not irqfd. ;)  However, I believe everything is
actually ok here.

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl
  2009-06-14 12:53     ` Gregory Haskins
@ 2009-06-14 13:28       ` Michael S. Tsirkin
  2009-06-15  3:39         ` Gregory Haskins
  2009-06-15  3:48         ` Gregory Haskins
  0 siblings, 2 replies; 25+ messages in thread
From: Michael S. Tsirkin @ 2009-06-14 13:28 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm, linux-kernel, avi, davdel, paulmck, akpm

On Sun, Jun 14, 2009 at 08:53:11AM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Thu, Jun 04, 2009 at 08:48:12AM -0400, Gregory Haskins wrote:
> >   
> >> +static void
> >> +irqfd_disconnect(struct _irqfd *irqfd)
> >> +{
> >> +	struct kvm *kvm;
> >> +
> >> +	mutex_lock(&irqfd->lock);
> >> +
> >> +	kvm = rcu_dereference(irqfd->kvm);
> >> +	rcu_assign_pointer(irqfd->kvm, NULL);
> >> +
> >> +	mutex_unlock(&irqfd->lock);
> >> +
> >> +	if (!kvm)
> >> +		return;
> >>  
> >>  	mutex_lock(&kvm->lock);
> >> -	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
> >> -	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> >> +	list_del(&irqfd->list);
> >>  	mutex_unlock(&kvm->lock);
> >> +
> >> +	/*
> >> +	 * It is important to not drop the kvm reference until the next grace
> >> +	 * period because there might be lockless references in flight up
> >> +	 * until then
> >> +	 */
> >> +	synchronize_srcu(&irqfd->srcu);
> >> +	kvm_put_kvm(kvm);
> >>  }
> >>     
> >
> > So irqfd object will persist after kvm goes away, until eventfd is closed?
> >   
> 
> Yep, by design.  It becomes part of the eventfd and is thus associated
> with its lifetime.  Consider it as if we made our own anon-fd
> implementation for irqfd and the lifetime looks similar.  The difference
> is that we are reusing eventfd and its interface semantics.
> >   
> >>  
> >>  static int
> >>  irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
> >>  {
> >>  	struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
> >> +	unsigned long flags = (unsigned long)key;
> >>  
> >> -	/*
> >> -	 * The wake_up is called with interrupts disabled.  Therefore we need
> >> -	 * to defer the IRQ injection until later since we need to acquire the
> >> -	 * kvm->lock to do so.
> >> -	 */
> >> -	schedule_work(&irqfd->work);
> >> +	if (flags & POLLIN)
> >> +		/*
> >> +		 * The POLLIN wake_up is called with interrupts disabled.
> >> +		 * Therefore we need to defer the IRQ injection until later
> >> +		 * since we need to acquire the kvm->lock to do so.
> >> +		 */
> >> +		schedule_work(&irqfd->inject);
> >> +
> >> +	if (flags & POLLHUP) {
> >> +		/*
> >> +		 * The POLLHUP is called unlocked, so it theoretically should
> >> +		 * be safe to remove ourselves from the wqh using the locked
> >> +		 * variant of remove_wait_queue()
> >> +		 */
> >> +		remove_wait_queue(irqfd->wqh, &irqfd->wait);
> >> +		flush_work(&irqfd->inject);
> >> +		irqfd_disconnect(irqfd);
> >> +
> >> +		cleanup_srcu_struct(&irqfd->srcu);
> >> +		kfree(irqfd);
> >> +	}
> >>  
> >>  	return 0;
> >>  }
> >>     
> >
> > And it is removed by this function when eventfd is closed.
> > But what prevents the kvm module from going away, meanwhile?
> >   
> 
> Well, we hold a reference to struct kvm until we call
> irqfd_disconnect().  If kvm closes first, we disconnect and disassociate
> all references to kvm leaving irqfd->kvm = NULL.  Likewise, if irqfd
> closes first, we disassociate with kvm with the above quoted logic.  In
> either case, we are holding a kvm reference up until that "disconnect"
> point.  Therefore kvm should not be able to disappear before that
> disconnect, and after that point we do not care.

Yes, we do care.

Here's the scenario in more detail:

- kvm is closed
- irq disconnect is called
- kvm is put
- kvm module is removed: all irqs are disconnected
- eventfd closes and triggers callback into removed kvm module
- crash

> If that is not sufficient to prevent kvm.ko from going away in the
> middle, then IMO kvm_get_kvm() has a bug, not irqfd. ;) However, I
> believe everything is actually ok here.
> 
> -Greg
> 


BTW, why can't we remove irqfds in kvm_release?

-- 
MST

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl
  2009-06-14 13:28       ` Michael S. Tsirkin
@ 2009-06-15  3:39         ` Gregory Haskins
  2009-06-15  9:46           ` Michael S. Tsirkin
  2009-06-15  3:48         ` Gregory Haskins
  1 sibling, 1 reply; 25+ messages in thread
From: Gregory Haskins @ 2009-06-15  3:39 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, linux-kernel, avi, davdel, paulmck, akpm

[-- Attachment #1: Type: text/plain, Size: 6116 bytes --]

Michael S. Tsirkin wrote:
> On Sun, Jun 14, 2009 at 08:53:11AM -0400, Gregory Haskins wrote:
>   
>> Michael S. Tsirkin wrote:
>>     
>>> On Thu, Jun 04, 2009 at 08:48:12AM -0400, Gregory Haskins wrote:
>>>   
>>>       
>>>> +static void
>>>> +irqfd_disconnect(struct _irqfd *irqfd)
>>>> +{
>>>> +	struct kvm *kvm;
>>>> +
>>>> +	mutex_lock(&irqfd->lock);
>>>> +
>>>> +	kvm = rcu_dereference(irqfd->kvm);
>>>> +	rcu_assign_pointer(irqfd->kvm, NULL);
>>>> +
>>>> +	mutex_unlock(&irqfd->lock);
>>>> +
>>>> +	if (!kvm)
>>>> +		return;
>>>>  
>>>>  	mutex_lock(&kvm->lock);
>>>> -	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
>>>> -	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
>>>> +	list_del(&irqfd->list);
>>>>  	mutex_unlock(&kvm->lock);
>>>> +
>>>> +	/*
>>>> +	 * It is important to not drop the kvm reference until the next grace
>>>> +	 * period because there might be lockless references in flight up
>>>> +	 * until then
>>>> +	 */
>>>> +	synchronize_srcu(&irqfd->srcu);
>>>> +	kvm_put_kvm(kvm);
>>>>  }
>>>>     
>>>>         
>>> So irqfd object will persist after kvm goes away, until eventfd is closed?
>>>   
>>>       
>> Yep, by design.  It becomes part of the eventfd and is thus associated
>> with its lifetime.  Consider it as if we made our own anon-fd
>> implementation for irqfd and the lifetime looks similar.  The difference
>> is that we are reusing eventfd and its interface semantics.
>>     
>>>   
>>>       
>>>>  
>>>>  static int
>>>>  irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
>>>>  {
>>>>  	struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
>>>> +	unsigned long flags = (unsigned long)key;
>>>>  
>>>> -	/*
>>>> -	 * The wake_up is called with interrupts disabled.  Therefore we need
>>>> -	 * to defer the IRQ injection until later since we need to acquire the
>>>> -	 * kvm->lock to do so.
>>>> -	 */
>>>> -	schedule_work(&irqfd->work);
>>>> +	if (flags & POLLIN)
>>>> +		/*
>>>> +		 * The POLLIN wake_up is called with interrupts disabled.
>>>> +		 * Therefore we need to defer the IRQ injection until later
>>>> +		 * since we need to acquire the kvm->lock to do so.
>>>> +		 */
>>>> +		schedule_work(&irqfd->inject);
>>>> +
>>>> +	if (flags & POLLHUP) {
>>>> +		/*
>>>> +		 * The POLLHUP is called unlocked, so it theoretically should
>>>> +		 * be safe to remove ourselves from the wqh using the locked
>>>> +		 * variant of remove_wait_queue()
>>>> +		 */
>>>> +		remove_wait_queue(irqfd->wqh, &irqfd->wait);
>>>> +		flush_work(&irqfd->inject);
>>>> +		irqfd_disconnect(irqfd);
>>>> +
>>>> +		cleanup_srcu_struct(&irqfd->srcu);
>>>> +		kfree(irqfd);
>>>> +	}
>>>>  
>>>>  	return 0;
>>>>  }
>>>>     
>>>>         
>>> And it is removed by this function when eventfd is closed.
>>> But what prevents the kvm module from going away, meanwhile?
>>>   
>>>       
>> Well, we hold a reference to struct kvm until we call
>> irqfd_disconnect().  If kvm closes first, we disconnect and disassociate
>> all references to kvm leaving irqfd->kvm = NULL.  Likewise, if irqfd
>> closes first, we disassociate with kvm with the above quoted logic.  In
>> either case, we are holding a kvm reference up until that "disconnect"
>> point.  Therefore kvm should not be able to disappear before that
>> disconnect, and after that point we do not care.
>>     
>
> Yes, we do care.
>
> Here's the scenario in more detail:
>
> - kvm is closed
> - irq disconnect is called
> - kvm is put
> - kvm module is removed: all irqs are disconnected
> - eventfd closes and triggers callback into removed kvm module
> - crash
>   

[ lightbulb turns on]

Ah, now I see the point you were making.  I thought you were talking
about the .text in kvm_set_irq() (which would be protected by my
kvm_get_kvm() reference afaict).  But you are actually talking about the
irqfd .text itself.  Indeed, you are correct that is this currently a
race.  Good catch!

>   
>> If that is not sufficient to prevent kvm.ko from going away in the
>> middle, then IMO kvm_get_kvm() has a bug, not irqfd. ;) However, I
>> believe everything is actually ok here.
>>
>> -Greg
>>
>>     
>
>
> BTW, why can't we remove irqfds in kvm_release?
>   

Well, this would be ideal but we run into that bi-directional reference
thing that we talked about earlier and we both agree is non-trivial to
solve.  Solving this locking problem would incidentally also pave the
way for restoring the DEASSIGN feature, so patches welcome!  In the
meantime, I think we can close the hole you found with the following
patch (build-tested only):

commit f3a8dccc9e815599438e9feb0ea53e8eb10ad2b3
Author: Gregory Haskins <ghaskins@novell.com>
Date:   Sun Jun 14 23:37:49 2009 -0400

    KVM: make irqfd take kvm.ko module reference
   
    Michael Tsirkin pointed out that we currently have a race between
someone
    holding an irqfd reference and an rmmod against kvm.ko.  This patch
closes
    that hole by making sure that irqfd holds a kvm.ko reference for its
lifetime.
   
    Found-by: Michael S. Tsirkin <mst@redhat.com>
    Signed-off-by: Gregory Haskins <ghaskins@novell.com>

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 2c8028c..67e4eca 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -29,6 +29,7 @@
 #include <linux/list.h>
 #include <linux/eventfd.h>
 #include <linux/srcu.h>
+#include <linux/module.h>
 
 /*
  * --------------------------------------------------------------------
@@ -123,6 +124,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int
sync, void
 *key)
 
                cleanup_srcu_struct(&irqfd->srcu);
                kfree(irqfd);
+               module_put(THIS_MODULE);
        }
 
        return 0;
@@ -176,6 +178,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
        if (ret < 0)
                goto fail;
 
+       __module_get(THIS_MODULE);
        kvm_get_kvm(kvm);
 
        mutex_lock(&kvm->lock);




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl
  2009-06-14 13:28       ` Michael S. Tsirkin
  2009-06-15  3:39         ` Gregory Haskins
@ 2009-06-15  3:48         ` Gregory Haskins
  1 sibling, 0 replies; 25+ messages in thread
From: Gregory Haskins @ 2009-06-15  3:48 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, linux-kernel, avi, Davide Libenzi, paulmck, akpm


[-- Attachment #1.1: Type: text/plain, Size: 6221 bytes --]

[ restoring poor Davide's proper email address.  Sorry for the constant
fat-fingering of your addr, Davide! ]

Michael S. Tsirkin wrote:
> On Sun, Jun 14, 2009 at 08:53:11AM -0400, Gregory Haskins wrote:
>   
>> Michael S. Tsirkin wrote:
>>     
>>> On Thu, Jun 04, 2009 at 08:48:12AM -0400, Gregory Haskins wrote:
>>>   
>>>       
>>>> +static void
>>>> +irqfd_disconnect(struct _irqfd *irqfd)
>>>> +{
>>>> +	struct kvm *kvm;
>>>> +
>>>> +	mutex_lock(&irqfd->lock);
>>>> +
>>>> +	kvm = rcu_dereference(irqfd->kvm);
>>>> +	rcu_assign_pointer(irqfd->kvm, NULL);
>>>> +
>>>> +	mutex_unlock(&irqfd->lock);
>>>> +
>>>> +	if (!kvm)
>>>> +		return;
>>>>  
>>>>  	mutex_lock(&kvm->lock);
>>>> -	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
>>>> -	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
>>>> +	list_del(&irqfd->list);
>>>>  	mutex_unlock(&kvm->lock);
>>>> +
>>>> +	/*
>>>> +	 * It is important to not drop the kvm reference until the next grace
>>>> +	 * period because there might be lockless references in flight up
>>>> +	 * until then
>>>> +	 */
>>>> +	synchronize_srcu(&irqfd->srcu);
>>>> +	kvm_put_kvm(kvm);
>>>>  }
>>>>     
>>>>         
>>> So irqfd object will persist after kvm goes away, until eventfd is closed?
>>>   
>>>       
>> Yep, by design.  It becomes part of the eventfd and is thus associated
>> with its lifetime.  Consider it as if we made our own anon-fd
>> implementation for irqfd and the lifetime looks similar.  The difference
>> is that we are reusing eventfd and its interface semantics.
>>     
>>>   
>>>       
>>>>  
>>>>  static int
>>>>  irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
>>>>  {
>>>>  	struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
>>>> +	unsigned long flags = (unsigned long)key;
>>>>  
>>>> -	/*
>>>> -	 * The wake_up is called with interrupts disabled.  Therefore we need
>>>> -	 * to defer the IRQ injection until later since we need to acquire the
>>>> -	 * kvm->lock to do so.
>>>> -	 */
>>>> -	schedule_work(&irqfd->work);
>>>> +	if (flags & POLLIN)
>>>> +		/*
>>>> +		 * The POLLIN wake_up is called with interrupts disabled.
>>>> +		 * Therefore we need to defer the IRQ injection until later
>>>> +		 * since we need to acquire the kvm->lock to do so.
>>>> +		 */
>>>> +		schedule_work(&irqfd->inject);
>>>> +
>>>> +	if (flags & POLLHUP) {
>>>> +		/*
>>>> +		 * The POLLHUP is called unlocked, so it theoretically should
>>>> +		 * be safe to remove ourselves from the wqh using the locked
>>>> +		 * variant of remove_wait_queue()
>>>> +		 */
>>>> +		remove_wait_queue(irqfd->wqh, &irqfd->wait);
>>>> +		flush_work(&irqfd->inject);
>>>> +		irqfd_disconnect(irqfd);
>>>> +
>>>> +		cleanup_srcu_struct(&irqfd->srcu);
>>>> +		kfree(irqfd);
>>>> +	}
>>>>  
>>>>  	return 0;
>>>>  }
>>>>     
>>>>         
>>> And it is removed by this function when eventfd is closed.
>>> But what prevents the kvm module from going away, meanwhile?
>>>   
>>>       
>> Well, we hold a reference to struct kvm until we call
>> irqfd_disconnect().  If kvm closes first, we disconnect and disassociate
>> all references to kvm leaving irqfd->kvm = NULL.  Likewise, if irqfd
>> closes first, we disassociate with kvm with the above quoted logic.  In
>> either case, we are holding a kvm reference up until that "disconnect"
>> point.  Therefore kvm should not be able to disappear before that
>> disconnect, and after that point we do not care.
>>     
>
> Yes, we do care.
>
> Here's the scenario in more detail:
>
> - kvm is closed
> - irq disconnect is called
> - kvm is put
> - kvm module is removed: all irqs are disconnected
> - eventfd closes and triggers callback into removed kvm module
> - crash
>   

[ lightbulb turns on]

Ah, now I see the point you were making.  I thought you were talking
about the .text in kvm_set_irq() (which would be protected by my
kvm_get_kvm() reference afaict).  But you are actually talking about the
irqfd .text itself.  Indeed, you are correct that is this currently a
race.  Good catch!

>   
>> If that is not sufficient to prevent kvm.ko from going away in the
>> middle, then IMO kvm_get_kvm() has a bug, not irqfd. ;) However, I
>> believe everything is actually ok here.
>>
>> -Greg
>>
>>     
>
>
> BTW, why can't we remove irqfds in kvm_release?
>   

Well, this would be ideal but we run into that bi-directional reference
thing that we talked about earlier and we both agree is non-trivial to
solve.  Solving this locking problem would incidentally also pave the
way for restoring the DEASSIGN feature, so patches welcome!  In the
meantime, I think we can close the hole you found with the following
patch (build-tested only):

commit f3a8dccc9e815599438e9feb0ea53e8eb10ad2b3
Author: Gregory Haskins <ghaskins@novell.com>
Date:   Sun Jun 14 23:37:49 2009 -0400

    KVM: make irqfd take kvm.ko module reference

    Michael Tsirkin pointed out that we currently have a race between
someone
    holding an irqfd reference and an rmmod against kvm.ko.  This patch
closes
    that hole by making sure that irqfd holds a kvm.ko reference for its
lifetime.

    Found-by: Michael S. Tsirkin <mst@redhat.com>
    Signed-off-by: Gregory Haskins <ghaskins@novell.com>

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 2c8028c..67e4eca 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -29,6 +29,7 @@
 #include <linux/list.h>
 #include <linux/eventfd.h>
 #include <linux/srcu.h>
+#include <linux/module.h>

 /*
  * --------------------------------------------------------------------
@@ -123,6 +124,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int
sync, void
 *key)

                cleanup_srcu_struct(&irqfd->srcu);
                kfree(irqfd);
+               module_put(THIS_MODULE);
        }

        return 0;
@@ -176,6 +178,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
        if (ret < 0)
                goto fail;

+       __module_get(THIS_MODULE);
        kvm_get_kvm(kvm);

        mutex_lock(&kvm->lock);





[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl
  2009-06-15  3:39         ` Gregory Haskins
@ 2009-06-15  9:46           ` Michael S. Tsirkin
  2009-06-15 12:08             ` Gregory Haskins
  0 siblings, 1 reply; 25+ messages in thread
From: Michael S. Tsirkin @ 2009-06-15  9:46 UTC (permalink / raw)
  To: Gregory Haskins; +Cc: kvm, linux-kernel, avi, davidel, paulmck, akpm

On Sun, Jun 14, 2009 at 11:39:21PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Sun, Jun 14, 2009 at 08:53:11AM -0400, Gregory Haskins wrote:
> >   
> >> Michael S. Tsirkin wrote:
> >>     
> >>> On Thu, Jun 04, 2009 at 08:48:12AM -0400, Gregory Haskins wrote:
> >>>   
> >>>       
> >>>> +static void
> >>>> +irqfd_disconnect(struct _irqfd *irqfd)
> >>>> +{
> >>>> +	struct kvm *kvm;
> >>>> +
> >>>> +	mutex_lock(&irqfd->lock);
> >>>> +
> >>>> +	kvm = rcu_dereference(irqfd->kvm);
> >>>> +	rcu_assign_pointer(irqfd->kvm, NULL);
> >>>> +
> >>>> +	mutex_unlock(&irqfd->lock);
> >>>> +
> >>>> +	if (!kvm)
> >>>> +		return;
> >>>>  
> >>>>  	mutex_lock(&kvm->lock);
> >>>> -	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
> >>>> -	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> >>>> +	list_del(&irqfd->list);
> >>>>  	mutex_unlock(&kvm->lock);
> >>>> +
> >>>> +	/*
> >>>> +	 * It is important to not drop the kvm reference until the next grace
> >>>> +	 * period because there might be lockless references in flight up
> >>>> +	 * until then
> >>>> +	 */
> >>>> +	synchronize_srcu(&irqfd->srcu);
> >>>> +	kvm_put_kvm(kvm);
> >>>>  }
> >>>>     
> >>>>         
> >>> So irqfd object will persist after kvm goes away, until eventfd is closed?
> >>>   
> >>>       
> >> Yep, by design.  It becomes part of the eventfd and is thus associated
> >> with its lifetime.  Consider it as if we made our own anon-fd
> >> implementation for irqfd and the lifetime looks similar.  The difference
> >> is that we are reusing eventfd and its interface semantics.
> >>     
> >>>   
> >>>       
> >>>>  
> >>>>  static int
> >>>>  irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
> >>>>  {
> >>>>  	struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
> >>>> +	unsigned long flags = (unsigned long)key;
> >>>>  
> >>>> -	/*
> >>>> -	 * The wake_up is called with interrupts disabled.  Therefore we need
> >>>> -	 * to defer the IRQ injection until later since we need to acquire the
> >>>> -	 * kvm->lock to do so.
> >>>> -	 */
> >>>> -	schedule_work(&irqfd->work);
> >>>> +	if (flags & POLLIN)
> >>>> +		/*
> >>>> +		 * The POLLIN wake_up is called with interrupts disabled.
> >>>> +		 * Therefore we need to defer the IRQ injection until later
> >>>> +		 * since we need to acquire the kvm->lock to do so.
> >>>> +		 */
> >>>> +		schedule_work(&irqfd->inject);
> >>>> +
> >>>> +	if (flags & POLLHUP) {
> >>>> +		/*
> >>>> +		 * The POLLHUP is called unlocked, so it theoretically should
> >>>> +		 * be safe to remove ourselves from the wqh using the locked
> >>>> +		 * variant of remove_wait_queue()
> >>>> +		 */
> >>>> +		remove_wait_queue(irqfd->wqh, &irqfd->wait);
> >>>> +		flush_work(&irqfd->inject);
> >>>> +		irqfd_disconnect(irqfd);
> >>>> +
> >>>> +		cleanup_srcu_struct(&irqfd->srcu);
> >>>> +		kfree(irqfd);
> >>>> +	}
> >>>>  
> >>>>  	return 0;
> >>>>  }
> >>>>     
> >>>>         
> >>> And it is removed by this function when eventfd is closed.
> >>> But what prevents the kvm module from going away, meanwhile?
> >>>   
> >>>       
> >> Well, we hold a reference to struct kvm until we call
> >> irqfd_disconnect().  If kvm closes first, we disconnect and disassociate
> >> all references to kvm leaving irqfd->kvm = NULL.  Likewise, if irqfd
> >> closes first, we disassociate with kvm with the above quoted logic.  In
> >> either case, we are holding a kvm reference up until that "disconnect"
> >> point.  Therefore kvm should not be able to disappear before that
> >> disconnect, and after that point we do not care.
> >>     
> >
> > Yes, we do care.
> >
> > Here's the scenario in more detail:
> >
> > - kvm is closed
> > - irq disconnect is called
> > - kvm is put
> > - kvm module is removed: all irqs are disconnected
> > - eventfd closes and triggers callback into removed kvm module
> > - crash
> >   
> 
> [ lightbulb turns on]
> 
> Ah, now I see the point you were making.  I thought you were talking
> about the .text in kvm_set_irq() (which would be protected by my
> kvm_get_kvm() reference afaict).  But you are actually talking about the
> irqfd .text itself.  Indeed, you are correct that is this currently a
> race.  Good catch!
> 
> >   
> >> If that is not sufficient to prevent kvm.ko from going away in the
> >> middle, then IMO kvm_get_kvm() has a bug, not irqfd. ;) However, I
> >> believe everything is actually ok here.
> >>
> >> -Greg
> >>
> >>     
> >
> >
> > BTW, why can't we remove irqfds in kvm_release?
> >   
> 
> Well, this would be ideal but we run into that bi-directional reference
> thing that we talked about earlier and we both agree is non-trivial to
> solve.  Solving this locking problem would incidentally also pave the
> way for restoring the DEASSIGN feature, so patches welcome!

So far the only workable approach that I see is reverting the POLLHUP
patch. I agree it looks pretty, but DEASSIGN and closing the races is
more important IMO. And locking will definitely become much simpler.

> In the meantime, I think we can close the hole you found with the
> following patch (build-tested only):
> 
> commit f3a8dccc9e815599438e9feb0ea53e8eb10ad2b3
> Author: Gregory Haskins <ghaskins@novell.com>
> Date:   Sun Jun 14 23:37:49 2009 -0400
> 
>     KVM: make irqfd take kvm.ko module reference
>    
>     Michael Tsirkin pointed out that we currently have a race between someone
>     holding an irqfd reference and an rmmod against kvm.ko.  This patch closes
>     that hole by making sure that irqfd holds a kvm.ko reference for its lifetime.
>    
>     Found-by: Michael S. Tsirkin <mst@redhat.com>
>     Signed-off-by: Gregory Haskins <ghaskins@novell.com>
> 
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index 2c8028c..67e4eca 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -29,6 +29,7 @@
>  #include <linux/list.h>
>  #include <linux/eventfd.h>
>  #include <linux/srcu.h>
> +#include <linux/module.h>
>  
>  /*
>   * --------------------------------------------------------------------
> @@ -123,6 +124,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int
> sync, void
>  *key)
>  
>                 cleanup_srcu_struct(&irqfd->srcu);
>                 kfree(irqfd);
> +               module_put(THIS_MODULE);
>         }
>  
>         return 0;

module_put(THIS_MODULE) is always a bug unless you know that someone has
a reference to the current module: the module could go away between this
call and returning from function.

> @@ -176,6 +178,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>         if (ret < 0)
>                 goto fail;
>  
> +       __module_get(THIS_MODULE);
>         kvm_get_kvm(kvm);
>  
>         mutex_lock(&kvm->lock);
> 
> 
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl
  2009-06-15  9:46           ` Michael S. Tsirkin
@ 2009-06-15 12:08             ` Gregory Haskins
  2009-06-15 12:54               ` Michael S. Tsirkin
  0 siblings, 1 reply; 25+ messages in thread
From: Gregory Haskins @ 2009-06-15 12:08 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, linux-kernel, avi, davidel, paulmck, akpm

[-- Attachment #1: Type: text/plain, Size: 7997 bytes --]

Michael S. Tsirkin wrote:
> On Sun, Jun 14, 2009 at 11:39:21PM -0400, Gregory Haskins wrote:
>   
>> Michael S. Tsirkin wrote:
>>     
>>> On Sun, Jun 14, 2009 at 08:53:11AM -0400, Gregory Haskins wrote:
>>>   
>>>       
>>>> Michael S. Tsirkin wrote:
>>>>     
>>>>         
>>>>> On Thu, Jun 04, 2009 at 08:48:12AM -0400, Gregory Haskins wrote:
>>>>>   
>>>>>       
>>>>>           
>>>>>> +static void
>>>>>> +irqfd_disconnect(struct _irqfd *irqfd)
>>>>>> +{
>>>>>> +	struct kvm *kvm;
>>>>>> +
>>>>>> +	mutex_lock(&irqfd->lock);
>>>>>> +
>>>>>> +	kvm = rcu_dereference(irqfd->kvm);
>>>>>> +	rcu_assign_pointer(irqfd->kvm, NULL);
>>>>>> +
>>>>>> +	mutex_unlock(&irqfd->lock);
>>>>>> +
>>>>>> +	if (!kvm)
>>>>>> +		return;
>>>>>>  
>>>>>>  	mutex_lock(&kvm->lock);
>>>>>> -	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1);
>>>>>> -	kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
>>>>>> +	list_del(&irqfd->list);
>>>>>>  	mutex_unlock(&kvm->lock);
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * It is important to not drop the kvm reference until the next grace
>>>>>> +	 * period because there might be lockless references in flight up
>>>>>> +	 * until then
>>>>>> +	 */
>>>>>> +	synchronize_srcu(&irqfd->srcu);
>>>>>> +	kvm_put_kvm(kvm);
>>>>>>  }
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> So irqfd object will persist after kvm goes away, until eventfd is closed?
>>>>>   
>>>>>       
>>>>>           
>>>> Yep, by design.  It becomes part of the eventfd and is thus associated
>>>> with its lifetime.  Consider it as if we made our own anon-fd
>>>> implementation for irqfd and the lifetime looks similar.  The difference
>>>> is that we are reusing eventfd and its interface semantics.
>>>>     
>>>>         
>>>>>   
>>>>>       
>>>>>           
>>>>>>  
>>>>>>  static int
>>>>>>  irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
>>>>>>  {
>>>>>>  	struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait);
>>>>>> +	unsigned long flags = (unsigned long)key;
>>>>>>  
>>>>>> -	/*
>>>>>> -	 * The wake_up is called with interrupts disabled.  Therefore we need
>>>>>> -	 * to defer the IRQ injection until later since we need to acquire the
>>>>>> -	 * kvm->lock to do so.
>>>>>> -	 */
>>>>>> -	schedule_work(&irqfd->work);
>>>>>> +	if (flags & POLLIN)
>>>>>> +		/*
>>>>>> +		 * The POLLIN wake_up is called with interrupts disabled.
>>>>>> +		 * Therefore we need to defer the IRQ injection until later
>>>>>> +		 * since we need to acquire the kvm->lock to do so.
>>>>>> +		 */
>>>>>> +		schedule_work(&irqfd->inject);
>>>>>> +
>>>>>> +	if (flags & POLLHUP) {
>>>>>> +		/*
>>>>>> +		 * The POLLHUP is called unlocked, so it theoretically should
>>>>>> +		 * be safe to remove ourselves from the wqh using the locked
>>>>>> +		 * variant of remove_wait_queue()
>>>>>> +		 */
>>>>>> +		remove_wait_queue(irqfd->wqh, &irqfd->wait);
>>>>>> +		flush_work(&irqfd->inject);
>>>>>> +		irqfd_disconnect(irqfd);
>>>>>> +
>>>>>> +		cleanup_srcu_struct(&irqfd->srcu);
>>>>>> +		kfree(irqfd);
>>>>>> +	}
>>>>>>  
>>>>>>  	return 0;
>>>>>>  }
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> And it is removed by this function when eventfd is closed.
>>>>> But what prevents the kvm module from going away, meanwhile?
>>>>>   
>>>>>       
>>>>>           
>>>> Well, we hold a reference to struct kvm until we call
>>>> irqfd_disconnect().  If kvm closes first, we disconnect and disassociate
>>>> all references to kvm leaving irqfd->kvm = NULL.  Likewise, if irqfd
>>>> closes first, we disassociate with kvm with the above quoted logic.  In
>>>> either case, we are holding a kvm reference up until that "disconnect"
>>>> point.  Therefore kvm should not be able to disappear before that
>>>> disconnect, and after that point we do not care.
>>>>     
>>>>         
>>> Yes, we do care.
>>>
>>> Here's the scenario in more detail:
>>>
>>> - kvm is closed
>>> - irq disconnect is called
>>> - kvm is put
>>> - kvm module is removed: all irqs are disconnected
>>> - eventfd closes and triggers callback into removed kvm module
>>> - crash
>>>   
>>>       
>> [ lightbulb turns on]
>>
>> Ah, now I see the point you were making.  I thought you were talking
>> about the .text in kvm_set_irq() (which would be protected by my
>> kvm_get_kvm() reference afaict).  But you are actually talking about the
>> irqfd .text itself.  Indeed, you are correct that is this currently a
>> race.  Good catch!
>>
>>     
>>>   
>>>       
>>>> If that is not sufficient to prevent kvm.ko from going away in the
>>>> middle, then IMO kvm_get_kvm() has a bug, not irqfd. ;) However, I
>>>> believe everything is actually ok here.
>>>>
>>>> -Greg
>>>>
>>>>     
>>>>         
>>> BTW, why can't we remove irqfds in kvm_release?
>>>   
>>>       
>> Well, this would be ideal but we run into that bi-directional reference
>> thing that we talked about earlier and we both agree is non-trivial to
>> solve.  Solving this locking problem would incidentally also pave the
>> way for restoring the DEASSIGN feature, so patches welcome!
>>     
>
> So far the only workable approach that I see is reverting the POLLHUP
> patch. I agree it looks pretty, but DEASSIGN and closing the races is
> more important IMO. And locking will definitely become much simpler.
>
>   
>> In the meantime, I think we can close the hole you found with the
>> following patch (build-tested only):
>>
>> commit f3a8dccc9e815599438e9feb0ea53e8eb10ad2b3
>> Author: Gregory Haskins <ghaskins@novell.com>
>> Date:   Sun Jun 14 23:37:49 2009 -0400
>>
>>     KVM: make irqfd take kvm.ko module reference
>>    
>>     Michael Tsirkin pointed out that we currently have a race between someone
>>     holding an irqfd reference and an rmmod against kvm.ko.  This patch closes
>>     that hole by making sure that irqfd holds a kvm.ko reference for its lifetime.
>>    
>>     Found-by: Michael S. Tsirkin <mst@redhat.com>
>>     Signed-off-by: Gregory Haskins <ghaskins@novell.com>
>>
>> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
>> index 2c8028c..67e4eca 100644
>> --- a/virt/kvm/eventfd.c
>> +++ b/virt/kvm/eventfd.c
>> @@ -29,6 +29,7 @@
>>  #include <linux/list.h>
>>  #include <linux/eventfd.h>
>>  #include <linux/srcu.h>
>> +#include <linux/module.h>
>>  
>>  /*
>>   * --------------------------------------------------------------------
>> @@ -123,6 +124,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int
>> sync, void
>>  *key)
>>  
>>                 cleanup_srcu_struct(&irqfd->srcu);
>>                 kfree(irqfd);
>> +               module_put(THIS_MODULE);
>>         }
>>  
>>         return 0;
>>     
>
> module_put(THIS_MODULE) is always a bug unless you know that someone has
> a reference to the current module: the module could go away between this
> call and returning from function.
>   

Hmm.  I understand what you are saying conceptually (i.e. the .text
could get yanked before we hit the next line of code, in this case the
"return 0").  However, holding a reference when you _know_ someone else
holds a reference to me says that one of the references is redundant. 
In addition, there is certainly plenty of precedence for
module_put(THIS_MODULE) all throughout the kernel (including
module_put_and_exit()).  Are those broken as well?

In any case, one of the patches I have in queue to push to Davide for
eventfd may provide a good solution to this problem as well, so I will
get that polished up today.

Thanks Michael,
-Greg

>   
>> @@ -176,6 +178,7 @@ kvm_irqfd(struct kvm *kvm, int fd, int gsi, int flags)
>>         if (ret < 0)
>>                 goto fail;
>>  
>> +       __module_get(THIS_MODULE);
>>         kvm_get_kvm(kvm);
>>  
>>         mutex_lock(&kvm->lock);
>>
>>
>>
>>     
>
>
>   



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl
  2009-06-15 12:08             ` Gregory Haskins
@ 2009-06-15 12:54               ` Michael S. Tsirkin
  2009-06-18  5:16                 ` Rusty Russell
  0 siblings, 1 reply; 25+ messages in thread
From: Michael S. Tsirkin @ 2009-06-15 12:54 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: kvm, linux-kernel, avi, davidel, paulmck, akpm, Rusty Russell

On Mon, Jun 15, 2009 at 08:08:18AM -0400, Gregory Haskins wrote:
> >> @@ -123,6 +124,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int
> >> sync, void
> >>  *key)
> >>  
> >>                 cleanup_srcu_struct(&irqfd->srcu);
> >>                 kfree(irqfd);
> >> +               module_put(THIS_MODULE);
> >>         }
> >>  
> >>         return 0;
> >>     
> >
> > module_put(THIS_MODULE) is always a bug unless you know that someone has
> > a reference to the current module: the module could go away between this
> > call and returning from function.
> >   
> 
> Hmm.  I understand what you are saying conceptually (i.e. the .text
> could get yanked before we hit the next line of code, in this case the
> "return 0").  However, holding a reference when you _know_ someone else
> holds a reference to me says that one of the references is redundant. 
> In addition, there is certainly plenty of precedence for
> module_put(THIS_MODULE) all throughout the kernel (including
> module_put_and_exit()).  Are those broken as well?

Maybe not, but I don't know why. It works fine as long as you don't
unload any modules though :) Rusty, could you enlighten us please?


-- 
MST

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl
  2009-06-15 12:54               ` Michael S. Tsirkin
@ 2009-06-18  5:16                 ` Rusty Russell
  2009-06-18  6:49                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 25+ messages in thread
From: Rusty Russell @ 2009-06-18  5:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gregory Haskins, kvm, linux-kernel, avi, davidel, paulmck, akpm

On Mon, 15 Jun 2009 10:24:39 pm Michael S. Tsirkin wrote:
> On Mon, Jun 15, 2009 at 08:08:18AM -0400, Gregory Haskins wrote:
> > Hmm.  I understand what you are saying conceptually (i.e. the .text
> > could get yanked before we hit the next line of code, in this case the
> > "return 0").  However, holding a reference when you _know_ someone else
> > holds a reference to me says that one of the references is redundant.
> > In addition, there is certainly plenty of precedence for
> > module_put(THIS_MODULE) all throughout the kernel (including
> > module_put_and_exit()).  Are those broken as well?
>
> Maybe not, but I don't know why. It works fine as long as you don't
> unload any modules though :) Rusty, could you enlighten us please?

Yep, they're almost all broken.  A few have comments indicating that someone 
else is holding a reference (eg. loopback).

But at some point you give up playing whack-a-mole for random drivers.

module_put_and_exit() does *not* have this problem, BTW.

Rusty.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl
  2009-06-18  5:16                 ` Rusty Russell
@ 2009-06-18  6:49                   ` Michael S. Tsirkin
  2009-06-18 12:00                     ` Gregory Haskins
  0 siblings, 1 reply; 25+ messages in thread
From: Michael S. Tsirkin @ 2009-06-18  6:49 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Gregory Haskins, kvm, linux-kernel, avi, davidel, paulmck, akpm

On Thu, Jun 18, 2009 at 02:46:30PM +0930, Rusty Russell wrote:
> On Mon, 15 Jun 2009 10:24:39 pm Michael S. Tsirkin wrote:
> > On Mon, Jun 15, 2009 at 08:08:18AM -0400, Gregory Haskins wrote:
> > > Hmm.  I understand what you are saying conceptually (i.e. the .text
> > > could get yanked before we hit the next line of code, in this case the
> > > "return 0").  However, holding a reference when you _know_ someone else
> > > holds a reference to me says that one of the references is redundant.
> > > In addition, there is certainly plenty of precedence for
> > > module_put(THIS_MODULE) all throughout the kernel (including
> > > module_put_and_exit()).  Are those broken as well?
> >
> > Maybe not, but I don't know why. It works fine as long as you don't
> > unload any modules though :) Rusty, could you enlighten us please?
> 
> Yep, they're almost all broken.  A few have comments indicating that someone 
> else is holding a reference (eg. loopback).
> 
> But at some point you give up playing whack-a-mole for random drivers.
> 
> module_put_and_exit() does *not* have this problem, BTW.
> 
> Rusty.

I see that, the .text for module_put_and_exit is never modular itself.
Thanks, Rusty!

BTW, Gregory, this can be used to fix the race in the design: create a
thread and let it drop the module reference with module_put_and_exit.
Which will work, but I guess at this point we should ask ourselves
whether all the hearburn with srcu, threads and module references is
better than just asking the user to call and ioctl.

-- 
MST

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl
  2009-06-18  6:49                   ` Michael S. Tsirkin
@ 2009-06-18 12:00                     ` Gregory Haskins
  2009-06-18 12:22                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 25+ messages in thread
From: Gregory Haskins @ 2009-06-18 12:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rusty Russell, kvm, linux-kernel, avi, davidel, paulmck, akpm

[-- Attachment #1: Type: text/plain, Size: 2850 bytes --]

Michael S. Tsirkin wrote:
> On Thu, Jun 18, 2009 at 02:46:30PM +0930, Rusty Russell wrote:
>   
>> On Mon, 15 Jun 2009 10:24:39 pm Michael S. Tsirkin wrote:
>>     
>>> On Mon, Jun 15, 2009 at 08:08:18AM -0400, Gregory Haskins wrote:
>>>       
>>>> Hmm.  I understand what you are saying conceptually (i.e. the .text
>>>> could get yanked before we hit the next line of code, in this case the
>>>> "return 0").  However, holding a reference when you _know_ someone else
>>>> holds a reference to me says that one of the references is redundant.
>>>> In addition, there is certainly plenty of precedence for
>>>> module_put(THIS_MODULE) all throughout the kernel (including
>>>> module_put_and_exit()).  Are those broken as well?
>>>>         
>>> Maybe not, but I don't know why. It works fine as long as you don't
>>> unload any modules though :) Rusty, could you enlighten us please?
>>>       
>> Yep, they're almost all broken.  A few have comments indicating that someone 
>> else is holding a reference (eg. loopback).
>>
>> But at some point you give up playing whack-a-mole for random drivers.
>>
>> module_put_and_exit() does *not* have this problem, BTW.
>>
>> Rusty.
>>     
>
> I see that, the .text for module_put_and_exit is never modular itself.
> Thanks, Rusty!
>   

Ah!  That is the trick I wasn't understanding.
> BTW, Gregory, this can be used to fix the race in the design: create a
> thread and let it drop the module reference with module_put_and_exit.
>   

I had thought of doing something like this initially too, but I think
its racy as well.  Ultimately, you need to make sure the eventfd
callback is completely out before its safe to run, and deferring to a
thread would not change this race.  The only sane way I can see to do
that is to have the caller infrastructure annotate the event somehow
(either directly with a module_put(), or indirectly with some kind of
state transition that can be tracked with something like
synchronize_sched().
> Which will work, but I guess at this point we should ask ourselves
> whether all the hearburn with srcu, threads and module references is
> better than just asking the user to call and ioctl.
>   

I am starting to agree with you, here. :)

Note one thing: the SRCU stuff is mostly orthogonal from the rest of the
conversation re: the module_put() races.  I only tied it into the
current thread because the eventfd_notifier_register() thread gave me a
convenient way to hook some other context to do the module_put().  In
the long term, the srcu changes are for the can_sleep() stuff.  So on
that note, lets see if I can convince Davide that the srcu stuff is not
so evil before we revert the POLLHUP patches, since the module_put() fix
is trivial once that is in place.

Thanks Michael (and Rusty),
-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl
  2009-06-18 12:00                     ` Gregory Haskins
@ 2009-06-18 12:22                       ` Michael S. Tsirkin
  2009-06-18 14:03                         ` Gregory Haskins
  0 siblings, 1 reply; 25+ messages in thread
From: Michael S. Tsirkin @ 2009-06-18 12:22 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Rusty Russell, kvm, linux-kernel, avi, davidel, paulmck, akpm

On Thu, Jun 18, 2009 at 08:00:39AM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Thu, Jun 18, 2009 at 02:46:30PM +0930, Rusty Russell wrote:
> >   
> >> On Mon, 15 Jun 2009 10:24:39 pm Michael S. Tsirkin wrote:
> >>     
> >>> On Mon, Jun 15, 2009 at 08:08:18AM -0400, Gregory Haskins wrote:
> >>>       
> >>>> Hmm.  I understand what you are saying conceptually (i.e. the .text
> >>>> could get yanked before we hit the next line of code, in this case the
> >>>> "return 0").  However, holding a reference when you _know_ someone else
> >>>> holds a reference to me says that one of the references is redundant.
> >>>> In addition, there is certainly plenty of precedence for
> >>>> module_put(THIS_MODULE) all throughout the kernel (including
> >>>> module_put_and_exit()).  Are those broken as well?
> >>>>         
> >>> Maybe not, but I don't know why. It works fine as long as you don't
> >>> unload any modules though :) Rusty, could you enlighten us please?
> >>>       
> >> Yep, they're almost all broken.  A few have comments indicating that someone 
> >> else is holding a reference (eg. loopback).
> >>
> >> But at some point you give up playing whack-a-mole for random drivers.
> >>
> >> module_put_and_exit() does *not* have this problem, BTW.
> >>
> >> Rusty.
> >>     
> >
> > I see that, the .text for module_put_and_exit is never modular itself.
> > Thanks, Rusty!
> >   
> 
> Ah!  That is the trick I wasn't understanding.
> > BTW, Gregory, this can be used to fix the race in the design: create a
> > thread and let it drop the module reference with module_put_and_exit.
> >   
> 
> I had thought of doing something like this initially too, but I think
> its racy as well.  Ultimately, you need to make sure the eventfd
> callback is completely out before its safe to run, and deferring to a
> thread would not change this race.  The only sane way I can see to do
> that is to have the caller infrastructure annotate the event somehow
> (either directly with a module_put(), or indirectly with some kind of
> state transition that can be tracked with something like
> synchronize_sched().

Here's what one could do: create a thread for each irqfd, and increment
module ref count, put that thread to sleep.  When done with
irqfd, don't delete it and don't decrement module refcount, wake thread
instead.  thread kills irqfd and calls module_put_and_exit.

I don't think it's racy but I won't call it sane either.

> > Which will work, but I guess at this point we should ask ourselves
> > whether all the hearburn with srcu, threads and module references is
> > better than just asking the user to call and ioctl.
> >   
> 
> I am starting to agree with you, here. :)
> 
> Note one thing: the SRCU stuff is mostly orthogonal from the rest of the
> conversation re: the module_put() races.  I only tied it into the
> current thread because the eventfd_notifier_register() thread gave me a
> convenient way to hook some other context to do the module_put().  In
> the long term, the srcu changes are for the can_sleep() stuff.  So on
> that note, lets see if I can convince Davide that the srcu stuff is not
> so evil before we revert the POLLHUP patches, since the module_put() fix
> is trivial once that is in place.

Can this help with DEASSIGN as well? We need it for migration.

> Thanks Michael (and Rusty),
> -Greg
> 
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl
  2009-06-18 12:22                       ` Michael S. Tsirkin
@ 2009-06-18 14:03                         ` Gregory Haskins
  2009-06-18 14:35                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 25+ messages in thread
From: Gregory Haskins @ 2009-06-18 14:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rusty Russell, kvm, linux-kernel, avi, davidel, paulmck, akpm

[-- Attachment #1: Type: text/plain, Size: 2632 bytes --]

Michael S. Tsirkin wrote:
> On Thu, Jun 18, 2009 at 08:00:39AM -0400, Gregory Haskins wrote:
>   
>> Michael S. Tsirkin wrote:
>>     
>>
>>> BTW, Gregory, this can be used to fix the race in the design: create a
>>> thread and let it drop the module reference with module_put_and_exit.
>>>   
>>>       
>> I had thought of doing something like this initially too, but I think
>> its racy as well.  Ultimately, you need to make sure the eventfd
>> callback is completely out before its safe to run, and deferring to a
>> thread would not change this race.  The only sane way I can see to do
>> that is to have the caller infrastructure annotate the event somehow
>> (either directly with a module_put(), or indirectly with some kind of
>> state transition that can be tracked with something like
>> synchronize_sched().
>>     
>
> Here's what one could do: create a thread for each irqfd, and increment
> module ref count, put that thread to sleep.  When done with
> irqfd, don't delete it and don't decrement module refcount, wake thread
> instead.  thread kills irqfd and calls module_put_and_exit.
>
> I don't think it's racy


I believe it is. How would you prevent the thread from doing the
module_put_and_exit() before the eventfd callback thread is known to
have exited the relevant .text section?

All this talk does give me an idea, tho.  Ill make a patch.

 

>   
>>> Which will work, but I guess at this point we should ask ourselves
>>> whether all the hearburn with srcu, threads and module references is
>>> better than just asking the user to call and ioctl.
>>>   
>>>       
>> I am starting to agree with you, here. :)
>>
>> Note one thing: the SRCU stuff is mostly orthogonal from the rest of the
>> conversation re: the module_put() races.  I only tied it into the
>> current thread because the eventfd_notifier_register() thread gave me a
>> convenient way to hook some other context to do the module_put().  In
>> the long term, the srcu changes are for the can_sleep() stuff.  So on
>> that note, lets see if I can convince Davide that the srcu stuff is not
>> so evil before we revert the POLLHUP patches, since the module_put() fix
>> is trivial once that is in place.
>>     
>
> Can this help with DEASSIGN as well? We need it for migration.
>   

No, but afaict you do not need this for migration anyway.  Migrate the
GSI and re-call kvm_irqfd() on the other side.  Would the fd even be
relevant across a migration anyway?  I would think not, but admittedly I
know little about how qemu/kvm migration actually works.

Regards,
-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl
  2009-06-18 14:03                         ` Gregory Haskins
@ 2009-06-18 14:35                           ` Michael S. Tsirkin
  2009-06-18 16:29                             ` Gregory Haskins
  0 siblings, 1 reply; 25+ messages in thread
From: Michael S. Tsirkin @ 2009-06-18 14:35 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Rusty Russell, kvm, linux-kernel, avi, davidel, paulmck, akpm

On Thu, Jun 18, 2009 at 10:03:36AM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Thu, Jun 18, 2009 at 08:00:39AM -0400, Gregory Haskins wrote:
> >   
> >> Michael S. Tsirkin wrote:
> >>     
> >>
> >>> BTW, Gregory, this can be used to fix the race in the design: create a
> >>> thread and let it drop the module reference with module_put_and_exit.
> >>>   
> >>>       
> >> I had thought of doing something like this initially too, but I think
> >> its racy as well.  Ultimately, you need to make sure the eventfd
> >> callback is completely out before its safe to run, and deferring to a
> >> thread would not change this race.  The only sane way I can see to do
> >> that is to have the caller infrastructure annotate the event somehow
> >> (either directly with a module_put(), or indirectly with some kind of
> >> state transition that can be tracked with something like
> >> synchronize_sched().
> >>     
> >
> > Here's what one could do: create a thread for each irqfd, and increment
> > module ref count, put that thread to sleep.  When done with
> > irqfd, don't delete it and don't decrement module refcount, wake thread
> > instead.  thread kills irqfd and calls module_put_and_exit.
> >
> > I don't think it's racy
> 
> 
> I believe it is. How would you prevent the thread from doing the
> module_put_and_exit() before the eventfd callback thread is known to
> have exited the relevant .text section?

Right.

> All this talk does give me an idea, tho.  Ill make a patch.

OK, but ask yourself whether this bag of tricks is worth it, and whether
we'll find another hole later. Let's reserve the trickiness for
fast-path, where it's needed, and keep at least the assign/deassign simple.

> >   
> >>> Which will work, but I guess at this point we should ask ourselves
> >>> whether all the hearburn with srcu, threads and module references is
> >>> better than just asking the user to call and ioctl.
> >>>   
> >>>       
> >> I am starting to agree with you, here. :)
> >>
> >> Note one thing: the SRCU stuff is mostly orthogonal from the rest of the
> >> conversation re: the module_put() races.  I only tied it into the
> >> current thread because the eventfd_notifier_register() thread gave me a
> >> convenient way to hook some other context to do the module_put().  In
> >> the long term, the srcu changes are for the can_sleep() stuff.  So on
> >> that note, lets see if I can convince Davide that the srcu stuff is not
> >> so evil before we revert the POLLHUP patches, since the module_put() fix
> >> is trivial once that is in place.
> >>     
> >
> > Can this help with DEASSIGN as well? We need it for migration.
> >   
> 
> No, but afaict you do not need this for migration anyway.  Migrate the
> GSI and re-call kvm_irqfd() on the other side.  Would the fd even be
> relevant across a migration anyway?  I would think not, but admittedly I
> know little about how qemu/kvm migration actually works.

Yes but that's not live migration. For live migration, the trick is that
you are running locally but send changes to remote guest.  For that, we
need to put qemu in the middle between the device and the guest, so it
can detect activity and update the remote side.

And the best way to do that is to take poll eventfd that device assigns
and push eventfd that kvm polls. To switch between this setup
and the one where kvm polls the ventfd from device directly,
you need deassign.

-- 
MST

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl
  2009-06-18 14:35                           ` Michael S. Tsirkin
@ 2009-06-18 16:29                             ` Gregory Haskins
  2009-06-19 15:37                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 25+ messages in thread
From: Gregory Haskins @ 2009-06-18 16:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rusty Russell, kvm, linux-kernel, avi, davidel, paulmck, akpm

[-- Attachment #1: Type: text/plain, Size: 4150 bytes --]

Michael S. Tsirkin wrote:
> On Thu, Jun 18, 2009 at 10:03:36AM -0400, Gregory Haskins wrote:
>   
>> Michael S. Tsirkin wrote:
>>     
>>> On Thu, Jun 18, 2009 at 08:00:39AM -0400, Gregory Haskins wrote:
>>>   
>>>       
>>>> Michael S. Tsirkin wrote:
>>>>     
>>>>
>>>>         
>>>>> BTW, Gregory, this can be used to fix the race in the design: create a
>>>>> thread and let it drop the module reference with module_put_and_exit.
>>>>>   
>>>>>       
>>>>>           
>>>> I had thought of doing something like this initially too, but I think
>>>> its racy as well.  Ultimately, you need to make sure the eventfd
>>>> callback is completely out before its safe to run, and deferring to a
>>>> thread would not change this race.  The only sane way I can see to do
>>>> that is to have the caller infrastructure annotate the event somehow
>>>> (either directly with a module_put(), or indirectly with some kind of
>>>> state transition that can be tracked with something like
>>>> synchronize_sched().
>>>>     
>>>>         
>>> Here's what one could do: create a thread for each irqfd, and increment
>>> module ref count, put that thread to sleep.  When done with
>>> irqfd, don't delete it and don't decrement module refcount, wake thread
>>> instead.  thread kills irqfd and calls module_put_and_exit.
>>>
>>> I don't think it's racy
>>>       
>> I believe it is. How would you prevent the thread from doing the
>> module_put_and_exit() before the eventfd callback thread is known to
>> have exited the relevant .text section?
>>     
>
> Right.
>
>   
>> All this talk does give me an idea, tho.  Ill make a patch.
>>     
>
> OK, but ask yourself whether this bag of tricks is worth it, and whether
> we'll find another hole later. Let's reserve the trickiness for
> fast-path, where it's needed, and keep at least the assign/deassign simple.
>   

Understood.  OTOH, going back to the model where two steps are needed
for close() is ugly too, so I don't want to just give up and revert that
fix too easily.  At some point we will call it one way or the other, but
I am not there quite yet.
>   
>>>   
>>>       
>>>>> Which will work, but I guess at this point we should ask ourselves
>>>>> whether all the hearburn with srcu, threads and module references is
>>>>> better than just asking the user to call and ioctl.
>>>>>   
>>>>>       
>>>>>           
>>>> I am starting to agree with you, here. :)
>>>>
>>>> Note one thing: the SRCU stuff is mostly orthogonal from the rest of the
>>>> conversation re: the module_put() races.  I only tied it into the
>>>> current thread because the eventfd_notifier_register() thread gave me a
>>>> convenient way to hook some other context to do the module_put().  In
>>>> the long term, the srcu changes are for the can_sleep() stuff.  So on
>>>> that note, lets see if I can convince Davide that the srcu stuff is not
>>>> so evil before we revert the POLLHUP patches, since the module_put() fix
>>>> is trivial once that is in place.
>>>>     
>>>>         
>>> Can this help with DEASSIGN as well? We need it for migration.
>>>   
>>>       
>> No, but afaict you do not need this for migration anyway.  Migrate the
>> GSI and re-call kvm_irqfd() on the other side.  Would the fd even be
>> relevant across a migration anyway?  I would think not, but admittedly I
>> know little about how qemu/kvm migration actually works.
>>     
>
> Yes but that's not live migration. For live migration, the trick is that
> you are running locally but send changes to remote guest.  For that, we
> need to put qemu in the middle between the device and the guest, so it
> can detect activity and update the remote side.
>
> And the best way to do that is to take poll eventfd that device assigns
> and push eventfd that kvm polls. To switch between this setup
> and the one where kvm polls the ventfd from device directly,
> you need deassign.
>   

So its still not clear why the distinction between
deassign-the-gsi-but-leave-the-fd-valid is needed over a simple
close().  Can you elaborate?

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl
  2009-06-18 16:29                             ` Gregory Haskins
@ 2009-06-19 15:37                               ` Michael S. Tsirkin
  2009-06-19 16:07                                 ` Gregory Haskins
  0 siblings, 1 reply; 25+ messages in thread
From: Michael S. Tsirkin @ 2009-06-19 15:37 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Rusty Russell, kvm, linux-kernel, avi, davidel, paulmck, akpm

On Thu, Jun 18, 2009 at 12:29:31PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Thu, Jun 18, 2009 at 10:03:36AM -0400, Gregory Haskins wrote:
> >   
> >> Michael S. Tsirkin wrote:
> >>     
> >>> On Thu, Jun 18, 2009 at 08:00:39AM -0400, Gregory Haskins wrote:
> >>>   
> >>>       
> >>>> Michael S. Tsirkin wrote:
> >>>>     
> >>>>
> >>>>         
> >>>>> BTW, Gregory, this can be used to fix the race in the design: create a
> >>>>> thread and let it drop the module reference with module_put_and_exit.
> >>>>>   
> >>>>>       
> >>>>>           
> >>>> I had thought of doing something like this initially too, but I think
> >>>> its racy as well.  Ultimately, you need to make sure the eventfd
> >>>> callback is completely out before its safe to run, and deferring to a
> >>>> thread would not change this race.  The only sane way I can see to do
> >>>> that is to have the caller infrastructure annotate the event somehow
> >>>> (either directly with a module_put(), or indirectly with some kind of
> >>>> state transition that can be tracked with something like
> >>>> synchronize_sched().
> >>>>     
> >>>>         
> >>> Here's what one could do: create a thread for each irqfd, and increment
> >>> module ref count, put that thread to sleep.  When done with
> >>> irqfd, don't delete it and don't decrement module refcount, wake thread
> >>> instead.  thread kills irqfd and calls module_put_and_exit.
> >>>
> >>> I don't think it's racy
> >>>       
> >> I believe it is. How would you prevent the thread from doing the
> >> module_put_and_exit() before the eventfd callback thread is known to
> >> have exited the relevant .text section?
> >>     
> >
> > Right.
> >
> >   
> >> All this talk does give me an idea, tho.  Ill make a patch.
> >>     
> >
> > OK, but ask yourself whether this bag of tricks is worth it, and whether
> > we'll find another hole later. Let's reserve the trickiness for
> > fast-path, where it's needed, and keep at least the assign/deassign simple.
> >   
> 
> Understood.  OTOH, going back to the model where two steps are needed
> for close() is ugly too, so I don't want to just give up and revert that
> fix too easily.  At some point we will call it one way or the other, but
> I am not there quite yet.
> >   
> >>>   
> >>>       
> >>>>> Which will work, but I guess at this point we should ask ourselves
> >>>>> whether all the hearburn with srcu, threads and module references is
> >>>>> better than just asking the user to call and ioctl.
> >>>>>   
> >>>>>       
> >>>>>           
> >>>> I am starting to agree with you, here. :)
> >>>>
> >>>> Note one thing: the SRCU stuff is mostly orthogonal from the rest of the
> >>>> conversation re: the module_put() races.  I only tied it into the
> >>>> current thread because the eventfd_notifier_register() thread gave me a
> >>>> convenient way to hook some other context to do the module_put().  In
> >>>> the long term, the srcu changes are for the can_sleep() stuff.  So on
> >>>> that note, lets see if I can convince Davide that the srcu stuff is not
> >>>> so evil before we revert the POLLHUP patches, since the module_put() fix
> >>>> is trivial once that is in place.
> >>>>     
> >>>>         
> >>> Can this help with DEASSIGN as well? We need it for migration.
> >>>   
> >>>       
> >> No, but afaict you do not need this for migration anyway.  Migrate the
> >> GSI and re-call kvm_irqfd() on the other side.  Would the fd even be
> >> relevant across a migration anyway?  I would think not, but admittedly I
> >> know little about how qemu/kvm migration actually works.
> >>     
> >
> > Yes but that's not live migration. For live migration, the trick is that
> > you are running locally but send changes to remote guest.  For that, we
> > need to put qemu in the middle between the device and the guest, so it
> > can detect activity and update the remote side.
> >
> > And the best way to do that is to take poll eventfd that device assigns
> > and push eventfd that kvm polls. To switch between this setup
> > and the one where kvm polls the ventfd from device directly,
> > you need deassign.
> >   
> 
> So its still not clear why the distinction between
> deassign-the-gsi-but-leave-the-fd-valid is needed over a simple
> close().  Can you elaborate?


The fd needs to be left assigned to the device, so that we can poll
the fd and get events, then forward them to kvm.

-- 
MST

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl
  2009-06-19 15:37                               ` Michael S. Tsirkin
@ 2009-06-19 16:07                                 ` Gregory Haskins
  0 siblings, 0 replies; 25+ messages in thread
From: Gregory Haskins @ 2009-06-19 16:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Rusty Russell, kvm, linux-kernel, avi, davidel, paulmck, akpm

[-- Attachment #1: Type: text/plain, Size: 5106 bytes --]

Michael S. Tsirkin wrote:
> On Thu, Jun 18, 2009 at 12:29:31PM -0400, Gregory Haskins wrote:
>   
>> Michael S. Tsirkin wrote:
>>     
>>> On Thu, Jun 18, 2009 at 10:03:36AM -0400, Gregory Haskins wrote:
>>>   
>>>       
>>>> Michael S. Tsirkin wrote:
>>>>     
>>>>         
>>>>> On Thu, Jun 18, 2009 at 08:00:39AM -0400, Gregory Haskins wrote:
>>>>>   
>>>>>       
>>>>>           
>>>>>> Michael S. Tsirkin wrote:
>>>>>>     
>>>>>>
>>>>>>         
>>>>>>             
>>>>>>> BTW, Gregory, this can be used to fix the race in the design: create a
>>>>>>> thread and let it drop the module reference with module_put_and_exit.
>>>>>>>   
>>>>>>>       
>>>>>>>           
>>>>>>>               
>>>>>> I had thought of doing something like this initially too, but I think
>>>>>> its racy as well.  Ultimately, you need to make sure the eventfd
>>>>>> callback is completely out before its safe to run, and deferring to a
>>>>>> thread would not change this race.  The only sane way I can see to do
>>>>>> that is to have the caller infrastructure annotate the event somehow
>>>>>> (either directly with a module_put(), or indirectly with some kind of
>>>>>> state transition that can be tracked with something like
>>>>>> synchronize_sched().
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> Here's what one could do: create a thread for each irqfd, and increment
>>>>> module ref count, put that thread to sleep.  When done with
>>>>> irqfd, don't delete it and don't decrement module refcount, wake thread
>>>>> instead.  thread kills irqfd and calls module_put_and_exit.
>>>>>
>>>>> I don't think it's racy
>>>>>       
>>>>>           
>>>> I believe it is. How would you prevent the thread from doing the
>>>> module_put_and_exit() before the eventfd callback thread is known to
>>>> have exited the relevant .text section?
>>>>     
>>>>         
>>> Right.
>>>
>>>   
>>>       
>>>> All this talk does give me an idea, tho.  Ill make a patch.
>>>>     
>>>>         
>>> OK, but ask yourself whether this bag of tricks is worth it, and whether
>>> we'll find another hole later. Let's reserve the trickiness for
>>> fast-path, where it's needed, and keep at least the assign/deassign simple.
>>>   
>>>       
>> Understood.  OTOH, going back to the model where two steps are needed
>> for close() is ugly too, so I don't want to just give up and revert that
>> fix too easily.  At some point we will call it one way or the other, but
>> I am not there quite yet.
>>     
>>>   
>>>       
>>>>>   
>>>>>       
>>>>>           
>>>>>>> Which will work, but I guess at this point we should ask ourselves
>>>>>>> whether all the hearburn with srcu, threads and module references is
>>>>>>> better than just asking the user to call and ioctl.
>>>>>>>   
>>>>>>>       
>>>>>>>           
>>>>>>>               
>>>>>> I am starting to agree with you, here. :)
>>>>>>
>>>>>> Note one thing: the SRCU stuff is mostly orthogonal from the rest of the
>>>>>> conversation re: the module_put() races.  I only tied it into the
>>>>>> current thread because the eventfd_notifier_register() thread gave me a
>>>>>> convenient way to hook some other context to do the module_put().  In
>>>>>> the long term, the srcu changes are for the can_sleep() stuff.  So on
>>>>>> that note, lets see if I can convince Davide that the srcu stuff is not
>>>>>> so evil before we revert the POLLHUP patches, since the module_put() fix
>>>>>> is trivial once that is in place.
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> Can this help with DEASSIGN as well? We need it for migration.
>>>>>   
>>>>>       
>>>>>           
>>>> No, but afaict you do not need this for migration anyway.  Migrate the
>>>> GSI and re-call kvm_irqfd() on the other side.  Would the fd even be
>>>> relevant across a migration anyway?  I would think not, but admittedly I
>>>> know little about how qemu/kvm migration actually works.
>>>>     
>>>>         
>>> Yes but that's not live migration. For live migration, the trick is that
>>> you are running locally but send changes to remote guest.  For that, we
>>> need to put qemu in the middle between the device and the guest, so it
>>> can detect activity and update the remote side.
>>>
>>> And the best way to do that is to take poll eventfd that device assigns
>>> and push eventfd that kvm polls. To switch between this setup
>>> and the one where kvm polls the ventfd from device directly,
>>> you need deassign.
>>>   
>>>       
>> So its still not clear why the distinction between
>> deassign-the-gsi-but-leave-the-fd-valid is needed over a simple
>> close().  Can you elaborate?
>>     
>
>
> The fd needs to be left assigned to the device, so that we can poll
> the fd and get events, then forward them to kvm.
>   

Ah, ok.  Now I get what you are trying to do.

Well, per the PM I sent you this morning, I figured out the magic to
resolve the locking issues.  So we should be able to add DEASSIGN logic
soon, I hope.

-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 266 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2009-06-19 16:07 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-04 12:48 [KVM PATCH v2 0/2] irqfd: use POLLHUP notification for close() Gregory Haskins
2009-06-04 12:48 ` [KVM PATCH v2 1/2] Allow waiters to be notified about the eventfd file* going away, and give Gregory Haskins
2009-06-04 12:48 ` [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl Gregory Haskins
2009-06-14 11:49   ` Michael S. Tsirkin
2009-06-14 12:53     ` Gregory Haskins
2009-06-14 13:28       ` Michael S. Tsirkin
2009-06-15  3:39         ` Gregory Haskins
2009-06-15  9:46           ` Michael S. Tsirkin
2009-06-15 12:08             ` Gregory Haskins
2009-06-15 12:54               ` Michael S. Tsirkin
2009-06-18  5:16                 ` Rusty Russell
2009-06-18  6:49                   ` Michael S. Tsirkin
2009-06-18 12:00                     ` Gregory Haskins
2009-06-18 12:22                       ` Michael S. Tsirkin
2009-06-18 14:03                         ` Gregory Haskins
2009-06-18 14:35                           ` Michael S. Tsirkin
2009-06-18 16:29                             ` Gregory Haskins
2009-06-19 15:37                               ` Michael S. Tsirkin
2009-06-19 16:07                                 ` Gregory Haskins
2009-06-15  3:48         ` Gregory Haskins
2009-06-04 14:02 ` [KVM PATCH v2 0/2] irqfd: use POLLHUP notification for close() Avi Kivity
2009-06-12  3:58 ` Michael S. Tsirkin
2009-06-12  4:08 ` Michael S. Tsirkin
2009-06-14 12:38   ` Gregory Haskins
2009-06-14 12:51     ` Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.