linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 1/2] sched/wait: Add add_wait_queue_priority()
@ 2020-10-26 17:53 David Woodhouse
  2020-10-26 17:53 ` [RFC PATCH 2/2] kvm/eventfd: Use priority waitqueue to catch events before userspace David Woodhouse
  2020-10-27 14:39 ` [PATCH v2 0/2] Allow KVM IRQFD to consistently intercept events David Woodhouse
  0 siblings, 2 replies; 29+ messages in thread
From: David Woodhouse @ 2020-10-26 17:53 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paolo Bonzini, linux-kernel, kvm

From: David Woodhouse <dwmw@amazon.co.uk>

This allows an exclusive wait_queue_entry to be added at the head of the
queue, instead of the tail as normal. Thus, it gets to consume events
first.

The problem I'm trying to solve here is interrupt remapping invalidation
vs. MSI interrupts from VFIO. I'd really like KVM IRQFD to be able to
consume events before (and indeed instead of) userspace.

When the remapped MSI target in the KVM routing table is invalidated,
the VMM needs to *deassociate* the IRQFD and fall back to handling the
next IRQ in userspace, so it can be retranslated and a fault reported
if appropriate.

It's possible to do that by constantly registering and deregistering the
fd in the userspace poll loop, but it gets ugly especially because the
fallback handler isn't really local to the core MSI handling.

It's much nicer if the userspace handler can just remain registered all
the time, and it just doesn't get any events when KVM steals them first.
Which is precisely what happens with posted interrupts, and this makes
it consistent. (Unless I'm missing something that prevents posted
interrupts from working when there's another listener on the eventfd?)

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 include/linux/wait.h | 12 +++++++++++-
 kernel/sched/wait.c  | 11 +++++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 27fb99cfeb02..fe10e8570a52 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -22,6 +22,7 @@ int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int
 #define WQ_FLAG_BOOKMARK	0x04
 #define WQ_FLAG_CUSTOM		0x08
 #define WQ_FLAG_DONE		0x10
+#define WQ_FLAG_PRIORITY	0x20
 
 /*
  * A single wait-queue entry structure:
@@ -164,11 +165,20 @@ static inline bool wq_has_sleeper(struct wait_queue_head *wq_head)
 
 extern void add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
 extern void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
+extern void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
 extern void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
 
 static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
 {
-	list_add(&wq_entry->entry, &wq_head->head);
+	struct list_head *head = &wq_head->head;
+	struct wait_queue_entry *wq;
+
+	list_for_each_entry(wq, &wq_head->head, entry) {
+		if (!(wq->flags & WQ_FLAG_PRIORITY))
+			break;
+		head = &wq->entry;
+	}
+	list_add(&wq_entry->entry, head);
 }
 
 /*
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 01f5d3020589..d2a84c8e88bf 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -37,6 +37,17 @@ void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue
 }
 EXPORT_SYMBOL(add_wait_queue_exclusive);
 
+void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
+{
+	unsigned long flags;
+
+	wq_entry->flags |= WQ_FLAG_EXCLUSIVE | WQ_FLAG_PRIORITY;
+	spin_lock_irqsave(&wq_head->lock, flags);
+	__add_wait_queue(wq_head, wq_entry);
+	spin_unlock_irqrestore(&wq_head->lock, flags);
+}
+EXPORT_SYMBOL_GPL(add_wait_queue_priority);
+
 void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
 {
 	unsigned long flags;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 2/2] kvm/eventfd: Use priority waitqueue to catch events before userspace
  2020-10-26 17:53 [RFC PATCH 1/2] sched/wait: Add add_wait_queue_priority() David Woodhouse
@ 2020-10-26 17:53 ` David Woodhouse
  2020-10-27  8:01   ` Paolo Bonzini
  2020-10-27 14:39 ` [PATCH v2 0/2] Allow KVM IRQFD to consistently intercept events David Woodhouse
  1 sibling, 1 reply; 29+ messages in thread
From: David Woodhouse @ 2020-10-26 17:53 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paolo Bonzini, linux-kernel, kvm

From: David Woodhouse <dwmw@amazon.co.uk>

As far as I can tell, when we use posted interrupts we silently cut off
the events from userspace, if it's listening on the same eventfd that
feeds the irqfd.

I like that behaviour. Let's do it all the time, even without posted
interrupts. It makes it much easier to handle IRQ remapping invalidation
without having to constantly add/remove the fd from the userspace poll
set. We can just leave userspace polling on it, and the bypass will...
well... bypass it.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 virt/kvm/eventfd.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index d6408bb497dc..39443e2f72bf 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -191,6 +191,7 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
 	struct kvm *kvm = irqfd->kvm;
 	unsigned seq;
 	int idx;
+	int ret = 0;
 
 	if (flags & EPOLLIN) {
 		idx = srcu_read_lock(&kvm->irq_srcu);
@@ -204,6 +205,7 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
 					      false) == -EWOULDBLOCK)
 			schedule_work(&irqfd->inject);
 		srcu_read_unlock(&kvm->irq_srcu, idx);
+		ret = 1;
 	}
 
 	if (flags & EPOLLHUP) {
@@ -227,7 +229,7 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
 		spin_unlock_irqrestore(&kvm->irqfds.lock, iflags);
 	}
 
-	return 0;
+	return ret;
 }
 
 static void
@@ -236,7 +238,7 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
 {
 	struct kvm_kernel_irqfd *irqfd =
 		container_of(pt, struct kvm_kernel_irqfd, pt);
-	add_wait_queue(wqh, &irqfd->wait);
+	add_wait_queue_priority(wqh, &irqfd->wait);
 }
 
 /* Must be called under irqfds.lock */
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 2/2] kvm/eventfd: Use priority waitqueue to catch events before userspace
  2020-10-26 17:53 ` [RFC PATCH 2/2] kvm/eventfd: Use priority waitqueue to catch events before userspace David Woodhouse
@ 2020-10-27  8:01   ` Paolo Bonzini
  2020-10-27 10:15     ` David Woodhouse
  2020-10-27 13:55     ` [PATCH 0/3] Allow in-kernel consumers to drain events from eventfd David Woodhouse
  0 siblings, 2 replies; 29+ messages in thread
From: Paolo Bonzini @ 2020-10-27  8:01 UTC (permalink / raw)
  To: David Woodhouse, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, linux-kernel, kvm

On 26/10/20 18:53, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> As far as I can tell, when we use posted interrupts we silently cut off
> the events from userspace, if it's listening on the same eventfd that
> feeds the irqfd.
> 
> I like that behaviour. Let's do it all the time, even without posted
> interrupts. It makes it much easier to handle IRQ remapping invalidation
> without having to constantly add/remove the fd from the userspace poll
> set. We can just leave userspace polling on it, and the bypass will...
> well... bypass it.

This looks good, though of course it depends on the somewhat hackish
patch 1. However don't you need to read the eventfd as well, since
userspace will never be able to do so?

Paolo

> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  virt/kvm/eventfd.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index d6408bb497dc..39443e2f72bf 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -191,6 +191,7 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
>  	struct kvm *kvm = irqfd->kvm;
>  	unsigned seq;
>  	int idx;
> +	int ret = 0;
>  
>  	if (flags & EPOLLIN) {
>  		idx = srcu_read_lock(&kvm->irq_srcu);
> @@ -204,6 +205,7 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
>  					      false) == -EWOULDBLOCK)
>  			schedule_work(&irqfd->inject);
>  		srcu_read_unlock(&kvm->irq_srcu, idx);
> +		ret = 1;
>  	}
>  
>  	if (flags & EPOLLHUP) {
> @@ -227,7 +229,7 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
>  		spin_unlock_irqrestore(&kvm->irqfds.lock, iflags);
>  	}
>  
> -	return 0;
> +	return ret;
>  }
>  
>  static void
> @@ -236,7 +238,7 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
>  {
>  	struct kvm_kernel_irqfd *irqfd =
>  		container_of(pt, struct kvm_kernel_irqfd, pt);
> -	add_wait_queue(wqh, &irqfd->wait);
> +	add_wait_queue_priority(wqh, &irqfd->wait);
>  }
>  
>  /* Must be called under irqfds.lock */
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 2/2] kvm/eventfd: Use priority waitqueue to catch events before userspace
  2020-10-27  8:01   ` Paolo Bonzini
@ 2020-10-27 10:15     ` David Woodhouse
  2020-10-27 13:55     ` [PATCH 0/3] Allow in-kernel consumers to drain events from eventfd David Woodhouse
  1 sibling, 0 replies; 29+ messages in thread
From: David Woodhouse @ 2020-10-27 10:15 UTC (permalink / raw)
  To: Paolo Bonzini, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, linux-kernel, kvm

[-- Attachment #1: Type: text/plain, Size: 2050 bytes --]

On Tue, 2020-10-27 at 09:01 +0100, Paolo Bonzini wrote:
> On 26/10/20 18:53, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > As far as I can tell, when we use posted interrupts we silently cut off
> > the events from userspace, if it's listening on the same eventfd that
> > feeds the irqfd.
> > 
> > I like that behaviour. Let's do it all the time, even without posted
> > interrupts. It makes it much easier to handle IRQ remapping invalidation
> > without having to constantly add/remove the fd from the userspace poll
> > set. We can just leave userspace polling on it, and the bypass will...
> > well... bypass it.
> 
> This looks good, though of course it depends on the somewhat hackish
> patch 1.

I thought it was quite neat :)

>  However don't you need to read the eventfd as well, since
> userspace will never be able to do so?

Yes. Although that's a separate cleanup as it was already true before
my patch. Right now, userspace needs to explicitly stop polling on the
VFIO eventfd while it's assigned as KVM IRQFD (to avoid injecting
duplicate interrupts when the kernel isn't using PI and allows events
to leak). So it isn't going to consume the events in that case either.
Nothing's really changed.

The VFIO virqfd is just the same. The count just builds up when the
kernel handles the events, and is eventually cleared by
eventfd_ctx_remove_wait_queue().

In both cases, that actually works fine because in practice the events
are raised by eventfd_signal() in the kernel, and that works even if
the count reaches ULLONG_MAX. It's just that sending further events
from *userspace* would block in that case.

Both of them theoretically want fixing — regardless of the priority
patch.

Since the wq lock is held while the wakeup function (virqfd_wakeup or 
irqfd_wakeup for VFIO/KVM respectively) run, all they really need to do
is call eventfd_ctx_do_read() to consume the events. I'll look at
whether I can find a nicer option than just exporting that.


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 0/3] Allow in-kernel consumers to drain events from eventfd
  2020-10-27  8:01   ` Paolo Bonzini
  2020-10-27 10:15     ` David Woodhouse
@ 2020-10-27 13:55     ` David Woodhouse
  2020-10-27 13:55       ` [PATCH 1/3] eventfd: Export eventfd_ctx_do_read() David Woodhouse
                         ` (2 more replies)
  1 sibling, 3 replies; 29+ messages in thread
From: David Woodhouse @ 2020-10-27 13:55 UTC (permalink / raw)
  To: bonzini
  Cc: Alex Williamson, Cornelia Huck, Alexander Viro, Jens Axboe, kvm,
	linux-kernel, linux-fsdevel

Paolo pointed out that the KVM eventfd doesn't drain the events from the
irqfd as it handles them, and just lets them accumulate. This is also
true for the VFIO virqfd used for handling acks for level-triggered IRQs.

Export eventfd_ctx_do_read() and make the wakeup functions call it as they
handle their respective events.

David Woodhouse (3):
      eventfd: Export eventfd_ctx_do_read()
      vfio/virqfd: Drain events from eventfd in virqfd_wakeup()
      kvm/eventfd: Drain events from eventfd in irqfd_wakeup()

 drivers/vfio/virqfd.c   | 3 +++
 fs/eventfd.c            | 5 ++++-
 include/linux/eventfd.h | 6 ++++++
 virt/kvm/eventfd.c      | 3 +++
 4 files changed, 16 insertions(+), 1 deletion(-)



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/3] eventfd: Export eventfd_ctx_do_read()
  2020-10-27 13:55     ` [PATCH 0/3] Allow in-kernel consumers to drain events from eventfd David Woodhouse
@ 2020-10-27 13:55       ` David Woodhouse
  2020-10-27 13:55       ` [PATCH 2/3] vfio/virqfd: Drain events from eventfd in virqfd_wakeup() David Woodhouse
  2020-10-27 13:55       ` [PATCH 3/3] kvm/eventfd: Drain events from eventfd in irqfd_wakeup() David Woodhouse
  2 siblings, 0 replies; 29+ messages in thread
From: David Woodhouse @ 2020-10-27 13:55 UTC (permalink / raw)
  To: bonzini
  Cc: Alex Williamson, Cornelia Huck, Alexander Viro, Jens Axboe, kvm,
	linux-kernel, linux-fsdevel

From: David Woodhouse <dwmw@amazon.co.uk>

Where events are consumed in the kernel, for example by KVM's
irqfd_wakeup() and VFIO's virqfd_wakeup(), they currently lack a
mechanism to drain the eventfd's counter.

Since the wait queue is already locked while the wakeup functions are
invoked, all they really need to do is call eventfd_ctx_do_read().

Add a check for the lock, and export it for them.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 fs/eventfd.c            | 5 ++++-
 include/linux/eventfd.h | 6 ++++++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index df466ef81ddd..e265b6dd4f34 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -182,11 +182,14 @@ static __poll_t eventfd_poll(struct file *file, poll_table *wait)
 	return events;
 }
 
-static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt)
+void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt)
 {
+	lockdep_assert_held(&ctx->wqh.lock);
+
 	*cnt = (ctx->flags & EFD_SEMAPHORE) ? 1 : ctx->count;
 	ctx->count -= *cnt;
 }
+EXPORT_SYMBOL_GPL(eventfd_ctx_do_read);
 
 /**
  * eventfd_ctx_remove_wait_queue - Read the current counter and removes wait queue.
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index dc4fd8a6644d..fa0a524baed0 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -41,6 +41,7 @@ struct eventfd_ctx *eventfd_ctx_fileget(struct file *file);
 __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n);
 int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_entry_t *wait,
 				  __u64 *cnt);
+void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt);
 
 DECLARE_PER_CPU(int, eventfd_wake_count);
 
@@ -82,6 +83,11 @@ static inline bool eventfd_signal_count(void)
 	return false;
 }
 
+static inline void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt)
+{
+
+}
+
 #endif
 
 #endif /* _LINUX_EVENTFD_H */
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/3] vfio/virqfd: Drain events from eventfd in virqfd_wakeup()
  2020-10-27 13:55     ` [PATCH 0/3] Allow in-kernel consumers to drain events from eventfd David Woodhouse
  2020-10-27 13:55       ` [PATCH 1/3] eventfd: Export eventfd_ctx_do_read() David Woodhouse
@ 2020-10-27 13:55       ` David Woodhouse
  2020-11-06 23:29         ` Alex Williamson
  2020-10-27 13:55       ` [PATCH 3/3] kvm/eventfd: Drain events from eventfd in irqfd_wakeup() David Woodhouse
  2 siblings, 1 reply; 29+ messages in thread
From: David Woodhouse @ 2020-10-27 13:55 UTC (permalink / raw)
  To: bonzini
  Cc: Alex Williamson, Cornelia Huck, Alexander Viro, Jens Axboe, kvm,
	linux-kernel, linux-fsdevel

From: David Woodhouse <dwmw@amazon.co.uk>

Don't allow the events to accumulate in the eventfd counter, drain them
as they are handled.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 drivers/vfio/virqfd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/vfio/virqfd.c b/drivers/vfio/virqfd.c
index 997cb5d0a657..414e98d82b02 100644
--- a/drivers/vfio/virqfd.c
+++ b/drivers/vfio/virqfd.c
@@ -46,6 +46,9 @@ static int virqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void
 	__poll_t flags = key_to_poll(key);
 
 	if (flags & EPOLLIN) {
+		u64 cnt;
+		eventfd_ctx_do_read(virqfd->eventfd, &cnt);
+
 		/* An event has been signaled, call function */
 		if ((!virqfd->handler ||
 		     virqfd->handler(virqfd->opaque, virqfd->data)) &&
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/3] kvm/eventfd: Drain events from eventfd in irqfd_wakeup()
  2020-10-27 13:55     ` [PATCH 0/3] Allow in-kernel consumers to drain events from eventfd David Woodhouse
  2020-10-27 13:55       ` [PATCH 1/3] eventfd: Export eventfd_ctx_do_read() David Woodhouse
  2020-10-27 13:55       ` [PATCH 2/3] vfio/virqfd: Drain events from eventfd in virqfd_wakeup() David Woodhouse
@ 2020-10-27 13:55       ` David Woodhouse
  2020-10-27 18:41         ` kernel test robot
                           ` (2 more replies)
  2 siblings, 3 replies; 29+ messages in thread
From: David Woodhouse @ 2020-10-27 13:55 UTC (permalink / raw)
  To: bonzini
  Cc: Alex Williamson, Cornelia Huck, Alexander Viro, Jens Axboe, kvm,
	linux-kernel, linux-fsdevel

From: David Woodhouse <dwmw@amazon.co.uk>

Don't allow the events to accumulate in the eventfd counter, drain them
as they are handled.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 virt/kvm/eventfd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index d6408bb497dc..98b5cfa1d69f 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -193,6 +193,9 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
 	int idx;
 
 	if (flags & EPOLLIN) {
+		u64 cnt;
+		eventfd_ctx_do_read(&irqfd->eventfd, &cnt);
+
 		idx = srcu_read_lock(&kvm->irq_srcu);
 		do {
 			seq = read_seqcount_begin(&irqfd->irq_entry_sc);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 0/2] Allow KVM IRQFD to consistently intercept events
  2020-10-26 17:53 [RFC PATCH 1/2] sched/wait: Add add_wait_queue_priority() David Woodhouse
  2020-10-26 17:53 ` [RFC PATCH 2/2] kvm/eventfd: Use priority waitqueue to catch events before userspace David Woodhouse
@ 2020-10-27 14:39 ` David Woodhouse
  2020-10-27 14:39   ` [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority() David Woodhouse
  2020-10-27 14:39   ` [PATCH v2 2/2] kvm/eventfd: Use priority waitqueue to catch events before userspace David Woodhouse
  1 sibling, 2 replies; 29+ messages in thread
From: David Woodhouse @ 2020-10-27 14:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paolo Bonzini, kvm

When posted interrupts are in use, KVM fully bypasses the eventfd and
delivers events directly to the appropriate vCPU. Without posted
interrupts, it still uses the eventfd but it doesn't actually stop
userspace from receiving the events too. This leaves userspace having
to carefully avoid seeing the same events and injecting duplicate
interrupts to the guest.

Fix it by adding a 'priority' mode for exclusive waiters which puts them 
at the head of the list, where they can consume events before the 
non-exclusive waiters are woken.

v2: 
 • Drop [RFC]. This seems to be working nicely, and userspace is a lot
   cleaner without having to mess around with adding/removing the eventfd
   to its poll set. And nobody yelled at me. Yet.
 • Reword commit comments, update comment above __wake_up_common()
 • Rebase to be applied after the (only vaguely related) fix to make
   irqfd actually consume the eventfd counter too.

David Woodhouse (2):
      sched/wait: Add add_wait_queue_priority()
      kvm/eventfd: Use priority waitqueue to catch events before userspace

 include/linux/wait.h | 12 +++++++++++-
 kernel/sched/wait.c  | 17 ++++++++++++++++-
 virt/kvm/eventfd.c   |  6 ++++--




^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority()
  2020-10-27 14:39 ` [PATCH v2 0/2] Allow KVM IRQFD to consistently intercept events David Woodhouse
@ 2020-10-27 14:39   ` David Woodhouse
  2020-10-27 19:09     ` Peter Zijlstra
  2020-10-28 14:35     ` Peter Zijlstra
  2020-10-27 14:39   ` [PATCH v2 2/2] kvm/eventfd: Use priority waitqueue to catch events before userspace David Woodhouse
  1 sibling, 2 replies; 29+ messages in thread
From: David Woodhouse @ 2020-10-27 14:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paolo Bonzini, kvm

From: David Woodhouse <dwmw@amazon.co.uk>

This allows an exclusive wait_queue_entry to be added at the head of the
queue, instead of the tail as normal. Thus, it gets to consume events
first without allowing non-exclusive waiters to be woken at all.

The (first) intended use is for KVM IRQFD, which currently has
inconsistent behaviour depending on whether posted interrupts are
available or not. If they are, KVM will bypass the eventfd completely
and deliver interrupts directly to the appropriate vCPU. If not, events
are delivered through the eventfd and userspace will receive them when
polling on the eventfd.

By using add_wait_queue_priority(), KVM will be able to consistently
consume events within the kernel without accidentally exposing them
to userspace when they're supposed to be bypassed. This, in turn, means
that userspace doesn't have to jump through hoops to avoid listening
on the erroneously noisy eventfd and injecting duplicate interrupts.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 include/linux/wait.h | 12 +++++++++++-
 kernel/sched/wait.c  | 17 ++++++++++++++++-
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 27fb99cfeb02..fe10e8570a52 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -22,6 +22,7 @@ int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int
 #define WQ_FLAG_BOOKMARK	0x04
 #define WQ_FLAG_CUSTOM		0x08
 #define WQ_FLAG_DONE		0x10
+#define WQ_FLAG_PRIORITY	0x20
 
 /*
  * A single wait-queue entry structure:
@@ -164,11 +165,20 @@ static inline bool wq_has_sleeper(struct wait_queue_head *wq_head)
 
 extern void add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
 extern void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
+extern void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
 extern void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
 
 static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
 {
-	list_add(&wq_entry->entry, &wq_head->head);
+	struct list_head *head = &wq_head->head;
+	struct wait_queue_entry *wq;
+
+	list_for_each_entry(wq, &wq_head->head, entry) {
+		if (!(wq->flags & WQ_FLAG_PRIORITY))
+			break;
+		head = &wq->entry;
+	}
+	list_add(&wq_entry->entry, head);
 }
 
 /*
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 01f5d3020589..183cc6ae68a6 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -37,6 +37,17 @@ void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue
 }
 EXPORT_SYMBOL(add_wait_queue_exclusive);
 
+void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
+{
+	unsigned long flags;
+
+	wq_entry->flags |= WQ_FLAG_EXCLUSIVE | WQ_FLAG_PRIORITY;
+	spin_lock_irqsave(&wq_head->lock, flags);
+	__add_wait_queue(wq_head, wq_entry);
+	spin_unlock_irqrestore(&wq_head->lock, flags);
+}
+EXPORT_SYMBOL_GPL(add_wait_queue_priority);
+
 void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
 {
 	unsigned long flags;
@@ -57,7 +68,11 @@ EXPORT_SYMBOL(remove_wait_queue);
 /*
  * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
  * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
- * number) then we wake all the non-exclusive tasks and one exclusive task.
+ * number) then we wake that number of exclusive tasks, and potentially all
+ * the non-exclusive tasks. Normally, exclusive tasks will be at the end of
+ * the list and any non-exclusive tasks will be woken first. A priority task
+ * may be at the head of the list, and can consume the event without any other
+ * tasks being woken.
  *
  * There are circumstances in which we can try to wake a task which has already
  * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 2/2] kvm/eventfd: Use priority waitqueue to catch events before userspace
  2020-10-27 14:39 ` [PATCH v2 0/2] Allow KVM IRQFD to consistently intercept events David Woodhouse
  2020-10-27 14:39   ` [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority() David Woodhouse
@ 2020-10-27 14:39   ` David Woodhouse
  1 sibling, 0 replies; 29+ messages in thread
From: David Woodhouse @ 2020-10-27 14:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paolo Bonzini, kvm

From: David Woodhouse <dwmw@amazon.co.uk>

When posted interrupts are available, the IRTE is modified to deliver
interrupts direclty to the vCPU and nothing ever reaches userspace, if
it's listening on the same eventfd that feeds the irqfd.

I like that behaviour. Let's do it all the time, even without posted
interrupts. It makes it much easier to handle IRQ remapping invalidation
without having to constantly add/remove the fd from the userspace poll
set. We can just leave userspace polling on it, and the bypass will...
well... bypass it.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 virt/kvm/eventfd.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 87fe94355350..09cbdf2ded70 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -191,6 +191,7 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
 	struct kvm *kvm = irqfd->kvm;
 	unsigned seq;
 	int idx;
+	int ret = 0;
 
 	if (flags & EPOLLIN) {
 		u64 cnt;
@@ -207,6 +208,7 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
 					      false) == -EWOULDBLOCK)
 			schedule_work(&irqfd->inject);
 		srcu_read_unlock(&kvm->irq_srcu, idx);
+		ret = 1;
 	}
 
 	if (flags & EPOLLHUP) {
@@ -230,7 +232,7 @@ irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
 		spin_unlock_irqrestore(&kvm->irqfds.lock, iflags);
 	}
 
-	return 0;
+	return ret;
 }
 
 static void
@@ -239,7 +241,7 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
 {
 	struct kvm_kernel_irqfd *irqfd =
 		container_of(pt, struct kvm_kernel_irqfd, pt);
-	add_wait_queue(wqh, &irqfd->wait);
+	add_wait_queue_priority(wqh, &irqfd->wait);
 }
 
 /* Must be called under irqfds.lock */
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] kvm/eventfd: Drain events from eventfd in irqfd_wakeup()
  2020-10-27 13:55       ` [PATCH 3/3] kvm/eventfd: Drain events from eventfd in irqfd_wakeup() David Woodhouse
@ 2020-10-27 18:41         ` kernel test robot
  2020-10-27 21:42         ` kernel test robot
  2020-10-27 23:13         ` kernel test robot
  2 siblings, 0 replies; 29+ messages in thread
From: kernel test robot @ 2020-10-27 18:41 UTC (permalink / raw)
  To: David Woodhouse, bonzini
  Cc: kbuild-all, Alex Williamson, Cornelia Huck, Alexander Viro,
	Jens Axboe, kvm, linux-kernel, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 4397 bytes --]

Hi David,

I love your patch! Yet something to improve:

[auto build test ERROR on vfio/next]
[also build test ERROR on vhost/linux-next linus/master kvm/linux-next v5.10-rc1 next-20201027]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/David-Woodhouse/Allow-in-kernel-consumers-to-drain-events-from-eventfd/20201027-215658
base:   https://github.com/awilliam/linux-vfio.git next
config: x86_64-randconfig-s021-20201027 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-56-gc09e8239-dirty
        # https://github.com/0day-ci/linux/commit/dc45dd9af28fede8f8dd29b705b90f78cf87538c
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review David-Woodhouse/Allow-in-kernel-consumers-to-drain-events-from-eventfd/20201027-215658
        git checkout dc45dd9af28fede8f8dd29b705b90f78cf87538c
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   arch/x86/kvm/../../../virt/kvm/eventfd.c: In function 'irqfd_wakeup':
>> arch/x86/kvm/../../../virt/kvm/eventfd.c:197:23: error: passing argument 1 of 'eventfd_ctx_do_read' from incompatible pointer type [-Werror=incompatible-pointer-types]
     197 |   eventfd_ctx_do_read(&irqfd->eventfd, &cnt);
         |                       ^~~~~~~~~~~~~~~
         |                       |
         |                       struct eventfd_ctx **
   In file included from arch/x86/kvm/../../../virt/kvm/eventfd.c:21:
   include/linux/eventfd.h:44:46: note: expected 'struct eventfd_ctx *' but argument is of type 'struct eventfd_ctx **'
      44 | void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt);
         |                          ~~~~~~~~~~~~~~~~~~~~^~~
   cc1: some warnings being treated as errors

vim +/eventfd_ctx_do_read +197 arch/x86/kvm/../../../virt/kvm/eventfd.c

   180	
   181	/*
   182	 * Called with wqh->lock held and interrupts disabled
   183	 */
   184	static int
   185	irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
   186	{
   187		struct kvm_kernel_irqfd *irqfd =
   188			container_of(wait, struct kvm_kernel_irqfd, wait);
   189		__poll_t flags = key_to_poll(key);
   190		struct kvm_kernel_irq_routing_entry irq;
   191		struct kvm *kvm = irqfd->kvm;
   192		unsigned seq;
   193		int idx;
   194	
   195		if (flags & EPOLLIN) {
   196			u64 cnt;
 > 197			eventfd_ctx_do_read(&irqfd->eventfd, &cnt);
   198	
   199			idx = srcu_read_lock(&kvm->irq_srcu);
   200			do {
   201				seq = read_seqcount_begin(&irqfd->irq_entry_sc);
   202				irq = irqfd->irq_entry;
   203			} while (read_seqcount_retry(&irqfd->irq_entry_sc, seq));
   204			/* An event has been signaled, inject an interrupt */
   205			if (kvm_arch_set_irq_inatomic(&irq, kvm,
   206						      KVM_USERSPACE_IRQ_SOURCE_ID, 1,
   207						      false) == -EWOULDBLOCK)
   208				schedule_work(&irqfd->inject);
   209			srcu_read_unlock(&kvm->irq_srcu, idx);
   210		}
   211	
   212		if (flags & EPOLLHUP) {
   213			/* The eventfd is closing, detach from KVM */
   214			unsigned long iflags;
   215	
   216			spin_lock_irqsave(&kvm->irqfds.lock, iflags);
   217	
   218			/*
   219			 * We must check if someone deactivated the irqfd before
   220			 * we could acquire the irqfds.lock since the item is
   221			 * deactivated from the KVM side before it is unhooked from
   222			 * the wait-queue.  If it is already deactivated, we can
   223			 * simply return knowing the other side will cleanup for us.
   224			 * We cannot race against the irqfd going away since the
   225			 * other side is required to acquire wqh->lock, which we hold
   226			 */
   227			if (irqfd_is_active(irqfd))
   228				irqfd_deactivate(irqfd);
   229	
   230			spin_unlock_irqrestore(&kvm->irqfds.lock, iflags);
   231		}
   232	
   233		return 0;
   234	}
   235	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 42031 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority()
  2020-10-27 14:39   ` [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority() David Woodhouse
@ 2020-10-27 19:09     ` Peter Zijlstra
  2020-10-27 19:27       ` David Woodhouse
  2020-10-28 14:35     ` Peter Zijlstra
  1 sibling, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2020-10-27 19:09 UTC (permalink / raw)
  To: David Woodhouse
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paolo Bonzini, kvm, Oleg Nesterov

On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> This allows an exclusive wait_queue_entry to be added at the head of the
> queue, instead of the tail as normal. Thus, it gets to consume events
> first without allowing non-exclusive waiters to be woken at all.
> 
> The (first) intended use is for KVM IRQFD, which currently has

Do you have more? You could easily special case this inside the KVM
code.

I don't _think_ the other users of __add_wait_queue() will mind the
extra branch, but what do I know.

> inconsistent behaviour depending on whether posted interrupts are
> available or not. If they are, KVM will bypass the eventfd completely
> and deliver interrupts directly to the appropriate vCPU. If not, events
> are delivered through the eventfd and userspace will receive them when
> polling on the eventfd.
> 
> By using add_wait_queue_priority(), KVM will be able to consistently
> consume events within the kernel without accidentally exposing them
> to userspace when they're supposed to be bypassed. This, in turn, means
> that userspace doesn't have to jump through hoops to avoid listening
> on the erroneously noisy eventfd and injecting duplicate interrupts.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  include/linux/wait.h | 12 +++++++++++-
>  kernel/sched/wait.c  | 17 ++++++++++++++++-
>  2 files changed, 27 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/wait.h b/include/linux/wait.h
> index 27fb99cfeb02..fe10e8570a52 100644
> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -22,6 +22,7 @@ int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int
>  #define WQ_FLAG_BOOKMARK	0x04
>  #define WQ_FLAG_CUSTOM		0x08
>  #define WQ_FLAG_DONE		0x10
> +#define WQ_FLAG_PRIORITY	0x20
>  
>  /*
>   * A single wait-queue entry structure:
> @@ -164,11 +165,20 @@ static inline bool wq_has_sleeper(struct wait_queue_head *wq_head)
>  
>  extern void add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
>  extern void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
> +extern void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
>  extern void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
>  
>  static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
>  {
> -	list_add(&wq_entry->entry, &wq_head->head);
> +	struct list_head *head = &wq_head->head;
> +	struct wait_queue_entry *wq;
> +
> +	list_for_each_entry(wq, &wq_head->head, entry) {
> +		if (!(wq->flags & WQ_FLAG_PRIORITY))
> +			break;
> +		head = &wq->entry;
> +	}
> +	list_add(&wq_entry->entry, head);
>  }

So you're adding the PRIORITY things to the head of the list and need
the PRIORITY flag to keep them in FIFO order there, right?

While looking at this I found that weird __add_wait_queue_exclusive()
which is used by fs/eventpoll.c and does something similar, except it
doesn't keep the FIFO order.

The Changelog doesn't state how important this property is to you.

>  /*
> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index 01f5d3020589..183cc6ae68a6 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -37,6 +37,17 @@ void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue
>  }
>  EXPORT_SYMBOL(add_wait_queue_exclusive);
>  
> +void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
> +{
> +	unsigned long flags;
> +
> +	wq_entry->flags |= WQ_FLAG_EXCLUSIVE | WQ_FLAG_PRIORITY;
> +	spin_lock_irqsave(&wq_head->lock, flags);
> +	__add_wait_queue(wq_head, wq_entry);
> +	spin_unlock_irqrestore(&wq_head->lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(add_wait_queue_priority);
> +
>  void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
>  {
>  	unsigned long flags;
> @@ -57,7 +68,11 @@ EXPORT_SYMBOL(remove_wait_queue);
>  /*
>   * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
>   * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
> - * number) then we wake all the non-exclusive tasks and one exclusive task.
> + * number) then we wake that number of exclusive tasks, and potentially all
> + * the non-exclusive tasks. Normally, exclusive tasks will be at the end of
> + * the list and any non-exclusive tasks will be woken first. A priority task
> + * may be at the head of the list, and can consume the event without any other
> + * tasks being woken.
>   *
>   * There are circumstances in which we can try to wake a task which has already
>   * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority()
  2020-10-27 19:09     ` Peter Zijlstra
@ 2020-10-27 19:27       ` David Woodhouse
  2020-10-27 20:30         ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: David Woodhouse @ 2020-10-27 19:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paolo Bonzini, kvm, Oleg Nesterov

[-- Attachment #1: Type: text/plain, Size: 2950 bytes --]

On Tue, 2020-10-27 at 20:09 +0100, Peter Zijlstra wrote:
> On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > This allows an exclusive wait_queue_entry to be added at the head of the
> > queue, instead of the tail as normal. Thus, it gets to consume events
> > first without allowing non-exclusive waiters to be woken at all.
> > 
> > The (first) intended use is for KVM IRQFD, which currently has
> 
> Do you have more? You could easily special case this inside the KVM
> code.

I don't have more right now. What is the easy special case that you
see?

> I don't _think_ the other users of __add_wait_queue() will mind the
> extra branch, but what do I know.

I suppose we could add an unlikely() in there. It seemed like premature
optimisation.

> >  static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
> >  {
> > -	list_add(&wq_entry->entry, &wq_head->head);
> > +	struct list_head *head = &wq_head->head;
> > +	struct wait_queue_entry *wq;
> > +
> > +	list_for_each_entry(wq, &wq_head->head, entry) {
> > +		if (!(wq->flags & WQ_FLAG_PRIORITY))
> > +			break;
> > +		head = &wq->entry;
> > +	}
> > +	list_add(&wq_entry->entry, head);
> >  }
> 
> So you're adding the PRIORITY things to the head of the list and need
> the PRIORITY flag to keep them in FIFO order there, right?

No, I don't care about the order of priority entries; there will
typically be only one of them; that's the point. (I'd have used the
word 'exclusive' if that wasn't already in use for something that...
well... isn't.)

I only case that the priority entries come *before* the bog-standard
non-exclusive entries (like ep_poll_callback).

The priority items end up getting added in FIFO order purely by chance,
because it was simpler to use the same insertion flow for both priority
and normal non-exclusive entries instead of making a new case. So they
all get inserted behind any existing priority entries.

> While looking at this I found that weird __add_wait_queue_exclusive()
> which is used by fs/eventpoll.c and does something similar, except it
> doesn't keep the FIFO order.

It does, doesn't it? Except those so-called "exclusive" entries end up
in FIFO order amongst themselves at the *tail* of the queue, to be
woken up only after all the other entries before them *haven't* been
excluded.

> The Changelog doesn't state how important this property is to you.

Because it isn't :)

The ordering is:

 { PRIORITY }*  { NON-EXCLUSIVE }* { EXCLUSIVE(sic) }*

I care that PRIORITY comes before the others, because I want to
actually exclude the others. Especially the "non-exclusive" ones, which
the 'exclusive' ones don't actually exclude.

I absolutely don't care about ordering *within* the set of PRIORITY
entries, since as I said I expect there to be only one.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority()
  2020-10-27 19:27       ` David Woodhouse
@ 2020-10-27 20:30         ` Peter Zijlstra
  2020-10-27 20:49           ` David Woodhouse
  2020-10-27 21:32           ` David Woodhouse
  0 siblings, 2 replies; 29+ messages in thread
From: Peter Zijlstra @ 2020-10-27 20:30 UTC (permalink / raw)
  To: David Woodhouse
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paolo Bonzini, kvm, Oleg Nesterov

On Tue, Oct 27, 2020 at 07:27:59PM +0000, David Woodhouse wrote:

> > While looking at this I found that weird __add_wait_queue_exclusive()
> > which is used by fs/eventpoll.c and does something similar, except it
> > doesn't keep the FIFO order.
> 
> It does, doesn't it? Except those so-called "exclusive" entries end up
> in FIFO order amongst themselves at the *tail* of the queue, to be
> woken up only after all the other entries before them *haven't* been
> excluded.

__add_wait_queue_exclusive() uses __add_wait_queue() which does
list_add(). It does _not_ add at the tail like normal exclusive users,
and there is exactly _1_ user in tree that does this.

I'm not exactly sure how this happened, but:

  add_wait_queue_exclusive()

and

  __add_wait_queue_exclusive()

are not related :-(

> > The Changelog doesn't state how important this property is to you.
> 
> Because it isn't :)
> 
> The ordering is:
> 
>  { PRIORITY }*  { NON-EXCLUSIVE }* { EXCLUSIVE(sic) }*
> 
> I care that PRIORITY comes before the others, because I want to
> actually exclude the others. Especially the "non-exclusive" ones, which
> the 'exclusive' ones don't actually exclude.
> 
> I absolutely don't care about ordering *within* the set of PRIORITY
> entries, since as I said I expect there to be only one.

Then you could arguably do something like:

	spin_lock_irqsave(&wq_head->lock, flags);
	__add_wait_queue_exclusive(wq_head, wq_entry);
	spin_unlock_irqrestore(&wq_head->lock, flags);

and leave it at that.

But now I'm itching to fix that horrible naming... tomorrow perhaps.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority()
  2020-10-27 20:30         ` Peter Zijlstra
@ 2020-10-27 20:49           ` David Woodhouse
  2020-10-27 21:32           ` David Woodhouse
  1 sibling, 0 replies; 29+ messages in thread
From: David Woodhouse @ 2020-10-27 20:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paolo Bonzini, kvm, Oleg Nesterov

[-- Attachment #1: Type: text/plain, Size: 2123 bytes --]

On Tue, 2020-10-27 at 21:30 +0100, Peter Zijlstra wrote:
> On Tue, Oct 27, 2020 at 07:27:59PM +0000, David Woodhouse wrote:
> 
> > > While looking at this I found that weird __add_wait_queue_exclusive()
> > > which is used by fs/eventpoll.c and does something similar, except it
> > > doesn't keep the FIFO order.
> > 
> > It does, doesn't it? Except those so-called "exclusive" entries end up
> > in FIFO order amongst themselves at the *tail* of the queue, to be
> > woken up only after all the other entries before them *haven't* been
> > excluded.
> 
> __add_wait_queue_exclusive() uses __add_wait_queue() which does
> list_add(). It does _not_ add at the tail like normal exclusive users,
> and there is exactly _1_ user in tree that does this.
> 
> I'm not exactly sure how this happened, but:
> 
>   add_wait_queue_exclusive()
> 
> and
> 
>   __add_wait_queue_exclusive()
> 
> are not related :-(

Oh, that is *particularly* special.

It sounds like the __add_wait_queue_exclusive() version is a half-baked 
attempt at doing what I'm doing here, except....

> > > The Changelog doesn't state how important this property is to you.
> > 
> > Because it isn't :)
> > 
> > The ordering is:
> > 
> >  { PRIORITY }*  { NON-EXCLUSIVE }* { EXCLUSIVE(sic) }*
> > 
> > I care that PRIORITY comes before the others, because I want to
> > actually exclude the others. Especially the "non-exclusive" ones, which
> > the 'exclusive' ones don't actually exclude.
> > 
> > I absolutely don't care about ordering *within* the set of PRIORITY
> > entries, since as I said I expect there to be only one.
> 
> Then you could arguably do something like:
> 
> 	spin_lock_irqsave(&wq_head->lock, flags);
> 	__add_wait_queue_exclusive(wq_head, wq_entry);
> 	spin_unlock_irqrestore(&wq_head->lock, flags);
> 
> and leave it at that.

.. the problem with that is that other waiters *can* end up on the
queue before it, if they are added later. I don't know if the existing
user (ep_poll) cares, but I do.

> But now I'm itching to fix that horrible naming... tomorrow perhaps.

:)

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority()
  2020-10-27 20:30         ` Peter Zijlstra
  2020-10-27 20:49           ` David Woodhouse
@ 2020-10-27 21:32           ` David Woodhouse
  2020-10-28 14:20             ` Peter Zijlstra
  1 sibling, 1 reply; 29+ messages in thread
From: David Woodhouse @ 2020-10-27 21:32 UTC (permalink / raw)
  To: Peter Zijlstra, Davide Libenzi, Davi E. M. Arnaut, davi
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paolo Bonzini, kvm, Oleg Nesterov

[-- Attachment #1: Type: text/plain, Size: 1526 bytes --]

On Tue, 2020-10-27 at 21:30 +0100, Peter Zijlstra wrote:
> On Tue, Oct 27, 2020 at 07:27:59PM +0000, David Woodhouse wrote:
> 
> > > While looking at this I found that weird __add_wait_queue_exclusive()
> > > which is used by fs/eventpoll.c and does something similar, except it
> > > doesn't keep the FIFO order.
> > 
> > It does, doesn't it? Except those so-called "exclusive" entries end up
> > in FIFO order amongst themselves at the *tail* of the queue, to be
> > woken up only after all the other entries before them *haven't* been
> > excluded.
> 
> __add_wait_queue_exclusive() uses __add_wait_queue() which does
> list_add(). It does _not_ add at the tail like normal exclusive users,
> and there is exactly _1_ user in tree that does this.
> 
> I'm not exactly sure how this happened, but:
> 
>   add_wait_queue_exclusive()
> 
> and
> 
>   __add_wait_queue_exclusive()
> 
> are not related :-(

I think that goes all the way back to here:

https://lkml.org/lkml/2007/5/4/530

It was rounded up in commit d47de16c72and subsequently "cleaned up"
into an inline in wait.h, but I don't think there was ever a reason for
it to be added to the head of the list instead of the tail.

So I think we can reasonably make __add_wait_queue_exclusive() do
precisely the same thing as add_wait_queue_exclusive() does (modulo
locking).

And then potentially rename them both to something that isn't quite
such a lie. And give me the one I want that *does* actually exclude
other waiters :)

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] kvm/eventfd: Drain events from eventfd in irqfd_wakeup()
  2020-10-27 13:55       ` [PATCH 3/3] kvm/eventfd: Drain events from eventfd in irqfd_wakeup() David Woodhouse
  2020-10-27 18:41         ` kernel test robot
@ 2020-10-27 21:42         ` kernel test robot
  2020-10-27 23:13         ` kernel test robot
  2 siblings, 0 replies; 29+ messages in thread
From: kernel test robot @ 2020-10-27 21:42 UTC (permalink / raw)
  To: David Woodhouse, bonzini
  Cc: kbuild-all, clang-built-linux, Alex Williamson, Cornelia Huck,
	Alexander Viro, Jens Axboe, kvm, linux-kernel, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 11205 bytes --]

Hi David,

I love your patch! Yet something to improve:

[auto build test ERROR on vfio/next]
[also build test ERROR on vhost/linux-next linus/master kvm/linux-next v5.10-rc1 next-20201027]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/David-Woodhouse/Allow-in-kernel-consumers-to-drain-events-from-eventfd/20201027-215658
base:   https://github.com/awilliam/linux-vfio.git next
config: s390-randconfig-r023-20201027 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project f2c25c70791de95d2466e09b5b58fc37f6ccd7a4)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install s390 cross compiling tool for clang build
        # apt-get install binutils-s390x-linux-gnu
        # https://github.com/0day-ci/linux/commit/dc45dd9af28fede8f8dd29b705b90f78cf87538c
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review David-Woodhouse/Allow-in-kernel-consumers-to-drain-events-from-eventfd/20201027-215658
        git checkout dc45dd9af28fede8f8dd29b705b90f78cf87538c
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=s390 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from arch/s390/include/asm/kvm_para.h:25:
   In file included from arch/s390/include/asm/diag.h:12:
   In file included from include/linux/if_ether.h:19:
   In file included from include/linux/skbuff.h:31:
   In file included from include/linux/dma-mapping.h:11:
   In file included from include/linux/scatterlist.h:9:
   In file included from arch/s390/include/asm/io.h:72:
   include/asm-generic/io.h:490:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/big_endian.h:34:59: note: expanded from macro '__le32_to_cpu'
   #define __le32_to_cpu(x) __swab32((__force __u32)(__le32)(x))
                                                             ^
   include/uapi/linux/swab.h:119:21: note: expanded from macro '__swab32'
           ___constant_swab32(x) :                 \
                              ^
   include/uapi/linux/swab.h:21:12: note: expanded from macro '___constant_swab32'
           (((__u32)(x) & (__u32)0x00ff0000UL) >>  8) |            \
                     ^
   In file included from arch/s390/kvm/../../../virt/kvm/eventfd.c:12:
   In file included from include/linux/kvm_host.h:32:
   In file included from include/linux/kvm_para.h:5:
   In file included from include/uapi/linux/kvm_para.h:36:
   In file included from arch/s390/include/asm/kvm_para.h:25:
   In file included from arch/s390/include/asm/diag.h:12:
   In file included from include/linux/if_ether.h:19:
   In file included from include/linux/skbuff.h:31:
   In file included from include/linux/dma-mapping.h:11:
   In file included from include/linux/scatterlist.h:9:
   In file included from arch/s390/include/asm/io.h:72:
   include/asm-generic/io.h:490:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/big_endian.h:34:59: note: expanded from macro '__le32_to_cpu'
   #define __le32_to_cpu(x) __swab32((__force __u32)(__le32)(x))
                                                             ^
   include/uapi/linux/swab.h:119:21: note: expanded from macro '__swab32'
           ___constant_swab32(x) :                 \
                              ^
   include/uapi/linux/swab.h:22:12: note: expanded from macro '___constant_swab32'
           (((__u32)(x) & (__u32)0xff000000UL) >> 24)))
                     ^
   In file included from arch/s390/kvm/../../../virt/kvm/eventfd.c:12:
   In file included from include/linux/kvm_host.h:32:
   In file included from include/linux/kvm_para.h:5:
   In file included from include/uapi/linux/kvm_para.h:36:
   In file included from arch/s390/include/asm/kvm_para.h:25:
   In file included from arch/s390/include/asm/diag.h:12:
   In file included from include/linux/if_ether.h:19:
   In file included from include/linux/skbuff.h:31:
   In file included from include/linux/dma-mapping.h:11:
   In file included from include/linux/scatterlist.h:9:
   In file included from arch/s390/include/asm/io.h:72:
   include/asm-generic/io.h:490:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/big_endian.h:34:59: note: expanded from macro '__le32_to_cpu'
   #define __le32_to_cpu(x) __swab32((__force __u32)(__le32)(x))
                                                             ^
   include/uapi/linux/swab.h:120:12: note: expanded from macro '__swab32'
           __fswab32(x))
                     ^
   In file included from arch/s390/kvm/../../../virt/kvm/eventfd.c:12:
   In file included from include/linux/kvm_host.h:32:
   In file included from include/linux/kvm_para.h:5:
   In file included from include/uapi/linux/kvm_para.h:36:
   In file included from arch/s390/include/asm/kvm_para.h:25:
   In file included from arch/s390/include/asm/diag.h:12:
   In file included from include/linux/if_ether.h:19:
   In file included from include/linux/skbuff.h:31:
   In file included from include/linux/dma-mapping.h:11:
   In file included from include/linux/scatterlist.h:9:
   In file included from arch/s390/include/asm/io.h:72:
   include/asm-generic/io.h:501:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writeb(value, PCI_IOBASE + addr);
                               ~~~~~~~~~~ ^
   include/asm-generic/io.h:511:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
                                                         ~~~~~~~~~~ ^
   include/asm-generic/io.h:521:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
                                                         ~~~~~~~~~~ ^
   include/asm-generic/io.h:609:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsb(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:617:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsw(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:625:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsl(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:634:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesb(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
   include/asm-generic/io.h:643:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesw(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
   include/asm-generic/io.h:652:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesl(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
>> arch/s390/kvm/../../../virt/kvm/eventfd.c:197:23: error: incompatible pointer types passing 'struct eventfd_ctx **' to parameter of type 'struct eventfd_ctx *'; remove & [-Werror,-Wincompatible-pointer-types]
                   eventfd_ctx_do_read(&irqfd->eventfd, &cnt);
                                       ^~~~~~~~~~~~~~~
   include/linux/eventfd.h:44:46: note: passing argument to parameter 'ctx' here
   void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt);
                                                ^
   20 warnings and 1 error generated.

vim +197 arch/s390/kvm/../../../virt/kvm/eventfd.c

   180	
   181	/*
   182	 * Called with wqh->lock held and interrupts disabled
   183	 */
   184	static int
   185	irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
   186	{
   187		struct kvm_kernel_irqfd *irqfd =
   188			container_of(wait, struct kvm_kernel_irqfd, wait);
   189		__poll_t flags = key_to_poll(key);
   190		struct kvm_kernel_irq_routing_entry irq;
   191		struct kvm *kvm = irqfd->kvm;
   192		unsigned seq;
   193		int idx;
   194	
   195		if (flags & EPOLLIN) {
   196			u64 cnt;
 > 197			eventfd_ctx_do_read(&irqfd->eventfd, &cnt);
   198	
   199			idx = srcu_read_lock(&kvm->irq_srcu);
   200			do {
   201				seq = read_seqcount_begin(&irqfd->irq_entry_sc);
   202				irq = irqfd->irq_entry;
   203			} while (read_seqcount_retry(&irqfd->irq_entry_sc, seq));
   204			/* An event has been signaled, inject an interrupt */
   205			if (kvm_arch_set_irq_inatomic(&irq, kvm,
   206						      KVM_USERSPACE_IRQ_SOURCE_ID, 1,
   207						      false) == -EWOULDBLOCK)
   208				schedule_work(&irqfd->inject);
   209			srcu_read_unlock(&kvm->irq_srcu, idx);
   210		}
   211	
   212		if (flags & EPOLLHUP) {
   213			/* The eventfd is closing, detach from KVM */
   214			unsigned long iflags;
   215	
   216			spin_lock_irqsave(&kvm->irqfds.lock, iflags);
   217	
   218			/*
   219			 * We must check if someone deactivated the irqfd before
   220			 * we could acquire the irqfds.lock since the item is
   221			 * deactivated from the KVM side before it is unhooked from
   222			 * the wait-queue.  If it is already deactivated, we can
   223			 * simply return knowing the other side will cleanup for us.
   224			 * We cannot race against the irqfd going away since the
   225			 * other side is required to acquire wqh->lock, which we hold
   226			 */
   227			if (irqfd_is_active(irqfd))
   228				irqfd_deactivate(irqfd);
   229	
   230			spin_unlock_irqrestore(&kvm->irqfds.lock, iflags);
   231		}
   232	
   233		return 0;
   234	}
   235	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 26571 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] kvm/eventfd: Drain events from eventfd in irqfd_wakeup()
  2020-10-27 13:55       ` [PATCH 3/3] kvm/eventfd: Drain events from eventfd in irqfd_wakeup() David Woodhouse
  2020-10-27 18:41         ` kernel test robot
  2020-10-27 21:42         ` kernel test robot
@ 2020-10-27 23:13         ` kernel test robot
  2 siblings, 0 replies; 29+ messages in thread
From: kernel test robot @ 2020-10-27 23:13 UTC (permalink / raw)
  To: David Woodhouse, bonzini
  Cc: kbuild-all, clang-built-linux, Alex Williamson, Cornelia Huck,
	Alexander Viro, Jens Axboe, kvm, linux-kernel, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 4377 bytes --]

Hi David,

I love your patch! Yet something to improve:

[auto build test ERROR on vfio/next]
[also build test ERROR on vhost/linux-next linus/master kvm/linux-next v5.10-rc1 next-20201027]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/David-Woodhouse/Allow-in-kernel-consumers-to-drain-events-from-eventfd/20201027-215658
base:   https://github.com/awilliam/linux-vfio.git next
config: x86_64-randconfig-a004-20201026 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project f2c25c70791de95d2466e09b5b58fc37f6ccd7a4)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install x86_64 cross compiling tool for clang build
        # apt-get install binutils-x86-64-linux-gnu
        # https://github.com/0day-ci/linux/commit/dc45dd9af28fede8f8dd29b705b90f78cf87538c
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review David-Woodhouse/Allow-in-kernel-consumers-to-drain-events-from-eventfd/20201027-215658
        git checkout dc45dd9af28fede8f8dd29b705b90f78cf87538c
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> arch/x86/kvm/../../../virt/kvm/eventfd.c:197:23: error: incompatible pointer types passing 'struct eventfd_ctx **' to parameter of type 'struct eventfd_ctx *'; remove & [-Werror,-Wincompatible-pointer-types]
                   eventfd_ctx_do_read(&irqfd->eventfd, &cnt);
                                       ^~~~~~~~~~~~~~~
   include/linux/eventfd.h:44:46: note: passing argument to parameter 'ctx' here
   void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt);
                                                ^
   1 error generated.

vim +197 arch/x86/kvm/../../../virt/kvm/eventfd.c

   180	
   181	/*
   182	 * Called with wqh->lock held and interrupts disabled
   183	 */
   184	static int
   185	irqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
   186	{
   187		struct kvm_kernel_irqfd *irqfd =
   188			container_of(wait, struct kvm_kernel_irqfd, wait);
   189		__poll_t flags = key_to_poll(key);
   190		struct kvm_kernel_irq_routing_entry irq;
   191		struct kvm *kvm = irqfd->kvm;
   192		unsigned seq;
   193		int idx;
   194	
   195		if (flags & EPOLLIN) {
   196			u64 cnt;
 > 197			eventfd_ctx_do_read(&irqfd->eventfd, &cnt);
   198	
   199			idx = srcu_read_lock(&kvm->irq_srcu);
   200			do {
   201				seq = read_seqcount_begin(&irqfd->irq_entry_sc);
   202				irq = irqfd->irq_entry;
   203			} while (read_seqcount_retry(&irqfd->irq_entry_sc, seq));
   204			/* An event has been signaled, inject an interrupt */
   205			if (kvm_arch_set_irq_inatomic(&irq, kvm,
   206						      KVM_USERSPACE_IRQ_SOURCE_ID, 1,
   207						      false) == -EWOULDBLOCK)
   208				schedule_work(&irqfd->inject);
   209			srcu_read_unlock(&kvm->irq_srcu, idx);
   210		}
   211	
   212		if (flags & EPOLLHUP) {
   213			/* The eventfd is closing, detach from KVM */
   214			unsigned long iflags;
   215	
   216			spin_lock_irqsave(&kvm->irqfds.lock, iflags);
   217	
   218			/*
   219			 * We must check if someone deactivated the irqfd before
   220			 * we could acquire the irqfds.lock since the item is
   221			 * deactivated from the KVM side before it is unhooked from
   222			 * the wait-queue.  If it is already deactivated, we can
   223			 * simply return knowing the other side will cleanup for us.
   224			 * We cannot race against the irqfd going away since the
   225			 * other side is required to acquire wqh->lock, which we hold
   226			 */
   227			if (irqfd_is_active(irqfd))
   228				irqfd_deactivate(irqfd);
   229	
   230			spin_unlock_irqrestore(&kvm->irqfds.lock, iflags);
   231		}
   232	
   233		return 0;
   234	}
   235	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 40494 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority()
  2020-10-27 21:32           ` David Woodhouse
@ 2020-10-28 14:20             ` Peter Zijlstra
  2020-10-28 14:44               ` Paolo Bonzini
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2020-10-28 14:20 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Davide Libenzi, Davi E. M. Arnaut, davi, linux-kernel,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paolo Bonzini, kvm, Oleg Nesterov

On Tue, Oct 27, 2020 at 09:32:11PM +0000, David Woodhouse wrote:
> On Tue, 2020-10-27 at 21:30 +0100, Peter Zijlstra wrote:
> > On Tue, Oct 27, 2020 at 07:27:59PM +0000, David Woodhouse wrote:
> > 
> > > > While looking at this I found that weird __add_wait_queue_exclusive()
> > > > which is used by fs/eventpoll.c and does something similar, except it
> > > > doesn't keep the FIFO order.
> > > 
> > > It does, doesn't it? Except those so-called "exclusive" entries end up
> > > in FIFO order amongst themselves at the *tail* of the queue, to be
> > > woken up only after all the other entries before them *haven't* been
> > > excluded.
> > 
> > __add_wait_queue_exclusive() uses __add_wait_queue() which does
> > list_add(). It does _not_ add at the tail like normal exclusive users,
> > and there is exactly _1_ user in tree that does this.
> > 
> > I'm not exactly sure how this happened, but:
> > 
> >   add_wait_queue_exclusive()
> > 
> > and
> > 
> >   __add_wait_queue_exclusive()
> > 
> > are not related :-(
> 
> I think that goes all the way back to here:
> 
> https://lkml.org/lkml/2007/5/4/530
> 
> It was rounded up in commit d47de16c72and subsequently "cleaned up"
> into an inline in wait.h, but I don't think there was ever a reason for
> it to be added to the head of the list instead of the tail.

Maybe, I'm not sure I can tell in a hurry. I've opted to undo the above
'cleanups'

> So I think we can reasonably make __add_wait_queue_exclusive() do
> precisely the same thing as add_wait_queue_exclusive() does (modulo
> locking).

Aye, see below.

> And then potentially rename them both to something that isn't quite
> such a lie. And give me the one I want that *does* actually exclude
> other waiters :)

I don't think we want to do that; people are very much used to the
current semantics.

I also very much want to do:
s/__add_wait_queue_entry_tail/__add_wait_queue_tail/ on top of all this.

Anyway, I'll agree to your patch. How do we route this? Shall I take the
waitqueue thing and stick it in a topic branch for Paolo so he can then
merge that and the kvm bits on top into the KVM tree?

---
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 4df61129566d..a2a7e1e339f6 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1895,10 +1895,12 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		 */
 		eavail = ep_events_available(ep);
 		if (!eavail) {
-			if (signal_pending(current))
+			if (signal_pending(current)) {
 				res = -EINTR;
-			else
-				__add_wait_queue_exclusive(&ep->wq, &wait);
+			} else {
+				wq_entry->flags |= WQ_FLAG_EXCLUSIVE;
+				__add_wait_queue(wq_head, wq_entry);
+			}
 		}
 		write_unlock_irq(&ep->lock);
 
diff --git a/fs/orangefs/orangefs-bufmap.c b/fs/orangefs/orangefs-bufmap.c
index 538e839590ef..8cac3589f365 100644
--- a/fs/orangefs/orangefs-bufmap.c
+++ b/fs/orangefs/orangefs-bufmap.c
@@ -86,7 +86,7 @@ static int wait_for_free(struct slot_map *m)
 	do {
 		long n = left, t;
 		if (likely(list_empty(&wait.entry)))
-			__add_wait_queue_entry_tail_exclusive(&m->q, &wait);
+			__add_wait_queue_exclusive(&m->q, &wait);
 		set_current_state(TASK_INTERRUPTIBLE);
 
 		if (m->c > 0)
diff --git a/include/linux/wait.h b/include/linux/wait.h
index 27fb99cfeb02..4b8c4ece13f7 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -171,23 +171,13 @@ static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait
 	list_add(&wq_entry->entry, &wq_head->head);
 }
 
-/*
- * Used for wake-one threads:
- */
-static inline void
-__add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
-{
-	wq_entry->flags |= WQ_FLAG_EXCLUSIVE;
-	__add_wait_queue(wq_head, wq_entry);
-}
-
 static inline void __add_wait_queue_entry_tail(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
 {
 	list_add_tail(&wq_entry->entry, &wq_head->head);
 }
 
 static inline void
-__add_wait_queue_entry_tail_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
+__add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
 {
 	wq_entry->flags |= WQ_FLAG_EXCLUSIVE;
 	__add_wait_queue_entry_tail(wq_head, wq_entry);


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority()
  2020-10-27 14:39   ` [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority() David Woodhouse
  2020-10-27 19:09     ` Peter Zijlstra
@ 2020-10-28 14:35     ` Peter Zijlstra
  2020-11-04  9:35       ` David Woodhouse
  1 sibling, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2020-10-28 14:35 UTC (permalink / raw)
  To: David Woodhouse
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paolo Bonzini, kvm

On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> This allows an exclusive wait_queue_entry to be added at the head of the
> queue, instead of the tail as normal. Thus, it gets to consume events
> first without allowing non-exclusive waiters to be woken at all.
> 
> The (first) intended use is for KVM IRQFD, which currently has
> inconsistent behaviour depending on whether posted interrupts are
> available or not. If they are, KVM will bypass the eventfd completely
> and deliver interrupts directly to the appropriate vCPU. If not, events
> are delivered through the eventfd and userspace will receive them when
> polling on the eventfd.
> 
> By using add_wait_queue_priority(), KVM will be able to consistently
> consume events within the kernel without accidentally exposing them
> to userspace when they're supposed to be bypassed. This, in turn, means
> that userspace doesn't have to jump through hoops to avoid listening
> on the erroneously noisy eventfd and injecting duplicate interrupts.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

> ---
>  include/linux/wait.h | 12 +++++++++++-
>  kernel/sched/wait.c  | 17 ++++++++++++++++-
>  2 files changed, 27 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/wait.h b/include/linux/wait.h
> index 27fb99cfeb02..fe10e8570a52 100644
> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -22,6 +22,7 @@ int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int
>  #define WQ_FLAG_BOOKMARK	0x04
>  #define WQ_FLAG_CUSTOM		0x08
>  #define WQ_FLAG_DONE		0x10
> +#define WQ_FLAG_PRIORITY	0x20
>  
>  /*
>   * A single wait-queue entry structure:
> @@ -164,11 +165,20 @@ static inline bool wq_has_sleeper(struct wait_queue_head *wq_head)
>  
>  extern void add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
>  extern void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
> +extern void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
>  extern void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
>  
>  static inline void __add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
>  {
> -	list_add(&wq_entry->entry, &wq_head->head);
> +	struct list_head *head = &wq_head->head;
> +	struct wait_queue_entry *wq;
> +
> +	list_for_each_entry(wq, &wq_head->head, entry) {
> +		if (!(wq->flags & WQ_FLAG_PRIORITY))
> +			break;
> +		head = &wq->entry;
> +	}
> +	list_add(&wq_entry->entry, head);
>  }
>  
>  /*
> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index 01f5d3020589..183cc6ae68a6 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -37,6 +37,17 @@ void add_wait_queue_exclusive(struct wait_queue_head *wq_head, struct wait_queue
>  }
>  EXPORT_SYMBOL(add_wait_queue_exclusive);
>  
> +void add_wait_queue_priority(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
> +{
> +	unsigned long flags;
> +
> +	wq_entry->flags |= WQ_FLAG_EXCLUSIVE | WQ_FLAG_PRIORITY;
> +	spin_lock_irqsave(&wq_head->lock, flags);
> +	__add_wait_queue(wq_head, wq_entry);
> +	spin_unlock_irqrestore(&wq_head->lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(add_wait_queue_priority);
> +
>  void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
>  {
>  	unsigned long flags;
> @@ -57,7 +68,11 @@ EXPORT_SYMBOL(remove_wait_queue);
>  /*
>   * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
>   * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
> - * number) then we wake all the non-exclusive tasks and one exclusive task.
> + * number) then we wake that number of exclusive tasks, and potentially all
> + * the non-exclusive tasks. Normally, exclusive tasks will be at the end of
> + * the list and any non-exclusive tasks will be woken first. A priority task
> + * may be at the head of the list, and can consume the event without any other
> + * tasks being woken.
>   *
>   * There are circumstances in which we can try to wake a task which has already
>   * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority()
  2020-10-28 14:20             ` Peter Zijlstra
@ 2020-10-28 14:44               ` Paolo Bonzini
  0 siblings, 0 replies; 29+ messages in thread
From: Paolo Bonzini @ 2020-10-28 14:44 UTC (permalink / raw)
  To: Peter Zijlstra, David Woodhouse
  Cc: Davide Libenzi, Davi E. M. Arnaut, davi, linux-kernel,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, kvm, Oleg Nesterov

On 28/10/20 15:20, Peter Zijlstra wrote:
> Shall I take the waitqueue thing and stick it in a topic branch for
> Paolo so he can then merge that and the kvm bits on top into the KVM
> tree?

Topic branches are always the best solution. :)

Paolo


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority()
  2020-10-28 14:35     ` Peter Zijlstra
@ 2020-11-04  9:35       ` David Woodhouse
  2020-11-04 11:25         ` Paolo Bonzini
  2020-11-06 10:17         ` Paolo Bonzini
  0 siblings, 2 replies; 29+ messages in thread
From: David Woodhouse @ 2020-11-04  9:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Paolo Bonzini, kvm

[-- Attachment #1: Type: text/plain, Size: 1417 bytes --]

On Wed, 2020-10-28 at 15:35 +0100, Peter Zijlstra wrote:
> On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > This allows an exclusive wait_queue_entry to be added at the head of the
> > queue, instead of the tail as normal. Thus, it gets to consume events
> > first without allowing non-exclusive waiters to be woken at all.
> > 
> > The (first) intended use is for KVM IRQFD, which currently has
> > inconsistent behaviour depending on whether posted interrupts are
> > available or not. If they are, KVM will bypass the eventfd completely
> > and deliver interrupts directly to the appropriate vCPU. If not, events
> > are delivered through the eventfd and userspace will receive them when
> > polling on the eventfd.
> > 
> > By using add_wait_queue_priority(), KVM will be able to consistently
> > consume events within the kernel without accidentally exposing them
> > to userspace when they're supposed to be bypassed. This, in turn, means
> > that userspace doesn't have to jump through hoops to avoid listening
> > on the erroneously noisy eventfd and injecting duplicate interrupts.
> > 
> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Thanks. Paolo, the conclusion was that you were going to take this set
through the KVM tree, wasn't it?


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority()
  2020-11-04  9:35       ` David Woodhouse
@ 2020-11-04 11:25         ` Paolo Bonzini
  2020-11-06 10:17         ` Paolo Bonzini
  1 sibling, 0 replies; 29+ messages in thread
From: Paolo Bonzini @ 2020-11-04 11:25 UTC (permalink / raw)
  To: David Woodhouse, Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, kvm

On 04/11/20 10:35, David Woodhouse wrote:
> On Wed, 2020-10-28 at 15:35 +0100, Peter Zijlstra wrote:
>> On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote:
>>> From: David Woodhouse <dwmw@amazon.co.uk>
>>>
>>> This allows an exclusive wait_queue_entry to be added at the head of the
>>> queue, instead of the tail as normal. Thus, it gets to consume events
>>> first without allowing non-exclusive waiters to be woken at all.
>>>
>>> The (first) intended use is for KVM IRQFD, which currently has
>>> inconsistent behaviour depending on whether posted interrupts are
>>> available or not. If they are, KVM will bypass the eventfd completely
>>> and deliver interrupts directly to the appropriate vCPU. If not, events
>>> are delivered through the eventfd and userspace will receive them when
>>> polling on the eventfd.
>>>
>>> By using add_wait_queue_priority(), KVM will be able to consistently
>>> consume events within the kernel without accidentally exposing them
>>> to userspace when they're supposed to be bypassed. This, in turn, means
>>> that userspace doesn't have to jump through hoops to avoid listening
>>> on the erroneously noisy eventfd and injecting duplicate interrupts.
>>>
>>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
>>
>> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> Thanks. Paolo, the conclusion was that you were going to take this set
> through the KVM tree, wasn't it?
> 

Yes.

Paolo


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority()
  2020-11-04  9:35       ` David Woodhouse
  2020-11-04 11:25         ` Paolo Bonzini
@ 2020-11-06 10:17         ` Paolo Bonzini
  2020-11-06 16:32           ` Alex Williamson
  1 sibling, 1 reply; 29+ messages in thread
From: Paolo Bonzini @ 2020-11-06 10:17 UTC (permalink / raw)
  To: David Woodhouse, Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, kvm, Alex Williamson

On 04/11/20 10:35, David Woodhouse wrote:
> On Wed, 2020-10-28 at 15:35 +0100, Peter Zijlstra wrote:
>> On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote:
>>> From: David Woodhouse <dwmw@amazon.co.uk>
>>>
>>> This allows an exclusive wait_queue_entry to be added at the head of the
>>> queue, instead of the tail as normal. Thus, it gets to consume events
>>> first without allowing non-exclusive waiters to be woken at all.
>>>
>>> The (first) intended use is for KVM IRQFD, which currently has
>>> inconsistent behaviour depending on whether posted interrupts are
>>> available or not. If they are, KVM will bypass the eventfd completely
>>> and deliver interrupts directly to the appropriate vCPU. If not, events
>>> are delivered through the eventfd and userspace will receive them when
>>> polling on the eventfd.
>>>
>>> By using add_wait_queue_priority(), KVM will be able to consistently
>>> consume events within the kernel without accidentally exposing them
>>> to userspace when they're supposed to be bypassed. This, in turn, means
>>> that userspace doesn't have to jump through hoops to avoid listening
>>> on the erroneously noisy eventfd and injecting duplicate interrupts.
>>>
>>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
>>
>> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> 
> Thanks. Paolo, the conclusion was that you were going to take this set
> through the KVM tree, wasn't it?
> 

Queued, except for patch 2/3 in the eventfd series which Alex hasn't 
reviewed/acked yet.

Paolo


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority()
  2020-11-06 10:17         ` Paolo Bonzini
@ 2020-11-06 16:32           ` Alex Williamson
  2020-11-06 17:18             ` David Woodhouse
  0 siblings, 1 reply; 29+ messages in thread
From: Alex Williamson @ 2020-11-06 16:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Woodhouse, Peter Zijlstra, linux-kernel, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira, kvm

On Fri, 6 Nov 2020 11:17:21 +0100
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 04/11/20 10:35, David Woodhouse wrote:
> > On Wed, 2020-10-28 at 15:35 +0100, Peter Zijlstra wrote:  
> >> On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote:  
> >>> From: David Woodhouse <dwmw@amazon.co.uk>
> >>>
> >>> This allows an exclusive wait_queue_entry to be added at the head of the
> >>> queue, instead of the tail as normal. Thus, it gets to consume events
> >>> first without allowing non-exclusive waiters to be woken at all.
> >>>
> >>> The (first) intended use is for KVM IRQFD, which currently has
> >>> inconsistent behaviour depending on whether posted interrupts are
> >>> available or not. If they are, KVM will bypass the eventfd completely
> >>> and deliver interrupts directly to the appropriate vCPU. If not, events
> >>> are delivered through the eventfd and userspace will receive them when
> >>> polling on the eventfd.
> >>>
> >>> By using add_wait_queue_priority(), KVM will be able to consistently
> >>> consume events within the kernel without accidentally exposing them
> >>> to userspace when they're supposed to be bypassed. This, in turn, means
> >>> that userspace doesn't have to jump through hoops to avoid listening
> >>> on the erroneously noisy eventfd and injecting duplicate interrupts.
> >>>
> >>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>  
> >>
> >> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>  
> > 
> > Thanks. Paolo, the conclusion was that you were going to take this set
> > through the KVM tree, wasn't it?
> >   
> 
> Queued, except for patch 2/3 in the eventfd series which Alex hasn't 
> reviewed/acked yet.

There was no vfio patch here, nor mention why it got dropped in v2
afaict.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority()
  2020-11-06 16:32           ` Alex Williamson
@ 2020-11-06 17:18             ` David Woodhouse
  0 siblings, 0 replies; 29+ messages in thread
From: David Woodhouse @ 2020-11-06 17:18 UTC (permalink / raw)
  To: Alex Williamson, Paolo Bonzini
  Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, kvm



On 6 November 2020 16:32:00 GMT, Alex Williamson <alex.williamson@redhat.com> wrote:
>On Fri, 6 Nov 2020 11:17:21 +0100
>Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>> On 04/11/20 10:35, David Woodhouse wrote:
>> > On Wed, 2020-10-28 at 15:35 +0100, Peter Zijlstra wrote:  
>> >> On Tue, Oct 27, 2020 at 02:39:43PM +0000, David Woodhouse wrote:  
>> >>> From: David Woodhouse <dwmw@amazon.co.uk>
>> >>>
>> >>> This allows an exclusive wait_queue_entry to be added at the head
>of the
>> >>> queue, instead of the tail as normal. Thus, it gets to consume
>events
>> >>> first without allowing non-exclusive waiters to be woken at all.
>> >>>
>> >>> The (first) intended use is for KVM IRQFD, which currently has
>> >>> inconsistent behaviour depending on whether posted interrupts are
>> >>> available or not. If they are, KVM will bypass the eventfd
>completely
>> >>> and deliver interrupts directly to the appropriate vCPU. If not,
>events
>> >>> are delivered through the eventfd and userspace will receive them
>when
>> >>> polling on the eventfd.
>> >>>
>> >>> By using add_wait_queue_priority(), KVM will be able to
>consistently
>> >>> consume events within the kernel without accidentally exposing
>them
>> >>> to userspace when they're supposed to be bypassed. This, in turn,
>means
>> >>> that userspace doesn't have to jump through hoops to avoid
>listening
>> >>> on the erroneously noisy eventfd and injecting duplicate
>interrupts.
>> >>>
>> >>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>  
>> >>
>> >> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>  
>> > 
>> > Thanks. Paolo, the conclusion was that you were going to take this
>set
>> > through the KVM tree, wasn't it?
>> >   
>> 
>> Queued, except for patch 2/3 in the eventfd series which Alex hasn't 
>> reviewed/acked yet.
>
>There was no vfio patch here, nor mention why it got dropped in v2
>afaict.  Thanks,

That was a different (but related) series. The VFIO one is https://patchwork.kernel.org/project/kvm/patch/20201027135523.646811-3-dwmw2@infradead.org/

Thanks.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] vfio/virqfd: Drain events from eventfd in virqfd_wakeup()
  2020-10-27 13:55       ` [PATCH 2/3] vfio/virqfd: Drain events from eventfd in virqfd_wakeup() David Woodhouse
@ 2020-11-06 23:29         ` Alex Williamson
  2020-11-08  9:17           ` Paolo Bonzini
  0 siblings, 1 reply; 29+ messages in thread
From: Alex Williamson @ 2020-11-06 23:29 UTC (permalink / raw)
  To: David Woodhouse, Bonzini, Paolo
  Cc: Cornelia Huck, Alexander Viro, Jens Axboe, kvm, linux-kernel,
	linux-fsdevel

On Tue, 27 Oct 2020 13:55:22 +0000
David Woodhouse <dwmw2@infradead.org> wrote:

> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Don't allow the events to accumulate in the eventfd counter, drain them
> as they are handled.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---

Acked-by: Alex Williamson <alex.williamson@redhat.com>

Paolo, I assume you'll add this to your queue.  Thanks,

Alex

>  drivers/vfio/virqfd.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/vfio/virqfd.c b/drivers/vfio/virqfd.c
> index 997cb5d0a657..414e98d82b02 100644
> --- a/drivers/vfio/virqfd.c
> +++ b/drivers/vfio/virqfd.c
> @@ -46,6 +46,9 @@ static int virqfd_wakeup(wait_queue_entry_t *wait, unsigned mode, int sync, void
>  	__poll_t flags = key_to_poll(key);
>  
>  	if (flags & EPOLLIN) {
> +		u64 cnt;
> +		eventfd_ctx_do_read(virqfd->eventfd, &cnt);
> +
>  		/* An event has been signaled, call function */
>  		if ((!virqfd->handler ||
>  		     virqfd->handler(virqfd->opaque, virqfd->data)) &&


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] vfio/virqfd: Drain events from eventfd in virqfd_wakeup()
  2020-11-06 23:29         ` Alex Williamson
@ 2020-11-08  9:17           ` Paolo Bonzini
  0 siblings, 0 replies; 29+ messages in thread
From: Paolo Bonzini @ 2020-11-08  9:17 UTC (permalink / raw)
  To: Alex Williamson, David Woodhouse
  Cc: Cornelia Huck, Alexander Viro, Jens Axboe, kvm, linux-kernel,
	linux-fsdevel

On 07/11/20 00:29, Alex Williamson wrote:
>> From: David Woodhouse<dwmw@amazon.co.uk>
>>
>> Don't allow the events to accumulate in the eventfd counter, drain them
>> as they are handled.
>>
>> Signed-off-by: David Woodhouse<dwmw@amazon.co.uk>
>> ---
> Acked-by: Alex Williamson<alex.williamson@redhat.com>
> 
> Paolo, I assume you'll add this to your queue.  Thanks,
> 
> Alex
> 

Yes, thanks.

Paolo


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2020-11-08  9:17 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-26 17:53 [RFC PATCH 1/2] sched/wait: Add add_wait_queue_priority() David Woodhouse
2020-10-26 17:53 ` [RFC PATCH 2/2] kvm/eventfd: Use priority waitqueue to catch events before userspace David Woodhouse
2020-10-27  8:01   ` Paolo Bonzini
2020-10-27 10:15     ` David Woodhouse
2020-10-27 13:55     ` [PATCH 0/3] Allow in-kernel consumers to drain events from eventfd David Woodhouse
2020-10-27 13:55       ` [PATCH 1/3] eventfd: Export eventfd_ctx_do_read() David Woodhouse
2020-10-27 13:55       ` [PATCH 2/3] vfio/virqfd: Drain events from eventfd in virqfd_wakeup() David Woodhouse
2020-11-06 23:29         ` Alex Williamson
2020-11-08  9:17           ` Paolo Bonzini
2020-10-27 13:55       ` [PATCH 3/3] kvm/eventfd: Drain events from eventfd in irqfd_wakeup() David Woodhouse
2020-10-27 18:41         ` kernel test robot
2020-10-27 21:42         ` kernel test robot
2020-10-27 23:13         ` kernel test robot
2020-10-27 14:39 ` [PATCH v2 0/2] Allow KVM IRQFD to consistently intercept events David Woodhouse
2020-10-27 14:39   ` [PATCH v2 1/2] sched/wait: Add add_wait_queue_priority() David Woodhouse
2020-10-27 19:09     ` Peter Zijlstra
2020-10-27 19:27       ` David Woodhouse
2020-10-27 20:30         ` Peter Zijlstra
2020-10-27 20:49           ` David Woodhouse
2020-10-27 21:32           ` David Woodhouse
2020-10-28 14:20             ` Peter Zijlstra
2020-10-28 14:44               ` Paolo Bonzini
2020-10-28 14:35     ` Peter Zijlstra
2020-11-04  9:35       ` David Woodhouse
2020-11-04 11:25         ` Paolo Bonzini
2020-11-06 10:17         ` Paolo Bonzini
2020-11-06 16:32           ` Alex Williamson
2020-11-06 17:18             ` David Woodhouse
2020-10-27 14:39   ` [PATCH v2 2/2] kvm/eventfd: Use priority waitqueue to catch events before userspace David Woodhouse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).